A computational approach for identification of core modules from a co-expression network and GWAS data

Summary This protocol describes the application of the “omnigenic” model of the genetic architecture of complex traits to identify novel “core” genes influencing a disease-associated phenotype. Core genes are hypothesized to directly regulate disease and may serve as therapeutic targets. This protocol leverages GWAS data, a co-expression network, and publicly available data, including the GTEx database and the International Mouse Phenotyping Consortium Database, to identify modules enriched for genes with “core-like” characteristics. For complete details on the use and execution of this protocol, please refer to Sabik et al. (2020).

5. If no publicly available data exists, identify a source of relevant cells and tissue from which you can extract RNA and perform RNA-seq. This data could come from human or mouse primary cells or a relevant cell line, however, the cell type or tissue must be relevant to the trait of interest. For an example, see our study of genes influencing bone mineral density using murine osteoblasts (Sabik et al., 2020).
Note: The protocol for processing and extracting RNA will be specific to your cell type or tissue, and this part of the process will not be covered in this protocol.
Pre-processing RNA-seq data for co-expression network construction Timing: 2 h 6. The main section of this protocol assumes you have an expression matrix that is pre-processed and ready for input to WGCNA (Langfelder and Horvath, 2008). Information about processing RNA-seq data can be found in the FAQ section of the WGCNA package. Briefly, starting from raw count data, it is recommended that genes with low expression across the majority of samples are filtered out, e.g., all features with fewer than 10 counts in 90% of the samples, and a variancestabilizing transformation (DESeq2 package) is performed (Love et al., 2014). Normalized counts (RPKM/FPKM) can also be log transformed (log2(x+1)) and used to construct the network. The result of this process is a matrix of expression values (rows) by samples (columns).

OPEN ACCESS
Optional: An optional step is to remove batch effects from the expression data using PEER (Stegle et al., 2012). PEER can be used to remove both known and latent batch effects from data. The developers of PEER have produced detailed tutorials, available on their Github wiki page.
7. Finally, a quantile normalization is performed to reduce experimental noise from the expression matrix (Bolstad, 2020).

Curating lists of known disease and phenotype-associated genes
Timing: 2-5 days 8. One characteristic of core genes is that they are associated with monogenic diseases related to your trait of interest (Boyle et al., 2017). For example, given an interest in bone mineral density GWAS, one would curate a list of genes that cause monogenic bone diseases (Sabik et al., 2020). a. Thus, this step includes identifying a specific set of monogenic or Mendelian diseases that are related to your trait of interest b. Then, continue with a literature search to curate a list of genes experimentally shown to cause these diseases i. For this purpose, you may want to consult the Online Mendelian Inheritance in Man (OMIM) database (McKusick, 2007) or eDGAR (Babbi et al., 2017) 9. Another characteristic of core genes is that they are strongly associated with abnormal phenotypes in knockout models (Boyle et al., 2017), and so another resource that can be used to identify core modules is a curated list of genes which produce a phenotype related to your trait of interest when knocked out. a. There are databases of gene perturbations that result in a variety of phenotypes, including the Mouse Genome Informatics database (MGI) and the International Mouse Phenotyping Consortium (IMPC) (Bolser, 2004;Koscielny et al., 2014). i. The IMPC database systematically knocks out genes and conducts a whole panel of phenotyping relating to various organ systems, development, neurology and behavior, aging and mortality, metabolism, reproduction, etc. ii. The MGI database is a systematic repository for observed phenotypes in experimental systems. Using their mammalian phenotype browser, search for phenotypes related to your trait of interest and create a list of relevant phenotypes and their associated genes. iii. These are just two examples of data repositories that contain associations between traits or diseases and genes. Be sure to do some research to see if your specific phenotype has a more targeted repository. For example, the Bonebase Database has deep bone phenotyping for many of the knockout mice produced in the knockout mouse project (KOMP) (Rowe et al., 2018). 10. Save all of the genes from this section comma separated value (csv) files for use during the protocol. 11. While there is no strict threshold for how large these lists of genes should be, the enrichment analysis will not identify any significantly enriched modules in Step 4, module enrichment.

STEP-BY-STEP METHOD DETAILS
Note: Further discussion of the research questions underlying this protocol can be found in our two prior publications that leveraged this approach (Calabrese et al., 2016, Sabik et al., 2020. For example, in these publications, we explore the specificity of the module enrichments against other GWAS studies and more deeply characterize the topology of the genes in the annotated gene lists within the co-expression modules.
Step 1: Creating GWAS gene list Timing: 20 min In order to identify co-expression modules enriched for GWAS genes, we first identify all genes overlapping a GWAS locus, defined as the set of SNPs in high linkage disequilibrium with the lead SNP.
1. Using the previously identified GWAS study, read in the list of lead SNPs that are significantly associated with the study trait. In many cases, this will be the output of fine-mapping (Schaid, et. al. 2018) or a file containing just the lead SNPs for each independent association. For downstream analysis, generate a table with these lead SNPs and their genome coordinates. 2. In order to programmatically access the LDLink data using the LDLinkR package in R, users must register on the LDLink website in order to receive an LDLink token. The token will be emailed to you. 3. Using this table and the LDLinkR package, each lead variant is queried in the 1000 Genomes database and the variants in linkage disequilibrium, or proxy SNPs, for each lead SNP are downloaded (1000Genomes Project Consortium et al., 2012Machiela and Chanock, 2015). In using this resource, you need to be aware of the population used in the GWAS study you are analyzing. In many cases, the European (''EUR'') population is representative of the population used in the study, however, this is not always the case. Compare the available populations from 1000 Genomes against the GWAS study publication to find the appropriate option for the analysis. Using the LDproxy_batch function, all GWAS SNPs can be queried at once, and the proxy SNPs will all be written to a file on disk using the append = TRUE option.
4. Next, this set of proxy SNPs is filtered to include only those with linkage disequilibirum R 2 >= 0.7 with the lead variant. These are the regions of the genome that will determine which genes are implicated by the GWAS in cis. 5. Then, using the Genomic Ranges tool, identify all genes from the GRCh37/hg19 Ensembl gene set overlapping a GWAS bin (Lawrence et al., 2013). Depending on the build of the genome used in the GWAS in the study, the correct gene annotation reference will need to be used. Download the reference file from the UCSC table browser (Haeussler et al., 2019). a. Create a GenomicRanges representation of both the GWAS regions and the reference gene set.
b. Next, identify the overlapping genes for each region using the findOverlaps function c. If no gene intersected a given region, include the nearest upstream and downstream genes from the region. These are identified using the ''precede'' and ''follow'' functions from Ge-nomicRanges. The results are converted into data frames and filtered to include only nearest upstream and downstream genes associated with variants that do not overlap a gene. 6. Finally, the list of human genes is converted to mouse homologs using the MGI mouse to human homology map (Bolser, 2004) (http://www.informatics.jax.org/downloads/reports/HOM_Mouse HumanSequence.rpt).
Step 2: Construct co-expression network Timing: 30 min to 1 h This step uses the WGCNA package to create a co-expression network ( Figure 1) by grouping genes that co-vary across samples into modules (Langfelder and Horvath, 2008). This co-expression network serves as a basis for the enrichment analysis in the next set of steps. This section closely follows the WGCNA tutorials created by the authors of WGCNA, found here: https://horvath.genetics. ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/.
7. Check the relationship among your samples using clustering to identify outliers.
a. Clustering and plotting the resulting tree will allow visualization of the relationships among your samples. b. As a result, you may want to remove samples that are dramatically different from the mean by cutting the tree using cutHeight; more details can be found in the WGCNA data processing tutorial. 8. Next, the soft-thresholding power is empirically determined using the pickSoftThreshold function. Because it is assumed that the co-expression network will have a scale-free topology, the scale-free topology fit index is calculated across a range of powers, and the minimum power required to reach the point of diminishing returns is selected as the soft thresholding power. This power is applied to the adjacency matrix prior to the calculation of the Topological Overlap Matrix.
Note: This step may take a few minutes to run.
9. After determining the optimal soft-thresholding power, we can now calculate co-expression modules from the gene expression dataset. There are many parameters that can be tweaked in this function that can alter the resulting network. In this example, values close to the defaults, however the default values may not be optimal for your dataset. In this example, we used a min-ModuleSize of 20 genes and a mergeCutHeight of 0.15, in order to avoid having a large number of small modules. Additionally, a signed network was constructed, as Peter Langfelder recommends for biological network (Langfelder, 2018). For more information about the parameters of the network calculation step, consult the WGCNA help files in the R environment. 10. While the network is constructed, the topological overlap matrices (TOMs) are saved, however, we also suggest saving the soft-thresholding results, the tree cutting results, the module labels, colors, and eigengenes so they can be easily reloaded later without re-running any of the above code.
Pause point: This is a good stopping point, as you can always come back and reload the network.RData file to come right back to where you are right now. To reload the network, use this code: 11. Finally, network results are all labeled with the identifiers from the expression matrix, however, it would also be useful to have them labeled with alternative identifiers that map to other applications we intend to use, including gene ontology analysis software and enrichment analysis with our curated gene lists. A common identifier for both of these applications in our case is gene symbol, however, this code can be modified for other organisms and ontologies. Step 3: Gene ontology analysis Timing: 30 min Next, perform gene ontology analysis for each module in the network to identify the functional categories represented by each module (Figure 1). There are numerous tools for conducting gene ontology analysis; however, our tool of preference for this project was ToppFun within the ToppGene suite (Chen et al., 2009). This web interface tool searches for enrichment across an extensive set of functional categories, including gene ontology and pathways, but also human and mouse phenotypes, publications and published co-expression data sets, gene families, microRNAs, drugs, and diseases. The background geneset used in ToppFun is the full set of genes present in the gene ontology categories that are being analyzed for enrichment.
Pause point: From this point on, each step break can be considered a good stopping point, just be sure to save your script.
12. In order to port our modules into the web interface and for ease of browsing, write each module out into a .csv file. For example, this code chunk writes out the genes in module 1 to a file that can be uploaded to the web interface.
13. Using the column of identifiers from this table, copy and paste into the ToppFun input box and begin the analysis. For each module of genes, ToppFun will report back a summary of all the enriched gene sets. Click the button at the top of the results page reading ''Download All'', and save the results for browsing in R. ToppFun reports back the significance of each enrichment with an assortment of multiple testing correction methods, the number of hits for that category from your query ("Hit Count in Query List"), and the total number of genes in the category ("Hit Count in Genome"). ToppFun returns three different adjusted p-values: the Bonferroni, Benjamini & Hochberg, and Benjamini & Yekutieli adjusted values. We recommend using a threshold of Bonferroni corrected p-values less than 0.05 to identify significant enrichments. This is the most conservative approach, but given the large number of gene ontology categories, we have deemed the conservative approach appropriate.
Step 4: Module enrichment Timing: 20 min 14. Now that a list of genes implicated by GWAS has been produced, we can identify which coexpression modules are enriched for GWAS genes and for the genes previously identified as causal of related monogenic diseases and phenotypes in mice. a. First, we define the set of modules in our co-expression network for analysis.
b. Then, a loop is initiated that begins by defining an object containing annotated information for the genes in each module. Next, a fisher's exact test is applied to measure the statistical significance of the representation of genes implicated by the GWAS study. In c. Finally, a table is produced that condenses results for each test, reporting the p-value and the odds ratio, for each module. A p-value adjustment is applied using the Benjamini & Hochberg (FDR) method and the results are returned, sorted by adjusted p-value. We recommend a threshold of FDR < 0.05 to identify enriched modules.
d. This process can be repeated for each module from the co-expression network and the results can be compared using a scatter plot of the results (Figures 2A, 2C, and 2D), where the enrichment and significance are plotted and each point represents a module. In this case, one module is among the most highly/significantly enriched across all gene sets, which is characteristic of a ''core'' module. The module that scores highest across all gene sets is a good candidate for a ''core'' module.  e. In this step of the analysis, the user will need to set their specific f. In order to better understand the biological functions that the enrich modules represent, refer back to the results of the gene ontology analysis conducted in section 3. Filter the results of the module you identified in this section to see which gene ontology terms are enriched in your module of interest.
Step 5: LD score regression

Timing: 1 h
Another lens through which we can understand the relationship between the modules in the coexpression network and the GWAS results is through partitioned heritability analysis using LD score regression. This workflow is used to identify modules that are composed of genes related to variants that are enriched for trait heritability in the GWAS study. LD score regression is conducted using the ldsc package, which takes GWAS summary statistics, baseline linkage disequilibrium measurements, and gene sets in which to identify enrichment (Finucane et al., 2015). The partitioned heritability of the SNPs in the regions surrounding the genes in each module is calculated to identify the modules that are most highly enriched for heritability for the trait of interest.
Note: While this package is not wrapped for R, it is accessible in python using the command line, so the code chunks from this section cannot be run in R! The ldsc package wiki on github has a detailed tutorial for the following steps, including links for downloading required files.
15. The first step of LD score regression is to format the GWAS summary statistics for using in the ldsc algorithm, the merge alleles file is provided with the ldsc package

OPEN ACCESS
16. Next, a set of SNPs associated with each module are identified via the genes in each module. This function requires a gene set for each module (as Ensembl gene IDs), a file indicating the coordinates of each Ensembl gene ID (ENSG_coord.txt, provided in the ldsc documentation), the plink file for each chromosome (1000G_plinkfiles/1000G.mac5eur., provided in the ldsc documentation), and a path for the annotation output, indicating the module and the chromosome number. This will need to be run for each chromosome across all modules.
17. Finally, using all of these annotations, run ldsc using the (1) processed summary statistics, (2) the base annotation paths for the modules' annotations, (3) the SNP weights and (4) frequencies for the European 1000 Genomes data that are provided with the ldsc package. The overlap annotations flag was used because transcripts were used to generate the co-expression network, so the gene sets are non-disjoint. Finally, a base name for the output is provided. This command will output a log file, recording the command used to generate the output and a results file. The results file contains a table reporting the proportion of SNPs associated with the gene set, the proportion and standard error of heritability in those SNPs, the enrichment and standard error of the enrichment, and a p-value indicating the statistical significance of the enrichment (Table 1). This table can be filtered to identify the modules that are significantly enriched for trait heritability, and ordered by enrichment to identify the most enriched modules. Step 6: Gene-level analysis prioritization

Timing: 30 min
Once key modules enriched for genes exhibiting core-like properties are identified in sections four and five, the next step is to use these modules as a platform for identifying key genes influencing the phenotype of interest. In sections seven and eight, it may not be feasible to analyze every gene in an enriched module. The topological properties of the nodes within the module can be used to prioritize genes for colocalization and phenstat analysis. In this section, the WGCNA package is used to calculate the module membership score for each gene within a module of interest.
18. Quantitatively, the module membership score is the correlation between the module's eigengene, which describes the collective expression profile of the group of genes, and each individual gene's expression pattern. Module membership is highly correlated with intramodular connectivity, and thus, intramodular hub genes tend to have high module membership.
19. Once module membership is calculated for all transcripts, we can annotate all the genes in our modules of interest with their module membership scores, sort the list to identify those genes with the highest module membership scores, and carry those into steps seven and eight.

Timing: 1 h 30 min
Colocalization analysis is conducted to provide support for hypotheses linking individual genes from core modules with the GWAS phenotype. Colocalization analysis is conducted to evaluate whether two loci share a causal variant. In this instance, the coloc package is implemented to identify relationships between trait-associated GWAS loci and cis expression quantitative trait loci (eQTLs) from Gene-Tissue Expression project (GTEx), however, any relevant eQTL dataset could be used in this step (Giambartolomei et al., 2014, GTEx Consortium et al., 2017. The coloc package evaluates five different hypotheses regarding the relationship between the two associations and returns the probability of each as a value between zero and one: the H0 hypothesis, that neither association ll OPEN ACCESS is significant, the H1/H2 hypotheses, that only one of the associations is significant, the H3 hypothesis, that both associations are significant but do not share a causal variant, and the H4 hypothesis, that the two associations are significant and share a causal variant. We are interested in identifying the pairs of eQTL and GWAS associations that have a high value of H4 (Table 2).
Note: If no single module is identified as a potential core module in steps 4 and 5, the results may need to be aggregated to create a ranking, see troubleshooting, Problem 1.
20. First, for the colocalization analysis, eQTL information from the GTEx consortium will need to be downloaded and filtered (GTEx Consortium et al., 2017). The data are available for download at https://gtexportal.org/home/datasets.

Note:
The tarball required for this analysis is 188 Gb in size. It would not be advisable to operate on this data on a laptop, though it may be possible. It is recommended that this file is downloaded directly to a remote server using the command line.
Note: It is possible to use eQTL from any source, however, the below code is designed to work with GTEx data and may need to be modified to work with other eQTL sources.
21. Once the full eQTL data are acquired, the associations for each gene will need to be extracted from each tissue file. This can be accomplished using awk on the command line. These files can all be saved to a subdirectory of the downloaded github repo directory and read in as a part of the loop.
22. Furthermore, there are a few additional pieces of information required for input to coloc that have not been used in previous steps of this analysis. The sample size for each underlying GWAS or eQTL study, and the minor allele frequency (MAF) of the variants in each study are required. For the GTEx v7 eQTL studies, a key is provided with the sample sizes for each of these studies. The sample size for the GWAS study should be included in the summary statistics or associated manuscript. Additionally, the MAFs are included as part of the GTEx eQTL association table, and the frequencies of the alleles in the GWAS study are typically reported, however, the MAF is not always the reported frequency, so you may need to convert these frequencies if  any are above 0.5. If MAFs are not reported, they can be sourced from the 1000 Genomes project using the LDLink tools.
23. With all of this data read in, a loop can now be run that reads in and formats eQTL and GWAS data and runs coloc for each file in the eQTL data folder. The results of the five hypothesis test for each eQTL/GWAS association pait are saved in a dataframe. a. First, an empty dataframe is initiated b. Next, we initiate a loop that iterates over each eQTL file, reads it in, and formats it with appropriate variant IDs.

Timing: 1-2 hrs
While the colocalization analysis provides evidence supporting a relationship between network identified genes and a trait of interest, a causal relationship can only be demonstrated through controlled perturbation of a target and direct measurement of the phenotype of interest. The hypotheses generated by the above steps can lead to a novel set of experiments, however, databases of experimental perturbations and measured phenotypes can be mined for evidence supporting a causal relationship between a gene and a phenotype of interest. For example, the International Mouse Phenotyping Consortium has a database of phenotypes measured in 7022 strains of knockout mice as of release 12.0 (Koscielny et al., 2014 c. IMPC reports which genotype/phenotype relationships are significant above a multiple testing threshold for all genes; however, they make all data available for testing a single hypothesis. d. The data may be downloaded by clicking on a relevant phenotype icon (https://www.mousephenotype.org/data/charts?accession=MGI:3041155&allele_accession_id=MGI:4434237& pipeline_stable_id=ESLIM_001&procedure_stable_id=ESLIM_005_001&parameter_stable_ id=ESLIM_005_001_004&zygosity=homozygote&phenotyping_center=ICS), scrolling to the ''Access the results programmatically'' section and clicking to download the ''PhenStat-ready raw experiment data''. e. This data can be loaded in R and the analysis can be run using the PhenStat package (Kurbatova et al., 2015). First, the data are read in: Figure 3. Identifying core genes from a core module (A-D) represent the results of positive colocalization analyses, where there was a sufficiently high PP H4 to indicate that the BMD GWAS signal and the GTEx eQTL for the given gene shared a common genetic driver.
(E-H) represent significant differences in a phenotype (bone mineral density) in knockout mice from the IMPC database, as analyzed by PhenStat. Adapted from Figure 5 in (Sabik et al., 2020). ll OPEN ACCESS f. Next, a PhenList object is created for Phenstat g. Next, use the testDataset function to test for differences in a dependent variable, here "Value". The program will choose whether to keep specific model effects, and report if it corrects for batch, weight, and sex, or an interaction term. Additionally, it does not detect a difference in variance between the genotype groups.
h. Use the summaryOutput function to view the results of the statistical test for comparing the knockout phenotype against the control.
i. Finally, create a boxplot of the differences in the value between genotypes for both sexes using the boxplotSexGenotype function, or the boxplotSexGenotypeBatchAdjusted if the effect of batch was significant.
j. The adjusted variables can also be extracted, using the getColumnWeightBatchAdjusted function, and plotted using a different package, for example, ggplot2 ( Figures 3E-3H).

EXPECTED OUTCOMES
The expectation of this analytical pipeline is that the results of the module enrichment analyses will identify one, or a small set of modules that will harbor significant enrichment for genes that exhibit core-like properties. As an example, in the original implementation of this pipeline to study bone mineral density, one group of co-expressed genes, the purple module, was the most highly enriched for genes associated with a number of core-like properties. The purple module was the most highly enriched (1) for genes associated with genes that, when knocked out in mice, exhibited an effect on bone mineral density, (2) for genes related to osteoblast-related monogenic diseases, including osteogenesis imperfecta, hyperostosis, and osteosclerosis, and (3) for genes implicated by GWAS and GWAS heritability ( Figure 2). However, the expectation that there will be a clear subset of modules enriched across all core-like properties that may not hold, see troubleshooting Problem 1.

LIMITATIONS
This protocol relies heavily on publicly available data, which can be both a benefit and an obstacle. It may be the case that there is useful data in each of the resources listed here, however, it is likely that only a subset of applicable data is available for the phenotype in question. In this case, specific, bespoke experiments may need to be performed to demonstrate the hypotheses in question. Additionally, this protocol requires knowledge of the R language and running tools on the command line, including python packages and awk. While the vignette on Github was created to help novice users by providing example data and examples of each step of this protocol, newer users may still need to learn some of the skills underlying the protocol to execute it fully.

TROUBLESHOOTING Problem 1
In the before you begin section, in ''Curating lists of known disease and phenotype genes'' , you may find that the gene lists you identify for your trait of interest from the databases mentioned are not sufficiently large to identify any enriched modules

Potential solution
While we list a couple of options for sources of disease and phenotype associated genes, the genes relevant to your analysis may not be as heavily represented and curated gene lists may need to be manually curated from the literature.

Problem 2
In steps three and four, gene ontology analysis and enrichment analysis, it may be the case that too many of the modules from the co-expression network are too small to be significantly enriched for GO terms, GWAS genes, or genes associated with monogenic diseases.

Potential solution
This can be addressed by increasing the minModuleSize parameter in the network construction command in step two. This will result in a small number of larger modules, which may elide some nuances between smaller modules, but will allow for more successful enrichment analyses.

Problem 3
In steps three through five, gene ontology, enrichment analysis, and LD score regression, you may find that no single module is remarkable across the GWAS, phenotype, and monogenic diseaserelated gene enrichment analyses and partitioned heritability analysis.

Potential solution
If this is the case, optimizing over the module ranks for each enrichment will lead to a prioritized list of modules, which can be carried into the steps seven and eight, colocalization and Phenstat analysis.

Problem 4
In step seven, colocalization analysis, there are a limited number of tissues in GTEx. It is possible that none of the tissues present in GTEx are relevant to the GWAS phenotype at the center of the analysis.

Potential solution
In the protocol, the solution implemented is to analyze all available tissues, which can still surface pan-tissue effects. Another option is to use an eQTL dataset more fit for purpose. Recently, the eQTL Catalog has been made available by the EMBL-EBI (Kerimov et al., 2020).

Problem 5
It is assumed that there is a viable surrogate phenotype in the IMPC for step eight, however there is a limited number of phenotypes available from the IMPC and there may not be one that is fit for purpose.

Potential solution
If there is not a suitable surrogate phenotype in the IMPC database, additional experiments may be required to demonstrate a causal relationship between the gene and phenotype of interest.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Charles Farber (crf2s@virginia.edu).

Materials availability
This study did not generate new unique reagents.

Data and code availability
The accession number for the RNA sequencing data used as an example in this paper is GEO: GSE134081. The code is available on github at: https://github.com/Farber-Lab/STAR_protocols_ core_modules.