Connectivity in eQTL networks dictates reproducibility and genomic properties

Summary Expression quantitative trait locus (eQTL) analysis associates SNPs with gene expression; these relationships can be represented as a bipartite network with association strength as “edge weights” between SNPs and genes. However, most eQTL networks use binary edge weights based on thresholded FDR estimates: definitions that influence reproducibility and downstream analyses. We constructed twenty-nine tissue-specific eQTL networks using GTEx data and evaluated a comprehensive set of network specifications based on false discovery rates, test statistics, and p values, focusing on the degree centrality—a metric of an SNP or gene node’s potential network influence. We found a thresholded Benjamini-Hochberg q value weighted by the Z-statistic balances metric reproducibility and computational efficiency. Our estimated gene degrees positively correlate with gene degrees in gene regulatory networks, demonstrating that these networks are complementary in understanding regulation. Gene degrees also correlate with genetic diversity, and heritability analyses show that highly connected nodes are enriched for tissue-relevant traits.


In brief
Networks provide a way to integratively characterize large collections of genomic associations. To assess the robustness of eQTL networks, Gaynor et al. present an extensive evaluation of network and metric specifications. They present a robust method for network edge definition and show its value in downstream gene-and trait-based analyses.

INTRODUCTION
Most human traits and diseases are influenced by many genetic variants that act in concert to alter cellular function (Hawkins et al., 2010). Experimental evidence has demonstrated that the overwhelming majority of trait-associated variants are enriched within regulatory elements (Albert and Kruglyak, 2015;Ward and Kellis, 2012;GTEx Consortium, 2015). Expression quantitative trait loci (eQTL), which associate genetic variants with gene expression, yield a substantial over-representation of genomewide association study (GWAS) variants as eQTLs relative to expectation (Morley et al., 2004;Cheung et al., 2005;Schadt et al., 2005;Nicolae et al., 2010). This suggests that eQTLs play an important role in the causal pathway between genetic variants and phenotype, and it is further evidenced that tissue-specific disease-linked eQTLs are enriched in relevant tissues (Dermitzakis, 2008;Fagny et al., 2017). Common approaches to genetic and genomic analyses such as eQTL mapping consider only pairwise associations, failing to elucidate the molecular mechanisms by which multiple genetic variants relate to expression across genes (Ward and Kellis, 2012;Korte and Farlow, 2013). Integrative analyses of the complex relationships between genetic and genomic features that are reproducible and accurately represent biological relationships are thus of increasingly significant importance.
Network analyses provide an integrative approach to characterize complex genomic associations (Barabá si et al., 2011). We can identify genetic variants and genes that collectively influence cellular processes to drive phenotypes using networks (Platig et al., 2016;Fagny et al., 2017). Bipartite networks naturally MOTIVATION Expression quantitative trait locus (eQTL) analysis has been transformed by the introduction of network representations of eQTL SNP-gene associations. However, these have relied on using sometimes arbitrary thresholds to turn functions of these associations such as false discovery rates into binary presence/absence representations. We sought to identify optimal methods for defining eQTL networks and metrics that balance statistical robustness, computational complexity, and biological discovery.
represent eQTL associations, where the edges between SNPs and gene expression indicate the eQTL association (Asratian et al., 1998;Platig et al., 2016;Fagny et al., 2017;Barber, 2007). Well-defined features of a network can elucidate genetic regulation and inform function. The degree is a measure of network centrality that is associated with how essential a node is to function. For example, nodes that are more densely connected can represent natural divisions of functional relatedness. This representation has been shown to identify biological effects in chronic obstructive pulmonary disease (COPD) (Glass et al., 2014), where GWAS-identified SNPs were most central among groups of functionally related features (Platig et al., 2016;Fagny et al., 2017;Cho et al., 2014;Nicolae et al., 2010).
Many network approaches have been introduced to model genetic and genomic data. Such approaches often use methods like correlation and association-based measures to define networks; other approaches, including Bayesian network analysis, infer directed acyclic graphs to model causal effects from observational data (Zhu et al., 2004(Zhu et al., , 2012Yazdani et al., 2016;Sedgewick et al., 2019;Badsha and Fu, 2019). Some approaches permit the use of many data types (such as multiomics data), incorporate prior evidence in modeling, and may allow missing data (Howey et al., 2021). Network methodologies vary in the assumptions made, often using multivariate distributional assumptions in order to obtain conditional dependencies between nodes. These assumptions may be violated in various settings, such as genetics settings with pleiotropy (Howey et al., 2020), and complex models often cannot operate on summary statistics. A straightforward approach is to define sets of associations directly as networks and permit them to represent a large graph from which one can perform secondary analyses, such as community detection, while maintaining the complexity of the associations (Platig et al., 2016).
Existing eQTL network analyses in particular have constructed networks using thresholded estimates from the eQTL analysis regressing gene expression on genotype (Albert and Kruglyak, 2015). This approach desirably imposes a small computational burden, as the network is limited to the sparse set of edges meeting a threshold but requires an informed threshold and selection of association measure, often selected in a semiquantitative manner. Reducing eQTL associations to such indicators to build a network may be detrimental by discarding potentially valuable data, detracting from potential reproducibility, and ultimately limiting the ability to perform informative downstream analyses. Methods that are more robust than dichotimization or based on rigorously defined measures or thresholds may overcome these limitations (Zhu et al., 2004(Zhu et al., , 2012Yazdani et al., 2016;Sedgewick et al., 2019;Badsha and Fu, 2019). However, approaches that include fully weighted networks or that cannot operate on summary statistics have greater computational burden, given the need to retain and operate on output from millions of regression models, and do not necessarily ensure improved biological insight. It is thus critical to comprehensively evaluate potential network specifications in order to fully characterize eQTL network degrees, indicate robustness of networkbased findings, and provide further biological insight.
In this article, we consider a comprehensive set of network representations of the SNP-gene association specifically toward estimating degree, a measure of how central a node is within the network. In an analysis of twenty-nine tissues from the Genotype-Tissue Expression (GTEx, v.8 release) project, we construct eQTL networks and estimate degree metrics for nodes. We consider definitions that vary with respect to estimation method, model output retained, threshold in dichotomized settings, and inclusion of weights. We evaluate the network reproducibility by considering consistency of degree in split-sample tissue-specific networks and across tissues for tissue-specific networks. Given our network characterization, which balances stability of SNP and gene degree with computation efficiency, we investigate the relationship of gene degree in the eQTL network to two other gene network types, gene regulatory networks and gene co-expression networks. We then consider the biological informativeness of the tissue-specific eQTL networks by evaluating the relationship of gene degree to genetic diversity measures and heritability enrichment of blood traits. Our results demonstrate that the topology of well-defined eQTL networks relates genomic features to genetic diversity and trait heritability.

RESULTS
eQTL networks are dependent on edge and degree definition We mapped eQTLs from the genotype and RNA sequencing (RNA-seq) data for twenty-nine tissues (sample size n = 202-706) from the GTEx v.8 dataset (https://gtexportal.org/home/ datasets). After data processing primarily to impute variants and normalize expression data, we retained 5,339,781 SNPs for all observations and 24,138 genes, on average, across tissues. We performed exhaustive eQTL mapping, adjusting for sex, genotyping platform and protocol, PEER factors, and the first five principal components by tissue (STAR Methods; Table S1); cis-eQTL mapping was performed for variants within 1 Mb of a gene's annotated transcriptional start site. We identified 806,182 (liver) to 3,930,834 (thyroid) cis-eQTLs and 18,037 (liver) and 110,952 (thyroid) trans-eQTLs at a false discovery rate (FDR) threshold of 0.05. We compared our cis-eQTL associations with those reported by the GTEx Consortium; on average, 79% (SD = 2%) of our cis-eQTLs were also in the GTEx results. These differences may be attributable to the genotype preparation and analysis methods.
We constructed eQTL networks from these results based on edge definitions varying in sparsity, estimation method, and weighting ( Figure 1). The ''sparse'' representation includes edges where associations met a measure of significance below a threshold, t, where t was set equal to 0.05, 0.1, 0.15, and 0.2. These measures included the q value (QV) as defined by Storey et al. (Storey, 2002;Storey et al., 2003;Storey and Tibshirani, 2003), the local FDR (LFDR), and an adaptation of the Benjamini-Hochberg (BH) procedure for calculating the QV (see STAR Methods). These methods differ in their computational efficiency and probabilistic interpretation. We extended these unweighted sparse networks to be weighted by the eQTL Z-statistic to provide a measure of the strength of association. We also considered a so-called ''denser'' network representation that includes edges defined by the p value for all tested associations. p values permit natural definitions of network metrics. Networks were constructed for SNP-gene pairings on both a genome-wide and location-specific scale; in the genome-wide setting, edges are defined without regard to location, whereas the location-specific setting distinguished between cis and trans effects. Given the variety of methods for defining eQTL weights, we calculated degree metrics using definitions appropriate for the choice of edge weight. In the sparse settings, the degree was defined as the summation of all edges connected to a node, as is standard; for the denser representation, the degree was defined as the proportion of non-null proportion (NP) edges connected to a node. We calculated the degree metrics for the location-specific networks. In the sparse setting, we calculated the degree metrics for networks of all SNP-gene pairings from both the genome-wide and location-specific networks; the dense representation only permitted a genome-wide degree when considering all SNPgene pairings.
Most association tests did not result in network SNP-gene edges (Figure 2), as expected given the relatively small number of eQTLs across the genome. The number of edges in the sparse networks generally decreased with decreasing sample size, as expected given corresponding decreasing power. The BHbased network consistently had fewer edges, followed by the LFDR-based network. The sparse networks had more nonzero edges when relaxing the threshold t, as evidenced by the lighter shades within stacked bars in Figure 2. This is to be expected because it permits more ''suggestive'' associations as edges beyond those meeting more rigorous criteria. The NPdefined network includes edges for all associations by definition.
The distributions of the SNP degree across tissues where t = 0.05 and edges are weighted by the magnitude of the eQTL Z-statistic are given in Figure 2. The distributions are right skewed, indicating that most SNPs have few connections and few SNPs are associated with many genes or with high effect size; this trend is observed for all t as well as for gene degree. The unweighted degree estimates are less right skewed given that the strength of association is not incorporated to further inform the integer-valued degree. The unweighted-degree estimates are highly correlated with the weighed degree (STAR Methods; Table S2). The NP method yields probabilistic degrees for all nodes, most of which were estimated to be near or equal to zero. Consistent with previous findings that cis-eQTLs are more commonly identified than trans-eQTLs due to multiple testing challenges, the majority of SNP-gene edges are cis-eQTLs. The cis component constitutes 100% of the degree magnitude for most nodes, as variants with regulatory functions tend to act locally (within the megabase window used to define cis associations).

Degree definition determines stability
Within-tissue reproducibility The stability of eQTL networks and their metrics is dependent on network definition and sufficient sample size. The reproducibility of degree estimates in independent and reduced samples were evaluated by computing the SNP and gene degree metrics for random sample splits and assessing the concordance between the split-sample degree estimates via Spearman correlation. This was repeated and averaged across five subsamples for the entire network (accounting for location in the sparse networks) and was restricted to cis and trans associations to account for variability ( Figure 3; Table S3).
Of the sparse networks, the BH-based network had the highest average SNP degree correlations between sample splits for the tissues (t = 0.05; unweighted median: 0.19, interquartile range [IQR]: 0.22; weighted median: 0.44, IQR: 0.19). The LFDR-(t = 0.05; unweighted median: -0.02, IQR: 0.33; weighted median: 0.14, IQR: 0.36) and QV-based networks (t = 0.05; unweighted median: -0.10, IQR: 0.34; weighted median: À0.003, IQR: 0.36) had lower average sample-split correlations for SNP degree. Each of the thresholded degree definitions demonstrated higher correlations for the weighted measures, which allows for greater granularity in the degree distribution. The average degree correlation between the splits decreases with increasing threshold t for most definitions. A relaxed threshold may introduce more edges than can be stably estimated due to larger errors. The NP-based degree in the network of all eQTLs had averaged sample-split correlations with median -0.34 (IQR: 0.004) for SNP degree. This more dense network representation was not positively correlated across sample splits in any tissue. This is likely attributable to the instability of estimating null proportions among a large set of primarily null tests (Storey, 2002;Storey et al., 2003;Storey and Tibshirani, 2003) and a general lack of spread in the degree. The results for gene degree were notably more consistent across methods, which may be  (Table S3).
The correlations varied across tissues; the average correlation between the sample splits increases on average with increasing sample size. Previous eQTL studies and power calculations have demonstrated that typically larger sample sizes than these subsamples (n = 101-353) are required to confidently map eQTLs (Huang et al., 2018). There was moderate concordance between the subsample and full-sample degree metrics illustrating the lack of stability of an eQTL network in a small subsample, suggesting the need for a larger minimum sample size for networks and possibly reflecting effects of tissue heterogeneity. The cisspecific networks, based on a smaller set of potential eQTLs and thus with a decreased multiple-testing burden, have a notably higher correlation on average than the trans-specific networks. The complete networks accounting for location are similar to the cis-specific networks, consistent with the dominance of cis associations in eQTL analyses. The BH-based SNP degree has a notably higher correlation for the trans-specific networks, which may be attributable to the nature of the significance measure calculation, which was estimated per SNP for other measures but considered the complete set of results for the efficient BH computation. The advantage of including both cis and trans associations is that it allows both local and nonlocal regulatory effects to be modeled and results in networks that exhibit higher-order structure based on those nonlocal associations.

Cross-tissue correlation
We compared the degrees identified in the tissue-specific networks across tissues using Spearman correlation; we expect moderate correlation across all tissues given that cis-eQTLs primarily contribute to the degree measures and are more often replicated across tissues (Gamazon et al., 2018), reflecting the fact that cells need to carry out a large number of core processes, such as respiration and metabolism, independent of tissue. This correlation was again performed for the complete, cis-, and trans-specific networks (Figure 4; Table S4).
Across all eQTLs, the correlation of SNP degree across the tissues was again higher for the BH degree (t = 0.05; weighted median: 0.27, IQR: 0.14) than the QV-based SNP degree (t = 0.05; unweighted median: 0.11, IQR: 0.16, weighted median 0.20, IQR: 0.16) and the LFDR-based SNP degree (t = 0.05; unweighted median: 0.15, IQR: 0.16; weighted median: 0.26, IQR: 0.18). The correlation of NP-based SNP degree across tissues had median -0.33 (IQR: 0.01). This further suggests that this NP-based network method, using a degree based on the estimation of the proportion of non-null hypotheses when it is very small, is not reliable, as we expect positive correlation between tissues for cis-specific degrees based on shared genetic regulation.
For all levels of thresholding and for both genome-wide and location-specific definitions, the correlation between tissues is again slightly increased by including weights. Higher, or more relaxed, thresholds led to lower average pairwise correlations between tissues (Table S4). A potential contributor to these trends may be that given the relaxed threshold, there is an increased potential for identifying sub-threshold tissue-specific trans-eQTLs as trans-eQTLs have lower power for eQTL mapping than cis-eQTLs. Given the stability demonstrated by the BH-based network and computational benefits in speed and memory, the subsequent analyses are presented primarily for the BH-based weighted degree with t = 0.05, and secondarily for all other degrees as reported in the supplemental information.
Relationship to other gene networks Many types of networks have been used to study biological systems, including correlation-based and regulatory networks. We compared the eQTL networks with gene regulatory and genegene correlation networks via the gene degree. We constructed tissue-specific gene regulatory networks using Passing Attributes between Networks for Data Assimilation (PANDA), which uses gene expression, transcriptomic, and transcription factor protein-protein interaction data to infer regulatory associations  (Glass et al., 2013;Schlauch et al., 2020). PANDA network analysis provides insight into genes' regulatory function as targets. We also used weighted correlation network analysis (WGCNA), which builds gene-gene pairwise correlation networks based on gene expression data (Langfelder and Horvath, 2020). We compared the tissue-specific correlation of the gene degree in the eQTL networks to gene degree in PANDA and WGCNA networks in each tissue and then performed meta-analysis (Tables S5 and S6).
The meta-analysis correlation was estimated to be 0.05 (95% confidence interval [CI]: (0.04, 0.07)) between the BH-based weighted gene degree (t = 0.05) and the degree from the PANDA network. PANDA networks, like eQTL networks, seek to capture genetic regulatory processes, which may explain their positive correlation. The gene eQTL network degree and WGCNA degree had a meta-analysis correlation of -0.02 (95% CI: (-0.04, -0.01)). The correlations were replicated using other sparse network definitions (Table S6). These incongruent findings between other gene network types are consistent with the notion that co-expression and regulatory networks capture different biological features. A correlation-based network finds genes whose expression levels are similar to each other; genes in an eQTL network are expressed at a level correlated with the genotype at that locus. Further, cis-eQTLs, which are the dominant edges in the network, generally are associated with SNPs falling in regulatory regions, potentially disrupting transcription-factor binding. Thus the extent of a gene's complex regulatory role would more likely be similarly reflected in both eQTL and gene regulatory networks.

Degree correlates with genetic diversity
We assessed the relationship between the degree and genetic evolution and diversity by calculating the correlation of genelevel degree (BH-based weighted gene degree, t = 0.05) and both nucleotide diversity and Tajima's D at the gene level (Danecek et al., 2011;Nei and Li, 1979;Tajima, 1989). Nucleotide diversity measures genetic variation based on the number of nucleotide differences between sequences, permitting insight into a population's mutation rate. Tajima's D considers nucleotide differences as well as the number of segregating sites to then assess whether neutral evolution, as in mutation-drift equilibrium, or selection is occurring. The meta-analysis correlation across tissues between gene degree and nucleotide diversity was estimated as 0.13 (95% CI: (0.12, 0.15)); the correlation between gene degree and Tajima's D was estimated as 0.15 (95% CI: (0.14, 0.15)). These results, consistent across sparse degree definitions (Tables S7 and S8), show statistically significant, positive associations between network gene degrees and genetic diversity. Given that eGenes (genes whose expressions are associated with at least one eQTL) are, by definition, more connected in the network, this is consistent with previous findings in plants where eGenes had higher genetic variation and Tajima's D (Mä hler et al., 2017). Thus, this similarly suggests that genes less connected within the network are under relatively stronger selective constraint. These results indicate that genes that are more central to the network experience increased rates of molecular evolution, at least in terms of the regulatory processes that control them, and evolve under decreased selective constraint. However, previous analyses have found that eQTL network hubs are less likely to be associated through GWAS with disease processes than are nodes of intermediate degree (Platig et al., 2016;Fagny et al., 2017), indicating this flexibility in regulatory constraint might also be linked to the functional roles played by these eGenes.

Heritability enrichment of degree
We evaluated whether the degree was enriched for trait heritability using S-LDSC on a set of six tissue-specific networks (artery aorta, coronary, and tibial; heart atrial appendage and left ventricle; whole blood) and seven relevant complex bloodrelated traits (eosinophil, high-and low-density lipoproteins [HDLs and LDLs, respectively], platelet count, red blood cell [RBC] width, red cell count, and white cell count) as analyzed in UK Biobank . We considered both SNP-and gene-level annotations (BH-based weighted degree, t = 0.05). We conditioned, on the baseline-LD model, a heritability model comprised of 97 annotations that has been demonstrated to be highly informative by capturing functionality, conservation, histone marks, and other variant-specific features (Gazal et al., 2017). We thus account for these existing functional annotations and evaluate the added value of our network annotations in capturing trait heritability. We identified the greatest enrichment across networks for the trait RBC width, with estimates ranging from 3.29 (p = 8.84 3 10 -6 ) to The correlation of estimated SNP degrees across split tissue samples is given using the unweighted and Z-statistic weighted approaches based upon the QV, LFDR, BH, and NP for each threshold t (0.05, 0.1, 0.15, and 0.2) given on the X axis. The distribution is given for the full network and stratified into cis-and translocation-specific networks.
Cell Reports Methods 2, 100218, May 23, 2022 5 Resource ll OPEN ACCESS 4.25 (p = 2.06 3 10 -9 ) for the SNP-based annotation and ranging from 1.87 (p = 2.55 3 10 -9 ) to 2.45 (p = 2.25 3 10 -11 ) for the gene-based annotation ( Figure 5). The degree annotation was significantly enriched for the majority of tissue-trait pairings, for both the SNP-and gene-level degrees (Tables S9, S10, and S11). However, none of the t* estimates were significant in these analyses, a measure that accounts for other functional annotations. Thus, given the lack of statistical significant of the effect size t*, similar to previously constructed network annotations, the network annotations do not provide significant heritability enrichment beyond the baseline-LD model (Kim et al., 2019).

Computational cost
The network construction and metric calculation methods we used incurred significantly different computational burdens. The least costly method is the BH-based approach. This approach only requires storing SNP-gene associations meeting the particular threshold t in memory. Further, the degree is calculated via simple summation. Given the high proportion of null associations, this means oftentimes that the association will not need to be stored in memory, and one can capitalize upon current eQTL software that allows for efficient regression for large datasets (Shabalin, 2019). The QV-and LFDR-based approaches require retaining the p values for all associations intermediately, but they can be stored sparsely. The most computationally costly approach is the NP-based network, which requires an edge weight to be calculated and retained in memory for all tested associations. A summary of the computational impact is given in Table S12. We compared the computational impact of calculating the degree of 10,000 SNPs for our considered set of 24,634 genes in the largest tissue (skeletal muscle) repeated across five iterations. The run time, including I/O, was based on a 2.70 GHz laptop with 16 Gb of memory. We observe that the location-specific computation, retaining only edges meeting the minimum threshold for each of cis and trans edges, is over four times faster than the exhaustive genome-wide computation. This is significant considering that studies are currently growing in the number of genotyped SNPs as well as those that can be imputed with high confidence; therefore, scalability is of high importance.

DISCUSSION
We performed a comprehensive analysis of eQTL networks in twenty-nine different tissues from the GTEx project. Using this unique resource, we tested multiple approaches to reconstructing these networks and interrogated the stability of their resulting properties. We found that the threshold significantly impacted network size, with more stringent thresholds yielding sparser, consistently defined networks. We also observed that the stability of our edge definitions propagated through to affect node degree as well, where correlation-based analyses demonstrated that degree estimates were more consistent in split-sample and cross-tissue networks using these more stringent, sparse definitions. We explored the relationship between eQTL networks and gene regulatory networks, observing a positive correlation between gene degree in our constructed eQTL networks and gene regulatory networks. There was a negative correlation between gene degree in our eQTL networks and gene coexpression networks, suggesting that eQTL networks may be appropriate to consider alongside other networks, particularly as eQTL networks uniquely permit the inclusion of SNPs as nodes and so sample the genetic factors associated with phenotype. We also observed a connection between eQTL network topology and evolutionary processes. Specifically, more highly connected genes in eQTL networks correlate with increased evolutionary rates, indicating that they are reflective of evolutionary processes. We also observe heritability enrichment of blood-related traits for highly connected SNPs and genes in trait-relevant tissues, indicating the informativeness of these network features.
We found that a thresholding approach for constructing network edges was both highly computational efficient and led to stable network properties, including degree. In particular, a weighted BH-based network had the highest correlations in split samples and across tissues; the other thresholding approaches Resource ll OPEN ACCESS often performed well with similar trends. We also observed this consistency of the thresholding approach when we compared the eQTL network metrics with gene networks and heritability measures. Further, the thresholded methods straightforwardly separate between cis-and trans-eQTLs when calculating the FDR rates in all settings (since they are based on location-specific and genome-wide associations, respectively) and likely reflects the fact that trans effects are biologically less common than cis effects.
The NP-based method did not perform well, which may be because the null proportion is challenging to stably estimate. This measure of the proportion of signals seeks to account for sparsity, and given that eQTL signals are very sparse among a large number of tests, methods for estimating the proportion of non-null do not work well Storey, 2002). Furthermore, this approach does not capture the distinction between cis and trans effects and considers all eQTLs together in genome-wide analysis; as a continuous measure, this may be too stringent since the cis and trans effects and mechanisms are different.
Our degree findings are also of computational importance given the substantial differences in the computational cost of inferring network edge weights. The BH-, QV-, and LFDR-based networks require fewer computational resources as they can be computed using matrix computations from summarized eQTL mapping results; the use of more stringent thresholds further reduce network-storage requirements and is supported by the consistent results on the downstream analyses across Figure 5. Heritability enrichment and t* estimates for degree conditional on the baseline-LD model thresholds. The NP degree measure, similar to many other existing network approaches, requires one to exhaustively output all eQTL relationships and then perform estimation on the entirety of the output. All of these degree measures can be completely parallelized for optimal computation, but the impact is nonetheless an important consideration, as one would expect to periodically repeat such analyses genome wide as datasets evolve.
We have been able to characterize the degree of nodes in eQTL networks under different settings and explore their biological implications. Further methodological work would include pursuing fully weighted representations of the eQTL network while calculating an estimate of the proportion of null for an SNP stratified by cis-and trans-eQTLs. It would be interesting to apply this framework to other biological QTL networks and so to allow for comparisons across QTL networks, as one would expect that SNPs associated with particular traits would also affect the expression of genes encoding proteins that regulate relevant biological processes. The value of considering network metrics is demonstrated in the secondary analyses, where their distinction from other networks and relationship to genomic features such as diversity are shown. These results further support integrating eQTLs with other measures of genetic association with phenotype.
The results we present here were based on eQTL networks, where the edge weights linking SNPs and genes are based on analysis of experimental data to identify associations between the genotype at a locus and the expression of each gene in the genome. However, the lessons we learned are broadly applicable to a wide range of problems in the inference of biological networks. Many real-world network analyses focus on metrics such as degree or betweenness centrality after binarizing the edges. However, in analyzing biological networks, we often have imperfect evidence or are modeling processes that are neither always ''on'' or ''off'' but instead occur with some likelihood. Understanding how edges are estimated, and the effect that different methodological choices have on downstream analyses and the overall stability of results, is important for further optimizing network methods. Robust methods for network inference and analysis will further our understanding of gene function and help identify downstream relationships with traits and diseases.

Limitations of the study
There are multiple limitations to the work presented here. First, we focused on network specification and its effect on the degree metric using the GTEx V8 data as a test case. While a key metric, the degree may not always be of greatest interest, and thus an association-based bipartite network may be better optimized for a different measure. However, any of the commonly used network metrics are based on edges, and the edge stability analysis we present in the context of degree is likely to affect other measures in a similar manner. We also note that eQTL association analysis methods and software are consistently improving and may permit further optimized computation; the trends we highlight, particularly in terms of storing and accessing large networks, would nonetheless persist. The resulting networks, metrics, and downstream findings may include false positive nodes, particularly in settings with relaxed FDR thresholds. In our ana-lyses, we used the most stringent threshold considered, and trends we observed were largely consistent across the twentynine tissues considered. Also, the heritability analysis we performed is limited by previously noted methodological limitations, including that S-LDSC requires sufficiently large and robust annotations for stable estimation, which limited our ability to incorporate weights in this aspect of the analysis.
The challenge that we face in validating network-based methods is that there is no source of genome-wide ''ground truth.'' The work presented here relies on estimating SNP-gene associations through eQTL analysis; these associations are supported by extensive data but are not individually experimentally validated. Consequently, when investigating thresholds on the individual estimates of SNP-gene associations, we must rely on other measures. Here, we appeal to properties such as network stability for identifying optimal methods. This is a reasonable strategy as a growing body of evidence indicates that while individual SNP-gene associations may not be reliable, the overall structure of eQTL networks can inform our understanding of the biology of the system under study. Consequently, consistency in estimating the presence of SNP-gene associations can improve our overall understanding of genetic regulatory processes.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: In the sparse, unweighted adjacency setting, all edges are binary. Thus, we take the standard summation of binary edges to obtain a count of the number of edges or connections a SNP has to a gene (or a gene to all SNPs). For the sparse weighted setting, the degree incorporates the magnitude of the test statistic Z i,i as a weighted sum.
For the denser representation of A, we estimate the proportion of significant eQTL analyses for a particular SNP, or the proportion of genes whose expression are influenced by the SNP by utilizing the proportion of true null hypotheses. The proportion of null hypotheses is given as r 0 = m 0 =m where m 0 is the number of true null hypotheses and m is the total number of hypotheses. (Storey, 2002;Storey et al., 2003Storey et al., , 2004Storey and Tibshirani, 2003). As such, the degree is given as: .; p i;m Á ; d Gene denser = 1 À r 0 À p 1;j ; p 2;j ; .; p n;j Á Thus SNPs that have higher degree if they are estimated to have fewer true null associations to the genes. This degree d NP , thus is given by the estimation of the proportion of non-null hypotheses.
factor binding motifs to the genome, protein-protein interaction data derived from the Catalog of Inferred Sequence Binding Preferences and StringDb (Weirauch et al., 2014;Szklarczyk et al., 2017), and gene expression data from GTEx V8 as inputs to PANDA. The output from PANDA was a set of twenty-nine tissue-specific gene regulatory network models linking transcription factors to genes for which there is evidence of regulation. We calculated the degree of genes in the regulatory network using the following proposed transformation to the edge weights to account for negative edge weights, where w ij is the edge weight between transcription factor i and gene j.
We compared the degree across the different types of gene networks, exploring the differences in the regulatory relationships represented in these networks. We calculated the correlation of the gene degree between our proposed eQTL network, primarily considering the Benjamini-Hochberg based weighted gene degree with a threshold of t < 0.05 as well as the other sparse definitions within the supplemental information, and each of the co-expression and regulatory networks for each tissue. We then meta-analyzed the correlations using a random effects model using the meta package (Balduzzi et al., 2019).

Degree and functional annotations
We define genomic annotations based on the estimated SNP and gene degrees. For the SNP-level degree, we defined the annotation to be the Benjamini-Hochberg based weighted degree with a corresponding threshold of t < 0.05. For the gene-level degree, we annotated all SNPs within the gene ( ± 50 KB window) with the gene's continuous-valued Benjamini-Hochberg based weighted degree with a corresponding threshold of t < 0.05. We also defined annotations for LFDR and QV with a threshold of t < 0.05. We used the 1000G European samples as reference SNPs for defining the gene-based annotation (Genomes Project Consortium, 2015).
We also considered two sets of external annotations. First, we estimate annotations capturing genetic diversity. We calculate both nucleotide diversity (p) and Tajima's D across all genes for which expression was measured. We used the window-pi and TajimaD functions of VCFtools on the previously described GTEx genotype data (Danecek et al., 2011). Second, we use the baseline-LD model (v2.2) in enrichment estimation, which contains a broad set of 97 annotations (Gazal et al., 2017). This model extends previous baseline-LD models and captures variant characteristics including functional regions, conservation, MAF, and LD-related annotations.

Correlation with genetic diversity annotations
We evaluated the correlation between degree and genetic diversity. For each tissue, we used the gene-level degree defined above primarily Benjamini-Hochberg based weighted degree with a corresponding threshold of t < 0.05, and secondarily across the other sparse definitions) and calculated the correlation to nucleotide diversity and Tajima's D in order to assess whether there is a relationship between increased network connections and genetic evolution. We summarized across tissues by meta-analyzing the correlations using a random effects model via the meta package (Balduzzi et al., 2019).

Effect size and enrichment estimation
We used stratified LD score regression (S-LDSC) to estimate the enrichment and standardized effect size of the degree-defined annotations (Gazal et al., 2017;Finucane et al., 2015). In particular, we considered seven blood traits selected to correspond with six relevant tissue-specific networks. The summary statistics for these traits were obtained from a publicly available analysis of the UK Biobank . As previously described, we define a cj as the annotation value of SNP j for the annotation c and t c as the contribution of annotation c to per-SNP heritability contribution. We consider a binary annotation in order to have sufficiently stable estimates, defined as 1 where the variant is in the top quartile of the Benjamini-Hochberg (or secondarily LFDR or QV) based weighted degree with a corresponding threshold of t < 0.05 and 0 otherwise. Then, assuming that the variance of each SNP is a linear additive contribution to the annotation: where t c is estimated as: where N is the GWAS sample size and l(j,c) is the LD score of SNP j for the annotation c. The LD score is estimated as lðj; cÞ = P k a ck r 2 jk where r jk is the correlation between SNPS j and k. We used LD scores computed from 1000G data from individuals with European ancestry (Genomes Project Consortium, 2015). The first measure of interest, effect size, is a standardized measure that describes effects unique to annotation c, conditional on all other annotations. It is defined as the proportionate change in per-SNP heritability associated with a one standard deviation increase in the value of the annotation (conditional on all the other annotations in the model),