Multi-Organ Expression Profiling Uncovers a Gene Module in Coronary Artery Disease Involving Transendothelial Migration of Leukocytes and LIM Domain Binding 2: The Stockholm Atherosclerosis Gene Expression (STAGE) Study

Environmental exposures filtered through the genetic make-up of each individual alter the transcriptional repertoire in organs central to metabolic homeostasis, thereby affecting arterial lipid accumulation, inflammation, and the development of coronary artery disease (CAD). The primary aim of the Stockholm Atherosclerosis Gene Expression (STAGE) study was to determine whether there are functionally associated genes (rather than individual genes) important for CAD development. To this end, two-way clustering was used on 278 transcriptional profiles of liver, skeletal muscle, and visceral fat (n = 66/tissue) and atherosclerotic and unaffected arterial wall (n = 40/tissue) isolated from CAD patients during coronary artery bypass surgery. The first step, across all mRNA signals (n = 15,042/12,621 RefSeqs/genes) in each tissue, resulted in a total of 60 tissue clusters (n = 3958 genes). In the second step (performed within tissue clusters), one atherosclerotic lesion (n = 49/48) and one visceral fat (n = 59) cluster segregated the patients into two groups that differed in the extent of coronary stenosis (P = 0.008 and P = 0.00015). The associations of these clusters with coronary atherosclerosis were validated by analyzing carotid atherosclerosis expression profiles. Remarkably, in one cluster (n = 55/54) relating to carotid stenosis (P = 0.04), 27 genes in the two clusters relating to coronary stenosis were confirmed (n = 16/17, P<10−27and−30). Genes in the transendothelial migration of leukocytes (TEML) pathway were overrepresented in all three clusters, referred to as the atherosclerosis module (A-module). In a second validation step, using three independent cohorts, the A-module was found to be genetically enriched with CAD risk by 1.8-fold (P<0.004). The transcription co-factor LIM domain binding 2 (LDB2) was identified as a potential high-hierarchy regulator of the A-module, a notion supported by subnetwork analysis, by cellular and lesion expression of LDB2, and by the expression of 13 TEML genes in Ldb2–deficient arterial wall. Thus, the A-module appears to be important for atherosclerosis development and, together with LDB2, merits further attention in CAD research.

used to standardize the collection procedure, including a specific order for obtaining each tissue sample in relation to the start of surgery. Anesthesia was standardized to keep systolic blood pressure <150 mm Hg throughout the operation. The time of extraction of each biopsy, deviations from the protocol, and no routine events were noted.
A total of five tissue samples were obtained. Skeletal muscle (~0.5 g ) was taken from the medial border of the apical rectus abdominus muscle close to the incision in sternum. Visceral fat (~1 g) was taken from the mediastinal tissue anterior to the pericardium and great vessels. The internal mammary artery (IMA) was dissected from the inside of the left chest wall, and 1 cm of the distal part of the artery was collected. Aortic wall samples were obtained from the whole punch used to create the proximal vein graft anastomoses at the aortic root during the operation. Liver tissue (3 mm diameter, 0.05 g) was taken from the very inferior border of the left liver lobe at the end of the operation. This part of the liver was easily accessed after the peritoneum was opened through a small incision in the diaphragm a few centimeters below the xiphoid process.
The incisions in the liver and diaphragm were sutured after removal of the biopsy. All tissue samples were taken without use of cautery, and none of the patients suffered from bleeding at the biopsy sites or other complications related to the tissue sampling. The tissue samples were rinsed with RNAlater (Qiagen) to remove blood and immediately (~10 seconds) put into vials containing fresh RNAlater. The vials were stored at room temperature until the end of surgery, at 4°C overnight, and then frozen at -80°C until further processed.
Blood glucose was measured by a glucose oxidase method (Kodak Ektachem) and insulin and pro-insulin by enzyme-linked immunosorbent assay (Dako Diagnostics).

The IMA and Non-Atherosclerosis-Related Genes
For reasons unknown, the IMA is an atherosclerosis-free artery (reference 29 in the manuscript).
Therefore, we decided to use IMA as an internal control (i.e., from the same patient) to normalize atherosclerotic arterial wall expression related to nonatherosclerotic processes (i.e., normal arterial wall gene expression). This was achieved by using the ratios of atherosclerotic arterial wall/IMA gene chip signals for each RefSeq in the cluster analysis.
However, we wanted to investigate the number of genes involved in "normal nonatherosclerosis-related activity" that differed between the aorta and IMA samples. This was done by analyzing, on a human Affymetrix GeneChip (HG-U133 Plus 2.0), total RNA from an additional human aortic root sample (a pooled aortic RNA sample from three male/female Caucasians who suffered sudden death at an early age [27-45 years]) and thus presumably had little or no aortic atherosclerosis; obtained from Clontech). We found 1.9% of the 15,042 RefSeqs or 285 RefSeqs were differentially expressed (i.e., mRNA levels, FDR=0.05) between this "normal" aorta and our IMA samples (n=40). None of the 285 RefSeqs were part of the atherosclerosis module (n=129 RefSeqs representing 128 genes).

Clustering Analysis
The principles of the clustering analysis are shown in Figure 1 in the manuscript. Geneexpression data from each tissue were clustered with a coupled two-way approach originally proposed by Getz et al. (reference 5 in the manuscript). The underlying rational for using a twoway clustering approach is mainly to enable clinical phenotypes to groups of functional associated genes rather than individual genes. A consequence of this approach is that all genes not belonging to a cluster (i.e. a putative functional group) will be excluded and not considered for the association analysis with clinical phenotypes (in our case coronary stenosis and IMT). The algorithm allows genes to belong to multiple clusters or to no cluster at all. We used SPC because it requires no filtering, is stabile against noise, and selects the optimal number of clusters. It also incorporates a cost function, which penalizes an assignment for placing genes in different clusters. On the other hand, if the distance between two gene-expression profiles is long, then the penalty for putting them into different clusters will be low. The distance d between two gene expression profiles e i and e j is computed as Using a correlation as measure as opposed to a regular distance measure (e.g. Manhattan of Euclidean distance) will prioritize genes with similar expression profiles as opposed to genes with similar expression levels. Moreover, Spearman rank correlation is non-parametric and thus more robust to extreme (outlier) data points (as compared to Pearson correlation). We use the absolute value of the Spearman rank correlation to accommodate for gene pairs with negative correlation.
The cost function uses a parameter called temperature T, whose value varies depending on how the algorithm assigns genes to clusters. At low temperature, all genes will belong to the same cluster; many genes have similar properties, and the system is in an ordered ferromagnetic state.
At high temperature, the gene properties change, and the state become unordered and paramagnetic; there will be many clusters ( Figure S1). At some value of T, the so-called superparamagnetic state, the system contains some genes in the ordered ferromagnetic state and some in the unordered paramagnetic state that then defines which genes that belongs, or do not belong, to a given cluster (hence, which genes are excluded). The identified clusters were stable over a temperature interval of at least 0.015, and we joined overlapping clusters if they were more than 60% identical and discarded clusters with more than 1000 members. Based on the mRNA GeneChip signal within individual gene clusters generated in the first step, the patients segregated into two groups. In the second step, hierarchical (agglomerative) clustering with average linkage in Mathematica was used. Manhattan distance was used to measure similarities between patients across GeneChip signal values of genes in the cluster. Small patient groups (3 patients or less) were considered outliers and therefore removed; the remaining patients were in such cases re-

Resampling Analysis
To assess the repeatability and reliability of the clusters, resampling using Jackknife analysis was performed (reference 57 in the manuscript). One patient at the time was removed in the leave-one-out cross-validation step where after data was reclustered with the SPC algorithm and gene clusters were identified over a temperature. A one-sided 90% confidence interval (CI) was calculated for each cluster (Table S1) showing the reappearance interval from the least number of genes (X) to infinity (i.e. the total number of genes in a given cluster). For example, the visceral fat cluster (n=59 RefSeqs) had a CI starting at 38 to a maximum of 59 RefSeqs. This means that at least 38 RefSeqs of the total 59 in this cluster appear in 90% of the resampling times. Since we are only interested in the lower limit, we performed a one-sided test; infinity in this case is the same as the total number of RefSeqs in a cluster.

Text Mining to Define Genes Previously Associated with CAD and
Atherosclerosis Automated text mining of abstracts in PubMed was used to establish a comprehensive list of genes previously related to CAD and atherosclerosis. Briefly, a gene was considered related if it co-occurred in the abstract text of a published article on PubMed with the MeSH terms coronary arteriosclerosis, arteriosclerosis, or atherosclerosis or the text words coronary artery disease, arteriosclerosis, or atherosclerosis. Gene symbols were extracted and compared to any gene symbol as provided by Entrez. Synonyms were converted to the most commonly used symbol and abbreviations that could be mistaken for other terms were removed. Additional genes were manually extracted from recent reviews of CAD (n=459 genes). The entire list of 2832 CAD-and atherosclerosis-related genes is found in Table S9.

Expression SNPs from GGE
In the GGE (reference 9 in the manuscript), all expression traits were tested for association with each of the genotyped SNPs. The strongest putative cis eQTL for a given expression trait was defined as the SNP most strongly associated with the expression trait over all of the SNPs typed within 1 megabase (Mb) of the transcription start or stop of the corresponding structural gene.
The association p-values were adjusted to control for testing of multiple SNPs and expression traits using an empirically determined false discovery rate (FDR) constrained to be < 10%. In the case of trans eQTL, all SNPs were tested for association to each of the expression traits.

Expression Network Analysis
Three No other network or functional analysis of different clusters, other than the three related to atherosclerosis severity, were tested.

In Silico Promoter Binding Analysis
One hundred sixty-one promoter sequences from Ensembl v. TRANSFAC v 11.2 (reference 13 in the manuscript), we identified a total of 171 known TF binding sites for these TFs, of which 81% were present in the 161 promoters (one or more binding sites), suggesting that in principle these promoters could bind the analyzed transcription factor. To investigate how specific this association may be, we selected 10255 human promoters covering the [-600,-1] region relative to transcription start sites as background. We then compared the combination of any two binding sites (two binding sites are less likely to happen by chance) appearing in the target promoter set and in the background sequences, and calculated fold enrichment for each of the cases. We calculated Bonferroni corrected P-values using Fisher's exact right-side test. We found that the combination of binding sites was statistically enriched in the target set relative to the considered background, with enrichment varying from 1.2-to 5-fold (Table S10).