Tissue-specific regulatory elements in mammalian promoters

Transcription factor-binding sites and the cis-regulatory modules they compose are central determinants of gene expression. We previously showed that binding site motifs and modules in proximal promoters can be used to predict a significant portion of mammalian tissue-specific transcription. Here, we report on a systematic analysis of promoters controlling tissue-specific expression in heart, kidney, liver, pancreas, skeletal muscle, testis and CD4 T cells, for both human and mouse. We integrated multiple sources of expression data to compile sets of transcripts with strong evidence for tissue-specific regulation. The analysis of the promoters corresponding to these sets produced a catalog of predicted tissue-specific motifs and modules, and cis-regulatory elements. Predicted regulatory interactions are supported by statistical evidence, and provide a foundation for targeted experiments that will improve our understanding of tissue-specific regulatory networks. In a broader context, methods used to construct the catalog provide a model for the analysis of genomic regions that regulate differentially expressed genes.


Introduction
Reverse engineering mammalian transcriptional regulatory circuits can be achieved using systematic methodology that includes both computational and experimental techniques, working in tandem to generate, refine and verify hypotheses. Understanding tissue-specific transcription is a necessary step for extending regulatory circuit reverse-engineering efforts from single-cell eukaryotes to metazoans. We recently demonstrated that the information in proximal promoters can predict a significant portion of tissue-specific elevated or inhibited expression (Smith et al, 2006). Here, focusing on tissue-specific regulatory pattern identification and prediction accuracy instead of proof of existence, we use refined analysis and data curation methods to discover and catalog highconfidence regulatory interactions and sites. This catalog will assist experimental efforts to reverse engineer tissue-specific transcriptional regulatory networks from the bottom up.
Numerous techniques for analysis of regulatory sequences have been proposed, and the problem of module identification is now receiving due attention (Zhou and Wong, 2004;Gupta and Liu, 2005;Zhu et al, 2005). Previously characterized binding site motifs have been used to infer transcription factor function in certain tissues (Nelander et al, 2005). Xie et al (2005) identified conserved motifs across ortholog promoters of four mammalian genomes. Robertson et al (2006) describe cisRed, a database that integrates genome annotation data, homology data and genome alignments to identify motifs with conserved sites across mammals. We analyzed proximal promoters with evidence for tissue-specific regulation in order to identify tissue-specific motifs, modules and their sites in proximal promoters. We developed a new technique for characterizing tissue-specific modules that ensures that each module component significantly improves tissue-specific module enrichment.
We integrated multiple sources of expression data to identify reliable sets of transcripts that are under tissue-specific regulation in human and mouse. Using transcription start site (TSS) annotation in Cold Spring Harbor Mammalian Promoter Database (CSHLmpd) (Xuan et al, 2005), we compiled sets of proximal promoters corresponding to transcripts with evidence for specific regulation in the selected tissues. Our analysis was based on motifs discovered de novo (called novel motifs) using DME (Smith et al, 2005a) and DME-B (Smith et al, 2006), as well as previously characterized vertebrate binding-site motifs (called known motifs) from TRANSFAC (Matys et al, 2003) and JASPAR (Sandelin et al, 2004). We evaluated motifs according to enrichment in tissue-specific promoters relative to other promoters from the same species. We showed that motifs associated with factors with known tissue-specific roles rank high for enrichment, that motif ranks are significantly correlated between human and mouse and that this same set of motifs and their corresponding cis-elements are unlikely to be identified using traditional, order-preserving alignments of ortholog promoter sequences. We constructed modules of interacting motifs (both novel and known motifs), ensuring that each component contributed significantly to the enrichment of the whole module. We annotated tissue-specific promoters with predicted tissue-specific regulatory elements and demonstrated that these sites are in excellent agreement with experimentally annotated liver-specific sites in the human albumin promoter and skeletal-muscle-specific sites in the human a-actin promoter. Both promoters are particularly well annotated with experimentally verified tissue-specific regulatory elements and permit an informative comparison. In other tissues, we gave predicted sites for tissue-specific motifs in representative promoters. The complete data and analysis are available in TCat: The Catalog of Tissue-Specific Regulatory Motifs (http://rulai.cshl.edu/tcat).

Results
We describe first steps toward cataloging high-confidence tissue-specific motifs, modules and their sites. We first collected and integrated expression and function data from various sources, and identified transcripts that are likely to be under tissue specific regulation. We demonstrated that transcripts with evidence for tissue-specific regulation from multiple expression sources in one species (human or mouse) are significantly more likely to have evidence for tissuespecificity in the other species. We analyzed and annotated proximal-promoter sets in seven representative tissues from both human and mouse, demonstrating that motifs and predicted binding sites are in agreement with experimentally verified data and that analyses in human and mouse are significantly correlated. We also showed that the top-scoring sites in orthologous tissue-specific promoters from human and mouse rarely have significant conservation of site order, suggesting that comparative genomics alone may not be sufficient to decode the regulatory signals in these proximal promoters.

Transcripts under tissue-specific regulation
Few transcripts have expression restricted to a single tissue, but many transcripts appear to be regulated in a tissue-specific manner (Su et al, 2004), and the corresponding promoters are likely to contain tissue-specific regulatory elements. To circumvent problems associated with individual sources of information, we used a voting system that combined information about expression, function and tissue specificity from different sources. Table I gives the number of transcripts with single and multiple sources of evidence (votes) for tissuespecific regulation in each tissue. Orthologs of transcripts with multiple votes for tissue-specific regulation were more likely to have evidence for specific regulation in that tissue, suggesting that the false-positive rate for calling a transcript tissue specific is lower when based on multiple votes. The number of ortholog transcript pairs with multiple votes for tissue-specific regulation in both species ranged from 1 in CD4 T cells to 69 in liver. Table II lists genes and orthologous transcripts with votes for skeletal-muscle-specific regulation in both human and mouse. Gene and transcript lists for other tissues are given in Supplementary Section 1.5.

Enrichment of known tissue-specific motifs
Knowledge of factors and corresponding binding sites that regulate tissue-specific transcription can be used to evaluate motif ranking. We measured motif enrichment in tissue-specific promoter sets using balanced error rates, evaluating motifs for their ability to distinguish tissue-specific sets (foreground sets) from background sets that are composed of non-tissue-specific promoter samples from CSHLmpd. Balanced error rates measure proportions of misclassified promoters after normalization of foreground and background sizes. We ranked motifs according to enrichment and determined whether the ranks assigned to binding-site motifs for factors with known tissue-specific roles are significantly elevated in the corresponding tissues. The results presented in Table III demonstrate that binding-site motifs for these factors ranked significantly high (Po0.01) according to a Wilcoxon signed-ranks test in almost all tissues tested. Excluding DBP in human liver and HNF-3 in mouse liver, these factors had evidence for expression in their respective tissues, and their binding-site motifs were highly enriched in our foreground sets. Results are summarized in Table IV, and the motifs with greatest enrichment in each tissue are given in TCat. Table IV also includes information for HNF-6 in liver. HNF-6 is a known liver regulator, but there is no evidence for its expression in liver based on our data, and its binding-site motif was not enriched in our liver foreground sets. In addition, C/G-or A/Trich motifs are likely to be enriched in foreground sets that are C/G or A/T rich relative to promoter base composition.
To eliminate this potential bias, we adjusted the GC content in background sets to match foreground sets. Some known tissue-specific motifs were identified as enriched only after GC content correction. Nuclear receptor binding-site motifs and E-box motifs are among the top enriched motifs in 11 and 10 of the 14 human and mouse tissues, respectively. A Wilcoxon signed-ranks test showed that nuclear receptor and E-box motifs (represented by 54 and 39 TRANSFAC motifs, respectively) are enriched in the union of our foreground sets with P-values below 6.07EÀ14 and 2.02EÀ13 (nuclear-receptor motifs) and 2.22EÀ10 and 4.57EÀ 03 (E-box) in human and mouse, respectively. These results suggest a diversity of tissue-specific roles for nuclear receptors and E-box binders, likely mediated by tissue-specific cofactors.

Tissue-specific cis-regulatory elements
Highly enriched motifs, and the associated score thresholds identified by our methods, provide a starting point for targeted experimental annotation of tissue-specific promoters. Figure 1 shows known and predicted sites mapped on the À500 to þ 100 region of the human albumin promoter and the À250 to þ 50 region of the human skeletal muscle a-actin promoter. Human albumin has known functional binding sites for HNF-1, C/EBP, AFP and NF-Y (Paonessa et al, 1988;Sawadaishi et al, 1988;Frain et al, 1990;Li et al, 1990), all of which were identified among the top predicted motifs or included in top modules for liver. Locations of known sites for C/EBP (at À437) and NF-Y (at À125) do not perfectly align with the corresponding predicted sites (at À462 and À143, respectively), but current knowledge about binding sites for those factors raises the possibility that the predicted locations are more accurate. The only predicted binding site for these factors in the human albumin promoter that is not depicted in the figure is a C/EBP site at À956. Human skeletal muscle a-actin has known sites for SRF, TEF and a known TATA box (Boxer et al, 1989;MacLellan et al, 1994), all among top predicted motifs for human skeletal muscle (TATA box motifs have high similarity with MEF-2 motifs). Figure 2 gives predictions in a representative human promoter in each of the remaining tissues.

Comparison to previous results
Previous analysis of tissue-specific patterns in regulatory regions includes analysis based on cross-species conservation (Xie et al, 2005) and coexpression (Smith et al, 2006). Xie et al (2005) identified conserved elements in orthologous promoters of four mammalian genomes. They found 59 experimentally validated motifs that are significantly The 'Votes' column gives the total number of votes for skeletal muscle-specificity in both human and mouse. Our analysis used promoter sets of size 100 for both tissues, including promoters that correspond to transcripts with a single vote for tissue-specific regulation. Tables for the remaining 6 tissues are given in supplementary material.  The A/T-rich SRF is not identified in mouse even after GC-content correction.

MyoD
Myogenic factor 3; Member of the myogenin family.
Myogenin family, E-box binding factors.

Yes
Yes Ranks in the top 5 in both species and predicted to interact with SRF and MEF-2.

Sp1
Stimulating protein 1; A ubiquitous factor with a Cys2His2 zinc finger domain.

Yes
Yes

Yes
The C/G-rich motif is higly ranked with and without GC-content correction.
Testis SRY Sex-determining region on Y chromosome; Member of the highmobility group (HMG) class of factors.
SOX factors.

Yes
No The A/T-rich motif is ranked 2nd in human after GC-content correction.
CREM Cyclic AMP-responsive element modulator; Member of the CREB/ATF subfamily of the bZIP factors.

Yes
Yes Yes Yes C/G-balanced and synergistic, ranks 1st in mouse with and without GC-content correction.

RFX
Regulatory factor X; subfamily of the forkhead factors with winged-helix binding domains.

Yes
No

Yes
The core of the RFX motif (GTTGCCA) is highly similar to the reverse of the core MYB motif (CCGTTG), ranking top in human.
A motif classifies foreground from background if it is ranked in the top 20 distinct motif classifiers. In previous work (Smith et al, 2006), we tested the hypothesis that proximal promoters contain information that can be used to predict tissue-specific expression. We were not concerned with identifying the most significant tissue-specific motifs, modules and sites. Considering the difference between the goals of the two projects, it is not surprising that the predictive models described by Smith et al (2006) have little similarity to our top motifs and modules. The most significant similarities between our top tissue-specific patterns and the predictive models of Smith et al (2006) include the enrichment of ETS in CD4 T-cell-specific promoters and the enrichment of Smith et al (2006) motifs Novel3 and Novel6 in mouse testis and Novel1 in human testis. The three novel testis motifs are very similar to motifs that rank in the top 100 in our analysis, but the enrichment of these motifs was not sufficiently high for inclusion in TCat.

Correlation between human and mouse regulatory regions
We compared motif enrichment ranks in each human foreground set to ranks in the corresponding mouse foreground set using Spearman's rank correlation test, and found that enrichment ranks across species are highly correlated (Po0.001) for all but CD4 T cells (Supplementary Table 15). In CD4 T cells, motif enrichment ranks are similar only for few highly enriched motifs. Despite the motif-enrichment ranks correlation, the order of the top predicted binding sites is not usually conserved between orthologous promoters. Fewer than 10% of orthologous pairs showed significant (Po0.01) conservation of site order. Weak site-order conservation  (Paonessa et al, 1988;Sawadaishi et al, 1988;Frain et al, 1990;Li et al, 1990) and for a-actin (Boxer et al, 1989;MacLellan et al, 1994) were mapped to the promoter from CSHLmpd to obtain their correct locations relative to the TSS.  suggests that the top tissue-specific sites would be difficult to identify using traditional cross-species alignment alone, and methods that rely on co-linear promoter alignment may have high false-negative detection rates. This evidence is in agreement with Frith et al (2006), who found that homologous transcription start sites can be separated by more than 100 nucleotides. A list of the nine genes (out of 102 candidates) with significant conservation of site order is given in Supplementary Section 2.3.

Materials and methods
The steps used in creating the catalog include (1) identifying tissue-specific transcripts, (2) identifying factors that are expressed in each tissue, (3) obtaining promoter sequences for tissuespecific transcript, and (4) identifying individual motifs and modules (i.e. sets of interacting motifs) that characterize tissue-specific promoter sets.

Identifying tissue-specific transcripts
To identify motifs and modules that regulate tissue-specific transcription, we analyzed promoters of transcripts that appear to be regulated in a tissue-specific manner. If an information source indicated that a transcript has restricted expression, unusually high expression, or a specific function in the tissue, that source voted for tissue specificity of the transcript. For each tissue, we sorted the transcripts according to the number of votes received, retaining the top 100 with distinct TSS as tissue specific. Ties in the ranking were broken according to intensity values from the GNF SymAtlas expression data (discussed below), which we have found to be the most complete and the most reliable source of tissue-based expression information. We used the same number of transcripts for each tissue to facilitate comparison across tissues, and 100 sequences provided sufficient information for our analysis while allowing identification of well-known tissue-specific motifs.

Microarray data
The GNF SymAtlas microarray data were generated using Affymetrix HG-U133A array and the custom GNF1H and GNF1M Affymetrix arrays, and include expression profiles for 79 human and 61 mouse tissues (Su et al, 2004). Among these are the seven tissues we selected to include in the catalog. Tissues were selected with consideration to data availability in GNF and other sources and interest from Zhang lab members and collaborators. A transcript received a vote for tissue specificity from this information source if it was called present and its intensity exceeded its mean across all tissues by 3 standard deviations.
The Hughes Toronto microarray data , which was generated using custom-built oligonucleotide arrays, provide mouse expression profiles for 55 tissues, including all of our tissues but CD4 T cells. A transcript received a vote for tissue specificity if it was called present in the tissue, and had intensity at least 10 standard deviations above its mean across all 55 tissues. This large number of standard deviations was required to limit the number of transcripts receiving positive votes.
The GeneNote expression profiles (Shmueli et al, 2003), which were generated using the Affymetrix GeneChip HG-U95, provide human expression data for 12 tissues, including all of our tissues but CD4 T cells and testis. The GeneNote data were used in the same way as the GNF SymAtlas data, with a transcript being called tissue specific if it is present in that tissue and has intensity at least 3 standard deviations above its mean across all 12 tissues.

EST data
dbEST is a database of expressed sequence tags (Boguski et al, 1993), and contains source information, such as the tissue of origin, for each EST. This information is used to annotate UniGene clusters with the source data, and a UniGene is said to have restricted expression in a tissue if more than half of the ESTs contributing to that UniGene have the same source tissue. A transcript received one vote for specificity in a particular tissue if the corresponding UniGene cluster is annotated as having expression restricted to that tissue.

GO terms
We associated a set of GO Terms with each tissue. This was performed by compiling a set of keywords for each tissue (e.g. 'renal' was associated with kidney; 'sperm' was associated with testis), and searching GO Term names and definitions for those keywords. This produced, for each tissue, a set of GO Terms that were subsequently reviewed to ensure that the context of the keywords was appropriate. A transcript of a gene annotated with a GO Term that is associated with a tissue received a vote for specificity in that tissue.

Selecting promoter sequences
Although regulatory elements can exist almost anywhere in the genome, they are concentrated near the TSS (Cooper et al, 2006). We used the CSHLmpd to map transcripts to promoters, using experimentally confirmed promoters from EPD (Perier et al, 1998), DBTSS (Suzuki et al, 2002) and GenBank, as well as computationally predicted promoters. For each promoter, we used the proximal sequence region of À1000 to þ 100 relative to the TSS.
Each part of our analysis is based on comparing the tissuespecific promoter sets to a background of random promoters from the same species. For each tissue, a background set was constructed by selecting 1000 transcripts uniformly at random from the set of RefSeqs for the corresponding species with TSS annotation in CSHLmpd. For each tissue, transcripts with at least one vote for specificity in that tissue were removed from consideration before selecting the background.
Because our analysis focused on proximal promoters, À1000 to þ 100 relative to the TSS, if the TSS annotation is off by several hundred base pairs, important promoter regions might be excluded. AKR1D1, identified as liver specific by our voting system, has two known TSSs within 500 bp of the first exon for the corresponding RefSeq (NM_005989) (Charbonneau and Luu-The, 1999). We used the TSS located upstream of the first exon, but could have chosen to use the other promoter, which was annotated by a generally more reliable source (DBTSS versus GenBank). Currently, in such situations, there is little information that identifies the promoter responsible for observed tissue-specific regulation, but comparative genomics and rapidly improving arrays promise better 5 0 end identification, thus improving proximal promoter annotation and association between transcripts and promoters (Kim et al, 2005;Carninci et al, 2006). Negative promoter sets can be used to cancel out patterns that are not related to tissue-specific transcription regulation. We use random negative promoter sets with and without GC-content correction; this correction cancels the inuence of genomic GC-content isochore variability.

Identifying and evaluating motifs
Given a motif M (represented as a position-frequency matrix) and a sequence S, the max-score of M in S, max-score (M, S) is the score of the top scoring subsequence of S when aligned against the scoring matrix for M. Details on constructing and using scoring matrices can be found in Stormo (2000). For a fixed threshold l, the max-score classification method classifies S as belonging to the foreground if max-score (M,S)Xl, and the background otherwise. Given a set of foreground sequences FG (i.e. the tissue-specific promoters) and a set of background sequences BG, the sensitivity of M and l under maxscore classification is seðM; l; FGÞ ¼ jfS 2 FG : max -scoreðM; SÞXlgj=jFGj; and the specificity is spðM; l; BGÞ ¼ jfS 2 BG : max -scoreðM; SÞolgj=jBGj: The balanced error rate for M and l under max-score classification is then BðM; l; FG; BGÞ ¼ 1 À ðseðM; l; FGÞ þ spðM; l; BGÞ=2: The quantity of interest in our analysis corresponds to the optimal value of l for M in distinguishing FG from BG: BðM; FG; BGÞ ¼ min l fBðM; l; FG; BGÞg: Many known motifs are similar to each other, usually owing to similar binding specificities for distinct factors or distinct origins for mofits associated with a single factor. We used MATCOMPARE (Schones et al, 2005) to eliminate redundancies in the sets of known and novel motifs.

Identifying and evaluating modules
Modules are used to classify sequences based on the max-score values of the motifs they contain. Let M¼{M 1 , y, M k } be a module and L¼{l 1 , y, l k } be an associated set of thresholds. The max-score classification for modules assigns sequence S to the foreground if and only if max-score (M i , S)Xl i holds for all i (1pipk). Given sets of sequences FG and BG, the sensitivity of M and L under max-score classification is seðM; L; FGÞ ¼ jfS 2 FG : L k i¼1 max -scoreðM i ; SÞXl i gj=jFGj; and the specificity is spðM; L; BGÞ ¼ jfS 2 BG : _ k i¼1 max -scoreðM i ; SÞol i gj=jBGj: The balanced-error rate for M and L under the max-score classification is BðM; L; FG; BGÞ ¼ 1 À ðseðM; L; FGÞ þ spðM; L; BGÞÞ=2: As with mofits, we are interested in the optimal value of L and define BðM; FG; BGÞ ¼ min L fBðM; L; FG; BGÞg: Because modules are intended to describe synergistic function of a set of motifs, we are interested in modules whose performance is better than expected given the performance of the individual motifs composing the module. For a module M composed of k motifs, let M 0 CM minimize B(M 0 , FG, BG) over all size kÀ1 modules built from motifs in M, and let M 0 ¼M\M 0 . To assess whether M, with balancederror rate u, significantly improves over M 0 and M 0 , we use the probability PrðBðM; FG; BGÞpujBðM 0 ; FG; BGÞ; ðM 0 ; FG; BGÞÞ: We estimated this probability empirically, by sampling from the distribution of balanced-error rates resulting from intersections of sets with balanced-error rates B(M 0 , FG, BG) and B(M 0 , FG, BG).
We used MODULATOR, which is available in CREAD (Smith et al, 2005b), to construct modules. Given a set of motifs, a set of foreground sequences and a set of background sequences, MODU-LATOR identifies those modules composed of the given motifs that have the best balanced-error rates. A branch-and-bound algorithm is used to simultaneously optimize the score thresholds for the motifs in a module. Modules are constructed by adding motifs to existing modules until a user-specified module size is reached or until motif addition does not significantly improve enrichment. Each time a motif is added to a module, the resulting larger module is retained only if the balanced-error rate of the larger module is improved significantly above expectation. The initial modules of size two are obtained by combining pairs of motifs.
For modules that are entirely composed of known motifs, the top 100 motifs (before eliminating redundancies) were used. Modules were allowed to contain up to four motifs for reasons of computational feasibility, but many top modules are smaller. Novel modules, which must contain at least one novel motif, were constructed using the top 100 novel motifs and the top 100 known motifs. Redundancies were removed from the lists of top modules using a procedure described in Supplementary Section 2.

Measuring the significance of motifs and modules
To measure significance of enrichment for top known motifs we used known motifs to classify randomly assembled promoter sets. We constructed 1000 foreground/background pairs for each species by selecting 100 sequences for each foreground and 1000 for each background uniformly at random from CSHLmpd. For each foreground/background pair we calculated the balanced-error rate of each known motif. The best balanced-error rates overall obtained on random samples for human and mouse were 0.364 and 0.368, respectively. We used the distribution of these error rates to identify the q-value (Storey and Tibshirani, 2003) significance of the error rate of each motif. Tissues whose highest ranking motifs fail the qo0.05 test include CD4 Tcells and heart in human, and CD4 Tcells in mouse. TCat includes q-value annotation for each ranked known motif. The full set of motifs for which qo0.05 is estimated to include five false leads per 100 predictions. We did not obtain statistical significance measures for novel motifs, because this will require running DME and DME-B more times than is computationally feasible.
Modules were identified by combining motifs whose cooccurrence was enriched in the foreground sets. We measured enrichment of modules using the balanced-error rate (analogous to that of motifs), and we required that each motif in a module contributes significantly to the enrichment of the module as a whole. To test significance, we randomly selected 100 of the 1000 foreground/background pairs used to evaluate individual known motifs, and performed the module identification procedure on each of the 100 selected pairs. The best balanced-error rates for modules that are entirely composed of known motifs (called known modules) in random human and mouse sets were 0.3145 and 0.304, respectively. We used these balanced-error rates as an estimate of the critical value for Po0.01. We opted for using P-value cutoffs instead of computing q-values because accurate q-value estimation for modules is computationally prohibitive. Top known modules in human kidney (0.2955), liver (0.3105), pancreas (0.3125) and testis (0.3) scored better than the cutoff, as did top known modules in mouse kidney (0.3005), liver (0.3025) and testis (0.2945).

Supplementary information
Supplementary information is available at the Molecular Systems Biology website (www.nature.com/msb).