Computational Identification of Transcriptional Regulators in Human Endotoxemia

One of the great challenges in the post-genomic era is to decipher the underlying principles governing the dynamics of biological responses. As modulating gene expression levels is among the key regulatory responses of an organism to changes in its environment, identifying biologically relevant transcriptional regulators and their putative regulatory interactions with target genes is an essential step towards studying the complex dynamics of transcriptional regulation. We present an analysis that integrates various computational and biological aspects to explore the transcriptional regulation of systemic inflammatory responses through a human endotoxemia model. Given a high-dimensional transcriptional profiling dataset from human blood leukocytes, an elementary set of temporal dynamic responses which capture the essence of a pro-inflammatory phase, a counter-regulatory response and a dysregulation in leukocyte bioenergetics has been extracted. Upon identification of these expression patterns, fourteen inflammation-specific gene batteries that represent groups of hypothetically ‘coregulated’ genes are proposed. Subsequently, statistically significant cis-regulatory modules (CRMs) are identified and decomposed into a list of critical transcription factors (34) that are validated largely on primary literature. Finally, our analysis further allows for the construction of a dynamic representation of the temporal transcriptional regulatory program across the host, deciphering possible combinatorial interactions among factors under which they might be active. Although much remains to be explored, this study has computationally identified key transcription factors and proposed a putative time-dependent transcriptional regulatory program associated with critical transcriptional inflammatory responses. These results provide a solid foundation for future investigations to elucidate the underlying transcriptional regulatory mechanisms under the host inflammatory response. Also, the assumption that coexpressed genes that are functionally relevant are more likely to share some common transcriptional regulatory mechanism seems to be promising, making the proposed framework become essential in unravelling context-specific transcriptional regulatory interactions underlying diverse mammalian biological processes.


Introduction
Inflammation and activation of innate immunity are essential defense responses against invading pathogens and endogenous danger signals. The innate immune response involves the initial recognition of conserved pathogen-associated molecular patterns by members of the Toll-like receptor (TLR) family [1]. The exposure of the host to gram negative bacteria, simulated by lipopolysaccharide (LPS) recognized by TLR-4, triggers intracellular signalling cascades which eventually release a lot of pro-and anti-inflammatory cytokines [2]. While the host inflammatory response is essential to resolve the infection or repair the damage and restore the system homeostasis, it also plays a central pathogenic role in a wide spectrum of diseases including sepsis [3]. Under healthy circumstances, inflammatory responses are activated, clear the pathogen in the case of infection, initialize a repair process and then abate [4]. However when anti-inflamma-tory processes fail, an amplified inflammation can turn what is normally a beneficial reparative process into a detrimental physiological state with severe, uncontrolled systemic inflammation [5].
Studies involving experimental human endotoxemia have reported rapid intravenous infusion in doses of 2-4 ng/kg body weight, which effectively induces an acute systemic inflammatory condition that mimics the early flow phase of injury and infection [6,7,8,9,10]. In human peripheral blood leukocytes, intravenous administration of endotoxin elicits dynamic and reproducible changes in the circulating leukocyte population as well as significant changes in blood leukocyte gene expression patterns [11]. This perturbation of leukocyte gene expression involves several thousands of transcripts and accompanies the systemic physiological responses during inflammation, which peaks ,4-6 hours after endotoxin exposure and resolves within 24 hours, compatible with a large and dynamic regulatory network. Transcriptional regulation is driven by incoming signals which activate transcription factors (TFs) through mechanisms such as phosphorylation or dimerization. Activated complexes are subsequently translocated into the nucleus and bind to the promoter region of target genes in the genome in order to activate or repress gene expression [12]. It is hypothesized that this regulation process is mainly controlled by the interplay between TFs and their corresponding transcription factor binding sites (TFBSs) on the proximal promoters of the target genes [13,14,15]. Recent technological developments have enabled the ability to broadly assess TF activities at a genome-wide scale. However, there is still no transcription factor-focused method that enables monitoring of all TFs at a time; technologies such as Chip-on-chip [16,17], SELEX [18,19] can identify all DNA binding sites occupied by a single TF given a condition. In order to compensate for this inability, computational techniques have become an essential tool in predicting putative TFBSs at a large scale [20,21].
Due to the fact that TFs in higher organisms regulate gene expression in a combinatorial manner rather than in isolation [22,23] and that TFBSs tend to form clusters of binding sites, known as cis-regulatory modules (CRMs) [24,25], computational methods have shifted towards discovering CRMs instead of a single TFBS. A cis-regulatory module is generally considered as the smallest functional regulatory unit [26]. From a computational standpoint, such module is mainly characterized by two factors: (i) composition which consists of a set of non-overlapping binding sites of TFs on the control regions of a gene and (ii) structural constraints that take into account the strand orientation to which TFs bind, the order and the distance between successive binding sites [27]. A variety of methods have been proposed to search for CRMs that include the structural constraints (FrameWorker [28], CMA [29]) or without the structural constraints (CREME [30], ModuleMiner [31], Stubb [32]). Some methods attempt to incorporate a priori knowledge of CRMs (HexDiff [33], ESPERR [34,35]) to increase the specificity of the prediction while some others are purely computationally discovery of CRMs (CisModule [36], CSam and D2Z-set [37]).
However, given a methodology to search for CRMs, a critical issue in predicting functional binding sites is identifying a set of relevant promoters that share common cis-regulatory modules or alternatively a set of genes that are potentially coregulated [38]. As such it is more appropriate to explore the concept of 'gene battery' originally proposed by Britten and Davidson [39] and has been further explored in the literature [40,41,42,43]. A gene battery refers to a group of genes that are coordinately expressed and/or functionally coupled since their regulatory regions respond to the same transcriptional signals [37,44]. With the assumption that genes in a gene battery are involved in key biological processes, recognized CRMs will consist of putative functional binding sites that are associated with essential transcriptional regulators. Yet, in higher eukaryotes especially in humans the problem turns to be much more difficult. One of the most critical issues is to determine which genes belong to the same gene battery. Prior studies assume that either coexpressed genes [45,46,47] or genes that belong to the same biological process [48,49] could be governed by some common regulatory mechanism. However, recent evidence suggests that co-expression or co-function alone is not sufficient to infer the existence of common regulatory mechanisms [50,51]. Oftentimes co-expressed genes can participate in a diverse array of biological functions while functionally-relevant genes can be characterized by different expression patterns [52,53]. Predicated upon these, in this study we explore the possibility that genes that are both co-expressed and functionally-relevant may be more likely to be co-regulated. Since genes within the same pathway encode for a set of interacting proteins, they are more likely to be governed by some common regulatory mechanism [54]. Therefore, the unifying hypothesis of this study is that genes that participate in the same pathway are functional relevant.
In this study, given the transcriptional profiling analysis of human blood leukocytes we hypothesized that genes that are most responsive to an external perturbation (endotoxin) and have concerted changes in their expression profiles are governed by some common regulatory mechanism. Based on our prior work, high-dimensional microarray data are decomposed into a comprehensive set of temporal responses that, in the case of transient human endotoxemia, capture the essence of a proinflammatory phase, a counter-regulatory response as well as a dysregulation in leukocyte bioenergetics [55]. Upon identification of these patterns, a number of inflammation-specific pathways are selected by evaluating the enrichment of the corresponding subsets and thereby defining a putative set of 'coregulated' genes. The CRM-se arching process, similar to FrameWorker [28], is proposed with a novel heuristic to address the issue related to multiple alternative promoters in eukaryotic genes. The definition of CRM structural constraints has been adjusted so that no parameter is required for the searching process except for the statistically significant threshold for the CRM selection. Furthermore, motivated by the work of Schones et al. [56] a precompilation step is performed by converting all promoter sequences into a set of corresponding promoter profiles of binding sites improving the estimation of the statistical significance of recognized CRMs. Overall, the present study aims to computationally identify transcription factors that are crucial to the dynamics of essential physiological processes associated with the acute human inflammatory response and provides significant insights into putative transcriptional regulatory mechanisms that underlie gene expression. Computational results were verified primarily based on available literature.

Identification of putatively 'coregulated' genes
Given the assumption that genes that are co-expressed and functionally relevant are more likely to be co-regulated, we first identify significant expression patterns from in vivo human transcriptional data [11]. Based on 3,269 differentially expressed probesets, we explored the potential of our prior work to identify highly coexpressed genes [55]. Specifically, the algorithm performs a consensus clustering and a trivial cluster removal procedure resulting in four expression patterns that describe the dynamic evolution of endotoxin-induced human inflammation ( Figure 1). The 'early-up' pattern consists of genes that are involved in critical pro-inflammatory processes (e.g. TNF, NFkBIA, C-X-C motifs -CXCL1, CXCL2, and CCL20). The 'middle-up' pattern represents an increased expression of genes with the peak at 4 hrs post-endotoxin administration, containing inflammatoryrelevant signaling pathways such as Apoptosis, Toll like receptor (TLR) signaling. The 'late-up' pattern characterizes anti-inflammatory processes and the 'down' pattern which is the most populated expression motif is characterized by genes involved in cellular bio-energetic processes e.g. oxidative phosphorylation, ribosome biogenesis and assembly.
Since pathways are robust and flexible enough to ensure cell survival under environmental changes, the corresponding genes are more likely to be characterized by a stronger coherency rather than those in a gene ontology (GO) definition. Based on the hypothesis that genes encoding a set of related proteins are more likely to be transcribed under some common regulatory mecha-nisms [54], we opt to use biological information in the context of pathways (KEGG database) rather than GO terms. All genes that participate in the same pathway are considered to be functionally relevant are related [53]. In each expression pattern, we select statistically inflammatory relevant significant pathways (p-value,0.05) based on literature information [55]. Accordingly, fourteen sets of genes that belong to a specific pathway and a pattern of gene expression are extracted. We further assume that these groups of genes represent gene batteries, and hereby are more likely to be coregulated ( Table 1).

Statistical significance of common CRMs
Within a gene battery, CRMs that are present on the control regions of corresponding genes above a frequency threshold (e.g. d = 70% of the number of genes) are considered as common CRMs. However, such CRMs can also be overrepresented in random gene sets. Therefore, in order to restrict the false positive matches and increase the statistical power of our method, we estimate the hyper-geometric p-values of common CRMs vs. a background set and only select those CRMs whose p-values exceed a pre-defined statistically significance threshold (e.g. 10 24 ). However, this threshold is very sensitive to the size of the gene battery and thus a uniform significance threshold cannot be applied for all gene batteries. As a result, we developed a heuristic procedure for estimating the significance threshold of common CRMs with respect to the size of gene batteries. The procedure is repeated 100 times for each N-size gene-set (N = 4, 5…, 20). At each iteration, the algorithm randomly selects N genes from the background set, searches for common CRMs that are present on the promoters of these genes (d = 0.7), estimates the statistical significance (p-values) for each CRM (see materials and methods), and records the minimum one. In this study, we choose the approximate values of the mean of these minimum p-values to set the criterion for the statistical significance of CRMs in a gene battery with size N (Figure 2). Consequently, for each gene battery only those CRMs that are identified with p-values less than the corresponding p-value thresholds are used to infer relevant transcription factors (see Data S1, sheet 'p-value').

Identification of inflammation-relevant transcriptional regulators
One of the key features in our analysis is the identification of significantly overrepresented CRMs in each gene battery (see Materials and methods and Algorithms S1). Based on the size of a gene battery, a corresponding significance threshold is applied to select statistically significant CRMs. Since these recognized CRMs are located on the control regions of many putatively coregulated genes (i.e. a gene battery), they are likely to be composed of functional binding sites that are activated upon the initiation of the transcriptional machinery. We therefore decompose these CRMs into a list of TFBSs to infer associated TFs which can be considered as relevant transcriptional regulators of the corresponding gene battery. In particular, TFs that are present with the high frequency among gene batteries (at least three times across fourteen gene batteries) are assumed to play a key role in the biological process ( Table 2). We identify 34 transcription factors (TFs) relevant to the human inflammatory responses, of which around 25% has been experimentally shown to be involved in the inflammatory and/or immune response based on literature evidence (discussed below) and more than half of the remaining have been computationally shown to play a critical role in the regulation of immune system [57] (see Data S1, sheet 'CRMs', 'TFs', and 'Middle-up TLR').

Putative temporal program of transcriptional regulation
The administration of a low dose of endotoxin to human subjects elicits dynamic and reproducible changes in the circulating leukocyte population by altering the expression level of numerous genes. Since the host response to endotoxin evolves dynamically, it is possible to observe a dynamic representation in the transcriptional regulatory program ( Figure 3). Due to the fact that transcription factors are characterized by pleiotropic effects [58], it is also reasonable to anticipate a significant overlap among sets of transcriptional regulators across various biological processes. On the other hand, our results also illustrate the phenomenon in which genes involved in the same function (pathway) may exhibit different expression patterns and genes within an expression pattern can participate in different functions, implying that there are different regulatory mechanisms regulating genes in the same function or in the same expression pattern. Along with this dynamic response, the regulatory mechanisms can also be

Functional characterization of inflammation-relevant pathways
Upon identification of four significant patterns of gene expression, a number of inflammation-specific pathways are selected by evaluating the enrichment of corresponding subsets in inflammation-specific pathways, including Toll-like receptor signaling, Cytokine-Cytokine receptor interaction, Apoptosis and JAK-STAT signaling cascade, etc. (Table 1). It is now well established that Toll like receptor signaling pathway is the first arm of the host defence system that is activated when endotoxin is recognized by pathogen recognition receptors [59]. During the recognition process, LPS binds and interacts with its signaling receptor (TLR4) which triggers a signal transduction cascade essential for the up-regulation of several pro-inflammatory mediators [60]. Such mediators including cytokines and chemokines interact with their appropriate receptors, giving rise to the Cytokine-Cytokine receptor signaling pathway that amplifies and propagates the inflammatory reaction throughout the cell until the system restores homeostasis [61]. Therefore, both Toll like receptor signaling and Cytokine-Cytokine receptor interaction pathways play a pivotal role in the pro-inflammatory response. Complementary to this, considerable attention has been given to the role of an excessive death of immune effector cells (apoptotic cells) during the progression of an aberrant inflammatory response [62]. The nature of apoptosis as a rectifying process has led researchers to the realization that identifying mediators that are critical in regulating the apoptotic-inflammatory imbalance might prove beneficial in treating human sepsis [63]. It is therefore reasonable to assume that apoptosis also plays a critical role in the endotoxin-induced inflammatory process. Along similar lines, JAK-STAT cascade is another highly enriched inflammationspecific pathway that exerts anti-inflammatory properties. Accordingly, recent data provide evidence that a STAT pathway from a receptor signaling system is a major determinant of key regulatory systems including feedback loops such as SOCS induction which subsequently suppresses the early induced Toll like receptor and cytokine signaling [64,65]. Endotoxin-induced inflammation also causes a widespread suppression at the transcriptional response level of genes involved in mitochondrial energy production (Oxidative phosphorylation) and protein synthesis machinery (Ribosome). Such dysregulation in leukocyte bioenergetics together with persistent decrease in mitochondrial activity can lead to reduced cellular metabolism [66], resulting in the participation of a number of critical metabolic pathways, e.g. Citrate cycle, Pyrimidine and Pyruvate metabolism.
While comparing the inflammatory relevant pathways across expression patterns, we observe that there is a dynamic evolution of these pathways during the propagation of LPS signaling. For example, Toll like receptor signaling appears to be significantly enriched with a diverse array of genes that are early, middle or late up-regulated. From a biological standpoint, since the recognition process of LPS from its signaling receptor (TLR4) leads to the transcriptional activation of cytokines and chemokines (e.g. TNF, IL1B, CCL4), the early transcriptional event (t = 2 h) can possibly reflect this initiating process [67] while at later stages of the inflammatory reaction (t = 4 h), the effect of LPS has been translated into the signal transduction cascade mediated by the up-regulated cytokines. Such a signal transduction cascade is likely to be initiated by the conserved Toll/IL1 receptor (TIR) signaling domain [65] explaining the presence of TLR signaling in the early-up response (t = 2 h). Meanwhile, the presence of this pathway in the middle-up response (t = 4 h) is indicative for amplifying the activity of pro-inflammatory transcription factors (e.g. NFkB) essential for mediating the host response at later time events. Regarding the late up-regulated transcriptional event (t = 6 h), the system activates the expression of various genes that can effectively constraint the inflammatory response. Thus, antiinflammatory mechanisms involve either the activation of JAK-STAT cascade (e.g. IL10RB, IL13RA1), or the increased expression of receptors (e.g. IL1R1, IL8RA, CCR1) that intend to replace and compensate for those that have been consumed during pro-inflammation [68]. Therefore, this cascade of events sheds significant insight on the dynamic evolution of critical proinflammatory pathways including TLR signaling, Apoptosis and Cytokine-Cytokine interaction signaling.

Biological characterization of identified transcription factors
Predicated upon the hypothesis that subsets of co-expressed genes involved in the same biological pathway are more likely to be co-regulated, their transcriptional regulators are computation-ally predicted ( Table 2). There is considerable evidence indicating the inflammatory relevance of the aforementioned inferred transcription factors including MEF2 [69], GATA [70], OCT1 [71], FKHD [72], ETSF [73], IRFF [74], NFKB [75] and CREB [76]. Specifically, the myocyte enhancer factor 2 (MEF2) transcription factor plays a central role in the transmission of extracellular signals to the genome and in the activation of genetic programs that control cell differentiation, proliferation, survival and apoptosis [77]. In addition to this, MEF2 proteins serve as the endpoints for multiple inflammatory signaling pathways including mitogen-activated protein kinase signaling pathway (MAPK) and thereby confer signal-responsiveness to downstream target genes [78]. Also, evidence [79,80] suggests the dual role of MAPK signaling and thereby the activation of MEF2 transcription factor under TLR4 and Cytokine dependent mechanism. Furthermore, the octamer transcription factor 21 (OCT-1) has also been shown to function as a stress sensor modulating the activity of genes important for the cellular response to stress [81]. Although OCT-1 is a ubiquitous transcription factor, it has recently been demonstrated that it promotes cell survival signifying its involvement in the apoptosis signaling [82]. Additional studies [83] document the involvement of octamer binding transcription factors (OCT-1) in regulating the expression of TLR4 in humans; thus making it a critical regulator of Toll like receptor signaling. Furthermore, Forkhead Transcription Factors (FKHD) also play a major role in the control of apoptosis perhaps by affecting the transcription of the gene encoding FASL [84]. Since these regulators can be the substrate of the protein kinase B (Akt) preventing their nuclear translocation, it is expected that FKHD regulators promote cellular survival and thereby control the apoptotic machinery [85]. Moreover, IFN regulatory factors (IRFF) are a family of transcription factors that regulate expression of various pro-inflammatory and anti-inflammatory genes. Research findings reveal a critical role for these interferon regulatory proteins in the control of apoptosis [86,87] while it has become evident [88,89] that such regulators are also essential for TLR gene expression including the trans-acting factors, IRF-1 and IRF-2. This implies that in addition to up-regulation of proinflammatory gene expression, TLR stimulation also results in modulation of TLR gene expression itself via interferon transcription factors.
One of the most important cellular factors involved in the regulation of the host innate immune response is the nuclear factor (NF)-kB which can be activated by a variety of stimuli including bacterial products, inflammatory cytokines and growth factors [75,90]. NF-kB is a pleiotropic transcription factor involved in the inducible expression of a diverse array of genes. As such, activation of the NF-kB signalling module involves not only the early upregulation of pro-inflammatory cytokines but also the transcriptional control of apoptosis [91]. Oftentimes, transcriptional regulation requires the participation of several transcriptional factors through protein-protein interactions, known as transcriptional co-activators or co-repressors. For example, NF-kB encompasses an important family of inducible transcriptional activators critical in the regulation of the gene expression in response to injury and inflammatory stimuli. As such, the CREBbinding protein has been identified as co-activator of the NF-kB component p65 and might play an important role in the cytokineinduced expression of various immune and inflammatory genes [92]. Such observations emphasize the role of the CREB regulator in pro-inflammatory signaling pathways including TLR signaling pathway. Further evidence [93] confers the involvement of overexpressed CREB in inducing apoptosis while the control of FASL induction which mediates programmed cell death in human T lymphocytes [94] appears to be accomplished through a series of regulatory interactions that implicate the role of NF-kB and CREB/ATF pathways [95].
Additionally, there is considerable evidence indicating the role of the early growth response-1 (member of EGR family) in regulating endotoxin induced SOCS-1 transcription [96]. SOCS-1 has been identified as a critical regulator of both adaptive cytokine signaling and innate immune responses and therefore understanding its transcriptional regulation under inflammatory conditions will no doubt be critical in understanding its role in limiting inflammatory responses [97]. Interestingly, these results demonstrate an important role of regulatory members of EGR family in regulating the endotoxin induced activity of the SOCS-1 promoter; thereby validating its presence in our computational predictions. In addition to transcriptional regulation of the antiinflammatory SOCS family, recent data [98] also discussed a critical role of IL10 signaling in SOCS-3 expression which provides for feedback attenuation of cytokine induced immune responses. Other studies [99] document that SP1 regulator may be a central mediator of IL10 induction and thereby it may also play a crucial role in cytokine homeostasis. Accordingly, our method captures the family of stimulating protein 1 (SP1F) as a putative regulator in late phase (resolution) of the inflammatory response. On the other hand, we also observe a significant overlap across various biological processes while comparing these sets of TFs but it is reasonable since transcription factors are characterized by pleiotropic effects [58] (Figure 4). TLR signaling appears to be the principal pathway that initiates the host response to endotoxin and via the cross-talk among other pathways (e.g. Apoptosis, JAK-STAT) amplifies and propagates the inflammatory reaction providing for complex non-linear responses [100].

Dynamic transcriptional regulatory program
The transcriptional regulatory program can potentially show a dynamic reorganization over time. In order to get a dynamic perspective, we focus on the apoptotic regulatory program as an illustration. Apoptosis is tightly regulated process that mainly responds to the initial stimulus followed by a cascade of events that involve the initiation and signal transduction phase. The initiation or preparation phase involves the activation of surface death receptors (extrinsic pathway); for example, TNFR responds to appropriate ligands (e.g. TNF), triggering downstream a signal transduction phase which eventually converges to the activation of effector caspases [101]. On the other hand, programmed cell death is regulated by caspase inhibitors which attenuate apoptosis. Such events would therefore reflect the early, middle and late apoptosis respectively ( Figure 5).
Despite the identification of condition (time) specific regulators, the pleiotropic role of some transcription factors across multiple conditions is also observed. Accordingly, the myeloid transcription factor, MYT1, appears to be a critical pro-apoptotic regulator that affects the progression of the early stage apoptosis. Although considerable evidence [102,103,104] indicates the role of MYT1 in cell cycle regulation through induction of Cdc2 activity, further studies [105] also implicate the activation of cyclin dependent kinases in apoptosis induced by various stimuli. Therefore, such observation implies the possible involvement of Cdc2 regulators including MYT1 in enhancing programmed cell death during the early pro-inflammatory phase (early apoptosis). In addition to regulation of apoptosis, E2F family of DNA binding proteins has also been shown to have the ability to induce not only cell cycle but also programmed cell death [106,107] by regulating the activation of pro-apoptotic genes as seen in Figure 5. Such evidence allows the identification of transcription factors with multifunctional capabilities essential for modifying coupled cellular processes including cell cycle and apoptotic behavior [102].
The retinoid X receptors (RXR) play a critical role in apoptosis. These proteins act functionally as either homodimers or heterodimers with other members of the nuclear hormones receptor family such as retinoid acid receptors (RAR) and peroxisome proliferators activated receptor (PPAR), which are all associated with the regulation of inflammation [108]. In particular, the interplay between RXR and RAR receptors has been shown to induce apoptosis which validates its role as a time specific regulator of early apoptosis [109,110]. Despite the conventional role of NF-kB as anti-apoptotic [111], our results indicate its putative role in mediating cell death by transcriptionally up-regulating early and middle pro-apoptotic genes. Such antagonistic duality for the regulatory role of NF-kB has been recently outlined in [112] signifying its involvement in a proapoptotic fashion by up-regulating the expression of Fas. Reduced Fas expression followed by a reduction in apoptosis was also observed upon endotoxin challenge in relA 2/2 (deficient) rodents suggesting a novel pro-apoptotic function for this protein in Fas induced cell death [113].
Putative regulators of middle and late apoptosis, including both pro-and anti-apoptotic transcription factors have been identified which aim at enhancing and controlling programmed cell death respectively. As such, activating binding protein (AP4R) which according to our results is present in middle apoptosis, has been shown to regulate the expression level of various caspases [114]. On the other hand, CREB binding protein has been shown to play a pivotal role in rescuing from apoptosis promoting cell survival via the induction of anti-apoptotic proteins [115,116] while evidence [117] documents GATA as a novel MAPK substrate that plays an essential role in a cytokine mediated anti-apoptotic response. Based on our predictions GATA appears to be a significant regulator during late apoptosis conferring its role as anti-apoptotic protein. Since the late transcriptional event has been previously identified as a critical anti-inflammatory component, it is expected the regulators of late apoptosis to participate in controlling apoptosis by transcriptionally activating anti-apoptotic genes such as PIK3CG [118]. For example, studies on the STAT regulator have shown that it exerts an anti-apoptotic function required for maintenance of neutrophil homeostasis [119]. Furthermore, FKHD proteins also negatively affect apoptosis signaling through Akt signaling that is implicated in promoting cell survival [85,120]. Both STAT and FKHD binding proteins exert direct regulatory links on the late apoptosis ( Figure 5).
On the other hand, regarding the case that genes are involved in different functions within a particular expression pattern, a possibility relevant to alternative promoter usage can be suggested [50,51]. Besides alternative promoters activated upon tissuespecific or condition-specific context, it is possible to hypothesize that they can be activated upon the coming set of biosignals which are transcriptional regulators, leading to the case that different protein isoforms are translated and thus the gene can be involved in different functions. Taken these together, our analysis provides significant insights on the potential regulatory interactions among transcriptional factors and their target genes which is a crucial step towards quantitative modelling of transcriptional regulatory networks [121].
In order to assess whether coexpressed genes are more likely to be coregulated, we estimate p-values of CRMs in individual gene batteries vs. the corresponding entire pattern of expression ( Table 3). The results show that the estimated p-values values are similar to those calculated for the background set, implying that the entire pattern of coexpressed genes behaves more likely the same as a random background rather than as a set of genes that share a common regulatory mechanism (see Data S1, sheet 'CRMs' and 'Middle-up TLR'). This supports our assumption related to the definition of a gene battery. Such preliminary results indicate that genes that are both coexpressed and functionally relevant are very likely to be governed by an underlying transcriptional regulatory program.

Predicting transcriptional regulators of an in vitro endotoxemia model
In order to assess the stability of our prediction, we applied the analysis to an in vitro human endotoxin model. Data are extracted from a culture of peripheral-blood-derived mononuclear cells stimulated by a high dose of LPS (100 ng/ml) [122]. Clustering approach reveals that there exist five critical transcriptional responses. Three of them characterize inflammatory phases similar to those identified in the analysis of in vivo data including an earlyup response (284 probesets), a late-up response (700 probesets), and a down regulation (226 probesets). Due to a high dose of LPS administration, it would have expected an up-(367 probesets) and a down-regulation (319 probesets) without returning to the base line after 24 hr of LPS administration. Subsequently, a similar analysis of pathway enrichment (using KEGG database) was applied for each set of genes characterizing a transcriptional response. In an overlap with the analysis of in vivo data, we select statistically inflammatory relevant significant pathways (p-value,0.05) that were selected from the analysis on the in vivo human endotoxemia model. Accordingly, nine sets of genes that belong to a specific pathway and a pattern of gene expression were extracted, corresponding to nine genes batteries used to determine critical transcriptional regulators relevant to the inflammatory response in this study (Table 4).
Subsequently, the proposed method has been applied to search for statistical significant CRMs which are decomposed into a list of TFBSs to infer associated TFs that may be functional transcription factors in the regulation of inflammatory transcriptional responses. In a similar manner with the in vivo analysis, TFs that are present with the high frequency among gene batteries (at least three times) are reported ( Table 4). We identify 27 critical TFs of which more than 80% are present in the list of relevant transcriptional regulators found in the analysis of the in vivo data including AP4R, CLOX, CREB, E2FF, EGRF, EKLF, ETSF, FKHD, HOMF, HOXF, IRFF, MAZF, NFKB, NKXH, NR2F, OCT1, RXRF, SORY, SP1F, STAT, TBPF, ZBPF. Given that different dosing amounts of LPS have been applied in two experiments, there may be different genes involved in the response of the same function between the in vivoand in vitromodel, resulting in different TFs involved in the transcriptional regulation of the same gene battery between two cases. However, the significant overlap between two final lists of predicted TFs relevant to inflammatory transcriptional responses provides promising implications of the predictive performance of the method. Therefore, the proposed framework appears to be a robust and valuable methodology to identify critical transcriptional regulators relevant to biological responses under external stimuli (see details in Data S2).

Computational issues
Regarding the computational definition of CRMs, although taking structural constraints (e.g. the strand orientation, the order, and the distance between successive TF-binding sites) into account can reduce the number of false positive matches, a strict definition of structural constraints can increase the rate of false negative Table 3. Statistical significance of selected cis-regulatory modules * .

No.
cisregulatory modules avglen-minlen-maxlen prediction for the existence of CRMs on promoter sequences. Usually there is a pre-defined selection in the length (or window size) and the distance variation between successive binding sites in the composition of CRMs. However, our analysis excludes these parameters since a distance variation of several bases or several hundreds of bases may be invariant in the cells. Instead, we estimate the CRM length based on the length of its instances present on the control regions of corresponding genes in order to select common CRMs and calculate their significance values vs. the background set. Therefore, there is only one adjustable parameter (p-value) which defines the significance level of the resulting CRMs. The statistical significance thresholds for CRMs are selected following the red curve in Figure 2 which is an approximation of the blue curve. The fluctuation of the blue curve is in part due to the random selection of the gene sets as well as due to the round-up to an integral number of the common level d (70%) compared to the gene-set size N (e.g. either N is 6 or 7, there is the same common level for those recognized CRMs -present on at least 5 genes in this case) (see Data S1, sheet 'p-value'). Predicated upon the context-specific nature of the problem as well as a number of other relevant issues (e.g. establishing the criteria to measure the performance of the prediction, building up testing datasets), testing predicted CRMs and/or relevant TFs as 'true' or 'false' remains a challenge [27]. Thus of critical importance is to evaluate and validate the computational results based on literature evidence and those experimentally verified if possible. In this study, we employed a CRM-searching approach, similar to FrameWorker method [28], to identify common CRMs in each gene battery. However, the most critical issue is that a large proportion of mammalian genes possess multiple transcription start sites (TSSs) and therefore multiple alternative promoters regulate gene expression in a context-specific manner [123,124,125]. For instance, in a recent study Singer et al. [126] developed and employed a custom microarray platform to show that there are nearly 35,000 alternative putative promoters present on around 7,000 human genes. As a result, the computational identification of CRMs becomes a combinatorial problem and oftentimes a daunting task due to the large number of alternative promoters of genes in the gene battery. For example, 7 genes that belong to Apoptosis pathway and late-up expression pattern can produce totally 5,600 combinatorial promoter sets; or 10 genes that are in Cytokine-cytokine pathway and late-up expression pattern can create 13,440 combinatorial promoter sets; while complexity further increases in the oxidative phosphorylation group (down expression pattern) characterized by 40 genes and 1,274,019,840 combinatorial promoter sets. Consequently, searching for common CRMs in all promoter combinations is computationally intense. Yet, our novelty heuristic can reduce these complexities into only one running time but still preserve the same result (see appendix, lemma 1). In a similar manner, the strategy of converting promoter sequences into promoter profiles also makes the estimation of the significance of common CRMs vs. a large background set more computationally tractable [56].
Since it is not clear how long the promoter length should be, our computational analysis extracts highly qualitatively defined promoters from Genomatix databases [26] including those with either an experimentally defined length or a default if there is no associated prior length information. This default length (500 bp upstream plus 100 bp downstream the TSSs) is also supported from a recent experiment known as genome-wide open chromatin map [127]. Additionally, we also examined how the promoter length affects the in silico inference of CRMs. Specifically, we count the number of relevant TFs that can be considered as transcriptional regulators for the group of 8 genes that belong to the middle-up expression pattern and the apoptosis pathway. For each specific length of extracted promoters (27 promoters that are relevant transcripts; 100*x upstream and 20*x downstream bases, x from 4 to 10), we applied the same procedure to search for statistically significant CRMs and then infer the list of relevant TFs. The results show that the number of relevant TFs increase linearly with respect to increasing promoter lengths (see Data S1, sheet 'Promoter length'). Thus, including prior information of the promoter lengths is very important to provide reliable computational predictions.
Another important challenge in computationally identifying TFs is associated with the fact that transcription factors can bind to regions far from the TSSs. For example, the P53 factor is a well established regulator for the programmed cell death (apoptosis) [128,129]; however such regulator is not identified as putative TF in the gene batteries relevant to apoptosis pathway. However, if we increase the promoter length up to approximately 1,000 bp P53 is identified within the statistically significant CRMs. This leads to Table 4. Critical transcription factors identified from the in vitro endotoxin study.

No. Patterns
Functions Transcription factors *   1  Early-up  Apoptosis  CLOX, E2FF, EGRF, EKLF, ETSF, FKHD, HOXC, HOXF, IRFF, MAZF, NKXH, NOLF, OCT1, RXRF, SORY, SP1F, STAT,  XBBF   2  Late-up  Apoptosis  CREB, EKLF the hypothesis that P53 might work in a cooperative manner with other TFs that bind to the distant promoter regions. Alternatively, it has been recognized that P53 can affect apoptosis via novel transcription-independent pathways although its role as a mediator of transcription is well established [130,131,132]. For instance, apoptosis can still occur when P53 mutants incapable of acting as transcription regulator are introduced [133,134]. This indicates the possibility that P53 might not directly regulate the apoptotic gene batteries as identified from our analysis. Thus, computational missing P53 as a relevant TF may be a reasonable result rather than a limitation from our computational analysis; yet, it is still a question to us in this study. However, since our analysis only searches for CRMs on the proximal promoters of genes, it should be acknowledged that we may miss some relevant transcription factors that bind to the regions far from the TSSs as well as enhancers that regulate the transcriptional process. Furthermore, we also analyzed the reasons why no statistically significant CRM is found in the down-regulated gene batteries of the oxidative phosphorylation pathway (so-called OXPHOS group). OXPHOS itself is composed of genes that are coexpressed across numerous datasets under different conditions [135,136] and it was proposed as a group of genes that might share a common regulatory mechanism [137]. However, we did not detect any complex-specific arrangement of TFBSs although it is highly enriched by a number of common TFBSs even when the promoter lengths are increased up to 1,000 bp upstream. Although this conclusion is similar to the result of a previous study [137], we note that subunits of each complex in OXPHOS group tend to have tighter coexpression with subunits of the same complex than subunits of other complex which is also proposed and discussed extensively in [137]. Based on the assumption that genes characterized by tightly coordinated expression levels are more likely to share common regulatory elements (proposed and demonstrated in [52]), we assume that genes belonging to the same complex might share some common set of regulatory signals. Therefore, we applied the same procedure of finding statistically significant CRMs on the control regions of those subgroups of genes including complex I -17 genes, complex III -6 genes, complex IV -4 genes, and complex V -13 genes. Eventually, we identified statistically significant CRMs for each complex from which relevant transcriptional regulators can be inferred. As a result, from a promoter analysis standpoint we are highly confident that subunits of each complex in OXPHOS group are more likely to be under a common regulatory mechanism rather than all the genes in the entire group (see Data S1, sheet 'OXPHOS'). However, from a computational standpoint this result raised another possibility related to whether a subset of genes within a gene battery can provide more statistically significant CRMs than the entire gene battery. Assuming that the possibility is correct, this raises two questions including: (i) what is an appropriate size of the subset as well as (ii) how genes in the subset are selected. In order to address this issue, we make a casestudy by randomly selecting a subset of N genes within the OXPHOS group (N = 17, 6, 4, and 13 respectively) and search for significant CRMs. The process is repeated 100 times and the average of minimum significance p-values is calculated. Results show that for N = 4, the average of minimum p-values is comparable to the one with N genes randomly selected from the background set ( Figure 2). Yet, for the other cases the average of minimum p-values is less significant than the ones from the background set, suggesting that random subsets of genes within a gene battery behave more or less similarly to the case from the background set. Certainly, some subsets can provide more significant CRMs than the entire gene battery but how to interpret those selected subsets and the corresponding results remains a challenge. Therefore, it should be emphasized that using prior biological knowledge might overcome some of these limitations.
Our analysis has attempted to reverse engineer the underlying regulatory network of the human blood leukocyte response to a prototypical inflammatory stimulus (endotoxin). Given the transcriptional profiling data of human blood leukocytes, an elementary set of temporal responses with putative transcriptional regulators have been identified. A key feature of the analysis is the exploration of the concept 'gene battery' which represents for a group of genes that are both co-expressed and functional relevant to identify inflammatory transcriptional regulators using a contextspecific searching approach [38]. Novel heuristics regarding to challenging issues e.g. eukaryotic genes consist of multiple alternative promoters leading to a huge computational complexity are also proposed. In order to provide a systematically unbiased in silico approach, CRM structural constraints are also adjusted so that no parameter is required except for the statistical significance thresholds. Furthermore, our analysis also allows for the reconstruction of a dynamic temporal regulatory network, making it a critical enabler for improving our understanding of how the transcriptional machinery 'program' effectively regulates key cellular processes.
Although no single analysis can identify all transcriptional regulators involved in a response, it has been demonstrated that the proposed framework can identify critical TFs that are relevant to acute inflammatory responses. Despite the fact that many methods have been proposed in the literature to search for relevant transcriptional regulators, different approaches explore different biological assumptions resulting to different sets of putative TFs which may or may not significantly overlap each other. Since the true extent of all TFs involved in the regulation of a complex response under some external stimuli is unknown, these differences could not be interpreted as the high-or low-accuracy of the methods. Instead, all of found TFs may be involved in different processes of the response but because of the limitation of the hypotheses used by the methods, they may not be recognized by a certain approach.
Novel methods are still proposed using different analytical approaches but generally they can be categorized into two main directions including mRNA expression-based [138,139,140] and TF binding pattern-based methods [28,29,30,31,36]. The first direction somehow utilizes the fundamental hypothesis that the mRNA expression level of TFs is proportional to their protein concentration but this may not be appropriate especially in higher eukaryotes since TF activation is often regulated post-translationally and acts somewhat in an independent manner of expression level. Some methods also require multiple-condition data as the input which may not be applicable when practical data are only sampled under one condition/treatment [138,139,140]. In the meanwhile, a lot of methods following to the latter direction have been developed e.g. FrameWorker [28], CMA [29], CRÈ ME [30], ModuleMiner [31], CisModule [36], BioMoby [141] etc. of which ours is among them. These are not limited by the mRNA expression proportion hypothesis but they are limited by promoter identification, TF binding profiles, and the underlying assumption to select the input set of 'co-regulated' genes.
In this study, we therefore opt to extend an available computational tool, FrameWorker, to take into account the fact that genes of higher eukaryotes contain multiple alternative promoters exploring the rich information of the Genomatix database on promoters and TF binding profiles. The underlying assumption that coexpressed genes are more likely to share some common regulatory mechanism when they are functional-relevant has been explored to predict putative functional activation of TFs in a specific context. These factors make our method become incomparable or unnecessary to compare with available methods. However, given the future availability of more complete TF binding data and other resources, the method could be enhanced by integrating protein-protein interaction to refine selected CRMs or using other tools to support the selection of relevant functions e.g. Pathway-Express [142]. Since each single method or even each direction always contains its own limitations and advantages, one possibility in future improvements could be the development of a framework to obtain a consensus result under diverse underlying hypotheses from various outputs of different methods.

Human endotoxemia model and data collection
In vivo data. The data used in this study were generated as part of the Inflammation and Host Response to Injury Large Scale Collaborative Project funded by the USPHS, U54 GM621119 [11,143]. Human subjects were injected intravenously with endotoxin (CC-RE, lot 2) at a dose of 2-ng/kg body weight (endotoxin treated subjects) or 0.9% sodium chloride (placebo treated subjects). Following lysis of erythrocytes and isolation of total RNA from leukocyte pellets [11], biotin-labelled cRNA was hybridized to the Hu133A and Hu133B arrays containing a total of 44,924 probesets for measuring the expression level of genes that can be either activated or repressed in response to endotoxin at 0 (before treatment), 2, 4, 6, 9, and 24 hr. Data are publicly available through the GEO Database (#GSE3284). ANOVA technique (p,10 24 ) was then applied to filter significantly differentially expressed probesets, resulting in 3,269 selected probesets [55]. Average expression profiles of probesets over replicates for each time-point were used as the final input data for further analyses [144]. The data have been appropriately deidentified, and appropriate IRB approval and informed, written consent were obtained by the glue grant investigators [11].
In vitro data. Isolated from peripheral blood mononuclear cells collected from three healthy humans, adherent monocytes were cultured for 10 days in RPMI medium 1640 (20% FBS/Lglutamine/20 mM Hepes/penicillin/streptomycin/50 ng/ml macrophage colony-stimulating factor) to generate peripheralblood-derived mononuclear cells [122]. These mononuclear cells were stimulated by 100 ng/ml LPS (Salmonella minnesota R595 ultra pure LPS; List Biological Laboratories, Campbell, CA) and sampled at 0 (before stimulation), 2, 4, 8, and 24 hr. Total RNA was isolated with TRIzol (Invitrogen, Carlsbad, CA) and two samples for each time-point were analyzed using HG-U133 Plus2 Affymetrix GeneChips producing mRNA expression profiles of 54,675 probesets (#GSE5504). Fold change (fold = 2.5) was then applied to filter significantly differentially expressed probesets, resulting in 2,892 selected probesets. Average expression profiles of probesets over replicates for each time-point were used as the final input data for further analyses [144].

Clustering and selection
Utilizing the concept of the agreement matrix (AM) in consensus clustering, we recently proposed a novel method to identify the core set of probesets that are most agreeable in the AM of which they belong to the same or different patterns of gene expression [55]. In order to produce the agreement matrix, a number of different clustering methods along with different metrics (Euclidean, Manhattan, and Pearson correlation) were used to reduce the bias and assumption of any specific clustering method. After identifying the core set of probesets, the AM is reduced correspondingly to those selected probesets and then the hierarchical clustering is applied on the reduced AM to produce a number of gene expression patterns. Subsequently, we applied a trivial-cluster removal procedure and obtain four significant patterns of gene expression which are shown to be critical in the dynamics of acute human inflammation.

Problem definition
In this study, genes that are both coexpressed and functionally relevant are assumed to belong to the same 'gene battery'. The problem of CRM searching can be formalized as follows: given a set of N putatively coregulated genes G~g i f g N i~1 , each of which contains K i alternative promoters g i~p ro ik f g Ki k~1 whereas each promoter is represented by a list of L ik binding sites ('promoter profiles') pro ik~b s ikl f g Lik l~1 and each binding site is a 3-tuple of corresponding transcription factor name f, position p and binding orientation o: bs ikl~v bs f ikl , bs p ikl , bs o ikl w, find a set of M cis-regulatory modules (CRMs) cCRM~crm j È É M j~1 , crm jb s jl È É Mj l~1 that are present as common over a threshold d (70% in this study) on the set of gene promoters (M j is the number of binding sites, yet to be determined, in CRM crm j ). The statistical significance of each commonly recognized CRM vs. a background set of genes is then estimated selecting only significant CRMs. The subscripts i, k, l, j indicate the gene number, the promoter number, the binding site number, and the CRM number respectively. An illustration of the computational framework is presented in Figure 6 while more details are discussed in the following section.

Discovery of TFBSs and promoter profiles
Based on a comprehensive database of promoters -Genomatix [26], a set of transcript-relevant promoters are extracted coupled with multiple alternative promoters and experimental information about the promoter length including those with either an experimentally defined length or a default if there is no associated prior length information (500 bp upstream plus 100 bp downstream the TSSs). MatInspector [20] is then applied to scan for PWM matches on those promoter sequences using optimal parameters from MatBase [26]. In order to speed up the process of discovering CRMs as outlined in [56], each promoter is remodelled with a list of L ik TFBSs ordered by their local positions on the promoter sequences and represented by the corresponding TF name (e.g. NFKB, ETSF) along with the binding orientation pro ik~b s ikl f g L ik l~1 . The conversion aims to answer two basic questions: (i) given a promoter sequence, identify whether a TFBS or a CRM is present on this promoter or not, and (ii) given a gene with K i alternative promoters, determine if a TFBS or a CRM is present on any promoter sequence of this gene. From a computational standpoint each promoter profile is loaded into a hash table whose field 'key' includes the TFBS name plus the binding orientation (e.g. +ETSF, 2PAX6, '+' as forward and '2' as backward binding orientation) and field 'value' is the position list of the corresponding TFBS with the same binding orientation. For example, if the key is '+ETSF' and the corresponding value is '373__386', we know that transcription factor ETSF is forward binding to the promoter at the local position 2373 or 2386 upstream. As a result, to decide the existence of a TFBS including the binding orientation on a promoter the process only makes a quick search in the hash keys. In a similar way, to determine the present of a CRM on a promoter the process will take into account the binding orientation from the keys and the positions from the values of corresponding keys to evaluate the structural constraints (see Data S1, sheet 'Promoter profiles').

Common cis-regulatory modules
Computationally, a cis-regulatory module crm j is a list of M j non-overlapping TFBSs ordered by their positions on the promoter sequence and characterized with their corresponding binding strand orientation. For example, CRM '+NFKB__ 2CREB__2SP1F' consists of three successive TFBSs of transcription factors NFKB, CREB, SP1F with the binding strand orientation forward, backward, and backward respectively. Besides the binding orientation and the position order of TFBSs, CRMs are also characterized by their length. If CRM A appears to be common in a gene battery of N genes, the average length of all instances of A on N genes is considered to be the length of this CRM. In the case that A presents more than one time on promoters of gene i, the length of instance A for this gene will be the minimum one. Subsequently, to estimate the common level of this CRM we only take into account those instances with the length approximate to the average one (e.g. from the half to the double). If the number of such instances over N is higher than a frequency threshold (d = 70% in this study), CRM A is considered as a common CRM of the gene battery.
However, a gene can have multiple alternative promoters and virtually in all cases, it is not known which promoter of the gene is activated. To identify activation of putative promoters, one solution would be to search for all possible combinations of promoters in the gene set. Yet given a set of N genes, each gene with K alternative promoters in average, the total combinatorial number of promoter sets is K N which is computationally intense and sometimes impossible to search for all promoter combinations. Consequently, we propose a novel heuristic where if a TFBS or a CRM is present on any promoter sequence of a gene, it is considered as present on the control regions of that gene. The heuristic results in one-time searching instead of K N but still produces the same results as the brute-force search in all combinations of promoters (see Appendix S1, procedure 'IsPresent' in Algorithms S1 and Procedures S1). Using this heuristic, the main algorithm to search for common CRMs in a gene battery, similar to FrameWorker [28], can be simply described with two primary steps as follows: (1) identify all potential TFBSs that are common in a gene battery and (2) employ the breadth first search technique to search for all possible combination of all commonly found TFBSs in step 1, each of which is a potentially common CRM yet to be determined quickly by the heuristic above (see details in Algorithms S1 and Procedures S1).
Estimating the statistical significance of cis-regulatory modules Due to the fact that a CRM can be present on promoters of many genes in the background set, we estimate the statistical significance of commonly identified CRMs for each gene battery vs. the background set to select those that are significantly overrepresented. Based on promoter profiles, the procedure can directly identify whether a particular CRM is present on any promoter sequence of a gene in real-time despite the large background set of genes. This calculation provides a corresponding hyper-geometric p-value defined as follows: where B and b is the number of genes and the number of hits respectively in the background set which is made up of 5,000 randomly selected genes in human genome; N and n is the number of genes and hits in the gene battery, respectively.

Supporting Information
Data S1 Provide detailed results for the analysis of in vivo data, including common significant CRMs recognized, the detailed data of Figure  Algorithms S1 Restate the problem of discovery CRMs in gene batteries and provide detailed pseudo-code for algorithms, including the procedure 'IsPresent' and the algorithm of the main procedure. (DOC) Appendix S1 Provide an illustration to prove that the set of common CRMs found by the proposed heuristic is the same with the set of common CRMs found from all promoter combinations. (DOC) Procedures S1 Provide source codes of some main procedure (by Perl language) including (1) check the existence of a CRM on alternative promoters of a gene, (2) search for common CRMs in a gene battery, and (3) estimate the hyper-geometric p-value of CRMs vs. the background set. (DOC)