Gene Interaction Network Suggests Dioxin Induces a Significant Linkage between Aryl Hydrocarbon Receptor and Retinoic Acid Receptor Beta

Gene expression arrays (gene chips) have enabled researchers to roughly quantify the level of mRNA expression for a large number of genes in a single sample. Several methods have been developed for the analysis of gene array data including clustering, outlier detection, and correlation studies. Most of these analyses are aimed at a qualitative identification of what is different between two samples and/or the relationship between two genes. We propose a quantitative, statistically sound methodology for the analysis of gene regulatory networks using gene expression data sets. The method is based on Bayesian networks for direct quantification of gene expression networks. Using the gene expression changes in HPL1A lung airway epithelial cells after exposure to 2,3,7,8-tetrachlorodibenzo-p-dioxin at levels of 0.1, 1.0, and 10.0 nM for 24 hr, a gene expression network was hypothesized and analyzed. The method clearly demonstrates support for the assumed network and the hypothesis linking the usual dioxin expression changes to the retinoic acid receptor system. Simulation studies demonstrated the method works well, even for small samples.

Gene expression arrays (gene chips) have enabled researchers to simultaneously monitor the approximate level of mRNA expression for a large number of genes. These mRNA expression levels are one component of the machinery that controls the function and survival of cells; the other components constitute the other major biochemical constituents of a cell such as the actual DNA sequence, protein levels, and cellular substructures. Signal transduction pathways have long been used to describe the sequence of biochemical events that control cellular function and generally include all aspects of the biochemistry of a cell. In the absence of full proteomic data (both primary proteins and modified proteins), it is valuable to understand the quantitative relationship between genes, which we will refer to as gene expression networks. The rates derived from the quantification of gene expression networks provide crude estimates for the overall rates linking genes through complicated signaling pathways. In addition, hypothesized linkages between genes will aid in focusing research efforts in other areas such as proteomics, metabolomics, and toxicologic assays.
We recently used toxicogenomic analysis to examine the response of human peripheral lung epithelial cells to 2,3,7,8tetrachlorodibenzo-p-dioxin (TCDD, dioxin) in vitro (Martinez et al. 2002). Exposure to this persistent environmental pollutant has been associated in human populations with increased risk of lung cancer and chronic obstructive pulmonary disease; therefore, understanding its mechanism of action may provide insights into the risk of persistent human exposure not only to TCDD but to other ligands of the aryl hydrocarbon receptor (AhR). In this study we showed a variety of cell-signaling pathways that exhibited a dose-dependent alteration by TCDD. One observation in this study was an alteration in retinoic acid (RA)-responsive genes. Alterations in RA homeostasis have been observed previously in rodents, leading to a retinoid-deficient state. In addition TCDD exposure in rats has been associated with increased incidence of squamous neoplastic and nonneoplastic lesions including squamous cell carcinoma of the lung and hard palate region the oral mucosa (Kociba et al. 1978). Given that alterations in retinoid signaling can affect the differentiation of squamous epithelia, it is possible that the increase in these squamous lesions may be due to a retinoid-deficient state induced by the alteration in retinoid homeostasis.
Identification of the retinoid-responsive genes in the TCDD microarray analyses suggested a functional relationship between AhR activation and retinoid homeostasis and/or signaling in the human lung epithelial cells. Although such relationships can be tested empirically, invariably a large number of functional relationships are possible within a given microarray data set; therefore, priority setting for functional validation studies is often a challenge. In this article we develop a computational approach for evaluating the likelihood that observed changes in gene expression are due to hypothesized functional relationships. We then test the AhR-retinoid interaction using this method.
Several methods have already been proposed for the analysis of gene expression data. The most commonly used methods rely on description of simple fold increases in expression, phylogenetic tree analyses, clustering methods, classification methods, or combinations of these. Methods have also been proposed to develop gene expression networks using dynamical systems defined by ordinary differential equations (Chen et al. 1999), modified linear regression methods (Gardner et al. 2003), Boolean networks (Akutsu et al. 2000) where gene expression data are converted to two states (ON and OFF), discrete networks (Hartemink et al. 2002), and many others. Bayesian networks (Friedman et al. 2000;Pe'er et al. 2001) have been proposed as a means of identifying gene interaction networks (Imoto et al. 2002;Tamayo et al. 1999) and for predicting protein-protein interactions using a combination of different types of genomic data (Jansen 2003). Many of the available methods are discussed in a recent review article (Lockhart and Winzeler 2000). Few methods exist that combine careful statistical estimation and hypothesis testing with quantitative gene interaction models to provide a systems biology-based approach for the analysis of microarray data.
In this article, a Bayesian network approach (Friedman et al. 2000;Imoto et al. 2002) previously suggested is modified to provide direct quantification of gene expression networks using microarray data for a known network. This analytical approach provides a model that can be used for mechanism-based mathematical Environmental Health Perspectives • VOLUME 112 | NUMBER 12 | August 2004

Definition of Gene Expression Network
The basic concept for Bayesian networks in the analysis of gene expression data has been described previously (Friedman et al. 2000;Imoto et al. 2002;Tamada et al. 2003). A gene expression network consists of a collection of P genes, denoted by X 1 , X 2 ,…X P , linked by weighting functions, w i (θ i ) ( i = 1,2,…P), where the subscript i denotes that this weighting function pertains to the control of gene X i by all genes linked to it and θ i denotes the vector of parameters defining the functional relationship. In cases where the relationship between individual genes is monotonic (i.e., X i either stimulates or inhibits X j but cannot have mixed effect), such a network can be easily represented graphically as in Figure 1. Figure 1 is a simple gene expression network consisting of four genes (squares) and four weighting functions (circles), with lines linking the genes and the weighting functions. Two kinds of lines appear in the model. A line with a bar implies inhibition (e.g., gene X 3 inhibits gene X 4 in Figure 1), and a line with no bar implies stimulation (e.g., gene X 1 stimulates gene X 4 ). No line between genes implies these genes have no direct relationship to each other (X 2 and X 4 are not directly linked). The weighting function combining the effects of genes X 1 and X 3 on gene X 4 is denoted by w 4 (θ 4 ) in Figure 1.
The vector W(θ θ) = [ w 1 (θ 1 ) w 2 (θ 2 ) … w P (θ P )] fully characterizes the functional relationships between genes in a gene expression network and is the target of any estimation effort to identify and quantify a network. The functional form that can be used for any individual w i (θ i ) is not restricted. One example is the log-linear gene expression network.

Log-Linear Gene Expression Network
One of the simplest types of weighting function used to describe a gene expression network is the log-linear weighting function given by the following form: where x j is the observed level of expression (or ratios of expression) of gene X j , β ji is the magnitude by which a change in one log unit of gene X j will affect the level of expression of gene X i , and I ji is an indicator variable describing the direction of the change denoted by β ji , where I ji = 1 for stimulation, I ji = -1 for inhibition, and I ji = 0 for no effect. For simplicity of notation, where we refer to T as the transition matrix and B as the parameter matrix. It is then possible to rewrite Equation 1 in its matrix form given by [2] where θ θ = [ β 11 β 12 …β pp ], and the dot represents element-by-element multiplication of B and T. In the example given by Figure  The transition matrix provides the qualitative structure of the gene expression network, and the parameter matrix quantifies the strength of the relationship between the genes. In the following we use N p (θ θ) to represent a general gene expression network with P genes and N TP (θ θ) to specifically represent a log-linear gene expression network with P genes.

Bayesian Network Estimation Procedure
Like any other biological measurement, it can be presumed that two observations taken from seemingly identical examples may differ because of uncontrolled variables or simple random fluctuation; this difference is traditionally defined as random variation about the mean behavior in a model. With random variation, x = [x 1 , x 2 …x p ] is an observation from a random matrix X = [X 1 , X 2 …X P ]. The simplest method by which random variation can be included in a gene interaction network is to assume that X i is conditional on knowledge of the other X's and θ θ follows a prescribed probability density function. Define X i = [X 1 , X 2 ,…X i-1 , X i+1 ,…X P ] and define f i (X i | -X i , θ θ) to be the conditional density of X i . If a gene has a regulatory effect on gene X i , that gene is referred to as a "parent of gene X i "; in other words, it belongs to the set referred to by Pa(X i ). Hence, for example, in the model depicted by Figure 1, Pa(X 3 ) = [X 1 , X 2 ]. This notation has been used in other cases and in the context of this modeling, the distribution could then be written as f i (X i |Pa(X i ),θ θ). A greater level of statistical complexity is possible by also presuming that the parameters have probability density functions; h i (θ i ) is referred to as the prior distribution of θ i . This formulation places the network defined by N P (θ θ) and the data into the context of classical Bayesian networks (Jensen 1996).
Suppose that we have m sets of microarray data [x 1j , x 2j ,…x Pj ] j = 1,2,…m from gene expression network N P (θ θ), where individual arrays are independent random samples from the joint density function for the genes. The joint density function for the parameters given the gene expression data, denoted g (θ θ| X ), is referred to as the posterior distribution and can be estimated using the Markov chain Monte Carlo (MCMC) method (Hastings 1970). In the examples given in this article, the Metropolis algorithm (Andrec and Prestegard 1998) is used to sample from the MCMC to generate samples from the joint density.

Specific Cases Used in This Analysis
In all analyses that follow, the gene expression network is presumed to be a log-linear network defined by N TP (θ θ) in Equations 1 and 2. It is assumed that data arise from microarrays using a relative comparison between two samples (no change results in a value of 1, increased expression > 1, decreased expression < 1), and the distributions for the log of the individual relative gene expression levels conditional on knowledge of T,θ θ and the other X's, , are assumed to be normal, with mean defined as the exponent of e in Equation 1 and with standard deviation (SD) σ. All parameters in θ θ = (B,S ), where S = [σ 1 , σ 2 ,…σ P ] are assumed to have prior distributions (normal for the elements of B and uniform for the elements of S).
Assume that the structure of T (transition matrix) is known without error. In this situation, the qualitative relationship between genes in the gene expression network is known. Taking Figure 1 as an w 1 (θ 1 ) Figure 1. A simple gene expression network consisting of four genes and four nonzero functional relationships.
example, expectation of each log (X i ) (i = 1,2,3) becomes E [log(X 1 )|T,B, - log(X 1 ) + β 23 log(X 2 ), and E[log(X 4 )|T,B, -X 4 ] = β 14 log(X 1 ) -β 34 log(X 3 ). The ultimate goal of defining a Bayesian network is to derive the posterior distribution for the parameters of interest. To derive the posterior, we must first calculate the conditional likelihood of the data, denoted L N [X |N TP (θ θ)]. The likelihood is the product of the individual conditional densities and is written In the MCMC analysis, we must assume a mean and variance for the prior normal distributions for the β's and bounds on the prior uniform distributions for the σ's. Several options were chosen for the prior means of the β's and an uninformative SD (10) was chosen for the prior variance. To develop bounds on the prior uniform distributions for the σ's, SDs were calculated for each gene across replicates, and the maximum SD observed was multiplied by 2 to set the upper bound, with 0 set as the lower bound. Given these priors and the data, MCMC iterations for each data set analyzed are run until the estimates for the posterior distributions for the β's and the σ's are stabilized.
Other distributions and methods could be used to define the priors and generate the posterior distributions for the likelihood and the parameters in the model. In considering a more complicated functional relationship between genes, Michaelis-Menten-type equations could be used to develop networks with restricted maximum and minimum linkages. Such networks would require substantially more data.
A user-friendly software package for these analyses is available from the corresponding author. Martinez et al. (2002) evaluated the change in expression of 2,091 genes in triplicate samples of HPL1A and A549 cells exposed to differing levels (0, 0.1, 1.0, and 10 nM) TCDD for 24 hr. Total RNA was extracted and, using methods described by Martinez et al., hybridized to NIEHS Human ToxChip, version 1.0 (http://dir. niehs.nih.gov/microarray/chips.htm) to obtain changes in gene expression in dioxin-treated cells (one channel) relative to the controls (second channel). They identified 68 genes that were altered in at least one cell line and 15 genes that were altered in both cell lines. Of these, they identified 11 genes that appear to be involved in the effects of TCDD on the retinoid-signaling pathway. In this article we hypothesize a gene interaction network defining the quantitative role of TCDD in altering retinoid signaling based on the current available literature. The data for these 11 genes from the HPL1A cells and the hypothesized network are analyzed using the methods described above.

Dioxin Analysis
2,3,7,8-Tetrachlorodibenzo-p-dioxin is a known human carcinogen, a suspected teratogen, and highly toxic in most mammalian species. There has been considerable speculation that TCDD alters the retinoic acid receptor (RAR)-dependent signaling pathway via alteration of the synthesis and metabolism of RA. Microarray data (Martinez et al. 2002) on changes in gene expression in HPL1A lung airway epithelial cells after exposure to TCDD at levels of 0.1, 1.0, and 10.0 nM for 24 hr identified 11 genes with significant changes at the 99% confidence level. The gene identifiers and data are given in Tables 1 and 2.

Toxicogenomics | AhR and RARB gene interaction network
Environmental Health Perspectives • VOLUME 112 | NUMBER 12 | August 2004  Table 1. Description of genes included in the gene interaction network shown in Figure 2.
Gene symbol (alternate symbols) a Accession no. a Gene name a Biological role Aldehyde dehydrogenase 3 family, memberA1 May play a role in the oxidation of lipid aldehydes, especially those generated by lipid peroxidation (Vasiliou et al. 2000); is induced in rat liver by TCDD (Unkila et al. 1993 (Varanasi et al. 1994); changes in this gene are likely to affect endogenous levels of fatty acids known to activate the retinoic X receptor, thereby modulating gene expression (Issemann et al. 1993) Figure 2 hypothesizes a gene interaction network linking the traditional TCDDinduced genes and genes in the RARdependent signaling pathway. Vitamin A (retinol) is taken up from blood and binds to the CRBP in the cytoplasm. Retinol and alcohol dehydrogenases convert the sequestered retinol to retinal, which is then converted to RA by retinal dehydrogenases such as ALDH6 (Rexer et al. 2001). It is also possible that cytochrome P450s such as CYP1A1 may also convert retinal to RA (Zhang et al. 2000). Once RA is synthesized, it binds to cytosolic RA binding proteins (such as CRABP). RA enters the nucleus, where it binds to two types of ligand-activated nuclear transcription factors, the RA receptors (e.g., RARB) and the retinoid X receptors. Several groups have hypothesized that changes observed in RA levels from dioxin exposures are mediated through increased metabolism of retinal to RA through retinal dehydrogenases or cytochrome P450s or both (Schmidt et al. 2003). Using these data together with the known AhR gene battery, we developed a hypothetical gene interaction network (Figure 2).
The predominant linkage to RAR is through upregulation of ALDH6 and CYP1A1, which synthesize RA. TCDD alters the metabolism of all-trans-RA (Schmidt et al. 2003), suggesting the linkage between ALDH6 and RARB in Figure 2. RARB has been shown to play a role in the inhibition of cellular replication (Sun et al. 2000). RARB is assumed to modify the expression levels of four genes: ELF3, NCOA2, ZNF42, and CDKN1A. These genes have been shown to be parts of the differentiation pathways of various cell types and are hypothesized to be modified by changes in the RA-signaling pathway. ELF3 is an epithelial-specific transcriptional regulator that may play a role in lung carcinogenesis (Tymms et al. 1997). NCOA2, also known as GRIP1, interacts with the five steroid hormone receptor types (Hong et al. 1997;Schmidt et al. 1998). ZNF42, also known as MZF-1, is a putative transcriptional regulator induced by RA in human myeloid cells (Hromas et al. 1991). CDKN1A is induced by RA through RARB in human neuroblastoma tumors (Cheung et al. 1998;Liu et al. 1996). Both ALDH6 and RARB affect the regulation of NCOA2, which in turn alters the regulation of ZNF42 and CDKN1A. ACOX1, the human peroxisomal acylcoenzyme A oxidase, is hypothesized in the model to upregulate both RARB and NCOA2. ACOX1 is the first enzyme of the fatty acid beta-oxidation pathway (Varanasi et al. 1994), and changes in this gene are likely to affect endogenous levels of fatty acids known to activate the retinoic X receptor, thereby modulating gene expression (Issemann et al. 1993). The second major linkage occurs between cytochrome P4501A1, CYP1A1, and RARB. An inducer of CYP1A1 (β-naphthoflavone) induced the metabolism of all-trans-RA in human intestinal epithelial cells (Lampen et al. 2000). CYP1A1 is upregulated by TCDD (Portier et al. 1993), suggesting the linkage between TCDD and genes in the RAR-signaling pathway such as RARB, NCOA2, and CRABP, a specific carrier protein for vitamin A that influences metabolism of RA and increases the sensitivity of a cell to vitamin A signaling (Boylan and Gudas 1992;Ong 1987). The CRABP promoter contains an enhancer region through which RA inhibits CRABP transcription (Means et al. 2000).
The slope parameters for all the linkages between genes (β ij in Equation 1) in Figure 2 were estimated using the Bayesian gene interaction network approach as described above. Prior probability distributions for the log gene expression values were assumed to be normally distributed with a mean of zero and a variance of 1. The SDs (σ 1 , σ 2 , …σ P ) were assumed to have uniform priors ranging from zero to two times the largest SD observed for  any one gene (an uninformative prior). A total of 100,000 MCMC samples were obtained, and the last 80% (80,000) were used to estimate the posterior density of the parameters. Although no formal MCMC stopping rule was used, analyses of the last 80,000 MCMC samples clearly supported convergence. Figure 2 also illustrates the type of results routinely obtained in Bayesian analyses. The four histograms shown in Figure 2 are the empirical posterior densities for 4 of the 19 linkages. The linkage between TCDD and CYP1A1 (A) has a distribution for which there are virtually no values below zero, indicating a strong statistical relationship in these data. The mean value, 0.269, indicates the degree of change in CYP1A1 expression as a function of the change in TCDD concentration. The means, SDs, and percentages of values below zero are summarized for all 19 linkages in Table 3. The distributions for the variances are presented in Table 4. Other uninformative priors were tried with no significant alteration in the results presented in Table 3. In our Bayesian analysis we estimate the posterior distributions for each parameter given the data. If a distribution for a given parameter has a small probability of being < 0 (such as ≤ 0.1), that parameter supports a linkage between genes.
The network depicted in Figure 2 was developed to test the hypothesis of a linkage between dioxin-responsive genes CYP1A1 and ALDH6, and the RAR-signaling gene RARB. The distribution for β CYP1A1 → RARB had a substantial mass less than zero (26% < 0, Table 3), suggesting a lack of support for the linkage between changes in message for these two genes.
Similar results were seen for β ALDH6 → RARB (20% < 0, Table 3). Examination of the joint density for β CYP1A1 → RARB and β ALDH6 → RARB suggested a negative correlation, indicating that the data may not support both linkages simultaneously. This is not surprising, as they are both acting upon the same component of RA synthesis. By forcing β CYP1A1 → RARB = 0 and again estimating the remaining parameters, we can examine the distribution of β ALDH6 → RARB under the condition that the other linkage is not present; in this case, β ALDH6 → RARB had no estimates less than zero (0%) and there was no change in the posterior distribution for the log-likelihood, suggesting almost no change in the fit of the network to the data, even though we dropped the linkage between CYP1A1 and RARB. Conversely, we can set β ALDH6 → RARB = 0 and examine the distribution of β CYP1A1 → RARB ; here also we see 0% < 0 and no change in the log-likelihood. These two analyses support the hypothesized linkage between TCDD-responsive genes and RARB-responsive genes, but only through either ALDH6 or CYP1A1, not both. Finally, setting both β CYP1A1 → RARB = 0 and β ALDH6 → RARB = 0 significantly shifts the distribution of the posterior loglikelihood to smaller values (10% reduction overall), suggesting that at least one of these linkages is needed to explain these data.
The only other linkage that did not appear to be supported by these data was the hypothesized linkage between NCOA2 and ZNF42. The distribution for β NCOA2 → ZNF42 had a mean estimate of zero, with 48.5% of the estimates less than zero. Assuming β NCOA2 → ZNF42 = 0 had no impact on the log-likelihood, suggesting this linkage was not needed in the model and that there was no correlation offset with other parameters. Given the sample size and the number of genes in the network, it is surprising that all other linkages appeared to be supported by these data, with the percentage of β values less than zero ranging from 0% for several pairs

Simulation Studies
Although the TCDD example is illustrative of the method, it does not address how well this method works under diverse conditions; this is best addressed by Monte Carlo simulations. One thousand (1,000) simulated experiments from the simple four-gene network in Figure 1 were generated by the computer using sample sizes of 50, 25, and 10 gene chips in each experiment. Twenty-two combinations of the model parameters (θ = [β 13 , β 14 , β 23 , β 34, σ 1 ,σ 2 ,σ 3 ,σ 4 ]) were considered. For each simulation, posterior distributions were calculated and summarized by their means, medians, and SDs. The MCMC process used was identical to that used for the dioxin example, with the exception that only 8,000 iterations of the Metropolis algorithm were performed, and the last 20% (1,600) values were used to calculate the summary statistics. Multiple runs with different starting points were used, with no difference in the final results (not shown). Table 5 provides representative results from two of the simulation studies. The results indicate that, when sample sizes are sufficiently large, Bayes estimates of the model parameters appear to be close to the assumed value. When sample size is reduced, SDs of the β's become larger, going from 0.2 to 0.45 as the sample size drops from 50 to 10. However, estimation itself seems to be unbiased, even in the case of only 10 replicates. In the second example Toxicogenomics | AhR and RARB gene interaction network Environmental Health Perspectives • VOLUME 112 | NUMBER 12 | August 2004  Table 3. Type of linkage, mean, SD, and percentage of the posterior distribution below zero for all gene-gene relationships in Figure 2.

From
To Type Mean SD % < 0  in Table 5, one parameter was set to zero, providing a case where there is no linkage from gene 1 to gene 3. In this case, we see that the estimation for a nonexistent link is approximately zero, and one could discard this link. Similar results were seen for all the cases studied.
To further challenge the estimation procedure, an eight-gene network was simulated and estimated. This network had 18 parameters and was a greater challenge to the Bayesian method. Because of the increase in the number of parameters, SDs were substantially larger than in the fourgene model, but the estimation was still effectively unbiased.

Discussion
Many methods have been developed for the analysis of gene expression microarray data, but few methods exist for using these data to quantify the interrelated behavior of genes within gene interaction networks. Most network-based methods are focused on network identification, not quantification. Given a hypothesized gene interaction network, this article develops and demonstrates the use of Bayesian network models as a tool for the analysis of a network using microarray data. The method allows for evaluating the strength of relationships within a hypothesized network and could also be used to test for additional linkages within the network.
There were two key points raised by these analyses. First, the application of this quantitative approach to the experimental data on TCDD effects in human lung epithelial cells clearly identified two subnetworks as significantly related to the AhR battery and the retinoid signaling. This indicates that the observed gene expression changes are consistent with the underlying hypothesized mechanism of action. In one sense this represents an alternate validation step in a tiered approach to evaluating microarray analyses. For example, ZNF42, although annotated as a retinoid-responsive gene, has not been previously validated as a retinoid-responsive gene in this cell system. The quantitative modeling suggests a highly significant relationship between ZNF42 expression and other genes in the retinoid-signaling subnetwork, which provides confidence that its alteration was indeed due to activation of the retinoidsignaling pathway. The testing of other subnetworks within a given data set can further serve to increase confidence that inferences on relationships between genes obtained from other types of analyses (evaluation of gene annotation, clustering, pathway analysis, informatic-based network mapping, literature searches) are real.
The second point from these analyses is that we were able to test the interaction between the two subnetworks (AhR and retinoid) and illustrated that a functional relationship was likely real. Such an analysis is useful in that it supports further testing of this mechanism experimentally. It may be that some fraction of the toxicity associated with chronic exposure to TCDD could be the direct result of TCDD-induced increases in RA in the cells. The hypothesized network clearly supports a significant change in gene expression associated with signaling through the RARB pathway. The quantitative linkages observed in this experiment are unlikely to hold for an in vivo system but suggest that an experiment exposing laboratory animals to TCDD, which includes both TCDD and RA measurements with gene expression measurements, would be useful. Two recent experiments address these issues to a limited extent. Schmidt et al. (2003) examined RA levels and changes in expression of CRBP1 in male Sprague-Dawley rats and saw significant changes in RA levels in kidney, liver, and serum, and a marginal change in liver CRBP1 after 28 days. They did not examine any of the genes in the network shown in Figure 2, so it is difficult to compare directly with our results. Johnson et al. (2004) used in vitro data from three experiments with AhR ligands activating genes in the heart, kidney, and thoracic aorta of mouse embryos. They used an exhaustive search of three linkages for each gene to identify the most likely gene-gene interactions. They also identified linkages to genes in the RA-signaling pathway (IGFBP-3 and IGFBP-6), but again, not the specific genes used in Figure 2.
The simulation experiments were different from the analysis of the TCDD study. In the TCDD study, the network linkages were perturbed to cause significant quantitative changes in expression, which then could be used to quantify the linkages between genes. In contrast, the simulation study used only the random variation in expression levels to quantify the network. The simulation studies indicate that the proposed method appears to be unbiased and, on average, produces the correct results. However, sample size could be a problem for small experiments with minor changes in gene expression. When the sample size is only 10 microarrays, the SD can be large relative to the expected value of the linkage between two genes, suggesting one might misinterpret a linkage as having little statistical support. This problem gets worse as the number of genes in the network increases. In contrast, large sample sizes of 50 microarrays are unlikely to have this problem.
Directed changes in the network, as in the dioxin experiment, can help overcome this problem and allow the quantification of significant linkages by as few as nine microarrays. To address this question, two additional simulations were conducted. Using the network shown in Figure 2 and the parameters estimated for the TCDD network shown in Table 3, we simulated 500 data sets consisting of nine microarrays-three for each dioxin dose; that is, we replicated the experiment 500 times using the predicted model. On average the resulting parameter estimates were identical to those observed from fitting the original data but appeared to have a slightly smaller SD than that estimated in the model. This decrease in SD could indicate a degree of Toxicogenomics | Toyoshiba et al.

1222
VOLUME 112 | NUMBER 12 | August 2004 • Environmental Health Perspectives Table 5. Mean, median, and SD from two simulation studies of the simple four-gene model (Figure 1). model misspecification, as the simulated data appear to fit better than the observed data. In addition, whereas the observed data showed a nonsignificant linkage between CYP1A1 to RARB, 48% of the simulated data sets found this linkage to be significant. Similarly, 54% found the linkage between ALDH6 and RARB to be significant. In contrast, the simulations found a significant linkage between NCOA2 and ZNF42 in only 6% of the cases (hence the Type I error appears to be good) and between TCDD and CYP1A1 in 100% of the cases (power is high). In a second simulation, the network shown in Figure 2 was again simulated, this time without TCDD included in the experimental design and using just random variation in the genes to produce the data. Again, the results were unbiased, but the SDs more than doubled. In addition, the probability of observing a significant linkage was reduced by about 20% for most linkages. This illustrates the value of stimulating the system when trying to identify gene interaction networks.

Sample
Clearly, this type of modeling approach is limited in terms of interpretation. First, the model cannot be cyclic; hence, increases in CRABP as a function of RARB that might then result in greater binding of RA in the cytosol, reducing RARB expression, could not be included. Given time-course data, it could be possible to explore this linkage using a more complicated modeling form or some other method of analysis such as semicyclic Bayesian networks. Second, the method is dependent on a parametric model, and the choice of this model could impact the overall findings from the analysis. For example, if certain genes reached their maximal expression at lower doses of TCDD, the use of a log-linear model could underestimate low-dose changes while overestimating high-dose changes. This, in turn, could lead one to accept or reject a given model incorrectly. It should be noted that this type of criticism applies to all the other network analysis methods as well. Finally, although not seen in this analysis, it is possible that the resulting distributions for the linkages between the genes could be sensitive to the choice of prior distributions, and one should be careful to evaluate if such an impact might exist with the data.
Although the approach presented here involves only gene expression data, it can easily be expanded to include other data relevant to the linkages between genes and the quantification of signal transduction pathways in cells. Data quantifying protein levels in cells could easily be folded into a general likelihood, linked via a similar model, and analyzed to quantify the entire network. Such an approach leads to rational, mechanism-driven simultaneous analyses of genomics, proteomics, and metabolomics data. In addition, the networks identified through this type of analysis can easily be combined with other mechanism-based mathematical models such as physiologically based pharmacokinetic and pharmacodynamic models to present a true, systems-biology approach for the quantification of risks from exposures to xenobiotics like dioxin. This analysis would form one module of an overall model for TCDD toxicity. For example, if microarray data were available in rats exposed to TCDD, existing models like that of Kohn et al. (2001) could easily be linked to the gene interaction network discussed above. These, in turn, could be linked to cancer data using a mechanistic model to test hypotheses regarding cancer incidence and the mechanisms involved, as shown by Brooks et al. (1999).
The method proposed here is not restricted to the log-linear model used in this analysis, nor is it linked to the statistical likelihood chosen for the analysis. Other models such as dynamic models (Chen et al. 1999) and other statistical likelihoods (Wolfinger et al. 2001) could easily be incorporated into the analysis methods.
Bayesian networks have been used in a number of settings to provide insight into the complicated linkage between variables that interact. Quantifying the distributions linking genes into networks and expanding this to include proteins and protein modifications will make it possible to quantify the impact of a given chemical agent on the signal transduction pathways in a cell. Although many different methods could be used for this, Bayesian networks have the advantage of flexibility, which will make it possible to build on existing knowledge while bringing new data into the analysis. For the dioxin study presented here, the limitations of the sample size preclude an overall conclusion concerning the validity of the final model for predictions about the role of dioxin in changes to the RARsignaling pathway. However, this analysis has strengthened the underlying hypothesis that changes in RAR signaling may play an important role in dioxin-mediated toxicity and suggest a number of experiments that could lead to a better-characterized network; this is left for future work.
In this article we used known scientific inferences and gene annotation to develop the initial tested network. This approach can also be applied to evaluating the likelihood of any hypothesized network developed by other approaches. As such, it can be applied to networks developed using other types of analyses including Bayesian, Boolean, and informatics-based approaches, as well as other known networks in the scientific literature. The ability to test hypotheses in the context of the network and to build modules that can be quantitatively linked to toxicity are first steps in a true systems-biology approach to mechanism-based use of genomics in risk assessment. This analysis is unique in that it directly addresses these uses.