Integrated Analysis of Gene Expression and Genotype Variation Data for Chronic Fatigue Syndrome

In the past few years, high throughput technologies, such as gene expression microarrays and genotyping techniques, have provided efficient ways to measure gene expression levels and genotype variation on a genome-wide scale [Schena et al., 1995; Howell et al., 1999]. Various approaches have been proposed to analyse gene expression data and genotype variation data, in order to discover the complex network of biochemical processes of complex diseases such as chronic fatigue syndrome (CFS) [Presson et al., 2008]. In the analysis of gene expression data, for example, the identification of differentially expressed genes between two groups has been of great interest, and various statistical tests have been conducted [Ghazalpour et al., 2008; Brem et al., 2002; Kang et al., 2008]. In analysing genotype variation data, logistic regression has been commonly used to model the relationship between binary clinical outcomes and discrete predictors, such as genotypes [Henshall & Goddard, 1999; Coffey et al., 2004].

level and disease.The ISM consists of two steps.The first step is to determine the causal relationship.Based on the causal relationship determined at the first step, the second step identifies significant gene expression traits of which the effects on disease status or the responses to disease status are modified by the specific genotype variation.By applying the ISM procedure to a CFS dataset, we identified a list of potential causal genes for CFS, and found an evidence for a difference in genetic mechanisms of the etiology between CFS patient and control groups.
Our ISM analyses considering the different levels of data simultaneously, allowed us to elucidate disease susceptibility and differentially expressed genes of genetically different individuals.Some results even showed that integrating genotype and expression data may help the search for new directions for the treatment of CFS that are not being detected by using only one type of data.The integrated analysis provided more information than the two separate analyses of gene expression data and genotype variation data for characterizing CFS that has several possible causes.

From genotype to phenotype
In the era of the genome project, the belief came with was that we would answer the questions on how the genes function and how they are related to diseases.The genome project successfully sequenced DNA of various species, including the human.Not only sequencing the genomes, many studies have also identified the gene functions by modifying individual genes in several animals and plants.However, many questions remain unanswered.We still do not know the functions of numerous genes, whether thay are annotated or un-annotated.Especially predicting what genes are associated with diseaserelated phenotypic variants is of particular interest and still in vague.The problem is complicated, because i. most phenotypes of medical interest are complex diseases, i.e., more than one gene or environmental effect contributes to the phenotype occurrence, ii. the underlying molecular mechanism regulating cellular functions is complicated, and iii.little genotypic data (or information) of disease-related phenotypes is available.
High throughput technologies advance for acquiring genome-wide genotyping data of many individuals with and without disease phenotypes.It is of a particular interest to segregate genotypic difference between disease-affected individuals and controls.The variation of genotypes comes from additive and epistatic effects of alleles across multiple genes, resulting in many individuals with phenotypes.Some combinations of genotypic variants result in enhanced traits, whereas other combinations are deleterious to fitness in specific environments.Phenotypic alterations are usually in matters of amount, rather than in the presence or absence of a trait.The field of statistical genetics has developed various methods and tools to map such quantitative traits to regions of chromosomes.These chromosomal regions are known as quantitative trait loci (hereafter QTLs) and are described in terms of the percentage of the variation of a trait that can be attributed to each region.
Integrated Analysis of Gene Expression and Genotype Variation Data for Chronic Fatigue Syndrome 49

Quantitative Trait Locus (QTL)
Quantitative traits refer to the characteristics or phenotypes that are quantitative, i.e., vary in degree or continuously, such as height, while dichotomous or discrete traits have two or several characteristic values.A QTL is a specific region of DNA that is associated with these quantitative phenotypic traits.The number of QTLs that explain the variation in the phenotypic trait tells us more about the genetic structure of a specific trait.For example, the research related to QTLs could provide further information about the genes that control human height.

xQTL: Various types of QTL mapping
Microarray technology has elucidated the genetics of gene expression in human populations.It has been less successful to identify genes in underlying diseases by using molecular profiling tools.Since too many genes have been identified to be associated with disease traits, determining and verifying which genes are the true disease-causing genes have been difficult.
Recently, microarray techniques have been combined with genotyping technology to facilitate the identification of key drivers of complex diseases.Figure 1 represents this approach, treating relative transcript abundances as quantitative traits when segregating populations.In this method, chromosomal regions that control the level of expression of a particular gene are mapped as expression quantitative trait loci (eQTL).

Integrative analysis
Fu et al. provided the first system-wide evidence for phenotypic buffering in Arabidopsis [Fu et al., 2009].Their approach consisted of three steps.Step 1 performed QTL mapping for transcript, protein, metabolite and phenotypic trait data.Then, Step 2 computed significance thresholds for detection of QTL hotspots per level, and finally, Step 3 detected hotspots that appeared across multiple levels.In particular, at Step 2 permutation analysis was used to compute significance thresholds for detecting QTL hotspots.For each of the 250 permutations, all > 40,000 traits were analyzed in order to map QTLs and the most significant marker for each QTL was stored.The number of significnat QTLs were counted over all traits for each marker, and the significant thresholds for hotspot detection per level were derived.For system-wide or multiple level QTL hotspots, Step 3 used the observed QTL hotspots and permutation analysis to compute significance thresholds for detecting QTL hotspots that appeared at multiple levels.Using the results obtained from per-level analysis, the markers per level were ranked from the one with the highest number of traits mapping to it, to the one with the lowest.Then, a rank-product test was performed to find markers that ranked significantly high at multiple levels [Breitling et al., 2004].For each permutated sample, the p-value was computed for the rank-product test at each of the 144 markers, and a threshold was derived for hotspot detection by the procedure controlling 5% of the false discovery rate (FDR) [Benjamini & Hochberg, 1995].
Using this approach, 162 recombinant inbred lines (RIL) of Arabidopsis thaliana were profiled for variation in transcript, protein and metabolite abundance, and were mapped to QTL for 40,580 of these molecular traits.Only six QTL hotspots were found which underlied variation in 16% of the transcript traits, 25% of the protein traits, 55% of the metabolite traits If the genotype at a certain locus is associated with the phenotype of a certain gene, this DNA region might contain a regulator of the target gene expression.It could be any functional nucleotide sequences such as protein-coding regions, microRNAs and cis-regulatory DNA motifs.The same individuals of a selected population have to be genotyped and phenotyped first.Based on the genotyped data (e.g.SNP), selecting markers that are polymorphic in the study population is in need.Then, at the heart of every eQTL study is the correlation of genotype patterns with expression levels in a genetically diverse population.The simplest mapping strategy is to split the population based on the genotypes at a specific marker and check if the expression levels of a given gene are significantly different between the two groups [Ghazalpour et al., 2008;Brem et al., 2002;Kang et al., 2008].
There have been many approaches to elucidate the variants affecting phenotypes, for example, Lan et al. explored correlation of expression profiles across a genetic dimension, namely genotypes segregating in a panel of 60 F 2 mice derived from a cross used to explore diabetes in obese mice.By combining the correlation results with linkage mapping information, they identified regulatory networks, made functinoal predictions for uncharacterized genes, and charaterized novel elements of knwon pathways [Lan et al., 2006].However, their approach did not provide any information about causality relationships among expression profile, genotype and disease.
The mixture over markers (MOM) model proposed by Kendziorski et al. combinds a transcript-based (TB) approach, refering to the repeated application of any single-phenotype mapping method to each mRNA transciprt, and a marker-based (MB) approach, refering to the repeated application, at each marker, of any method for identifying differnetially expressed transciprts [Kendziorski et al., 2006].They applied two MB approches: an empirical Bayes approach and an approach based on the Student's t-test.The MOM model is motivated from the fact that separate tests are conducted for each trascript-marker pair, and each measures evidence that the transcipt maps to that marker relative to evidence that it maps nowhere.Since a trancript can map to any of various marker locations, the evidence that a transcript maps to a particular marker should not be judged relative only to the possibility that it maps nowhere, but rather relative to the possibility that it maps nowhere or to some other markers.This model was proved useful in improving the specificity of eQTL identifications, but used only genotype variation and gene expressino data rather than disease status or trait data.
A gene-set approach based on weighted gene co-expression network analysis (WGCNA) by Presson et al. constructs a co-expression network, identifies trait-related modules within the network, uses a trait-related genetic marker to prioritize genes within the module, applies an integrated gene screening strategy to identify candidate genes and carries out causality testing to verify and/or prioritize results [Presson et al., 2008].Their work includes steps to identify trait-module association and trait-related genetic marker association, but does not provide the model-based statistical tests.
The step-wise approach proposed by Schadt et al. includes i) identifying pair-wise relationships among genotype variation, gene expression, and a complex trait, respectively investigated by identifying QTLs for the complex trait, ii) selecting gene expression traits correlated with the complex trait, iii) detecting eQTL, which overlap the identified QTL, for the selected expression traits; and iv) the likelihood based causality model selection (LCMS) test to identify the causal relationships of the genes detected with overlapping loci [Schadt et al., 2005].

Two-step integrative analysis
Schadt et al.'s approah has two major limitations.First, although the filtering step is effective in reducing the search-space, it might result in more false negatives than exhaustive search approahes in detecting causal relationships of the genes, espeically when a true causal relationship exists based on the interaction effects among genotype, gene expression and a trait of interest, but any pairwise association is weak.Second, the model does not comprehensively handle the interaction effects, which might cause different disease susceptibility.Therefore Lee et al. proposed a two-step integrative approach handling with exhaustive search and interaction effects based on LCMS test [Lee et al., 2009].In this section we provide a detailed review of the Lee et al.'s two-step procedure integrating genotype data, gene expression and clinical data, and thus elucidating mechanisms underlying disease susceptibility and progression [Lee et al, 2009].

Introduction
In figure 3, the two-step procedure is presented to illustrate the integration method based on causal relationship among the three different levels of data.In the first step, the most appropriate causality models are selected to understand the causal relationship among genetic variation, gene expression level, and disease for each gene expression-genetic variation combination.In the second step, significance testing is carried out based on a Fig. 3. Two-step procedure illustration of Lee et al.'s.In the first step, for each gene expression-genetic variation combination, the most appropriate causality models are selected.Then in the second step, significance test is carried out based on a statistical model for each combination according to the model selected in previous step.
Integrated Analysis of Gene Expression and Genotype Variation Data for Chronic Fatigue Syndrome 53 statistical model for each combination, such as logistic regression and a two-way analysis of variance (ANOVA), according to the causality model selected from the first step.Through these tests, gene expression traits whose effects on disease status or responses to disease status are modified by the genotype variation effects.

The first step: Causality model selection
The possible causal relationships among genetic variation, gene expression level and disease trait, can be summarized as three models.Figure 4 represents three simple models.Causal model assumes the simplest causal relationship with respect to mRNA expression, in which QTL acts on disease through transcript.Reactive model is the model with respect to mRNA expression, in which mRNA expression is modulated by disease.In independent model, QTL at a specific locus acts on these traits independently.Then, the best model was selected for each SNP-transcript combination, by choosing the model with the smallest Akaike Information Criterion (AIC) value which can be used to compare different models [Akaike, 1974).Lee et al. and Schadt et al. both assumed standard Markov properties for the simple graphs (Fig. 4), the joint probability distributions for the three models are as follows: • Causal Model: P S, R, D =P S P R|S P D|R , • Reactive Model: P S, R, D =P S P D|S P R|D , • Independent Model: P S, R, D =P S P R|S P D|R, S , where S represents a genotype variation, R gene expression, and D disease status.P(S) is the genotype probability distribution for marker S and is further assumed to be codominant.P(R|S) and P(R|D) are the conditional probabilities of R given genotypes S and disease status D, respectively.Lee et al. further assumed that the random variable R follows conditional normal distribution, and the random variable D has a binomial distribution.Therefore, in probability P(D|R), the random variable D has a binominal distribution with a success probability that can be modeled by a logistic regression model.P(D|S) is the probability distribution of D conditional on locus S, in which the random variable D also has a binomial distribution.Based on these assumptions, the likelihood of a correspondence to each of the joint probability distributions can be constructed.For each model, the model parameters can be estimated via a standard maximum likelihood method.The best model supported by the data is then chosen based on the AIC, which is commonly used to compare models with different numbers of parameters [Schadt et al., 2005;Lee et al. 2009]

The second step: Statistical test
Step 2 performs statistical tests to determine the significance of the genetic regulatory relationships described in the causality model selected at step 1.The response and independent variables in the statistical models depend on the causality model selected at step 1.These statistical tests can deal with the interaction effects among the three different levels of data and to elucidate differences in disease susceptibility and gene expression pattern across genetically different individuals.The two-step procedure can result in a set of candidate causal and reactive genes, whose expressions affect disease status and respond to disease status under the influence of genotype variation, respectively.

The causal model
In order to investigate gene expression traits whose effects on disease status are modified by genotype variation, the interaction effect of genotype variation and gene expression level on the disease status can be examined using logistic regression below: www.intechopen.comIntegrated Analysis of Gene Expression and Genotype Variation Data for Chronic Fatigue Syndrome 55 where π represents the probability of getting the disease; S represents the effect of genotype variation such as SNPs; R represents the effect of gene expression levels; and S × R represents the interaction effect between genotype variation and gene expression level.

The reactive model
For investigating gene expression traits whose responses to disease status are affected by genotype variations, one can fit the following two-way ANOVA model with the interaction between genotype variation and disease groups: where S represents the effect of genotype variation; D represents the effect of disease groups; and S × D represents the interaction effect between genotype variation and disease groups.

The independent model
When the independent model is selected at step 1, the effect of genotype variation on each of gene expression and disease can be investigated separately.First, the logistic regression is employed to detect genotypic markers linked to disease loci: Next, it is possible to identify genotypic markers that regulate gene expression levels, based on the one-way ANOVA model where the dependent variable is R and the independent variable is S.
In step 2, significant associations among genotype variation, gene expression and disease status are declared via statistical tests for all possible pairs of gene expression-genotype variation.Due to the large number of tests, the multiple-testing problem needs to be addressed.In order to adjust this multiplicity, Lee et al. used a step-up procedure controlling false discovery rate (FDR) [Benjamini & Hochberg, 1995].This CFS dataset has been analysed by many research groups for identifying molecular markers and elucidating pathophysiology of CFS, for finding two differentially expressed genes related with fatigue and depression, respectively, for discriminating classes of unexplained chronic fatigue based on differential gene expressions, and for examining the relationship between CFS and allostatic load based on the clinical dataset.In the CFS dataset, the expression levels of 20,160 genes were assessed from peripheral blood mononuclear cells, via custom-printed single-channel oligonucleotide chips.Quantile normalization was conducted on the gene expression data which were pre-processed by the original CDC research group.For genotype data, the whole blood DNA was extracted and specific areas of the genes of interest were amplified by PCR.

Application
For illustration, we summarized the analyses results from the multi-step procedure of Schadt et al. and the two-step approach of Lee et al.The detailed description of the results is provided in Lee et al. [Lee et al., 2009].

Multi-step procedure by Schadt et al.
The multi-step procedure proposed by Schadt et al. was applied to the same datasets for the purpose of comparison.First, a gene expression analysis was carried out to detect differentially expressed genes across clinical outcomes.Only a few genes were identified as differentially expressed (Table 1A) by three commonly used approaches such as the t-test, significance analysis of microarray (SAM) [Tusher et al., 2001] and the Bayesian regression approach [Baldi & Long, 2001].Second, genotype variation data and clinical outcomes were analyzed via logistic regression to detect the susceptibility genes of disease.Out of all 41 markers tested, nine markers were detected with significant genotype effect on initiation of CFS at a 5% significance level, while only four markers were dete ct ed with 5% FDR [Benjamini & Hochberg, 1995] (Table 1B).Interestingly, different sets of susceptible genes were identified as having statistically significant association with CFS and CFS-MDD/m.From the CFS vs. NF comparison, the seven markers in the NR3C1 gene were identified as significant markers linked to CFS.On the other hand, the CFS-MDD/m vs. NF comparison revealed the two significant markers in the COMT gene.Finally, for each of the differentially expressed genes across clinical outcomes, eQTL were searched at each of the markers that were identified at the second step, via one-way ANOVA of genotype variation and gene expression data.No significant association between gene expression level and genotype variation was found for any genotype-gene expression combination at a 5% significant level.In other words, no significant results were detected for both datasets from the Schadt et al.'s multi-step method.

Two-step integrative analysis
Lee et al. analyzed each combination of 20,160 genes and 41 SNPs with their two-step integrative analysis on two datasets, CFS vs. NF groups and CFS-MDD/m vs. NF groups.
For each gene-SNP combination, the best causal relationship was detected via the causality model selection at step 1.In comparing CFS with NF groups, the reactive model was selected for ∼70% of 20,160 genes on average, for all nine markers within two known CFSrelated genes, such as NR3C1 and COMT (Table 2).However, in comparing CFS-MDD/m with NF groups, the causal model was selected for nearly 70% genes for three markers in the NR3C1 gene.This different tendency in the model selection results between CFS and CFS-MDD/m would imply different genetic mechanisms of CFS and CFS-MDD/m.
At step 2, each gene-SNP combination data was analyzed based on one of the three statistical models, corresponding to the detected causal relationship.For all seven SNPs within NR3C1, significant causal relationships with gene expression levels were detected for either or both datasets (Table 2).Three SNPs (rs258750, rs6188 and rs852977) showed significant relationships with expression levels of a large number of genes, and can be candidates for genetic modulators of CFS-related regulatory pathways.[Lee et al., 2009] The integrative analyses were conducted respectively on two datasets, CFS vs. NF groups and CFS-MDD/m vs. NF groups.Note that the results are presented only for nine SNPs within two known CFSrelated genes (NR3C1 and COMT).For each combination of 20,160 genes and 41 SNPs, the best causal relationship was detected via causal model selection at step 1. Numbers in parenthesis indicate the numbers of genes having each causal relationship with each SNP and disease status.At step 2, each combination data was analyzed based on one of the three statistical models, corresponding to the detected causal relationship.Outside parenthesis, we present the numbers of significant genes identified by the corresponding statistical models.Three SNPs, each of which involves significant causal relationships with expression levels of more than 100 genes, are marked in bold.a Logistic regression was conducted to identify genes whose expressions have interaction effect with genotype variation on disease status.b Two-way ANOVA was conducted to identify genes whose expressions are affected by interaction between genotype variation and disease status.c Independent test was conducted to identify genes whose expressions differ according to SNP genotypes.d Gluccorticoid receptor located at 5q34. e Catechol-O-methltransferase located at 22q11.1.

Integrated Analysis of Gene Expression and Genotype Variation Data for Chronic Fatigue Syndrome 59
Next, pathway enrichment analyses were performed for these three SNPs, and the results are given in the next section.In comparing CFS with NF groups, for the rs258750 marker, 105 genes were identified with differential expression across genotypes with 5% FDR from the independent test.This result is supported by the evidence of the neuroendocrine regulation of immunity, because the gene expression data were obtained from a mononuclear cell, and the role of glucocorticoid receptor (NR3C1) gene is to regulate the level of glucocorticoid.
In the integrated analysis for comparing CFS-MDD/m with NF groups, for the rs6188 marker in the NR3C1 gene, 52 genes showed significant interaction effects with the rs6188 marker on disease status CFS-MDD/m from the logistic regression model.Also, the twoway ANOVA models yielded 217 candidate reactive genes, on which there are significant interaction effects between disease status and genotypes.Note that these candidate genes, especially reactive genes, could not be detected by Schadt et al.'s method.The Lee et al.'s two-step integration method revealed the causal association among gene expression level, genotype and disease status in depth.Candidate causal/reactive genes were detected also for rs852977 in the NR3C1 gene.However, the candidate gene set for the rs852977 is very similar to that for the rs6188, with slight differences in causality structure.This similarity would be due to a strong linkage between the two SNPs.

Pathway enrichment analysis
In comparing CFS with NF groups, Lee et al. further conducted a pathway enrichment analysis for 105 genes that were identified to have a significant relationship with the rs258750 marker from the independent test at step 2. The pathway classification showed that nine different pathways were associated with the rs258750 marker at the 5% significance level (Table 3).Out of nine pathways, four were enriched with genes involved in regulation of transcription, translation or mRNA processing, and three are related with immune system.
For comparing CFS-MDD/m with NF groups, pathway enrichment analyses were conducted on the genes that were identified to have a significant relationship with the rs6188 and/or rs852977 markers at step 2. Because of the linkage between the two SNPs, the results were similar (Tables 4 and 5), and the results was given only for the rs6188.While seven different pathways were detected at the 5% significance level for the 52 candidate causal genes, eleven different pathways were detected for the 217 candidate reactive genes (Table 4).In addition, two other pathways, whose p-values were slightly larger than the 5% significance level, are listed.
In pathway enrichment analyses of the candidate causal genes, the steroid biosynthesis pathway appears to have a direct causal effect on the disease status, CFS-MDD/m, through an integrative action of the rs6188 marker within the NR3C1 gene.The two significantly enriched biological pathways (i.e., 'IL-2 Receptor Beta Chain in T cell Activation', and 'HIV-1 Nef: negative effector of FAS and TNF') are all related to the immune system.On the other hand, the pathway enrichment analysis of the candidate reactive genes showed that several pathways related to lipid metabolism or biosynthesis, such as eicosanoid and fatty acid, appear to be important for responding to CFS-MDD/m.Furthermore, other pathways associated with neuron physiology and neurotransmitters appear to respond to CFS-MDD/m.3. Significant regulated pathways for SNP rs258750 (by courtesy of the authors) [Lee et al., 2009] Pathway enrichment analysis was conducted using 105 candidate independent genes, which were identified for rs258750.Significant biological pathways were detected via Fisher's exact test at a 5% significance level.Pathways are listed in order of significance e.g., most significant pathway presents at the top.  5. Significant regulated pathways for SNP rs852977 (by courtesy of the authors) [Lee et al., 2009] Pathway enrichment analysis was conducted using 120 candidate causal genes, which were identified for rs852977.Significant biological pathways were detected via Fisher's exact test at a 5% significance level.Pathways are listed in order of significance within each of causality model, e.g., most significant pathway presents at the top.a Name of biological pathway selected by Fisher's exact test.b Causality models selected ay step1.c Source of pathway d The number of candidate causal/reactive genes associated with pathway/the number of all genes associated with pathway.e Gene ID of candidate genes associated with pathway.

Discussion
The two-step procedure can integrate gene expression data, genotype variation data and clinical data, and identify the genetic mechanism of a complex disease.We described three different statistical tests based on the two-step procedure proposed by Lee et  Furthermore, the two-step approach provided some interesting results.First, CFS groups and CFS-MDD/m groups would appear to have different genotypes and gene expression profiles even though they had the common characteristic of chronic fatigue.In particular, CFS has major susceptibility markers within the NR3C1 gene, and CFSMDD/m seems to have major susceptibility markers within the catechol-O-methyltransferase (COMT) gene, though they are not statistically significant after FDR correction (Table 1B).The NR3C1 gene regulates the level of glucocorticoid which is the end product of the hypothalamic-pituitaryadrenal (HPA) whereas COMT catalyzes the transfer of a methyl group from Sadenosylmethionine to catecholamines, which is the principal end product of the sympathetic nervous system (SNS), of which the role is maintaining stress-related homeostasis [Elenkov et al., 2000].The different major susceptibility gene may be related with to the provoking of MDD/m.
Second, polymorphisms in the glucocorticoid receptor NR3C1 gene act on CFS and CFS-MDD/m differently.The polymorphisms (rs258750) within NR3C1 have significant effects on CFS, and the 105 gene expression levels independently.However, in the integrated analysis for comparing CFS-MDD/m and NF groups, polymorphisms within the NR3C1 gene affect the CFS-MDD/m and several gene expression levels differently.For example, the 217 genes are differentially expressed according to the rs6188 marker genotype within NR3C1 and disease status, even though polymorphisms within NR3C1 have no direct significant effects after FDR correction on the CFS-MDD/m.In addition, the 52 genes also regulate the CFS-MDD/m, through integrated action with the rs6188 marker.The different action of the NR3C1 gene on gene expression level and disease may be an outcome of other factors, such as environmental effects or polymorphisms of the COMT gene.The catecholamines which are regulated by the COMT gene, have been often been regarded as immunosuppressive [Elenkov et al., 2000].
Two pathway enrichment analyses for the 52 candidate causal genes and 217 candidate reactive genes indicated that our approach can recover plausible regulatory mechanisms of CFS-MDD/m by comparing CFS-MDD/m and NF groups.From the comparison, we noticed that the pathways related to the immune system and steroid may have causal effect on disease state through an integrative action of the NR3C1 gene.Both the NR3C1 gene that regulates the level of glucocorticoid, and the steroid that includes corticosteroids are known to regulate the immune function [Webster et al., 2002].A number of studies have found many irregularities in the immune systems in CFS patients [Natelson et al., 2002].This suggested that an important cause of CFS-MDD/m would be the immune system dysfunction, regulated by the neuroendocrine system, which rs6188 in the NR3C1 gene seems influence.Another potential implication of this comparison is that the CFS-MDD/m status and genetic polymorphisms can jointly induce different activation and expression of several lipid related metabolisms, neuron physiology differentiation, and neurotransmitters.
However, since fatigue is a core symptom in major depressive disorder [Pae et al., 2007], CFS-MDD/m patients might have fatigue due to the depression rather than unexplained causes, and hence the significant results may be related to a 'major depression disorder with melancholic features' rather than chronic fatigue.For example, the excessive hypothalamuspituitary-adrenal (HPA) axis responses, of which the end products are glucocorticoids, are known to be hallmarks of depression [Pariante & Miller, 2001;Holsboer, 2000;Pariante, 2004].Also, the major depression can be associated with the immune activation, dysfunction of neurotransmitters at synapse [Neumeister et al., 2004;Sanacora et al., 2004;Maes & Meltzer, 1995], and essential fatty acids [Van Strater & Bouvy, 2006].
The integrative analyses considering the interaction effect among different levels of data could elucidate different disease susceptibility and differentially expressed genes of genetically different individuals.Some results showed that integrating genotype and expression data may help the search for new directions for the treatment of common human diseases that are not being detected using only one type of data.The integrated analysis provided more information than the two separate analyses of gene expression data and genotype variation data for characterizing CFS that has several possible causes.
In conclusion, the two-step approach to the integration of heterogeneous data sets can be generally applied to other studies in which gene expression data, genotype variation data and clinical data are available, and it can be very useful as the importance of integrated data analysis has been increasing.The two-step approach can also be extended to datasets containing other type of data, such as protein data rather than clinical data.The two-step approach can be readily applicable to quantitative traits rather than binary clinical outcome traits, by employing linear regression analysis.Also, it can be easily applied to genome-wide association studies, and can handle environmental factors, such as age and sex, by treating these factors as covariates in the regression model.Furthermore, the two-step approach can be extended to the gene-set approach, the module based approach or co-expression network as Presson et al. [Presson et al., 2008] and Chen et al. [Chen et al., 2008] did.
However, there are some limitations to the two-step method.First, the causality models are too simple to represent true mechanisms, which would be more complicated due to possible interactions between causal-reactive genes [Schadt et al., 2005].Further considerations for more complicated models are necessary in order to identify the genetic mechanism of complex diseases.Second, the two-step approach may need large computing although it is applicable to genome-wide studies because it is not limited in the scale of data.Another limitation would be a misclassification problem in that the proposed method relies on the LCMS.The current two-step approach does not use FDR procedure to account for the model misclassification problem.In fact, FDR procedure was employed only in the second step, not in the first step for the model selection procedure that chooses the model with the minimum AIC among the three causal models.While anticipating the problem, we still employed the LCMS process because it showed good power for detecting true models in the simulation evaluated by Schadt et al.The two-step approach can be extended to account for the errors caused by the model misclassification in the first step.For example, we can test for the difference in the AIC values of three causality models, because the chance for model misclassification would be high when the difference between the smallest AIC value from the selected model and those from the other models is not large.A permutation-based nonparametric test might be developed for this testing.We think it requires a further study to control simultaneously two types of errors: causality model selection, and significant maker-gene pair identification.

Acknowledgment
The work was supported by the National Research Foundation (KRF-2008-313-C00086)

Fig. 1 .
Fig. 1. eQTL pipeline.From disease and normal individuals, genotypes and mRNA expressions are observed.
Fig. 4. Three possible causal relationships among genotype variation, mRNA level and complex disease proposed by Schadt et al.QTL, mRNA and disease represent any genotype variation like SNP, mRNA expression level of a gene, and complex disease or phenotype of interest, respectively.
Chronic fatigue syndrome (CFS) is a debilitating illness lacking consistent anatomic lesions and eluding conventional laboratory diagnosis.CFS has no confirmatory physical signs or laboratory abnormalities, and its etiology and pathophysiology are unknown.This disease characterized by chronic fatigue, lasting at least 6 months, which is accompanied by symptoms such as impairment in short-term memory or concentration, sore throat, tender lymph nodes, and muscle pain.The Centers for Disease Control and Prevention (CDC) Chronic Fatigue Syndrome Research Group produced the dataset including gene expression of 177 subjects, proteomic of 60 subjects, single nucleotide polymorphism (SNP) of 50 subjects, and clinical data of 227 subjects.All the data set is available on the following web site (http://www.camda.duke.edu/camda06/datasets/index.html).According to severity of symptoms, the patients were originally classified into five groups of CFS. Lee et al.'s study, however, only consider three groups of total 101 subjects: 46 subjects meeting the CFS research case definition (CFS), 19 subjects meeting the CFS research case definition and having 'a major depressive disorder with melancholic features' (CFS-MDD/m), and 36 subjects who show no fatigue (NF).
Lee et al.applied their two-step procedure to chronic fatigue syndrome (CFS) data to elucidate a list of potential causal genes of CFS.In this section, we provide the application of two-step procedure of Lee et al.'s4.1 Chronic Fatigue Syndrome (CFS) dataset

Table 1
[Lee et al., 2009]s for respective association of gene expression and genotype variation with disease status (by courtesy of the authors)[Lee et al., 2009]As multiple filtering steps is Schadt et al.'s procedure, the separate analyses were conducted respectively on two datasets, CFS vs. NF groups and CFS-MDD/m vs. NF groups.Bold numbers indicate p-values < 0.05.a NCBI dbSNP Build number is 125 using Human Genome Build 35.1 b Position of SNP on chromosome.1 c p-value from logistic regression with CFS vs. NF data.d p-value from logistic regression with CFS-MDD/m vs. NF data.e Glucocorticoid receptor located at 5q34.f Significant at the 5% false discovery rate (FDR).g Catechol-O-methyltransferase located at 22q11.1.

Table 4 .
[Lee et al., 2009]ted pathways for SNP rs6188 (by courtesy of the authors)[Lee et al., 2009]Pathway enrichment analysis was conducted using 52 candidate causal genes and 217 candidate reactive genes, which were identified for rs6188.Significant biological pathways were detected via Fisher's exact test at a 5% significance level.Pathways are listed in order of significance within each of causality models, e.g., most significant pathway presents at the top.a Name of biological pathway selected by Fisher's exact test.
a Name of biological pathway selected by Fisher's exact test.b Causality models selected ay step1.c b

b Source c Nodes d Gene ID e Gene name
al..For purposes of comparison, two different CFS related datasets were analyzed via the multi-step procedure proposed bySchadt et al..In these specific datasets, no significant results were detected from the multistep method of Schadt et al., while the method of Lee et al. enabled us to identify many statistically significant causal relationships, some of which were biologically supported by pathway enrichment analyses.These results demonstrated that the two-step method based on an exhaustive search investigation would provide more power.