Elsevier

Applied Soft Computing

Volume 4, Issue 1, February 2004, Pages 79-86
Applied Soft Computing

Routine discovery of complex genetic models using genetic algorithms

https://doi.org/10.1016/j.asoc.2003.08.003Get rights and content

Abstract

Simulation studies are useful in various disciplines for a number of reasons including the development and evaluation of new computational and statistical methods. This is particularly true in human genetics and genetic epidemiology where new analytical methods are needed for the detection and characterization of disease susceptibility genes whose effects are complex, nonlinear, and partially or solely dependent on the effects of other genes (i.e. epistasis or gene–gene interaction). Despite this need, the development of complex genetic models that can be used to simulate data is not always intuitive. In fact, only a few such models have been published. We previously developed a genetic algorithm (GA) approach to discovering complex genetic models in which two single nucleotide polymorphisms (SNPs) influence disease risk solely through nonlinear interactions. In this paper, we extend this approach for the discovery of high-order epistasis models involving three to five SNPs. We demonstrate that the genetic algorithm is capable of routinely discovering interesting high-order epistasis models in which each SNP influences risk of disease only through interactions with the other SNPs in the model. This study opens the door for routine simulation of complex gene–gene interactions among SNPs for the development and evaluation of new statistical and computational approaches for identifying common, complex multifactorial disease susceptibility genes.

Introduction

One goal of human genetics is to identify genes that confer increased susceptibility to certain diseases. The identification of disease susceptibility genes has the potential to improve human health through the development of new prevention, diagnosis, and treatment strategies. Although achieving this goal is an important public health endeavor, it is not easily accomplished for common diseases, such as essential hypertension, due to the complex multifactorial etiology of the disease [8], [13]. That is, risk of disease is due to a complex interplay between multiple genes and multiple environmental factors. Such gene–gene interactions (i.e. epistasis) and gene–environment interactions (i.e. plastic reaction norms) are expected to be ubiquitous in determining susceptibility to common human diseases [11], are examples of attribute interactions in data mining [4], and represent a significant statistical challenge in human genetics [11], [13], [16]. The statistical challenge is partly due to the curse of dimensionality [1]. That is, when high-order interactions are modeled, there are many genotype combinations (i.e. contingency table cells) for which there are no observed data. As a result, parametric statistical methods such as logistic regression have limited power and perhaps an increased false-positive or, type I error rate. Multifactor dimensionality reduction (MDR) is a nonparametric and genetic-model free classification method that was developed to address this problem [7], [13], [14], [15]. Although promising, the power of MDR to identify high-order gene–gene interactions has not been fully evaluated due to a lack of epistasis models in the literature that can be used to simulate data.

The lack of appropriate models that can be used for simulation is partly due to the combinatorial and computational complexities associated with the model discovery process [2]. The goal of the present study is to develop a genetic algorithm (GA) approach to model discovery that overcomes some of the combinatorial and computational limitations. We demonstrate here that the GA strategy is able to routinely discover models of gene–gene interaction effects among three to five total SNPs. The availability of high-order gene–gene interaction models will facilitate the simulation of data of varying complexity for the evaluation of new statistical and computational methods for identifying gene–gene interactions.

Section snippets

Related research

Penetrance functions represent one approach to modeling the relationship between single nucleotide polymorphisms (SNPs) (i.e. interindividual variation in a single nucleotide at a particular location in the DNA sequence of a gene) and risk of disease. A penetrance function is simply the probability (P) that an individual will have the disease (D) given a particular genotype or combination of genotypes (G) from multiple genes (i.e. P[D|G]). A single genotype is determined by one allele (i.e. a

Overview of genetic algorithms

Genetic algorithms, neural networks, case-based learning, rule induction, and analytic learning are some of the more popular paradigms in machine learning [9]. Genetic algorithms perform a beam or parallel search of the solution space that is analogous to the problem solving abilities of biological populations undergoing evolution by natural selection [5], [6]. With this procedure, a randomly generated ‘population’ of solutions to a particular problem are generated and then evaluated for their

Results

For each of the 1000 runs of the GA, a best model was identified that met the criteria of Vt≥0.1 and Vm≤0.0001 for three-SNP models and Vt≥0.2 and Vm≤0.0001 for four- and five-SNP models. Thus, the GA routinely identified epistasis models with no or little independent main effects. Further, of the 1000 models identified for each number of SNPs, there were no duplicates. Thus, the GA discovered 1000 unique three-, four-, and five-SNP gene–gene interaction models exhibiting minimal independent

Discussion

We have introduced a genetic algorithm approach to identifying penetrance functions that model epistasis or gene–gene interactions among three to five SNPs in the absence of independent main effects. The development of this GA approach and our initial two-SNP version [12] was motivated by a lack of published gene–gene interaction models and a lack of methods for generating them. We find that the GA is capable of routinely discovering interesting epistasis penetrance functions with up to five

Future research and conclusion

An important next step in the discovery of complex genetic models will be to determine the diversity and number of models that exhibit epistasis in the absence of main effects for any given number of SNPs. How many total models exist that have gene–gene interaction properties? How many exist for different allele and genotype frequencies? In this study, we identified 1000 models that have different probability values for some or all of the genotype combinations. Even though the probabilities are

Acknowledgements

This work was supported by National Institutes of Health grants HL65234, HL65962, GM31304, AG19085, AG20135, and LM007450. We thank an anonymous referee for their thoughtful comments and suggestions.

References (16)

There are more references available in the full text version of this article.

Cited by (42)

  • A comparative analysis of chaotic particle swarm optimizations for detecting single nucleotide polymorphism barcodes

    2016, Artificial Intelligence in Medicine
    Citation Excerpt :

    In the linear regression model, the original PSO method was set as reference, the Beta (β) value indicates whether the CPSO methods perform better (positive) or worse (negative) than the reference, and statistical significance is defined as p-value < 0.05. Both the XOR [47] and ZZ [48] disease models were used to test all CPSO methods. In the nonlinear XOR model, the heterozygous genotype from one locus or a heterozygous genotype from another locus may increase disease risk.

  • Analysis of high-order SNP barcodes in mitochondrial D-loop for chronic dialysis susceptibility

    2016, Journal of Biomedical Informatics
    Citation Excerpt :

    We considered two epistasis models whose multiloci penetrances are shown in Additional file C: Table C1. Model 1 is the nonlinear XOR model [26] where the high risk of disease is dependent on inheriting a heterozygous genotype from one locus or a heterozygous genotype from another locus, but not all loci. Model 2 is the ZZ model [27,28] where the high risk of disease is dependent upon inheriting exactly two high risk alleles from two loci.

  • Genetic predictors of outcome following traumatic brain injury

    2015, Handbook of Clinical Neurology
    Citation Excerpt :

    In addition, linear models generally include interaction effects for genotypes that have independent main effects, although interactions in the absence of significant main effects are observed by some approaches (Millstein et al., 2006). Recently, this view has changed, using different analytical tools that perform analyses using nonlinear interactions and interpreting the model(s) in the context of known biology (Moore et al., 2004). Two of these approaches use data mining and machine learning methods and have been reviewed by Gilbert-Diamond et al. (2011).

  • Haplotype inference using a novel binary particle swarm optimization algorithm

    2014, Applied Soft Computing Journal
    Citation Excerpt :

    In the post-genome era, as the high-throughput genomic technologies are available, correlating variation in DNA sequence with differences phenotypic (diseases, skin color and so on) has attracted increasing attention [1–3]. Single Nucleotide Polymorphisms (SNPs) are the most common form of DNA variation [4–6]. Studies showed that haplotypes (the combination of SNPs alleles on the same chromosome) can provide more information than genotypes (the conflated data of two haplotypes) in association studies [7–11].

  • Genome Simulation. Approaches for Synthesizing In Silico Datasets for Human Genomics

    2010, Advances in Genetics
    Citation Excerpt :

    SimPEN is an alternative strategy for assigning disease status to individuals. SimPEN was designed explicitly to generate multilocus penetrance models where there are no main effects from any single-locus alone; with such a genetic model, all loci must be evaluated jointly to detect the genetic effect (Moore et al., 2004). In terms of heritability, these models have no additive or dominant variance—all of the trait variance due to genetics is explained by the nonadditive interaction of genotypes.

View all citing articles on Scopus
View full text