Routine discovery of complex genetic models using genetic algorithms

doi:10.1016/j.asoc.2003.08.003

Applied Soft Computing

Volume 4, Issue 1, February 2004, Pages 79-86

https://doi.org/10.1016/j.asoc.2003.08.003 Get rights and content

Abstract

Simulation studies are useful in various disciplines for a number of reasons including the development and evaluation of new computational and statistical methods. This is particularly true in human genetics and genetic epidemiology where new analytical methods are needed for the detection and characterization of disease susceptibility genes whose effects are complex, nonlinear, and partially or solely dependent on the effects of other genes (i.e. epistasis or gene–gene interaction). Despite this need, the development of complex genetic models that can be used to simulate data is not always intuitive. In fact, only a few such models have been published. We previously developed a genetic algorithm (GA) approach to discovering complex genetic models in which two single nucleotide polymorphisms (SNPs) influence disease risk solely through nonlinear interactions. In this paper, we extend this approach for the discovery of high-order epistasis models involving three to five SNPs. We demonstrate that the genetic algorithm is capable of routinely discovering interesting high-order epistasis models in which each SNP influences risk of disease only through interactions with the other SNPs in the model. This study opens the door for routine simulation of complex gene–gene interactions among SNPs for the development and evaluation of new statistical and computational approaches for identifying common, complex multifactorial disease susceptibility genes.

Introduction

One goal of human genetics is to identify genes that confer increased susceptibility to certain diseases. The identification of disease susceptibility genes has the potential to improve human health through the development of new prevention, diagnosis, and treatment strategies. Although achieving this goal is an important public health endeavor, it is not easily accomplished for common diseases, such as essential hypertension, due to the complex multifactorial etiology of the disease [8], [13]. That is, risk of disease is due to a complex interplay between multiple genes and multiple environmental factors. Such gene–gene interactions (i.e. epistasis) and gene–environment interactions (i.e. plastic reaction norms) are expected to be ubiquitous in determining susceptibility to common human diseases [11], are examples of attribute interactions in data mining [4], and represent a significant statistical challenge in human genetics [11], [13], [16]. The statistical challenge is partly due to the curse of dimensionality [1]. That is, when high-order interactions are modeled, there are many genotype combinations (i.e. contingency table cells) for which there are no observed data. As a result, parametric statistical methods such as logistic regression have limited power and perhaps an increased false-positive or, type I error rate. Multifactor dimensionality reduction (MDR) is a nonparametric and genetic-model free classification method that was developed to address this problem [7], [13], [14], [15]. Although promising, the power of MDR to identify high-order gene–gene interactions has not been fully evaluated due to a lack of epistasis models in the literature that can be used to simulate data.

The lack of appropriate models that can be used for simulation is partly due to the combinatorial and computational complexities associated with the model discovery process [2]. The goal of the present study is to develop a genetic algorithm (GA) approach to model discovery that overcomes some of the combinatorial and computational limitations. We demonstrate here that the GA strategy is able to routinely discover models of gene–gene interaction effects among three to five total SNPs. The availability of high-order gene–gene interaction models will facilitate the simulation of data of varying complexity for the evaluation of new statistical and computational methods for identifying gene–gene interactions.

Section snippets

Related research

Penetrance functions represent one approach to modeling the relationship between single nucleotide polymorphisms (SNPs) (i.e. interindividual variation in a single nucleotide at a particular location in the DNA sequence of a gene) and risk of disease. A penetrance function is simply the probability (P) that an individual will have the disease (D) given a particular genotype or combination of genotypes (G) from multiple genes (i.e. P[D|G]). A single genotype is determined by one allele (i.e. a

Overview of genetic algorithms

Genetic algorithms, neural networks, case-based learning, rule induction, and analytic learning are some of the more popular paradigms in machine learning [9]. Genetic algorithms perform a beam or parallel search of the solution space that is analogous to the problem solving abilities of biological populations undergoing evolution by natural selection [5], [6]. With this procedure, a randomly generated ‘population’ of solutions to a particular problem are generated and then evaluated for their

Results

For each of the 1000 runs of the GA, a best model was identified that met the criteria of V_t≥0.1 and V_m≤0.0001 for three-SNP models and V_t≥0.2 and V_m≤0.0001 for four- and five-SNP models. Thus, the GA routinely identified epistasis models with no or little independent main effects. Further, of the 1000 models identified for each number of SNPs, there were no duplicates. Thus, the GA discovered 1000 unique three-, four-, and five-SNP gene–gene interaction models exhibiting minimal independent

Discussion

We have introduced a genetic algorithm approach to identifying penetrance functions that model epistasis or gene–gene interactions among three to five SNPs in the absence of independent main effects. The development of this GA approach and our initial two-SNP version [12] was motivated by a lack of published gene–gene interaction models and a lack of methods for generating them. We find that the GA is capable of routinely discovering interesting epistasis penetrance functions with up to five

Future research and conclusion

An important next step in the discovery of complex genetic models will be to determine the diversity and number of models that exhibit epistasis in the absence of main effects for any given number of SNPs. How many total models exist that have gene–gene interaction properties? How many exist for different allele and genotype frequencies? In this study, we identified 1000 models that have different probability values for some or all of the genotype combinations. Even though the probabilities are

Acknowledgements

This work was supported by National Institutes of Health grants HL65234, HL65962, GM31304, AG19085, AG20135, and LM007450. We thank an anonymous referee for their thoughtful comments and suggestions.

References (16)

R. Culverhouse et al.
A perspective on epistasis: limits of models displaying no main effect
Am. J. Hum. Genet.
(2002)
M.D. Ritchie et al.
Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer
Am. J. Hum. Genet.
(2001)
R. Bellman, Adaptive Control Processes, Princeton University Press, Princeton, NJ,...
W.N. Frankel et al.
Who’s afraid of epistasis?
Nat. Genet.
(1996)
A.A. Freitas
Understanding the crucial role of attribute interaction in data mining
Artif. Intell. Rev.
(2001)
D.E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading,...
J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor,...
L.W. Hahn et al.
Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions
Bioinformatics
(2003)

There are more references available in the full text version of this article.

Cited by (42)

Statistical genetic concepts in psychiatric genomics
2019, Personalized Psychiatry
This chapter provides a brief overview on statistical methods in genetic studies and their implications for psychiatric disorders. First, we briefly describe the different study types, followed by an introduction into the genetic architecture of complex traits and the caveats when conducting statistical analysis. We conclude with a discussion on more complex approaches looking into gene × environment interactions and integration of different “omics” datasets into a single framework.
The statistical methods introduced in this chapter can be used to harness different types of clinical and biological information to get a deeper understanding of the genes and molecular pathways involved in psychiatric diseases. Statistical algorithms will aid stratification of patients into biologically distinct homogenous groups of patients, with implications for treatment. This, in turn, will help to unravel the biological underpinnings of psychiatric diseases and pave the way for personalized psychiatry.
A comparative analysis of chaotic particle swarm optimizations for detecting single nucleotide polymorphism barcodes
2016, Artificial Intelligence in Medicine
Citation Excerpt :
In the linear regression model, the original PSO method was set as reference, the Beta (β) value indicates whether the CPSO methods perform better (positive) or worse (negative) than the reference, and statistical significance is defined as p-value < 0.05. Both the XOR [47] and ZZ [48] disease models were used to test all CPSO methods. In the nonlinear XOR model, the heterozygous genotype from one locus or a heterozygous genotype from another locus may increase disease risk.
Evolutionary algorithms could overcome the computational limitations for the statistical evaluation of large datasets for high-order single nucleotide polymorphism (SNP) barcodes. Previous studies have proposed several chaotic particle swarm optimization (CPSO) methods to detect SNP barcodes for disease analysis (e.g., for breast cancer and chronic diseases). This work evaluated additional chaotic maps combined with the particle swarm optimization (PSO) method to detect SNP barcodes using a high-dimensional dataset.
Nine chaotic maps were used to improve PSO method results and compared the searching ability amongst all CPSO methods. The XOR and ZZ disease models were used to compare all chaotic maps combined with PSO method. Efficacy evaluations of CPSO methods were based on statistical values from the chi-square test (χ²).
The results showed that chaotic maps could improve the searching ability of PSO method when population are trapped in the local optimum. The minor allele frequency (MAF) indicated that, amongst all CPSO methods, the numbers of SNPs, sample size, and the highest χ² value in all datasets were found in the Sinai chaotic map combined with PSO method. We used the simple linear regression results of the gbest values in all generations to compare the all methods. Sinai chaotic map combined with PSO method provided the highest β values (β ≥ 0.32 in XOR disease model and β ≥ 0.04 in ZZ disease model) and the significant p-value (p-value < 0.001 in both the XOR and ZZ disease models).
The Sinai chaotic map was found to effectively enhance the fitness values (χ²) of PSO method, indicating that the Sinai chaotic map combined with PSO method is more effective at detecting potential SNP barcodes in both the XOR and ZZ disease models.
Analysis of high-order SNP barcodes in mitochondrial D-loop for chronic dialysis susceptibility
2016, Journal of Biomedical Informatics
Citation Excerpt :
We considered two epistasis models whose multiloci penetrances are shown in Additional file C: Table C1. Model 1 is the nonlinear XOR model [26] where the high risk of disease is dependent on inheriting a heterozygous genotype from one locus or a heterozygous genotype from another locus, but not all loci. Model 2 is the ZZ model [27,28] where the high risk of disease is dependent upon inheriting exactly two high risk alleles from two loci.
Positively identifying disease-associated single nucleotide polymorphism (SNP) markers in genome-wide studies entails the complex association analysis of a huge number of SNPs. Such large numbers of SNP barcode (SNP/genotype combinations) continue to pose serious computational challenges, especially for high-dimensional data.
We propose a novel exploiting SNP barcode method based on differential evolution, termed IDE (improved differential evolution). IDE uses a “top combination strategy” to improve the ability of differential evolution to explore high-order SNP barcodes in high-dimensional data.
We simulate disease data and use real chronic dialysis data to test four global optimization algorithms. In 48 simulated disease models, we show that IDE outperforms existing global optimization algorithms in terms of exploring ability and power to detect the specific SNP/genotype combinations with a maximum difference between cases and controls. In real data, we show that IDE can be used to evaluate the relative effects of each individual SNP on disease susceptibility.
IDE generated significant SNP barcode with less computational complexity than the other algorithms, making IDE ideally suited for analysis of high-order SNP barcodes.
Genetic predictors of outcome following traumatic brain injury
2015, Handbook of Clinical Neurology
Citation Excerpt :
In addition, linear models generally include interaction effects for genotypes that have independent main effects, although interactions in the absence of significant main effects are observed by some approaches (Millstein et al., 2006). Recently, this view has changed, using different analytical tools that perform analyses using nonlinear interactions and interpreting the model(s) in the context of known biology (Moore et al., 2004). Two of these approaches use data mining and machine learning methods and have been reviewed by Gilbert-Diamond et al. (2011).
The nature of traumatic brain injury (TBI) has acute and chronic outcomes for those who survive. Over time, the chronic process of injury impacts multiple organ systems that may lead to disease. We discuss possible mechanisms and methodological issues in the context of candidate gene association studies using TBI patient populations. Because study population sizes have been generally limited, we discussed results on genes that have been the focus of independent studies. We also present a justification for testing more speculative candidate genes in recovery from TBI, such as those involved in circadian rhythm, to outline the importance of prioritizing functional variants in genes that may modulate recovery or provide neuroprotection from TBI. Finally, we provide a perspective on how future research will integrate population level genetic findings with the biological basis of disease in order to create a resource of predictive outcome measures for individual patients.
Haplotype inference using a novel binary particle swarm optimization algorithm
2014, Applied Soft Computing Journal
Citation Excerpt :
In the post-genome era, as the high-throughput genomic technologies are available, correlating variation in DNA sequence with differences phenotypic (diseases, skin color and so on) has attracted increasing attention [1–3]. Single Nucleotide Polymorphisms (SNPs) are the most common form of DNA variation [4–6]. Studies showed that haplotypes (the combination of SNPs alleles on the same chromosome) can provide more information than genotypes (the conflated data of two haplotypes) in association studies [7–11].
The knowledge of haplotypes allows researchers to identify the genetic variation affecting phenotypic such as health, disease and response to drugs. However, getting haplotype data by experimental methods is both time-consuming and expensive. Haplotype inference (HI) from the genotypes is a challenging problem in the genetics domain. There are several models for inferring haplotypes from genotypes, and one of the models is known as haplotype inference by pure parsimony (HIPP) which aims to minimize the number of distinct haplotypes used. The HIPP was proved to be an NP-hard problem. In this paper, a novel binary particle swarm optimization (BPSO) is proposed to solve the HIPP problem. The algorithm was tested on variety of simulated and real data sets, and compared with some current methods. The results showed that the method proposed in this paper can obtain the optimal solutions in most of the cases, i.e., it is a potentially powerful method for HIPP.
Genome Simulation. Approaches for Synthesizing In Silico Datasets for Human Genomics
2010, Advances in Genetics
Citation Excerpt :
SimPEN is an alternative strategy for assigning disease status to individuals. SimPEN was designed explicitly to generate multilocus penetrance models where there are no main effects from any single-locus alone; with such a genetic model, all loci must be evaluated jointly to detect the genetic effect (Moore et al., 2004). In terms of heritability, these models have no additive or dominant variance—all of the trait variance due to genetics is explained by the nonadditive interaction of genotypes.
Simulated data is a necessary first step in the evaluation of new analytic methods because in simulated data the true effects are known. To successfully develop novel statistical and computational methods for genetic analysis, it is vital to simulate datasets consisting of single nucleotide polymorphisms (SNPs) spread throughout the genome at a density similar to that observed by new high-throughput molecular genomics studies. In addition, the simulation of environmental data and effects will be essential to properly formulate risk models for complex disorders. Data simulations are often criticized because they are much less noisy than natural biological data, as it is nearly impossible to simulate the multitude of possible sources of natural and experimental variability. However, simulating data in silico is the most straightforward way to test the true potential of new methods during development. Thus, advances that increase the complexity of data simulations will permit investigators to better assess new analytical methods. In this work, we will briefly describe some of the current approaches for the simulation of human genomics data describing the advantages and disadvantages of the various approaches. We will also include details on software packages available for data simulation. Finally, we will expand upon one particular approach for the creation of complex, human genomic datasets that uses a forward-time population simulation algorithm: genomeSIMLA. Many of the hallmark features of biological datasets can be synthesized in silico; still much research is needed to enhance our capabilities to create datasets that capture the natural complexity of biological datasets.

View all citing articles on Scopus

View full text

Routine discovery of complex genetic models using genetic algorithms

Abstract

Introduction

Section snippets

Related research

Overview of genetic algorithms

Results

Discussion

Future research and conclusion

Acknowledgements

Am. J. Hum. Genet.

Am. J. Hum. Genet.

Who’s afraid of epistasis?

Nat. Genet.

Understanding the crucial role of attribute interaction in data mining

Artif. Intell. Rev.

Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions

Bioinformatics