Analysis of genetic risk assessment methods

Chronic non-communicable diseases are caused by a combination of multilocus genetic risk factors. The genetic risk assessment companies, e.g. Navigenics and 23andMe, calculate a lifetime risk of a disease by the use of strong assumptions on the total impact of the multiple SNPs genotype. The object of the paper is to compare such risk assessment methods. The theoretical disease model that describes both environmental and genetic factors has been used for evaluation of assessment methods. The system of nonlinear equations for tuning model’s parameters to real statistical parameters of the disease has been developed. The Receiver Operating Characteristic curve has been used to evaluate the quality of the methods as predictive tests.


Introduction
The purpose of the genetic risk assessment methods is to evaluate the quantitative index which indicates risk of the disease D in the case of an individual's genotype g.The calculation of the risk index in the case of one DNA locus (Single-Nucleotide Polymorphism -SNP) associated with the disease is not complicated and consists of solving of simple system of nonlinear algebraic equations.In this case g ∈ {N, R, R 2 } where R denotes risk and non-risk alleles in the diploid, R 2 denotes two risk alleles, N denotes both non-risk alleles.The conditional risk probability p(D/g), the corresponding odds ratio OR g = odds(p(D/g)) odds(p(D/N )) [3] or the relative risk λ g = p(D/g) p(D/N ) [7] are risk indexes.
However, chronic non-communicable diseases are caused by a combination of multi-locus genetic risk factors.The risk of multi-locus genotypes have been investigated in many papers (e.g.[3,7,2,4,8]).The main problem is to evaluate mutual impact of numerous disease-associated loci.If there are k disease-associated loci then there are 3 k different genotypes.Therefore, none of such very expensive statistical evaluations have been conducted until this time.So in all research papers the strong assumption is accepted that impacts of different loci are mutual independent.Also in the papers different assumptions have been accepted about the model of total impact of associated SNPs.The multiplicative model of the overall relative risk has been accepted in [7,4] and used by Navigenics company.The similar approach was presented in [8].The product of conditional probabilities was used in [2] and this approach is also equivalent to Navigenics approach.The corporation 23andMe has used another multiplicative model -the product of relative odds [3].
In the genetic databases rather few data about SNPs risk are presented.Therefore the assessment methods' input data should be calculated from known data.Thus, more assumptions should be accepted, e.g.Hardy-Weinberg equilibrium or additive models for log-odds and the impact of risk allele.It is difficult to calculate the error due to such assumptions.Therefore the evaluation of these methods may be done only by the use of experiments.

Input of assessment methods
The assessment methods use statistical data about prevalence of genotypes in a specific population and about the risk influenced by one SNP.Such data were estimated by the projects HapMap [1], Wellcome Trust Case Control Consortium (WTCCC) [9] and can be extracted from the knowledge base SNPedia [6]: • p(D) -an average lifetime risk; • OR i R2 , OR i R -risk odds ratios for homozygous (R 2 ) and heterozygous (R) genotypes in a i-th locus; The superscripts for odds ratios OR i , frequencies f i , relative risk λ i and values {N i , R i , R i 2 } of a genotype identify the locus i.

Genetic risk indexes
Different authors have derived different indexes under different assumptions.Suppose, for an individual k diploids SNPs associated with disease are known, therefore his known genotype is a genotype , where g i is a genotype in i-th locus.
Navigenics Corporation has developed Genetic Composite Index (GCI) [7,4]: 23andMe Corporation uses a risk index [3] equal to an odds ratio of risk probability in comparison with an average risk: The individual is supposed to be at disease developing risk if the risk index exceeds some threshold β.The relative risk λ i gi values are calculated from data presented in Section 1.

Theoretical disease model
We have used the stochastic model [4] of a disease as a sandbox for experiments.The model assumes that the disease is affected by environmental, known genetic and unknown factors that are mutually independent.The risk to develop a disease is simulated by the random variable H: Here coefficients v i ≥ 0, E is the model of environmental factors and G is the model of undisclosed genetic factors -both are normally distributed random variables with standard deviations σ e and σ g , respectively, and a zero means.The random variable is the model of the SNP having large effect and is the Binomially distributed variable B(2, p i ), where p i is the frequency of the risk allele p i = f i R2 + 0.5f i R ; the numeral 2 corresponds to 2 trials for an acquiring of a diplotype.
Let us denote the random multidimensional variable by Y = (y 1 , . . ., y k ) and its realization by X = (x 1 , . . ., x k ).It is assumed that an individual will develop the disease in his lifetime if H > α for an α such that the average lifetime risk p(D) equals the probability p(H > α).The genotype X of an individual is generated according to Binomial distribution B(2, p i ).Let us denote the set of all generated codes by X.
The parameters of the model ( 3) should be tuned to real statistical data about the disease.We have no access to clinical data, therefore the model parameters are tuned not to raw data, but to statistics calculated from them.The system of nonlinear equations for tuning model's parameters has been developed and solved.
From the equation (3), independency of risk factors, the formula of total probability, and normality of random variable G + E we have the equation: Here x is the probability that the binomial random variable acquires a value x; the ratios α/σ and v i /σ are assumed to be unknown variables, where i = 1, 2, . . ., k.
The k equations follow from the definition of the relative risk Liet. matem.rink.Proc.LMS, Ser.A, 56, 2015, 107-112. where The system of nonlinear equations ( 4)-( 5) can be solved by a numerical method.We have used an approximate solution which gives values of expressions ( 4)-( 5) nearest to statistics p(D) and λ j R calculated from real data.According to the formula (3), the risk to develop a disease for the individual X ∈ X is generated by the formula: , where N (0, 1) is a realization of the random variable with the standard normal distribution.The individual is attributed to the positive case sample Otherwise, it is attributed to the negative case sample X n (healthy individuals).

Experiments
The Receiver Operating Characteristic (ROC) curve is often used to evaluate the quality of a predictive test.The ROC curve is created by plotting the true positive rate TP (β) (sensitivity) in the vertical axis against the false positive rate FP (β) (fallout) in the horizontal axis at various threshold β values.The larger is the value of the area under the curve (AUC), the better is the indicator.
The experiments with the model (3) tuned to Type 2 Diabetes (T2D) statistics have been performed.The odds ratios for most significantly associated SNPs for known T2D susceptibility loci for individuals with ancestry from Europe are taken from [10] where results of 3 different scans of populations -Diabetes Genetics Initiative (DGI), Finland-US Investigation (FUSION) and WTCCC are presented.The model (3) has been tuned to the data taken from DGI database.
The system of nonlinear equations ( 4)-( 5) has been expressed as a least squares optimization problem and solved using the Single Agent Stochastic Search (SASS) algorithm [5].An approximate solution of the system has been used to generate population X of 100 000 individuals following (3).The generated population has been used to compare reliability of indexes GCIS and OR.The ROC curves of both indexes are presented in Fig. 1(a) as the result of the comparison.One can see from the figure that the difference between the curves is very slight and, therefore, both methods produce very similar result, when the disease model satisfies the same assumptions as risk indexes.
In genetic databases there are no parameters for Lithuanian population; therefore populations with ancestry from Europe are used for Lithuanian patients.In order to investigate the risk indexes' sensibility to the choice of such population the risk indexes and the corresponding ROC curves have been calculated for each of the three European populations, while population X has been tuned for DGI population.Results of the investigation, presented in Fig. 1(b), show that usage of other population territorially similar to that an individual is taken from, reduces the accuracy of the prediction, though the reduction is slight.

Conclusions
The risk indexes GSIS and OR produce very similar ROC curves in the case when the disease model satisfies the same assumptions as risk indexes.The quality of the risk indexes as a predictive test is low.The indexes are susceptible to the choice of a population territorially similar to that the individual under investigation is taken from.The choice of an improper population reduces the quality of prediction.

Fig. 1 .
Fig. 1.Comparison of ROC curves: (a) comparison of different indexes; (b) sensibility to the choice of the population.