Introduction

Personal genome tests have been offered directly to individual consumers since 2007.1 These data can be used to estimate the individual risk of diseases using the genotypes at single-nucleotide polymorphisms reported to be associated with diseases. However, genetic risk models based on known single-nucleotide polymorphisms typically have only low to moderate predictive ability for most diseases, because of the relatively low effect sizes of previously reported single-nucleotide polymorphisms.

Kalf et al.2 compared various algorithms used by three private genetic risk service companies (23 and Me, deCODEme and Navigenics), and reported that the area under the curve (AUC) values of receiver operating characteristic curves differed between these companies even for the same given genotypes. In addition, previous reports showed that the predicted risks differed among companies and were divergent for some traits in some individuals.3, 4, 5, 6 Although previous reports suggested that the discriminative accuracy reflected by the area under the curve of the receiver operating characteristic curve using currently available single-nucleotide polymorphisms is not sufficiently high,7, 8, 9 additional variations that have not yet been discovered along with more sophisticated algorithms may improve the accuracy of this method.

In the present study, we examined the characteristics of estimated risks based on individual genotypes from single and multiple loci to evaluate the validity of estimating such risks.

To estimate the risk of an individual to express a qualitative phenotype such as a disease based on single or multiple associated genetic loci, it is first necessary to determine the average risk in the population in addition to the population allele frequency and odds ratio of the association. However, it is often difficult to obtain an accurate average risk of a population. The risk is usually estimated either from the results of an epidemiological study or from a meta-analysis of multiple studies, and is expressed as an interval such as the 95% confidence interval rather than as a point estimation.

This type of interval estimation means that the calculated risk of a subject is likely to be influenced by any change in the average risk within the interval. According to such analyses, a graph of the estimated relative risk (y axis) against the average risk of the population (x axis) can be constructed. In the present context, a relative risk is defined as the individual risk divided by the average risk of the population. In general, it is more important to know whether an individual risk is higher or lower compared with the average risk of the population rather than estimating the absolute individual risk. Therefore, it is essential to determine whether the relative risk vs average risk graph crosses the line of y=1, and if so, to determine the point at which the average population risk (x) is equal to the estimated risk of the subject.

Here, we examine the conditions in which the graph of the estimated relative risk of a subject crosses the line of y=1, and propose methods to cope with that situation.

Materials and methods

First, we describe the algorithm used to calculate individual risk based on genotypes examined in this study.

Estimating the individual risk based on a single-locus genotype

Among two alleles, A and a, at a given locus, we designate a as the allele of interest. Accordingly, the number of the alleles of interest in the genotype of a subject is 0, 1 or 2 for the genotypes AA, Aa and aa, respectively.

Let d1, d2 and d3 be the absolute risks (e.g., the probability of developing a disease) of the subjects with the genotypes AA, Aa and aa, respectively. Let r1 be the odds ratio of the risk for the comparison of genotypes Aa and AA, and let r2 be the odds ratio of the risk for the comparison of genotypes aa and Aa. Then, because of the definition of the odds ratio, the following equations hold:

Let p denote the frequency of allele a and let m denote the average risk in the population, which is usually calculated from either the incidence or prevalence in the population. If Hardy–Weinberg’s equilibrium is assumed, then the following equation holds:

By removing the variables d2, d3 using Equations (1)–(3), the following equation is obtained, in which d1 remains as a variable:

Although this equation can be solved mathematically with respect to the variable d1, the solution is too complex to present here.

When assigning values to p, m, r1 and r2, we can obtain the solution for d1 using Cardano’s method.10

Estimation of the risk based on genotypes for a single locus

After deriving an appropriate solution of d1 using Cardano’s method based on Equation (4) as described above, d2 and d3 can be derived using the following equations obtained from Equations (1) and (2):

Thus, d1, d2 and d3 can be obtained if the values of p, m, r1 and r2 are known. The relative risk of an individual with the genotype AA, Aa or aa as compared with the average risk of the population; that is, d1/m, d2/m or d3/m can also be derived. From Equation (3), we obtain

When 0<=d1<d2<d3<=1, d1/m and d3/m must be < or >1, respectively, for 0<p<1.

The effects of changing the values of m, p, r1 or r2 on the relative risk of an individual were analyzed for each genotype within an appropriate interval, and graphs were constructed to examine these effects visually.

Estimation of the risk based on genotypes for multiple loci

The risk of a subject based on the data from multiple loci is calculated using the following multivariate logistic model:

where βi, (i=1,2,…,n) denote coefficients, and P denotes the variable for the risk. For the multilocus model, we assume that two odds ratios, r1 and r2, are equal (i.e., the effect of an allele is additive), and Xi denotes the number of the allele of interest a in the genotype at the ith locus; Xi is 0, 1 and 2 for genotypes AA, Aa and aa, respectively.

Since is the odds, the odds ratio of the risk for the comparison of genotypes aa (i.e., xi=2) and Aa (i.e., xi=1) is , which is equal to ri, denoting the odds ratio at the ith locus. Note that ri is the same as r1 and r2 for the ith locus in Equations (1) and (2). Therefore,

By solving Equation (6) with respect to P, we get the following logistic function:

The average of P in the population is

where s=(x1, x2,…, xn), and S denotes the set of all s. In Equation (9),

where pi denotes the frequency of the allele of interest at the ith locus.

When βi (i=1, 2,…,n) and pi (i=1, 2, …,n) as well as s=(x1, x2, …, xn) are given, the right-hand side of Equation (9) is a monotone increasing function with respect to β0.

Therefore, if E(P) is given, β0 can be numerically determined by solving Equation (9).

When β0 is determined, P for each given s, which is defined as Ps is determined using Equation (8), and the relative risk Ps/E(P) for each subject based on the observed genotypes can be obtained using Equations (8) and (9).

Thus, the relative risk Ps/E(P) based on a combination of multiple genotypes can be determined for different values of E(P). Accordingly, the graph was drawn for Ps/E(P) for the given values of m=E(P) between 0 and 1.

Results

Effect of the average population risk on the relative risks for different genotypes at a single locus

We developed an R script named Singlelocus.R (Supplementary Material 1) to solve Equations (1),(2), (3) for determining d1, d2 and d3 (penetrance parameters for genotypes AA, Aa and aa, respectively) from the odds ratios r1 and r2, the population frequency p of the allele of interest a and the average population risk m. First, we determined d1 by solving Equation (4), and then determined d2 and d3. This R script also draws the curves of d1, d2 and d3 as a function of the average population risk m.

Using this R script, we examined the effect of changes in the average population risk (m) on the relative risks for the different genotypes (AA, Aa, aa) based on the data from a single locus. Figures 1 and 2 show the results with various values of p, r1 and r2. All of the graphs reached a line of y=1 when m was 1, where y denotes the relative risk (Figures 1a–d and 2a–d). In all cases, the order of the relative risks for the genotypes AA, Aa, aa was AA≤Aa≤aa, and the relative risks of AA and Aa were equal when r1=1 (Figure 1c), whereas those of Aa and aa were equal when r2=1 (Figure 1d).

Figure 1
figure 1

Relative risk of subjects with three different genotypes at a single locus calculated using the R script Singlelocus. R at varying population average risk values, m. The frequency p of the allele of interest a and two odds ratios, r1and r2, were given as shown. n indicates the number of loci and n=1 was assumed to derive these graphs.

Figure 2
figure 2

Relative risk of subjects with three different genotypes at a single locus calculated using the R script Singlelocus. R at varying population average risk values, m, but with higher values of the odds ratios r1 and r2. Other conditions and parameters are the same as described for Figure 1.

Figures 1a–d indicates that under the applied conditions (r1 and r2 are close to 1), the relative risks of the individuals change in an almost linear manner because of the change of the average risk, and the differences between different genotypes tend to decrease when the average risk increases. Thus, when r1>1 and r2>1, the relative risk for AA tends to increase, whereas that for aa tends to decrease when the average risk increases (Figures 1a and b). The relative risk for Aa tends to increase when it is lower than 1, whereas it tends to decrease when it is higher than 1 (Figures 1a and b). None of the lines for the relative risks of different genotypes crossed the horizontal line of 1.0, indicating that the order of relative risk of an individual and the population average risk does not change when the average risk changes (Figures 1a–d).

When r1 and r2 are rather high, the relationship between the average risk and the relative risks of the subjects with different genotypes are no longer nearly linear (Figures 2a–d). Furthermore, the relative risk for genotype Aa changes from lower to higher compared with the average when the average risk increases (Figures 2b–d), as reflected by the fact that the graph crosses the line of y=1.

Effect of the frequency of the allele of interest on the relative risks for different genotypes based on single-locus data

We next examined the effect of the frequency of the allele of interest on the relative risks for different genotypes, and the results are shown in Figures 3a–d. With odds ratios of r1>1 and r2>1, the relative risks for all genotypes tend to decrease when the frequency of the allele of interest increases (Figures 3a and b). By contrast, with odds ratios of r1<1 and r2<1, the relative risks for all genotypes tend to increase when the frequency of the allele of interest increases (Figure 3c). However, when r1=0.7 and r2=2.0 (i.e., overdominance), the lines neither increase nor decrease monotonously, but instead show peaks between 0 and 1 (Figure 3d).

Figure 3
figure 3

Relative risk of subjects with three different genotypes at a single locus calculated by the R script Singlelocus. R with varying frequencies of the allele of interest a. The average population risk m and two odds ratios, r1and r2, were given as shown. n indicates the number of loci and n=1 was assumed to derive these graphs.

Effect of the odds ratio on relative risks for different genotypes based on single-locus data

We also examined the effect of the odds ratio on the relative risks for different genotypes, and the results are shown in Figures 4a–d. These graphs indicate that the relative risks for genotype AA decrease, whereas those for genotype aa increase when the odds ratio increases (Figures 4a–d). The change in the relative risk for genotype Aa in response to changes in the odds ratio depended on the specific conditions (Figures 4a–d).

Figure 4
figure 4

Relative risk of subjects with three different genotypes at a single locus calculated by the R script Singlelocus. R at varying odds ratios, r1 and r2. The average population risk m and the frequency p of the allele of interest a were given as shown. n indicates the number of loci and n=1 was assumed to derive these graphs.

Estimation of the risk based on genotypes at multiple loci using R scripts

For calculation of the risk based on multiple loci, an R script, MultilocusSubject.R (Supplementary Material 2), was developed according to given frequencies of the allele of interest, odds ratios based on the additive model and genotypes of the subject at multiple loci, as well as the average population risk m. This script also calculates the relative risk in comparison with the average risk.

We also developed another R script, MultilocusCurve.R (Supplementary Material 3), to draw a graph showing the change in the individual relative risk because of the change of the average risk m. We performed an extensive simulation by inputting a variety of data to MultilocusCurve.R. All of the graphs reach y=1 when m reaches 1 (Figures 5–6,7). In general, the relative risk either increases or decreases monotonously when m increases from 0 to 1, and it finally reaches 1 when m is equal to 1 (Figures 5a,6a and 7b). Therefore, in these cases, the relative risk reaches 1 only when m=1. When ri>1 and all loci have the genotype AA; that is, xi=0, the relative risk is always below 1, and increases monotonously to reach 1 when m=1 (Figure 5b). However, when ri>1 and all loci have the genotype aa; that is, xi=2, the relative risk is always above 1, and decreases monotonously to reach 1 when m=1 (Figure 5a). The graph does not cross the line of y=1 in any of these cases.

Figure 5
figure 5

Relative risk of subjects with different genotype frequencies at multiple loci calculated using the R script Multilocus Curve. R with varying average population risk values, m. The genotype is expressed according to the number of the allele of interest a; that is, 0 for AA, 1 for Aa and 2 for aa. The frequency p of the allele of interest a, and a single odds ratio r for each locus were given as shown. The loci were numbered, and two or three loci were assumed for the construction of these graphs.

Figure 6
figure 6

Relative risk of subjects with different genotype frequencies at multiple loci calculated using the R script Multilocus Curve. R with varying average population risk values, m. All parameters are the same as those described for Figure 5, except that three or four loci were assumed in these cases.

Figure 7
figure 7

Relative risk of subjects with different genotype frequencies at multiple loci calculated using the R script Multilocus Curve. R with varying average population risk values, m. All parameters are the same as those described for Figure 5, except that five loci were assumed in these cases.

In rare cases, however, the graph does cross the line of y=1 in the interval 0<m<1 and reaches 1 at m=1, similar to the case of using data from a single locus (Figures 5c and 6b). This phenomenon occurred irrespective of the number of loci considered. The graph tended to cross the line of y=1 within the interval 0<m<1 when some of the genotypes were Aa; that is, xi=1. However, even when none of the loci had the genotype Aa; that is, xi=1, the relative risk still crossed the line of y=1 (Figure 6b).

We implemented the algorithm to determine the specific value at which the relative risk crosses the line y=1; that is, the accurate value of m where the relative risk of an individual is equal to 1, using the bisection method11 in MultilocusCurve.R.

Discussion

In this study, we used the R environment (R version 2.15.0 The R Foundation for Statistical Computing, ISBN 3-900051-07-0 Platform: i386-pc-mingw32/i386) to successfully implement a system for estimating the risk of a subject given known allele frequencies, odds ratios and genotypes of the subject at multiple loci, in addition to the average population risk m. For estimation of the risks based on the genotypes at multiple loci, we assumed the additive model for the effect of the allele of interest. We found that the individual relative risk of the genotype Aa crosses the line of y=1 when the allele frequency p changes. This is not expected to cause a major problem in interpretation or analysis because the estimation of the allele frequency is often accurate. However, we also found that the estimated relative risk can cross the line of y=1 in some rare cases when the average population risk changes. This may cause a problem because estimating the individual relative risk is often more important than the absolute risk, and the average population risk is sometimes obtained as an interval or an approximate value. Therefore, we propose that the relative risk should be estimated for an interval of average risk values m, followed by an examination of whether the risk becomes lower or higher compared with the average within the interval. If the relative risk crosses the line of y=1 within the interval, we recommend that the interpretation should be reported as either ‘the relative risk cannot be estimated’ or ‘the relative risk becomes higher or lower than the average risk of the population at the value x’.

A limitation of this study is that a non-consistent message may be acceptable for a hererozygote at a single locus; however, similar messages based on multiple loci may not be acceptable for some subjects. Another limitation is that the risk of a subject is largely influenced by the factors other than the observed genotypes. Therefore, the usefulness of the estimation of the risk based on the genotypes of limited numbers of loci is limited.