Using synthetic datasets to bridge the gap between the promise and reality of basing health-related decisions on common single nucleotide polymorphisms

Thomas R. Wood; Nathan Owens

doi:10.12688/f1000research.21797.1

Home Browse Using synthetic datasets to bridge the gap between the promise and...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Using synthetic datasets to bridge the gap between the promise and reality of basing health-related decisions on common single nucleotide polymorphisms

[version 1; peer review: 1 approved, 1 approved with reservations]

Thomas R. Wood ^1,2, Nathan Owens³

PUBLISHED 30 Dec 2019

Author details Author details

¹ Division of Neonatology, Department of Pediatrics, University of Washington Medical Center, Seattle, WA, USA
² Institute for Human and Machine Cognition, Pensacola, FL, USA
³ Independent Researcher, Seattle, WA, USA

Thomas R. Wood
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Project Administration, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Nathan Owens
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background: While the academic genetic literature has clearly shown that common genetic single nucleotide polymorphisms (SNPs), and even large polygenic SNP risk scores, cannot reliably be used to determine risk of disease or to personalize interventions, a significant industry of companies providing SNP-based recommendations still exists. Healthcare practitioners must therefore be able to navigate between the promise and reality of these tools, including being able to interpret the literature that is associated with a given risk or suggested intervention. One significant hurdle to this process is the fact that most population studies of common SNPs only provide average (+/- error) phenotypic or risk descriptions for a given genotype, which hides the true heterogeneity of the population and reduces the ability of an individual to determine how they themselves or their patients might truly be affected.
Methods: We generated synthetic datasets generated from descriptive phenotypic data published on common SNPs associated with obesity, elevated fasting blood glucose, and methylation status. Using simple statistical theory and full graphical representation of the generated data, we developed a method by which anybody can better understand phenotypic heterogeneity in a population, as well as the degree to which common SNPs truly drive disease risk.
Results: Individual risk SNPs had a <10% likelihood of effecting the associated phenotype (bodyweight, fasting glucose, or homocysteine levels). Example polygenic risk scores including the SNPs most associated with obesity and type 2 diabetes only explained 2% and 5% of the final phenotype, respectively.
Conclusions: The data suggest that most disease risk is dominated by the effect of the modern environment, providing further evidence to support the pursuit of lifestyle-based interventions that are likely to be beneficial regardless of genetics.

Keywords

Genetics, single nucleotide polymorphisms, risk, obesity, methylation, blood glucose

Corresponding author: Thomas R. Wood

Competing interests: N.O declares that they have no competing interests. T.R.W. is a director of the British Society of Lifestyle Medicine (Registered Charity: SCIO SC046920), and is the founder of an as-yet unincorporated digital health group focused on lifestyle-based interventions.

Grant information: T.R.W is supported by start-up funds from the University of Washington Department of Pediatrics.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2019 Wood TR and Owens N. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Wood TR and Owens N. Using synthetic datasets to bridge the gap between the promise and reality of basing health-related decisions on common single nucleotide polymorphisms [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:2147 (https://doi.org/10.12688/f1000research.21797.1) First published: 30 Dec 2019, 8:2147 (https://doi.org/10.12688/f1000research.21797.1) Latest published: 30 Dec 2019, 8:2147 (https://doi.org/10.12688/f1000research.21797.1)

Introduction

Due to decreasing costs and a move towards “personalized medicine”, the use of direct-to-consumer (DTC) genetic analyses and third party interpretation services is increasing¹. Though whole genome sequencing is also increasing in popularity, most DTC products involve the analysis of common single nucleotide polymorphism (SNPs). These SNPs are then reported, either by the testing company or a third party tool that analyses the data, with specific disease risks based on published population data such as that from genome-wide association studies (GWAS). While the academic genetic literature has clearly shown that using SNPs, including polygenic risk scores (PRSs), to determine disease risk or to personalize clinical interventions is not currently possible or evidence-based, the trend for companies giving genetic-based advice on athletic ability or dietary recommendations is increasing². These risk predictions or recommendations are generally based on population average outcomes, with the heterogeneity of a given phenotype or disease risk infrequently reported. In fact, most GWAS studies tend to only report descriptive data (e.g. mean and standard error) for a given phenotype [such as body mass index (BMI) or fasting blood glucose] within a risk genotype. By only comparing or providing group averages based on genotype, the consumer is likely to overestimate the disease risk associated with a given SNP. Presenting only simplified descriptive data, either graphically or numerically, for a given genotype gives the impression that each SNP has consistent penetrance with respect to the phenotype in question, which is known to not be the case³. Therefore, the interpretation of disease risk based on SNPs by those not involved in the original studies and without access to the original data is almost impossible.

More important than the mean population effects of a given SNP or combination of SNPs that influence a common phenotype is the likelihood of a physiologically-relevant effect in a given individual. This includes the likelihood that there is no overall effect of genotype, particularly compared to common environmental factors that drive chronic disease risk in high income countries such as diet, sleep, and exercise. In order to allow for healthcare practitioners or self-interested parties to better understand the likelihood of a given phenotype being altered by a specific genotype, we developed a method by which synthetic datasets could be generated and analyzed. This is largely possible due to the fact that the effects of SNPs on measurable phenotypes are generally considered to follow a normal distribution, with the number of alleles or weighted genetic scores being linearly associated with the target phenotype. The method outlined is not intended to be a systematic approach to the literature of SNPs and their association with disease risk, but is instead intended to give healthcare providers a simple tool with which to better understand the literature and answer the questions of their patients. Using this approach, the significant heterogeneity of population data can be better understood, particularly with respect to how a given individual may or may not display phenotypic changes based on the presence of common genotypes.

Methods

Selection of representative SNPs

To provide illustrative examples, individual studies and meta-analyses of per allele effects for common SNPs most strongly-associated with risk of type 2 diabetes (Melatonin Receptor 1B, MTNR1B rs10830963), obesity (Fat mass and obesity-associated protein, FTO rs9939609), and altered methylation and nutrient handling resulting in elevated homocysteine levels (Methylenetetrahydrofolate Reductase, MTHFR rs1801131 and rs1801133) were identified from a commonly-used third-party SNP analysis tool (FoundMyFitness Genetic Report) output, as well as the online SNP wiki SNPedia.com^4–6. Due to the significant effect of ethnicity on SNP disease penetrance, example population data that were likely to most closely match the Anglo-Scandinavian background of the first author were used in individual examples, including data from deCODE (Iceland) and the Northern Finland Birth Cohort (NFBC), which were included in large multi-population GWAS studies^4,5. According to recently-published methods suggested by Pontzer et al., published hunter gatherer data for fasting glucose were used to provide an estimate of the effect of the Western environment on fasting glucose and diabetes risk compared to a published genetic risk score⁷.

Generation of synthetic datasets

Published per allele or per genetic risk score means were used to construct synthetic datasets for a given phenotype. All publications assumed data were normally distributed and that per allele/genetic risk score effects were linear. If data were expressed as mean with standard error (SE) or 95% confidence interval (CI), the standard deviation (SD) was calculated using the number (N) of participants in each group, where SD=SE*√N and SD=√N*(width of 95% CI)/3.92. When the descriptive data were not included in the publication, as was the case for genetic risk scores associated with obesity and fasting blood glucose^4,5, they were estimated from published graphs by extracting images and determining the number of pixels in each column and error bar relative to the scale bars on the axes. In all cases, enough data was included in the manuscript body to confirm that at least one of the estimated values was correctly determined using this method (such as total number of participants, or mean values in the highest or lowest genetic risk groups). For each genotype and gene, 1,000 synthetic individuals were randomly generated to re-create a normally-distributed dataset with the same mean and SD characteristics as those in the associated publication. Numbers were generated using Python 3.7 and the NumPy (1.17.0) and Pandas (0.25.0) libraries. The necessary code is available on GitHub (https://github.com/root-causing-health/SNPGaussianDistGenerator). Visual inspection of the data (Prism version 8, GraphPad Software, San Diego, CA) confirmed that they were normally distributed.

Statistical analysis

Each synthetic dataset was graphically represented using a violin plot to show the full distribution of the data. Percent chance of a null effect from a risk allele was calculated by determining the percent overlap of the normal distribution of the wild type phenotype with that of a risk genotype using statistics.NormalDist in Python 3.8 Beta. The percent likelihood of the phenotype in a risk allele group being at or below the mean value of the “wild type” was also calculated, and linear regression analysis was performed to determine the percent contribution of risk alleles to a given phenotype. Again, to provide graphical and statistical examples, similar analyses were performed using published multi-SNP PRSs for type 2 diabetes and obesity^4,5. As of the time of writing, perhaps the largest and most comprehensive published PRS for obesity (Khera et al., 2.1 million common SNPs in >300,000 individuals) could not be analyzed using this method, as only the mean phenotype in each decile of genetic risk is presented, with no error metric in either the text or figures⁸.

Alternative methods

To encourage attempts to perform similar analyses, a number of free online tools can be used that do not require significant technical skills. After calculating mean and SD as described above, free gaussian random number generators such as from Random.org can be used to generate synthetic datasets. Though the Box-Muller transform used by this tool is unlikely to produce a truly normal distribution⁹, this is also unlikely to meaningfully affect the outcome. Similar online tools can be used to determine the likelihood of being at, above, or below, a given point in a normal distribution to determine null effects of a given SNP or risk score (http://onlinestatbook.com/2/calculators/normal_dist.html). Finally, free online graphing software can be used to visually represent the datasets for visual examination of variability and overlap (e.g. Plotly), and perform linear regression analyses (e.g. GraphPad).

Results

FTO rs9939609 (A:T) and risk of being overweight

Published meta-analyses suggest an increase in BMI of 0.3 kg/m² per FTO rs9939609 A allele¹⁰. From this meta-analysis, data from the NFBC at 31 years of age (n=4,435) were used as a graphical example (Figure 1A)¹¹. Mean (SD) BMI across the three genotypes was 24.12 (3.87) kg/m², 24.43 (3.94) kg/m², and 24.82 (3.95) kg/m² for TT, AT, and AA respectively. In this population the risk of being overweight (BMI >25 kg/m²) was 41%, 44%, and 48%, resulting in an absolute 7% increase in risk in the TT genotype. BMI at or below the TT genotype was 47% in those with the AT genotype, and 43% in those with the TT genotype. The likelihood of null effect (percent overlap in BMI distribution of those with AT and AA genotypes compared to TT) was 96.8% and 92.8%, respectively. Therefore, only 3.2% of AT and 7.2% of AA genotypes would be expected to display any increase in BMI due to FTO genotype relative to TT. Linear regression found a significant association between number of A copies and BMI (p=0.001, R²=0.0035), suggesting that only around 0.4% of the variability in BMI is determined by FTO genotype (Figure 1B).

Figure 1. Effect of FTO rs9939609 genotype on BMI in the NFBC cohort and linear regression of FTO rs9939609 A alleles versus BMI.

(A) Violin plot displaying 1,000 synthetic BMI datapoints per FTO rs9939609 genotype, based on published population mean and SD values from the NFBC cohort. Percent overlap between the AT and AA normal distributions with that of the “wild type” (TT) genotype are displayed as a measure of the likelihood of these risk genotypes having no overall effect on BMI. (B) Linear regression of 1,000 synthetic BMI datapoints per FTO rs9939609 A allele copy. There was a significant association between number of A copies and BMI (p=0.001, R2=0.0035), suggesting that only around 0.4% of the variability in BMI is determined by FTO genotype.

Genetic BMI risk score

Willer et al. established a BMI genetic score using eight validated SNPs associated with BMI, weighted to effect size (with FTO rs9939609 given the largest weighting)⁵. This score was applied to the European Prospective Investigation of Cancer (EPIC) Norfolk cohort, where the top 1.2% of people (risk score >12) had an average BMI of 1.46 kg/m² greater than those in the bottom 1.4% (risk score <4). However, the majority of participants had risk scores in the middle of the range (6–10), with large variability across the whole range of scores (Figure 2A). In the highest genetic risk groups (genetic scores of 11, 12, and >12), the likelihood of null effect was at least 80% (Table 1). The likelihood of null effect in the most common genetic score (score of 8, 18.4% of participants) was 88.1%. This suggests that regardless of an individual’s genetic score, there is less than a 20% chance that they will display any increase in BMI due to their score relative to those 1.4% of individuals with the lowest genetic risk. Across the entire range of scores, linear regression found a significant association between risk score and BMI (p<0.001, R²=0.018), suggesting that only around 2% of BMI is determined by the eight SNPs most significantly associated with BMI (Figure 2B).

Table 1. Effect of BMI genetic score on risk of overweight and obesity.

BMI genetic risk score, as developed by Willer et al.⁵, and risk of being overweight or obese, using population mean and SD values from the EPIC Norfolk cohort. Genetic scores of 6–10 cover around 75% of the population. The likelihood of null effect of each score was determined as the percent overlap of its normal distribution with that of the lowest risk group (score <4). Even in the highest risk groups (11, 12, <12) percent overlap was at least 80%, with only 12–17% of those with a genetic score of 6-10 predicted to have BMI affected by their genotype.

Genetic BMI Score	Prevalence (%)	Mean (SD) BMI	Overweight (%)	Obese (%)	Distribution overlap with score <4 (%)
<4	1.4	25.4 (3.1)	55.1	7.1
4	3.4	25.7 (3.4)	58.2	10.2	95.3
5	7.2	25.9 (3.8)	59.3	14.2	89.1
6	12.9	26.2 (3.7)	62.6	15.6	88.2
7	17.4	26.2 (3.6)	63.1	14.4	89.3
8	18.1	26.3 (3.7)	63.9	15.5	88.1
9	15.8	26.5 (3.7)	65.7	17.3	85.9
10	10.6	26.6 (3.9)	66.0	19.1	83.4
11	7.7	26.8 (4.2)	66.7	22.1	80.9
12	2.8	27.0 (4.0)	69.2	22.6	80.0
>12	1.2	26.8 (3.8)	68.1	20.2	81.7

Figure 2. Effect of BMI genetic score on BMI in the EPIC Norfolk cohort and linear regression of genetic BMI risk score versus BMI.

(A) Violin plot displaying 1,000 synthetic BMI datapoints per group of BMI genetic risk score, as developed by Willer et al. using population mean and SD values from the EPIC Norfolk cohort. Significant variability is seen across the entire range of genetic scores, with more than 50% of individuals being overweight regardless of genotype. (B) Linear regression of 1,000 synthetic BMI datapoints per group of BMI genetic risk score, as developed by Willer et al.. There was a significant association between risk score and BMI (p<0.001, R²=0.018), with around 2% of BMI determined by the eight SNPs most significantly associated with BMI.

MTNR1B rs10830963 (C:G) and fasting blood glucose

Of the common SNPs associated with increased blood sugar, rs10830963 (C:G) has one of the largest effect sizes, with each G copy associated with around a 1.3 mg/dl increase in fasting blood glucose¹². Data from the deCODE cohort (n=6,240) were used as a graphical example (Figure 3A)¹². Mean (SD) fasting blood glucose across the three genotypes was 95.2 (12.8) mg/dl, 97.0 (12.8) mg/dl, and 97.9 (12.8) mg/dl for CC, CG, and GG respectively. The likelihood of null effect was 94.4% in those with the CG genotype, and 91.6% in those with the GG genotype. Linear regression found a significant association between number of G copies and fasting blood glucose (p<0.001, R²=0.01), with around 1% of the variability in blood glucose being determined by MTNR1B rs10830963 genotype (Figure 3B).

Figure 3. Effect of MTNR1B rs10830963 genotype on fasting glucose in the deCODE cohort and linear regression of MTNR1B rs10830963 genotype versus fasting glucose.

(A) Violin plot displaying 1,000 synthetic glucose datapoints per MTNR1B rs9939609 genotype, based on published population mean and SD values from the deCODE cohort. Percent overlap between the CG and GG normal distributions with that of the “wild type” (CC) genotype are displayed as a measure of the likelihood of these risk genotypes having no overall effect on fasting blood glucose. (B) Linear regression of 1,000 synthetic BMI datapoints per MTNR1B rs9939609 G allele copy. There was a significant association between number of G copies and fasting glucose (p<0.001, R²=0.011), suggesting that only around 1% of the variability in fasting blood glucose is determined by MTNR1B genotype.

Genetic type 2 diabetes risk score

Similar to the approach of Willer et al., Dupuis et al. published a genetic risk score for elevated fasting blood glucose and risk of type 2 diabetes⁴, including MTNR1B and 15 other loci. This score was applied to the Framingham cohort, where the top 3.1% of people (risk score >22) had an average fasting blood glucose ~6 mg/dl greater than those in the bottom 4.2% (risk score <13). Similar to the obesity risk score, significant heterogeneity in blood glucose levels was seen across the range of scores (Figure 4A). The likelihood of null effect in the most common genetic score (score of 18, 14.3% of participants) was 84.5% (Table 2). In those with the highest genetic risk score (scores 21, 22, and >22), the risk of prediabetic level blood glucose (>100mg/dl) was double that of those in the lowest risk group. However, even in these groups the likelihood of a given genetic score being associated with blood sugar outside of the distribution of those in the lowest risk group was only 25.5–27.7%, suggesting that fewer than 30% of people with the highest genetic risk of prediabetes experience that risk as a disease phenotype. Across the entire range of scores, linear regression found a significant association between risk score and fasting glucose (p<0.001, R²=0.049), suggesting that around 5% of fasting glucose is determined by the 16 SNPs most significantly associated with type 2 diabetes risk (Figure 4B). By comparison to the Framingham cohort, where mean (SD) fasting blood glucose was 92.5 (8.7) mg/dl in the lowest genetic risk group, free living hunter gathers from Tukisenta and Kitava reportedly have fasting blood glucose of around 75 (8) and 65 (14) mg/dl, respectively (Figure 4C)^13,14. Based on these data, the Tukisentans would have a 98.6% likelihood of having a blood sugar below the mean of those in the Framingham cohort with the lowest genetic risk score, with a 97.5% likelihood in the Kitavans, and normal distributions that display only 19.5% and 27.3% overlap with the lowest risk Framingham group. This translates to a 0.09% and 0.05% risk of prediabetic fasting blood glucose, respectively. Therefore, even in the lowest risk genetic group in the Framingham cohort, the relative risk of prediabetic fasting blood sugar levels (19.4%) is around 200–400 times higher than in hunter gatherer populations.

Table 2. Effect of glucose genetic score on risk of prediabetes.

Glucose genetic risk score, as developed by Dupuis et al.⁴, and risk of having prediabetes, using population mean and SD values from the Framingham cohort. Genetic scores of 16–19 cover around 52% of the population, and have around 30% prevalence of prediabetes. The likelihood of null effect of each score was determined as the percent overlap of its normal distribution with that of the lowest risk group (score <13). In those with the highest genetic risk scores (21, 22, and >22), the risk of prediabetic blood glucose (>100mg/dl) levels was double that of the lowest risk group. However, even in these groups the likelihood of a given genetic score being associated with blood sugar outside of the distribution of those in the lowest risk group was only 25.5–27.7%, suggesting that fewer than 30% of people with the highest genetic risk of prediabetes experience that risk as a disease phenotype.

Genetic Glucose Score	Prevalence (%)	Mean (SD) fasting glucose (mg/dl)	Prediabetes (%)	Distribution overlap with score <13 (%)
<13	4.2	92.5 (8.7)	19.4
13	5.0	93.6 (8.8)	23.4	94.8
14	8.2	94.2 (8.6)	25.0	92.3
15	9.8	94.3 (8.8)	25.9	91.7
16	13.0	95.2 (8.9)	29.5	87.7
17	13.8	95.4 (8.7)	29.9	86.7
18	14.3	95.9 (8.9)	32.3	84.5
19	11.5	96.5 (8.7)	34.4	81.9
20	8.5	97.3 (8.8)	38.0	78.3
21	5.4	98.1 (8.6)	41.3	74.5
22	3.2	98.6 (8.7)	43.6	72.7
>22	3.1	98.6 (8.6)	43.5	72.3

Figure 4. Effect of glucose genetic score on fasting glucose in the Framingham cohort, linear regression of genetic glucose risk score versus fasting glucose, and comparison between fasting glucose in the Framingham cohort and in hunter gatherer populations.

(A) Violin plot displaying 1,000 synthetic glucose datapoints per group of glucose genetic risk score, as developed by Dupuis et al.⁴, using population mean and SD values from the Framingham cohort. Significant variability is seen across the entire range of genetic scores. Risk of prediabetes (fasting glucose >100 mg/dl) increases from 19.4% to 43.5% from the lowest to highest risk group, with the most common genetic risk profiles (scores of 16–19, ~50% of individuals) having around a 30% risk of prediabetes. (B) Linear regression of 1,000 synthetic BMI datapoints per group of glucose genetic risk score, as developed by Dupuis et al.. There was a significant association between risk score and fasting blood glucose (p<0.001, R²=0.049), with around 5% of fasting glucose variability determined by the 16 SNPs most significantly associated with glucose homeostasis. (C) Violin plot displaying 1,000 synthetic glucose datapoints per group of glucose genetic risk score using population mean and SD values from the Framingham cohort, as well as using data from two hunter gatherer cohorts, the Tukisentans and Kitavans. The Tukisentans would have a 98.6% likelihood of having a blood sugar below the mean of those in the Framingham cohort with the best genetic score, with a 97.5% likelihood in the Kitavans. They display normal distributions with only 19.5% and 27.3% overlap with the lowest risk Framingham group and 0.09% and 0.05% risk of prediabetic fasting blood glucose, respectively. Even in the lowest risk genetic group in the Framingham cohort, the estimated prevalence of prediabetic fasting blood sugar levels (19.4%) is around 200–400 times higher than in hunter gatherer populations.

MTHFR rs1801131 (A:C) and rs1801133 (C:T) and homocysteine

Two common polymorphisms in the MTHFR gene, which alter in vitro enzyme activity and are associated with reduced capacity to produce 5-methyltetrahydrofolate, are frequently discussed in the popular and alternative health fields with regard to the methyl cycle and associated changes in detoxification, cellular repair, and detoxification pathways. In 1998, van der Put et al. described in vitro MTHFR activity of the most common combinations of alleles at rs1801131 and rs1801133, as well as homocysteine levels in the same participants⁶. In the most common genotypes, excluding 1298AA/677TT, which account for around 88% of the population on average, MTHFR function across five genotypes varies from 100% to 47.7% (Table 3). However, even in those with 47.7% function (1298AC/677CT) there is an 82.1% chance of null effect compared to 1298AA/677CC “wild type” with 100% function (Table 3). Across these common mutations, MTHFR function only explains around 1% of the variability in homocysteine levels (p<0.001, R²=0.01; Figure 5A). The addition of 1298AA/677TT, which has around 12% prevalence in the population and is associated with a 75.2% loss of MTHFR function, increases the explanation of variance to 7% (Figure 5B); however, the synthetic dataset included 6.9% negative values due to the large SD in this population. This suggests significant heterogeneity of homocysteine in those with the 677TT/1298AA genotype, which is not normally distributed. Indeed, though the percent chance of non-significant difference in homocysteine levels compared to 1298AA/677CC was only 35% in those with 1298AA/677TT, this includes a large proportion of the distribution in homocysteine levels that would be below that of the “wild type” due to the very large SD in the 1298AA/677TT group; 31.3% would be predicted to have homocysteine levels below the mean of 1298AA/677CC.

Table 3. Effect of common MTFHR SNPs on average function and homocysteine.

Average MTHFR function and homocysteine levels by rs1801131 (A1298C) and rs1801133 (C677T) SNP combination, as published by van der Put et al.¹⁴. Genotypes are listed in order of in vitro MTHFR enzyme function, with estimated population prevalence taken from Brown et al.¹⁵. The likelihood of null effect of each combination of MTHFR SNPs was determined as the percent overlap of its normal distribution with that with the “wild type” genotype (1298AA/677CC). Significant non-linearity between degree of MTHFR function and homocysteine is seen, with the most common genotypes, representing close to 88% of the population and 47.7%-83.2% enzyme function displaying 81-95% likelihood of null effect on resulting homocysteine levels.

Genotype		Estimated	Relative Function	Homocysteine	Homocysteine Distribution
1298	677	Prevalence (%)	Mean (%)	Mean (SD)	overlap with 1298AA/677CC (%)
AA	CC	15.3	100	12.9 (2.8)
AC	CC	20.8	83.2	13.6 (4.0)	81.5
AA	CT	22.8	66.8	12.8 (3.1)	94.9
CC	CC	8.8	61.1	13.9 (3.9)	81.0
AC	CT	19.8	47.7	14.2 (3.1)	82.1
AA	TT	12.2	24.8	19 (2.5)	35.0

Figure 5. Linear regressions of MTHFR activity versus homocysteine for the most common genotypes and MTHFR activity versus homocysteine including 1298AA/677TT.

(A) Linear regression of 1,000 synthetic homocysteine datapoints per combination of common rs1801131 (A1298C) and rs1801133 (C677T) SNPs by in vitro MTHFR activity, excluding 1298AA/677TT. There was a significant association between MTHFR function and homocysteine (p<0.001, R²=0.01), suggesting that only around 1% of the variability in homocysteine is determined by MTHFR activity across these genotypes. (B) Linear regression of 1,000 synthetic homocysteine datapoints per combination of rs1801131 (A1298C) and rs1801133 (C677T) SNPs by in vitro MTHFR activity. There was a significant association between MTHFR function and homocysteine (p<0.001, R²=0.07); however, the large SD (66% of the mean) in those with 1298AA/677TT resulted in 6.9% of predicted homocysteine levels being negative. This suggests that homocysteine in those with 1298AA/677TT is highly-variable, non-normally distributed, and that the effects of MTHFR activity on homocysteine levels are non-linear.

Discussion

The increasing prevalence of DTC genetic analyses is resulting in more and more healthcare providers being asked to interpret SNP-based disease risk by their patients, or attempting to incorporate these analyses into personalized treatment approaches. Here we demonstrate that, by using simple statistical theory and synthetic datasets generated based on published population phenotypic data from well-characterized SNPs, the likelihood of any given genotype resulting in a meaningful difference in phenotype is relatively small. For individual common SNPs determined to have large effect sizes, such as FTO rs9939609 on BMI and MTNR1B rs10830963 on fasting glucose, even those with two alleles have a less than 10% chance of displaying a difference in phenotype due to significant population variability. Additionally, baseline disease risks suggest that the vast majority of health outcomes associated with common SNPs are dominated by the environment.

The best-characterized SNP associated with risk of overweight and obesity is FTO rs9939609, with an average per A allele increase in BMI of 0.3 kg/m²¹⁰. However, an average population effect is less useful to an individual than the likelihood that they are going to be affected in the first place. For a single FTO A allele, this likelihood is around 3%, increasing to 7% in individuals with two A alleles, with 0.4% of overall BMI explained by FTO genotype. Though it may be the SNP most well associated with increases in BMI, the vast majority of individuals are unlikely to have their BMI meaningfully affected by their FTO SNPs. Importantly, even this negligible effect of FTO on BMI is largely dominated by the environment, with recent analyses suggesting that FTO rs9939609 genotype was not associated with BMI in those born before 1942¹⁶. Similarly, analyses of both FTO rs9939609 SNPs and composite obesity genetic risk scores suggest that those who partake in regular movement or exercise (~1h of moderate-vigorous physical activity per day) have similar BMIs regardless of genetics^17,18. In the well-characterized EPIC Norfolk cohort, the risk of being overweight was above 50% regardless of a genetic score consisting of the eight SNPs most tightly-associated with BMI, again suggesting a significant environmental component. Considering that current Centers for Disease Control and Prevention data suggests that 39.8% of the adult population in the United States is obese, the degree to which common genetic SNPs contribute to BMI may be statistically significant but borderline physiologically irrelevant compared to the impact of the environment. For instance, using a small PRS for obesity by Willer et al.⁵, our analysis suggested that around 2% of BMI determined by the eight SNPs most significantly associated with BMI. Khera et al. developed a 141-SNP PRS for obesity (that could not be analyzed here due to lack of reporting of group error/SD statistics), and even then this only explained 13.3% of phenotypic variability in bodyweight⁸.

Similar results to those seen with genetic obesity risk were found when analyzing genetic risk of elevated fasting blood glucose and type 2 diabetes. Of the SNPs associated with increased fasting blood glucose, MTNR1B SNP rs10830963 (C:G) has one of the largest effect sizes, with each G copy associated with around a 1.3 mg/dl increase in fasting blood glucose¹². In our analysis, only 5.6% of individuals with a single G copy would be expected to experience an increase in fasting blood sugar relative to those with the CC genotype, increasing to 8.2% in homozygotes. Using the genetic risk score developed by Dupuis et al. is more predictive, with more than a doubling of risk of prediabetes in those with the highest genetic frisk score compared to those with the lowest genetic risk. However, linear regression analysis suggested that only around 5% of fasting blood glucose is determined by genetic risk. This is just very similar to the proportion of explained variance that Dupuis et al. state in their original manuscript⁴, which provides some support for the use of synthetic datasets when variance and absolute numbers are not provided in the published literature. More importantly, however, it’s the way in which this information is placed into the context of the consumer using DTC genetic analysis to assess disease risk. For instance, the variance in fasting blood glucose (~5%) attributed to the loci included in the genetic risk score is smaller than the variance in reproducibility of commonly-used hand held at home glucometers used to monitor blood glucose in individuals with diabetes. Any effect of genetic risk is also largely a reflection of a slight amplification of the risk associated with the Western environment. Compared to hunter gatherer populations^7,13,14, fasting glucose is around 25–30 mg/dl higher even in the lowest genetic risk group, and the risk of prediabetes is 200–400 times higher. Indeed, in a recent analysis of the Bolivian Tsimane, prevalence of type 2 diabetes was 0%¹⁹, on top of which any increase in genetic risk would be essentially meaningless. Therefore, the presence of any prediabetes appears to simply be a reflection of disease risk in the US as a whole, where more than 80% are thought to have suboptimal metabolic health, including more than 50% with fasting glucose >100 mg/dl²⁰. Based on multiple lines of evidence, close to 100% of the disease risk associated with elevated fasting blood glucose in the Western world can be attributed to the modern environment.

The concept of methylation capacity and its association with long-term health has recently gained a lot of interest in the alternative health community and popular press. As a result, DTC testing of common SNPs in the MTHFR and other related genes is being used to estimate an individual’s capacity to (re)generate methylfolate in order to guide disease risk or nutrient supplementation. One potential biomarker of methyl cycle function, including MTHFR activity, is homocysteine, which is associated with and increased risk of cardiovascular disease, dementia, and all-cause mortality when elevated^21–23. Though there are multiple pathways for the metabolism of homocysteine, one is dependent on methylfolate, and homocysteine levels are often used as a proxy for the status of the folate cycle. Importantly, SNPs resulting in decreased in vitro MTHFR function are common. The “wild type” genotype 677CC/1298AA associated with 100% MTHFR function is only found in around 15% of the population¹⁵, which makes some degree of reduced MTHFR function a more representative “normal” state. In addition to this, the degree of MTHFR function appears to be only loosely associated with homocysteine levels. For instance, only 1% of homocysteine was accounted for by the five rs1801131 (A1298C) and rs1801133 (C677T) combinations that encompass 47.7–100% mean MTHFR activity. This suggests significant redundancy in the system that is unlikely to be able to inform any interventions based solely on genotype. Additionally, homocysteine levels are more likely to be determined by factors not associated with direct enzyme function, as those with the 1298AC/677CC genotype have higher MTHFR activity than 1298AA/677CT (83.2% versus 66.8% relative enzyme function), but also had higher mean homocysteine levels (13.6 μmol/L versus 12.8 μmol/L)⁶. The non-linearity of the association between MTHFR and homocysteine levels is typified by the 1298AA/677TT genotype, who have around 75% loss of enzyme function and 50% higher mean homocysteine levels but, importantly, display a high degree of variability and values that do not appear to be normally distributed. Therefore, any specific recommendations to this group must be based in phenotypic measurements, including individual homocysteine levels and nutrient status. Indeed, though MTHFR is associated with the folate cycle, ensuring adequate B6 and B12 may be at least as important with respect to homocysteine levels²⁴. Homocysteine in 677TT carriers can also be significantly reduced with a small amount of supplemental riboflavin²⁵. This again suggests that phenotypic measurements and ensuring adequate environmental/nutrient status has a much greater impact than does knowledge of genotype. However, it must be cautioned that, as yet, reducing homocysteine with nutritional supplements has not yet been shown to result robustly improve health outcomes, though there may be a small reduction in stroke risk²⁶.

This study does have some limitations. The approach used relies on the use of both simulated and statistically-ideal normal distributions based on published descriptive data rather than the data itself. However, where the methods could be tested against known data, such as the degree to which the glucose risk score explains glucose variability, the results were very similar to the original analyses. Importantly, if this approach fails to accurately recreate datasets similar to those in the published literature, then it is likely that those datasets were not normally-distributed and the original analyses were therefore inappropriate. This is probably the case for homocysteine levels in individuals with the MTHFR 1298AA/677TT genotype based on the widely-cited study by van der Put et al.⁶. Though all the SNPs analyzed here have low penetrance, they were specifically chosen because they are well-characterized in multiple populations and commonly included in third party DTC analyses of consumer genetic data. Though we have only highlighted a few SNPs, the techniques applied here could be used by any practitioner or interested individual to better understand their disease or outcome risk based on common genetic SNPs. Importantly, the methods described here are not intended to include a systematic exploration of the true association between common genetic polymorphisms and disease risk, but instead provide a tool that any individual can use to better understand genetic-based risk in the context of the heterogeneity of the population. Our analysis and suggestions also do not preclude the potential future utility and application of genetics in disease risk stratification when used in combination with clinical risk factors². For instance, those with the highest genetic risk for cardiovascular disease appear to be most likely to benefit from lipid-modifying therapies². Khera et al. also examined 50 risk SNPs for coronary disease in over 50,000 individuals, and found that those at the greatest genetic risk received the greatest risk reduction as a result of a healthy lifestyle²⁷. However, it is also worth mentioning that the same study showed that all groups benefited from the presence of healthy lifestyle factors regardless of genetic risk, again suggesting that an individual’s environment is the common factor driving the majority of baseline disease risk.

Even though there is inherent error in our approach, it is clear that using population means to determine genetic risk and make recommendations based on genetics, as is very common in the DTC market, is likely to be highly-flawed due to inherent phenotypic variability. This includes variability in risk based on common factors such as socioeconomic status and ethnicity. For instance, FTO genotypes are associated with increased BMI in Caucasians, but not in those of African origin¹⁰. For the risk of both obesity and prediabetes or type 2 diabetes, particularly, the effect of the environment (diet, exercise, nutrient status) is likely to dominate the phenotype such that knowing about an individual’s SNPs associated with risk will have little benefit. A focus on genetic risk may indeed be detrimental due to the fact that i) thinking that you have a risk SNP can have an effect on physiology regardless of whether you have that SNP²⁸, ii) the majority of people have average genetic risk for a given phenotype, iii) DTC genetics testing still includes significant variability and error²⁹, iv) there is little to no evidence that specific interventions for a given common SNP have any effect on health outcomes, v) communicating genetic risk does not appear to alter health behaviors³⁰, and vi) though statistically significant, the final effect of most SNPs on phenotype could often be considered physiologically irrelevant. These risks have generally been acknowledged by the scientific community performing genetic research², but the over-interpretation of risk by third-parties relying on published population averages remains a significant worry, likely due to misinterpretation of the nature of the data.

Conclusions

Using simple statistical techniques, either with Python code or freely-available online tools, we have outlined a method by which healthcare providers and third-party genetic analysis tools can more accurately analyze genetic disease risk. Importantly, it is worth noting that the widely-characterized and cited SNPs for obesity, type 2 diabetes, and methylation status appear to have negligible overall effects on phenotype compared to the dominant effect of the environment.

Data availability

Code used to generate synthetic datasets: https://github.com/root-causing-health/SNPGaussianDistGenerator

Archived code and synthetic dataset as at time of publication: https://doi.org/10.5281/zenodo.3583439³¹

License: MIT

Faculty Opinions recommended

References

1. Guerrini CJ, Wagner JK, Nelson SC, et al.: Who's on third? Regulation of third-party genetic interpretation services. Genet Med. 2019. PubMed Abstract | Publisher Full Text
2. Torkamani A, Wineinger NE, Topol EJ: The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018; 19(9): 581–590. PubMed Abstract | Publisher Full Text
3. Bush WS, Moore JH: Chapter 11: Genome-wide association studies. PLoS Comput Biol. 2012; 8(12): e1002822. PubMed Abstract | Publisher Full Text | Free Full Text
4. Dupuis J, Langenberg C, Prokopenko I, et al.: New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat Genet. 2010; 42(2): 105–116. PubMed Abstract | Publisher Full Text | Free Full Text
5. Willer CJ, Speliotes EK, Loos RJ, et al.: Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet. 2009; 41(1): 25–34. PubMed Abstract | Publisher Full Text | Free Full Text
6. van der Put NM, Gabreels F, Stevens EM, et al.: A second common mutation in the methylenetetrahydrofolate reductase gene: an additional risk factor for neural-tube defects? Am J Hum Genet. 1998; 62(5): 1044–1051. PubMed Abstract | Publisher Full Text | Free Full Text
7. Pontzer H, Wood BM, Raichlen DA: Hunter-gatherers as models in public health. Obes Rev. 2018; 19 Suppl 1: 24–35. PubMed Abstract | Publisher Full Text
8. Khera AV, Chaffin M, Wade KH, et al.: Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood. Cell. 2019; 177(3): 587–596 e589. PubMed Abstract | Publisher Full Text | Free Full Text
9. Johnston D: Random Number Generators—Principles and Practices: A Guide for Engineers and Programmers. De | G Press. 2018. Reference Source
10. Qi Q, Kilpelainen TO, Downer MK, et al.: FTO genetic variants, dietary intake and body mass index: insights from 177,330 individuals. Hum Mol Genet. 2014; 23(25): 6961–6972. PubMed Abstract | Publisher Full Text | Free Full Text
11. Kaakinen M, Laara E, Pouta A, et al.: Life-course analysis of a fat mass and obesity-associated (FTO) gene variant and body mass index in the Northern Finland Birth Cohort 1966 using structural equation modeling. Am J Epidemiol. 2010; 172(6): 653–665. PubMed Abstract | Publisher Full Text | Free Full Text
12. Prokopenko I, Langenberg C, Florez JC, et al.: Variants in MTNR1B influence fasting glucose levels. Nat Genet. 2009; 41(1): 77–81. PubMed Abstract | Publisher Full Text | Free Full Text
13. Lindeberg S, Eliasson M, Lindahl B, et al.: Low serum insulin in traditional Pacific Islanders--the Kitava Study. Metabolism. 1999; 48(10): 1216–1219. PubMed Abstract | Publisher Full Text
14. Sinnett PF, Whyte HM: Epidemiological studies in a total highland population, Tukisenta, New Guinea. Cardiovascular disease and relevant clinical, electrocardiographic, radiological and biochemical findings. J Chronic Dis. 1973; 26(5): 265–290. PubMed Abstract | Publisher Full Text
15. Brown NM, Pratt VM, Buller A, et al.: Detection of 677CT/1298AC "double variant" chromosomes: implications for interpretation of MTHFR genotyping results. Genet Med. 2005; 7(4): 278–282. PubMed Abstract | Publisher Full Text
16. Rosenquist JN, Lehrer SF, O'Malley AJ, et al.: Cohort of birth modifies the association between FTO genotype and BMI. Proc Natl Acad Sci U S A. 2015; 112(2): 354–359. PubMed Abstract | Publisher Full Text | Free Full Text
17. Vimaleswaran KS, Li S, Zhao JH, et al.: Physical activity attenuates the body mass index-increasing influence of genetic variation in the FTO gene. Am J Clin Nutr. 2009; 90(2): 425–428. PubMed Abstract | Publisher Full Text
18. Li S, Zhao JH, Luan J, et al.: Physical activity attenuates the genetic predisposition to obesity in 20,000 men and women from EPIC-Norfolk prospective population study. PLoS Med. 2010; 7(8): pii: e1000332. PubMed Abstract | Publisher Full Text | Free Full Text
19. Kaplan H, Thompson RC, Trumble BC, et al.: Coronary atherosclerosis in indigenous South American Tsimane: a cross-sectional cohort study. Lancet. 2017; 389(10080): 1730–1739. PubMed Abstract | Publisher Full Text | Free Full Text
20. Araújo J, Cai J, Stevens J: Prevalence of Optimal Metabolic Health in American Adults: National Health and Nutrition Examination Survey 2009-2016. Metab Syndr Relat Disord. 2019; 17(1): 46–52. PubMed Abstract | Publisher Full Text
21. Fan R, Zhang A, Zhong F: Association between Homocysteine Levels and All-cause Mortality: A Dose-Response Meta-Analysis of Prospective Studies. Sci Rep. 2017; 7(1): 4769. PubMed Abstract | Publisher Full Text | Free Full Text
22. Ganguly P, Alam SF: Role of homocysteine in the development of cardiovascular disease. Nutr J. 2015; 14: 6. PubMed Abstract | Publisher Full Text | Free Full Text
23. Smith AD, Refsum H, Bottiglieri T, et al.: Homocysteine and Dementia: An International Consensus Statement. J Alzheimers Dis. 2018; 62(2): 561–570. PubMed Abstract | Publisher Full Text | Free Full Text
24. Moll S, Varga EA: Homocysteine and MTHFR Mutations. Circulation. 2015; 132(1): e6–9. PubMed Abstract | Publisher Full Text
25. McNulty H, Dowey le RC, Strain JJ, et al.: Riboflavin lowers homocysteine in individuals homozygous for the MTHFR 677C->T polymorphism. Circulation. 2006; 113(1): 74–80. PubMed Abstract | Publisher Full Text
26. Marti-Carvajal AJ, Solà I, Lathyris D, et al.: Homocysteine-lowering interventions for preventing cardiovascular events. Cochrane Database Syst Rev. 2017; 8: CD006612. PubMed Abstract | Publisher Full Text | Free Full Text
27. Khera AV, Emdin CA, Drake I, et al.: Genetic Risk, Adherence to a Healthy Lifestyle, and Coronary Disease. N Engl J Med. 2016; 375(24): 2349–2358. PubMed Abstract | Publisher Full Text | Free Full Text
28. Turnwald BP, Goyer JP, Boles DZ, et al.: Learning one's genetic risk changes physiology independent of actual genetic risk. Nat Hum Behav. 2019; 3(1): 48–56. PubMed Abstract | Publisher Full Text | Free Full Text
29. Tandy-Connor S, Guiltinan J, Krempely K, et al.: False-positive results released by direct-to-consumer genetic tests highlight the importance of clinical confirmation testing for appropriate patient care. Genet Med. 2018; 20(12): 1515–1521. PubMed Abstract | Publisher Full Text | Free Full Text
30. Hollands GJ, French DP, Griffin SJ, et al.: The impact of communicating genetic risks of disease on risk-reducing health behaviour: systematic review with meta-analysis. BMJ. 2016; 352: i1102. PubMed Abstract | Publisher Full Text | Free Full Text
31. Owens N: root-causing-health/SNPGaussianDistGenerator: F1000 Publication Verison (Version 1.0). Zenodo. 2019. http://www.doi.org/10.5281/zenodo.3583439

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 30 Dec 2019

Author details Author details

¹ Division of Neonatology, Department of Pediatrics, University of Washington Medical Center, Seattle, WA, USA
² Institute for Human and Machine Cognition, Pensacola, FL, USA
³ Independent Researcher, Seattle, WA, USA

Thomas R. Wood
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Project Administration, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Nathan Owens
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Review & Editing

Competing interests

N.O declares that they have no competing interests. T.R.W. is a director of the British Society of Lifestyle Medicine (Registered Charity: SCIO SC046920), and is the founder of an as-yet unincorporated digital health group focused on lifestyle-based interventions.

Grant information

T.R.W is supported by start-up funds from the University of Washington Department of Pediatrics.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 30 Dec 2019, 8:2147

https://doi.org/10.12688/f1000research.21797.1

Copyright

© 2019 Wood TR and Owens N. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Wood TR and Owens N. Using synthetic datasets to bridge the gap between the promise and reality of basing health-related decisions on common single nucleotide polymorphisms [version 1; peer review: 1 approved, 1 approved with reservations] F1000Research 2019, 8:2147 (https://doi.org/10.12688/f1000research.21797.1)

NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 30 Dec 2019

Views

6

Reviewer Report 22 Feb 2021

Oliver Pain, Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK

Approved

https://doi.org/10.5256/f1000research.24028.r78395

This study illustrates and discusses several issues with the interpretability of genotype-based risk scores as they are often reported in relative terms, as opposed to absolute terms, and the model used to derive the risk scores may have been derived ... Continue reading

This study illustrates and discusses several issues with the interpretability of genotype-based risk scores as they are often reported in relative terms, as opposed to absolute terms, and the model used to derive the risk scores may have been derived within populations that are not representative of the target individual. They illustrate these issues by simulating individual-level data based on reported summary statistics, and then comparing the rate of disease/or trait mean across genetic risk score categories. The authors report the proportion of the target sample in the highest genetic risk category showing a ‘null effect’, meaning the distribution of phenotype overlaps with the phenotype of individuals in the lowest genetic risk category. In a range of settings, the authors show that the absolute difference in risk across genetic risk categories is low, and therefore the approach used by many DTCs is misleading and potentially dangerous. The authors also make the important point that application of genetic findings to different populations may also lead to highly misleading results due to differences in the distribution of the phenotype and potentially large differences in the environmental contribution to the phenotype.

I found study very interesting and I think nicely illustrates that genetic risk should be converted to absolute risk estimates before interpretation, and that models should not be applied to individuals that are not represented by the training sample.

Specific comments:

Discussion: “Additionally, baseline disease risks suggest that the vast majority of health outcomes associated with common SNPs are dominated by the environment.”
- I think this sentence should be reworded to reflect that variance explained by current genetic risk scores is substantially lower than unmeasured factors. I say this because it currently reads as though everything the genetic risk score cannot explain is due to environmental factors, but this unexplained variance is partly due to current genetic risk scores being unable to explain the full heritability of the outcome.
Discussion: “Our analysis and suggestions also do not preclude the potential future utility and application of genetics in disease risk stratification when used in combination with clinical risk factors”.
- I was relieved to read this sentence in the discussion. Whilst I appreciate that genetics is often miss-sold as being a powerful predictor, I felt this study generally did not reflect the useful contribution to risk prediction that genetic risk scores could provide as their variance explained increase, but more importantly, as they are used in combination with non-genetic risk factors to improve risk prediction. This study spent a lot of time saying why genetics alone and currently is a bad predictor, but only very briefly discussed to contribution genetic risk score could make.
Discussion, limitation 2: “ii) the majority of people have average genetic risk for a given phenotype”
- I found this limitation confusing. The authors state that because the majority of people have the average genetic risk, focusing on genetic risk estimates may be detrimental. I do not see how this is a limitation as it merely reflects the normal distribution of genetic risk scores.
Typo: “with more than a doubling of risk of prediabetes in those with the highest genetic frisk score compared”. ‘frisk’ needs to be changed to ‘risk.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genetic epidemiology; Psychiatric genetics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Views

15

Reviewer Report 12 Feb 2020

Yesim Aydin Son, Graduate School of Informatics, Department of Health Informatics, Middle East Technical University, Ankara, Turkey

Approved with Reservations

https://doi.org/10.5256/f1000research.24028.r58756

Interpretation of genomic variants in the clinic for the diagnosis of genetic diseases is a current challenge of bioinformatics and medical genomics. The evaluation of the performance of molecular diagnostics will be beneficial in practice. Even though the study addresses ... Continue reading

Interpretation of genomic variants in the clinic for the diagnosis of genetic diseases is a current challenge of bioinformatics and medical genomics. The evaluation of the performance of molecular diagnostics will be beneficial in practice. Even though the study addresses a timely problem, the authors did not fully present the current research in the area. Review of state of the art and discussions in the literature is missing.

Variant interpretation for single-gene diseases and complex genetic diseases require different methodologies, thus present different challenges. In this study, obesity, a complex genetic phenotype, selected as the study case. However, the analysis only focuses on selected few SNPs as if these phenotypes present a single gene or multigenic inheritance.

GWAS is the fundamental analysis technique for complex genetic phenotypes allowing genotyping of millions of variants from individual participants. As authors also concluded, descriptive statistics have limitations in the analysis of complex genetic diseases and identifying associated SNP profiles in post-GWAS research. In the last ten years, post-GWAS analysis based on data mining techniques are under investigation, and there is an expanding literature on using data mining approaches. When a wide set of variants (SNP profiles) selected as features in predictive studies, models only based on genomic variants can outperform phenotype-based predictions or hybrid models combining genetic, clinical, and environmental factors. In light of this information designing a study base on basic statistical approaches is a major limitation of the study.

An additional concern is the random generation of synthetic individual genotypes. Authors do not account for the linkage disequilibrium between SNPs, or population frequencies of individual SNPs while randomly generating the synthetic genotyping data.

Expanding the discussion on why descriptive statistics fail for complex genetic diseases, such as obesity and type-II diabetes, and need to study risk profiles rather than single individual risk SNPs will be much more beneficial to the community.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Dias R, Torkamani A: Artificial intelligence in clinical and genomic diagnostics. Genome Medicine. 2019; 11 (1). Publisher Full Text
2. Gül H, Aydin Son Y, Açikel C: Discovering missing heritability and early risk prediction for type 2 diabetes: a new perspective for genome-wide association study analysis with the Nurses' Health Study and the Health Professionals' Follow-Up Study.Turk J Med Sci. 2014; 44 (6): 946-54 PubMed Abstract | Publisher Full Text
3. Yücebaş SC, Aydın Son Y: A prostate cancer model build by a novel SVM-ID3 hybrid feature selection method using both genotyping and phenotype data from dbGaP.PLoS One. 2014; 9 (3): e91404 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Health Informatics, Clinical Bioinformatics, Genomic Medicine, GWAS, Molecular Diagnostics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 30 Dec 2019

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 30 Dec 19	read	read

Yesim Aydin Son, Middle East Technical University, Ankara, Turkey
Oliver Pain, King's College London, London, UK

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

6 Views

22 Feb 2021 | for Version 1

Oliver Pain, Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK

6 Views Cite this report Responses(0)

Approved

This study illustrates and discusses several issues with the interpretability of genotype-based risk scores as they are often reported in relative terms, as opposed to absolute terms, and the model used to derive the risk scores may have been derived within populations that are not representative of the target individual. They illustrate these issues by simulating individual-level data based on reported summary statistics, and then comparing the rate of disease/or trait mean across genetic risk score categories. The authors report the proportion of the target sample in the highest genetic risk category showing a ‘null effect’, meaning the distribution of phenotype overlaps with the phenotype of individuals in the lowest genetic risk category. In a range of settings, the authors show that the absolute difference in risk across genetic risk categories is low, and therefore the approach used by many DTCs is misleading and potentially dangerous. The authors also make the important point that application of genetic findings to different populations may also lead to highly misleading results due to differences in the distribution of the phenotype and potentially large differences in the environmental contribution to the phenotype.

I found study very interesting and I think nicely illustrates that genetic risk should be converted to absolute risk estimates before interpretation, and that models should not be applied to individuals that are not represented by the training sample.

Specific comments:

Discussion: “Additionally, baseline disease risks suggest that the vast majority of health outcomes associated with common SNPs are dominated by the environment.”
- I think this sentence should be reworded to reflect that variance explained by current genetic risk scores is substantially lower than unmeasured factors. I say this because it currently reads as though everything the genetic risk score cannot explain is due to environmental factors, but this unexplained variance is partly due to current genetic risk scores being unable to explain the full heritability of the outcome.
Discussion: “Our analysis and suggestions also do not preclude the potential future utility and application of genetics in disease risk stratification when used in combination with clinical risk factors”.
- I was relieved to read this sentence in the discussion. Whilst I appreciate that genetics is often miss-sold as being a powerful predictor, I felt this study generally did not reflect the useful contribution to risk prediction that genetic risk scores could provide as their variance explained increase, but more importantly, as they are used in combination with non-genetic risk factors to improve risk prediction. This study spent a lot of time saying why genetics alone and currently is a bad predictor, but only very briefly discussed to contribution genetic risk score could make.
Discussion, limitation 2: “ii) the majority of people have average genetic risk for a given phenotype”
- I found this limitation confusing. The authors state that because the majority of people have the average genetic risk, focusing on genetic risk estimates may be detrimental. I do not see how this is a limitation as it merely reflects the normal distribution of genetic risk scores.
Typo: “with more than a doubling of risk of prediabetes in those with the highest genetic frisk score compared”. ‘frisk’ needs to be changed to ‘risk.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Genetic epidemiology; Psychiatric genetics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

15 Views

12 Feb 2020 | for Version 1

Yesim Aydin Son, Graduate School of Informatics, Department of Health Informatics, Middle East Technical University, Ankara, Turkey

15 Views Cite this report Responses(0)

Approved With Reservations

Interpretation of genomic variants in the clinic for the diagnosis of genetic diseases is a current challenge of bioinformatics and medical genomics. The evaluation of the performance of molecular diagnostics will be beneficial in practice. Even though the study addresses a timely problem, the authors did not fully present the current research in the area. Review of state of the art and discussions in the literature is missing.

Variant interpretation for single-gene diseases and complex genetic diseases require different methodologies, thus present different challenges. In this study, obesity, a complex genetic phenotype, selected as the study case. However, the analysis only focuses on selected few SNPs as if these phenotypes present a single gene or multigenic inheritance.

GWAS is the fundamental analysis technique for complex genetic phenotypes allowing genotyping of millions of variants from individual participants. As authors also concluded, descriptive statistics have limitations in the analysis of complex genetic diseases and identifying associated SNP profiles in post-GWAS research. In the last ten years, post-GWAS analysis based on data mining techniques are under investigation, and there is an expanding literature on using data mining approaches. When a wide set of variants (SNP profiles) selected as features in predictive studies, models only based on genomic variants can outperform phenotype-based predictions or hybrid models combining genetic, clinical, and environmental factors. In light of this information designing a study base on basic statistical approaches is a major limitation of the study.

An additional concern is the random generation of synthetic individual genotypes. Authors do not account for the linkage disequilibrium between SNPs, or population frequencies of individual SNPs while randomly generating the synthetic genotyping data.

Expanding the discussion on why descriptive statistics fail for complex genetic diseases, such as obesity and type-II diabetes, and need to study risk profiles rather than single individual risk SNPs will be much more beneficial to the community.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Dias R, Torkamani A: Artificial intelligence in clinical and genomic diagnostics. Genome Medicine. 2019; 11 (1). Publisher Full Text
2. Gül H, Aydin Son Y, Açikel C: Discovering missing heritability and early risk prediction for type 2 diabetes: a new perspective for genome-wide association study analysis with the Nurses' Health Study and the Health Professionals' Follow-Up Study.Turk J Med Sci. 2014; 44 (6): 946-54 PubMed Abstract | Publisher Full Text
3. Yücebaş SC, Aydın Son Y: A prostate cancer model build by a novel SVM-ID3 hybrid feature selection method using both genotyping and phenotype data from dbGaP.PLoS One. 2014; 9 (3): e91404 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Health Informatics, Clinical Bioinformatics, Genomic Medicine, GWAS, Molecular Diagnostics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Guerrini CJ, Wagner JK, Nelson SC, et al.: Who's on third? Regulation of third-party genetic interpretation services. Genet Med. 2019. PubMed Abstract | Publisher Full Text

[2] 2. Torkamani A, Wineinger NE, Topol EJ: The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018; 19(9): 581–590. PubMed Abstract | Publisher Full Text

[3] 3. Bush WS, Moore JH: Chapter 11: Genome-wide association studies. PLoS Comput Biol. 2012; 8(12): e1002822. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Dupuis J, Langenberg C, Prokopenko I, et al.: New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat Genet. 2010; 42(2): 105–116. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Willer CJ, Speliotes EK, Loos RJ, et al.: Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet. 2009; 41(1): 25–34. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. van der Put NM, Gabreels F, Stevens EM, et al.: A second common mutation in the methylenetetrahydrofolate reductase gene: an additional risk factor for neural-tube defects? Am J Hum Genet. 1998; 62(5): 1044–1051. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Pontzer H, Wood BM, Raichlen DA: Hunter-gatherers as models in public health. Obes Rev. 2018; 19 Suppl 1: 24–35. PubMed Abstract | Publisher Full Text

[8] 8. Khera AV, Chaffin M, Wade KH, et al.: Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood. Cell. 2019; 177(3): 587–596 e589. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Johnston D: Random Number Generators—Principles and Practices: A Guide for Engineers and Programmers. De | G Press. 2018. Reference Source

[10] 10. Qi Q, Kilpelainen TO, Downer MK, et al.: FTO genetic variants, dietary intake and body mass index: insights from 177,330 individuals. Hum Mol Genet. 2014; 23(25): 6961–6972. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Kaakinen M, Laara E, Pouta A, et al.: Life-course analysis of a fat mass and obesity-associated (FTO) gene variant and body mass index in the Northern Finland Birth Cohort 1966 using structural equation modeling. Am J Epidemiol. 2010; 172(6): 653–665. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Prokopenko I, Langenberg C, Florez JC, et al.: Variants in MTNR1B influence fasting glucose levels. Nat Genet. 2009; 41(1): 77–81. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Lindeberg S, Eliasson M, Lindahl B, et al.: Low serum insulin in traditional Pacific Islanders--the Kitava Study. Metabolism. 1999; 48(10): 1216–1219. PubMed Abstract | Publisher Full Text

[14] 14. Sinnett PF, Whyte HM: Epidemiological studies in a total highland population, Tukisenta, New Guinea. Cardiovascular disease and relevant clinical, electrocardiographic, radiological and biochemical findings. J Chronic Dis. 1973; 26(5): 265–290. PubMed Abstract | Publisher Full Text

[15] 15. Brown NM, Pratt VM, Buller A, et al.: Detection of 677CT/1298AC "double variant" chromosomes: implications for interpretation of MTHFR genotyping results. Genet Med. 2005; 7(4): 278–282. PubMed Abstract | Publisher Full Text

[16] 16. Rosenquist JN, Lehrer SF, O'Malley AJ, et al.: Cohort of birth modifies the association between FTO genotype and BMI. Proc Natl Acad Sci U S A. 2015; 112(2): 354–359. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Vimaleswaran KS, Li S, Zhao JH, et al.: Physical activity attenuates the body mass index-increasing influence of genetic variation in the FTO gene. Am J Clin Nutr. 2009; 90(2): 425–428. PubMed Abstract | Publisher Full Text

[18] 18. Li S, Zhao JH, Luan J, et al.: Physical activity attenuates the genetic predisposition to obesity in 20,000 men and women from EPIC-Norfolk prospective population study. PLoS Med. 2010; 7(8): pii: e1000332. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Kaplan H, Thompson RC, Trumble BC, et al.: Coronary atherosclerosis in indigenous South American Tsimane: a cross-sectional cohort study. Lancet. 2017; 389(10080): 1730–1739. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Araújo J, Cai J, Stevens J: Prevalence of Optimal Metabolic Health in American Adults: National Health and Nutrition Examination Survey 2009-2016. Metab Syndr Relat Disord. 2019; 17(1): 46–52. PubMed Abstract | Publisher Full Text

[21] 21. Fan R, Zhang A, Zhong F: Association between Homocysteine Levels and All-cause Mortality: A Dose-Response Meta-Analysis of Prospective Studies. Sci Rep. 2017; 7(1): 4769. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Ganguly P, Alam SF: Role of homocysteine in the development of cardiovascular disease. Nutr J. 2015; 14: 6. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Smith AD, Refsum H, Bottiglieri T, et al.: Homocysteine and Dementia: An International Consensus Statement. J Alzheimers Dis. 2018; 62(2): 561–570. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Moll S, Varga EA: Homocysteine and MTHFR Mutations. Circulation. 2015; 132(1): e6–9. PubMed Abstract | Publisher Full Text

[25] 25. McNulty H, Dowey le RC, Strain JJ, et al.: Riboflavin lowers homocysteine in individuals homozygous for the MTHFR 677C->T polymorphism. Circulation. 2006; 113(1): 74–80. PubMed Abstract | Publisher Full Text

[26] 26. Marti-Carvajal AJ, Solà I, Lathyris D, et al.: Homocysteine-lowering interventions for preventing cardiovascular events. Cochrane Database Syst Rev. 2017; 8: CD006612. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Khera AV, Emdin CA, Drake I, et al.: Genetic Risk, Adherence to a Healthy Lifestyle, and Coronary Disease. N Engl J Med. 2016; 375(24): 2349–2358. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Turnwald BP, Goyer JP, Boles DZ, et al.: Learning one's genetic risk changes physiology independent of actual genetic risk. Nat Hum Behav. 2019; 3(1): 48–56. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Tandy-Connor S, Guiltinan J, Krempely K, et al.: False-positive results released by direct-to-consumer genetic tests highlight the importance of clinical confirmation testing for appropriate patient care. Genet Med. 2018; 20(12): 1515–1521. PubMed Abstract | Publisher Full Text | Free Full Text

[30] 30. Hollands GJ, French DP, Griffin SJ, et al.: The impact of communicating genetic risks of disease on risk-reducing health behaviour: systematic review with meta-analysis. BMJ. 2016; 352: i1102. PubMed Abstract | Publisher Full Text | Free Full Text

[31] 31. Owens N: root-causing-health/SNPGaussianDistGenerator: F1000 Publication Verison (Version 1.0). Zenodo. 2019. http://www.doi.org/10.5281/zenodo.3583439

Using synthetic datasets to bridge the gap between the promise and reality of basing health-related decisions on common single nucleotide polymorphisms

Abstract

Keywords

Introduction

Methods

Selection of representative SNPs

Generation of synthetic datasets

Statistical analysis

Alternative methods

Results

FTO rs9939609 (A:T) and risk of being overweight

Figure 1. Effect of FTO rs9939609 genotype on BMI in the NFBC cohort and linear regression of FTO rs9939609 A alleles versus BMI.

Genetic BMI risk score

Table 1. Effect of BMI genetic score on risk of overweight and obesity.

Figure 2. Effect of BMI genetic score on BMI in the EPIC Norfolk cohort and linear regression of genetic BMI risk score versus BMI.

MTNR1B rs10830963 (C:G) and fasting blood glucose

Figure 3. Effect of MTNR1B rs10830963 genotype on fasting glucose in the deCODE cohort and linear regression of MTNR1B rs10830963 genotype versus fasting glucose.

Genetic type 2 diabetes risk score

Table 2. Effect of glucose genetic score on risk of prediabetes.

Figure 4. Effect of glucose genetic score on fasting glucose in the Framingham cohort, linear regression of genetic glucose risk score versus fasting glucose, and comparison between fasting glucose in the Framingham cohort and in hunter gatherer populations.

MTHFR rs1801131 (A:C) and rs1801133 (C:T) and homocysteine

Table 3. Effect of common MTFHR SNPs on average function and homocysteine.

Figure 5. Linear regressions of MTHFR activity versus homocysteine for the most common genotypes and MTHFR activity versus homocysteine including 1298AA/677TT.

Discussion

Conclusions

Data availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated