Evaluating the Potential of Younger Cases and Older Controls Cohorts to Improve Discovery Power in Genome-Wide Association Studies of Late-Onset Diseases

For more than a decade, genome-wide association studies have been making steady progress in discovering the causal gene variants that contribute to late-onset human diseases. Polygenic late-onset diseases in an aging population display a risk allele frequency decrease at older ages, caused by individuals with higher polygenic risk scores becoming ill proportionately earlier and bringing about a change in the distribution of risk alleles between new cases and the as-yet-unaffected population. This phenomenon is most prominent for diseases characterized by high cumulative incidence and high heritability, examples of which include Alzheimer’s disease, coronary artery disease, cerebral stroke, and type 2 diabetes, while for late-onset diseases with relatively lower prevalence and heritability, exemplified by cancers, the effect is significantly lower. In this research, computer simulations have demonstrated that genome-wide association studies of late-onset polygenic diseases showing high cumulative incidence together with high initial heritability will benefit from using the youngest possible age-matched cohorts. Moreover, rather than using age-matched cohorts, study cohorts combining the youngest possible cases with the oldest possible controls may significantly improve the discovery power of genome-wide association studies.

SD band is a band of one standard deviation above and below the cases and the unaffected population of the same age. For highly prevalent LODs, at very old age, the mean polygenic risk of new cases crosses below the risk of an average healthy person at early onset age.  Common, low-effect-size alleles; all plots show MAF = 0.286 and OR = 1.15 allele. Change in the absolute magnitude of each allele frequency value is relatively small with age progression. GWASs' discovery power is a function of the difference in allele frequency between cases and controls. It is easy to visually estimate the change in the difference in allele frequency between the cases and controls. In the age-matched scenario, the difference is taken between points on the line at the same mid-cohort age. For the youngest cases-older controls scenario, the difference is taken always between the leftmost point on the red line and progressively older controls on the green line. From Supplemental Information in Oliynyk  Figure S4. Advantage of using youngest possible cases and increasingly older controls compared to classical age-matched cohorts.
(A) Relative increase in number of cases needed for 80% discovery power in a cohort study using progressively older case and control cohorts of the same age. (B) Relative decrease in the number of cases needed for 80% discovery power in a cohort study using progressively older control cohorts compared to fixed-age young-case cohorts. The youngest age cohort for each LOD is defined as the mid-cohort age at which the cumulative incidence for a cohort first reaches 0.25% of the population. Therefore, the leftmost point on each LOD line is the reference (youngest) cohort, and as cohorts age, the cohort case number multiple required to achieve 0.8 statistical power is relative to this earliest cohort. While all alleles display a different magnitude of cases needed to achieve the required statistical power, the change in the multiplier with age is almost identical for all alleles within a given genetic architecture scenario.   Figure S5. Multiple of the decline in the number of cases needed for 0.8 discovery power in a cohort study using progressively older control cohorts compared to a fixed-age young-cases cohort.
Cases' mid-cohort age is leftmost age (youngest plot point); control mid-cohort ages are incremental ages. The number of cases needed for 0.8 discovery power is smaller when older controls are used, particularly for LODs with the highest heritability and incidence. Common, low-effect-size alleles. A sample of nine out of 25 SNPs; MAF = minor (risk) allele frequency; OR = risk odds ratio.  Figure S6. GWAS association simulations: OR bias progression with control cohort age increasing against the constant youngest possible case cohort.
Common, low-effect-size alleles, showing two SNPs-with the largest and the smallest effect-for each LOD. The OR increase (bias) with mid-cohort age progression implies a power of ∆Age from age-matched youngest cohort. The confidence intervals are not displayed on this plot for illustration purposes; they are displayed in Figure S7, showing the same data in effect size units.   Common, low-effect-size alleles, showing two SNPs-with the largest and the smallest effect-for each LOD. The confidence interval bars correspond to two-sigma (95%) confidence from the GWASs' logistic regression association. The OR increase with mid-cohort age progression implies a power law relative to ∆a g e. This plot implies the LOD SNP age bias and corresponding adjustment value are proportionate to the SNP effect size. For reference, Figure S9 shows absolute effect size progression.   Figure S8. GWASs association simulations: characterizing the age bias adjustment maintaining "true" OR with control cohort age progression (quadratic: ∆T 2 ).
Common, low-effect-size alleles, showing two SNPs-with the largest and the smallest effect-for each LOD. The confidence interval bars correspond to two-sigma (95%) based on standard error of linear regression fitting. This plot depicts the adjustment proportionate to square of ∆t = t − T Y -relative age from the youngest cohort mid-cohort age for the normalized bias of the effect size β calculated ∆β/β, as described in the main article.   Common, low-effect-size alleles, showing two SNPs-with the largest and the smallest effect-for each LOD. The confidence interval bars correspond to two-sigma (95%) confidence from the GWASs' logistic regression association.