Skip to main content

Advertisement

Log in

Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records

  • Original Investigation
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

To reduce costs and improve clinical relevance of genetic studies, there has been increasing interest in performing such studies in hospital-based cohorts by linking phenotypes extracted from electronic medical records (EMRs) to genotypes assessed in routinely collected medical samples. A fundamental difficulty in implementing such studies is extracting accurate information about disease outcomes and important clinical covariates from large numbers of EMRs. Recently, numerous algorithms have been developed to infer phenotypes by combining information from multiple structured and unstructured variables extracted from EMRs. Although these algorithms are quite accurate, they typically do not provide perfect classification due to the difficulty in inferring meaning from the text. Some algorithms can produce for each patient a probability that the patient is a disease case. This probability can be thresholded to define case–control status, and this estimated case–control status has been used to replicate known genetic associations in EMR-based studies. However, using the estimated disease status in place of true disease status results in outcome misclassification, which can diminish test power and bias odds ratio estimates. We propose to instead directly model the algorithm-derived probability of being a case. We demonstrate how our approach improves test power and effect estimation in simulation studies, and we describe its performance in a study of rheumatoid arthritis. Our work provides an easily implemented solution to a major practical challenge that arises in the use of EMR data, which can facilitate the use of EMR infrastructure for more powerful, cost-effective, and diverse genetic studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Ananthakrishnan AN, Cai T, Savova G, Cheng SC, Chen P, Perez RG, Gainer VS, Murphy SN, Szolovits P, Xia Z et al (2013) Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflam Bowel Dis 19(7):1411–1420

    Article  Google Scholar 

  • Breslow NE, Day NE et al (1980) Statistical methods in cancer research. The analysis of case–control studies, vol 1. Distributed for IARC by WHO, Geneva

  • Brinkman B, Huizinga T, Kurban S, Van der Velde E, Schreuder G, Hazes J, Breedveld F, Verweij C (1997) Tumour necrosis factor alpha gene polymorphisms in rheumatoid arthritis: association with susceptibility to, or severity of, disease? Rheumatology 36(5):516–521

    Article  CAS  Google Scholar 

  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2012a) Measurement error in nonlinear models: a modern perspective. CRC Press

  • Carroll RJ, Thompson WK, Eyler AE, Mandelin AM, Cai T, Zink RM, Pacheco JA, Boomershine CS, Lasko TA, Xu H et al (2012b) Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J Am Med Inf Assoc 19(e1):e162–e169

    Article  Google Scholar 

  • Denny J, Ritchie M, Basford M, Pulley J, Bastarache L, Brown-Gentry K, Wang D, Masys D, Roden D, Crawford D (2010) Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26(9):1205–1210

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Denny JC, Crawford DC, Ritchie MD, Bielinski SJ, Basford MA, Bradford Y, Chai HS, Bastarache L, Zuvich R, Peissig P et al (2011) Variants near foxe1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome-and phenome-wide studies. Am J Hum Genet 89(4):529–542

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD, Field JR et al (2013) Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol 31(12):1102–1111

  • Gabriel SE (1994) The sensitivity and specificity of computerized databases for the diagnosis of rheumatoid arthritis. Arthritis Rheum 37(6):821–823

    Article  PubMed  CAS  Google Scholar 

  • Gonzalez-Gay MA, Garcia-Porrua C, Hajeer AH (2002) Influence of human leukocyte antigen-DRB1 on the susceptibility and severity of rheumatoid arthritis. Semin Arthritis Rheum 31(6):355–360

  • Gordon D, Finch SJ, Nothnagel M (2002) Power and sample size calculations for case–control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered 54(1):22–33

    Article  PubMed  Google Scholar 

  • Kastbom A, Verma D, Eriksson P, Skogh T, Wingren G, Söderkvist P (2008) Genetic variation in proteins of the cryopyrin inflammasome influences susceptibility and severity of rheumatoid arthritis (the swedish tira project). Rheumatology 47(4):415–417

    Article  PubMed  CAS  Google Scholar 

  • Katz J, Barrett J, Liang M, Bacon A, Kaplan H, Kieval R, Lindsey S, Roberts W, Sheff D, Spencer R et al (1997) Sensitivity and positive predictive value of medicare part b physician claims for rheumatologic diagnoses and procedures. Arthritis Rheum 40(9):1594–1600

    Article  PubMed  CAS  Google Scholar 

  • Kho A, Hayes M, Rasmussen-Torvik L, Pacheco J, Thompson W, Armstrong L, Denny J, Peissig P, Miller A, Wei W et al (2012) Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J Am Med Inf Assoc 19(2):212–218

    Article  Google Scholar 

  • Kohane I (2011) Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 12(6):417–428

    Article  PubMed  CAS  Google Scholar 

  • Kullback S (1959) Information theory and statistics. Wiley, New York

  • Kurreeman F, Liao K, Chibnik L, Hickey B, Stahl E, Gainer V, Li G, Bry L, Mahan S, Ardlie K et al (2011) Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am J Hum Genet 88(1):57–69

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Liao K, Cai T, Gainer V, Goryachev S, Zeng-treitler Q, Raychaudhuri S, Szolovits P, Churchill S, Murphy S, Kohane I et al (2010) Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res 62(8):1120–1127

    Article  Google Scholar 

  • Magder LS, Hughes JP (1997) Logistic regression when the outcome is measured with uncertainty. Am J Epidemiol 146(2):195–203

    Article  PubMed  CAS  Google Scholar 

  • McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9(5):356–369

    Article  PubMed  CAS  Google Scholar 

  • McDavid A, Crane PK, Newton KM, Crosslin DR, McCormick W, Weston N, Ehrlich K, Hart E, Harrison R, Kukull WA et al (2013) Enhancing the power of genetic association studies through the use of silver standard cases derived from electronic medical records. PLoS One 6(6):e63481

    Article  Google Scholar 

  • Neuhaus JM (1999) Bias and efficiency loss due to misclassified responses in binary regression. Biometrika 86(4):843–855

    Article  Google Scholar 

  • Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, Basford M, Chute CG, Kullo IJ, Li R, Pacheco JA, Rasmussen LV, Spangler L, Denny JC (2013) Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inf Assoc 20(e1):e147–e154

  • Perlis R, Iosifescu D, Castro V, Murphy S, Gainer V, Minnier J, Cai T, Goryachev S, Zeng Q, Gallagher P et al (2011) Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol Med 1(1):1–10

    Google Scholar 

  • Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909

    Article  PubMed  CAS  Google Scholar 

  • R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0, http://www.R-project.org.

  • Ritchie M, Denny J, Crawford D, Ramirez A, Weiner J, Pulley J, Basford M, Brown-Gentry K, Balser J, Masys D et al (2010) Robust replication of genotype–phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet 86(4):560–572

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Singh J, Holmgren A, Noorbaloochi S (2004) Accuracy of veterans administration databases for a diagnosis of rheumatoid arthritis. Arthritis Care Res 51(6):952–957

    Article  Google Scholar 

  • Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, Thomson BP, Li Y, Kurreeman FA, Zhernakova A, Hinks A et al (2010) Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat Genet 42(6):508–514

    Article  PubMed  CAS  Google Scholar 

  • Weyand CM, Hicok KC, Conn DL, Goronzy JJ (1992) The influence of hla-drb1 genes on disease severity in rheumatoid arthritis. Ann Intern Med 117(10):801–806

    Article  PubMed  CAS  Google Scholar 

  • Wilke R, Xu H, Denny J, Roden D, Krauss R, McCarty C, Davis R, Skaar T, Lamba J, Savova G (2011) The emerging role of electronic medical records in pharmacogenomics. Clin Pharmacol Therapeut 89(3):379–386

    Article  CAS  Google Scholar 

Download references

Acknowledgments

JAS was supported by the National Institutes of Health (NIH) Grants T32 GM074897 and T32 CA09001 and the A. David Mazzone Career Development Award. TC was supported by NIH Grants R01 GM079330, U01 GM092691 and U54 LM008748.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jennifer A. Sinnott.

Appendix

Appendix

Design A

In Design A, we consider a random sample of size \(n\) from the entire EMR data and calculate \(\hat{p}_{D}\) for everyone in the sample. Using assumption (\({\fancyscript{A}}\)), we see \(P(\hat{p}_{D}> c \mid {\mathbf{X}}) = P(\hat{p}_{D}> c \mid D=1) g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}) + P(\hat{p}_{D}> c \mid D=0)(1-g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})).\) Then, since for any positive random variable \(T\), \(E[T] = \int _0^\infty P(T > c)dc\), we have: \(E[\hat{p}_{D}\mid {\mathbf{X}}] = \int _0^1P(\hat{p}_{D}>c \mid {\mathbf{X}})dc = g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}) \int _0^1 P(\hat{p}_{D}> c \mid D=1)dc + (1-g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}))\int _0^1P(\hat{p}_{D}> c \mid D=0)dc = \zeta _0 + (\zeta _1 - \zeta _0)g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})\) where \(\zeta _d = E[\hat{p}_{D}\mid D=d].\) Thus, letting \({\fancyscript{Y}}_A = \frac{\hat{p}_{D}- \zeta _0}{\zeta _1 - \zeta _0}\), we have \(E[{\fancyscript{Y}}_A \mid {\mathbf{X}}] = g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}).\)

Design B

In Design B, we also genotype a random sample of size \(n\), but observe on everyone a perfect negative predictor \(U\) satisfying \(P(D=0 \mid U=0)=1.\) The EMR algorithm is developed among those individuals with \(U=1.\) In addition to assumption (\({\fancyscript{A}}\)), we assume that \(U\) is independent of \({\mathbf{X}}\) conditional on true disease status \(D.\) We let \({\tilde{p}}_{D}= {\tilde{p}}_{D}(U) = \hat{p}_{D}U.\) Defining \(\mu _d=E[\hat{p}_{D}\mid U=1, D=d]\) for \(d=0,1\), \(\rho = P(D=1 \mid U=1)\), \(\pi _U = P(U=1)\), and \({\tilde{\mu }}_0 = \mu _0 \frac{\pi _U - \rho \pi _U}{1-\rho \pi _U}\), we may calculate \(E[{\tilde{p}}_{D}\mid {\mathbf{X}}] = \sum _{d \in \{0,1\}} \mu _d P(U=1, D=d \mid {\mathbf{X}}) = \sum _{d \in \{0,1\}} \mu _d P(U=1 \mid D=d)P(D=d\mid {\mathbf{X}})= {\tilde{\mu }}_0\ + (\mu _1-{\tilde{\mu }}_0)g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})\) since \(P(U=1 \mid D=1)=1\) and \(P(U=1\mid D=0) = \frac{\pi _U - \rho \pi _U}{1-\rho \pi _U}\) by an application of Bayes rule. Thus, letting \({\fancyscript{Y}}_B = \frac{{\tilde{p}}_D - {\tilde{\mu }}_0}{\mu _1-{\tilde{\mu }}_0}\), we have \(E[{\fancyscript{Y}}_B \mid {\mathbf{X}}]= g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}).\)

Design C

In Design C, we first partition the full EMR into a disease-mart (\(M=1\)) that includes all disease cases and a control-mart (\(M=0\)) of disease-free individuals. We develop and apply our algorithm to calculate \(\hat{p}_{D}\) only among individuals with \(M=1.\) Let \(\text {PPV}(S) = P(D=1 \mid M=1, \hat{p}_{D}> p_{S})\), and we assume a design with \(m\) controls per case sampled from the control-mart. Let \(V\) be the indicator that an individual is sampled in our study, and let \({\tilde{p}}_{D}={\tilde{p}}_{D}(M) = \hat{p}_{D}M.\)

We may calculate \(E[{\tilde{p}}_{D}\mid {\mathbf{X}}, D, V=1] = E[{\tilde{p}}_{D}\mid D, V=1] = DE[\hat{p}_{D}\mid D=1, M=1, \hat{p}_{D}>p_{S}]P(M=1 \mid D=1, V=1) + (1-D)E[\hat{p}_{D}\mid D=0, M=1, \hat{p}_{D}>p_{S}]P(M=1\mid D=0,V=1)= D\xi _1 + (1-D)\xi _0(1-\pi )\) where \(\xi _d=E[\hat{p}_{D}\mid D=d, M=1, \hat{p}_{D}>p_{S}]\) and \(\pi =P(M=0\mid D=0,V=1).\) In this calculation, we have used that \({\tilde{p}}_{D}=0\) when \(M=0\) and that \(P(M=1\mid D=1,V=1)=1\) because the initial partition has perfect sensitivity. We further calculate: \(\pi = \frac{P(M=0, D=0 \mid V=1)}{P(D=0 \mid V=1)} = \frac{P(M=0 \mid V=1)}{P(M=0 \mid V=1) + P(D=0 \mid M=1, V=1)P(M=1 \mid V=1)} = \frac{\frac{m}{m+1}}{\frac{m}{m+1} + (1-\text {PPV}(S))\frac{1}{m+1}} = \frac{m}{m+(1-\text {PPV}(S))}.\) Then, letting \({\fancyscript{Y}}_C = \frac{{\tilde{p}}_{D}- \xi _0(1-\pi )}{\xi _1 - \xi _0(1-\pi )}\), we have that \(E[{\fancyscript{Y}}_C \mid {\mathbf{X}}, V=1] = E[D \mid {\mathbf{X}}, V=1].\)

Finally, using Bayes rule, we see that \( E[D\mid {\mathbf{X}}, V=1] =\frac{P(V=1 \mid D=1)g(\beta ^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})}{P(V=1 \mid D=1) g(\beta ^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}) + P(V=1 \mid D=0)(1-g(\beta ^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}))} =\frac{\exp (\beta ^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})}{\lambda +\exp (\beta ^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})}\) where \(\lambda = \frac{P(V=1 \mid D=0)}{P(V=1 \mid D=1)}.\) Letting \(\beta _0^* = \beta _0 - \log \{\lambda \}\), we have \(E[D \mid {\mathbf{X}}, V=1] = \frac{\exp (\beta _0^* + \beta _1Z +{\beta }_2^{\mathsf{\scriptscriptstyle {T}}}{\mathbf {W}})}{1+ \exp (\beta _0^* + \beta _1Z +{\beta }_2^{\mathsf{\scriptscriptstyle {T}}}{\mathbf {W}})}.\) Thus, \(E[{\fancyscript{Y}}_C \mid {\mathbf{X}}, V=1] = g({{\beta }^*}^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})\), where \({\beta }^* = (\beta _0^*, \beta _1, {\beta }_2^{\mathsf{\scriptscriptstyle {T}}})^{\mathsf{\scriptscriptstyle {T}}}\).

Power and bias calculations

For simplicity we derive expressions under Design A. When using \(\hat{p}_{D}\), the estimator \({\hat{{\beta }}}\) solves \(U({\beta })=\frac{1}{n}\sum _{i=1}^n\psi _i({\beta })=\frac{1}{n}\sum _{i=1}^n {\mathbf{X}}_i({\fancyscript{Y}}_i - g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}_i))=0\), so \(\sqrt{n}({\hat{{\beta }}}- {\beta }_0) \mathop {\rightarrow }\limits ^{{\fancyscript{D}}} N(0, V({\beta }))\) where \(V({\beta })=B({\beta })^{-1}A({\beta })(B({\beta })^{-1})^{\mathsf{\scriptscriptstyle {T}}}\) where \(B({\beta }) = E\left[ g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})(g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})-1) {\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}\right] \) and \(A({\beta }) = E\left[ ({\fancyscript{Y}}- g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}))^2 {\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}\right] = E\left[ \left\{ \text{ Var }({\fancyscript{Y}}| D) + E[{\fancyscript{Y}}\mid D]^2 \right\} E\left[ {\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}\mid D \right] \right] - 2 E\left[ E\left[ {\fancyscript{Y}}\mid D \right] E\left[ g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}) {\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}\mid D \right] \right] + E\left[ g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})^2 {\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}\right] \), using assumption (\({\fancyscript{A}}\)). We can further expand this since \({\mathbf{X}}=(1, Z)^{\mathsf{\scriptscriptstyle {T}}}\), for SNP \(Z;\) in particular, for any function \(f\), \(E[f(Z) \mid D] = \frac{D}{P(D=1)} E[f(Z)g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})] + \frac{1-D}{P(D=0)} E[f(Z)(1-g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}))].\) Letting \(\mu _d=E[{\fancyscript{Y}}\mid D=d]\) and \(\xi _d=\text{ Var }({\fancyscript{Y}}\mid D=d)\), we can rewrite \(A({\beta })\) as: \(A({\beta }) = \left\{ \xi _0 + \mu _0^2 \right\} E[{\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}] + \left\{ \xi _1 + \mu _1^2 - \xi _0 -\mu _0^2 -2 \mu _0 \right\} E[g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}){\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}] + (2(\mu _0 - \mu _1) +1) E[g({\beta }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})^2{\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}]\).

To compare power to results using \({\tilde{D}}\) in the misspecified model, we now consider the distribution of \({\hat{{\gamma }}}\) which solves \(U({\gamma })=\frac{1}{n}\sum _{i=1}^n\psi _i({\gamma })=\frac{1}{n}\sum _{i=1}^n {\mathbf{X}}_i({\tilde{D}}_i - g({\gamma }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}_i))=0.\) Then \(\sqrt{n}({\hat{{\gamma }}}- {\gamma }^*({\beta })) \rightarrow N(0, V^*({\beta }))\), where \({\gamma }^*({\beta })\) is a constant. To estimate \({\gamma }^*\), we can proceed as in Neuhaus (1999) and use results from work on misspecified models to see that estimates from the false model \(P_F({\tilde{D}}=1 \mid Z) = g(\gamma _0 + \gamma _1Z)\) converge to values \((\gamma _0^*, \gamma _1^*)\) that minimize the Kullback–Leibler divergence between the false model and the true model \(P_T({\tilde{D}}=1 \mid Z) = (1-S) + (\text{ SE }(S) + S - 1)g(\beta _0 + \beta _1Z)\) (Neuhaus 1999; Kullback 1959). The Kullback–Leibler divergence between these two models is \(E_Z[ \log \{ \frac{P_T({\tilde{D}}=1 \mid Z)}{P_F({\tilde{D}}=1 \mid Z)} \}P_T({\tilde{D}}=1 \mid Z)+ \log \{ \frac{P_T({\tilde{D}}=0 \mid Z)}{P_F({\tilde{D}}=0 \mid Z)} \} P_T({\tilde{D}}=0 \mid Z)].\) By taking derivatives with respect to \(\gamma _0\) and \(\gamma _1\) and setting them to 0, we find two equations: \(0= \left\{ \alpha _0 + \alpha _1g(\beta _0) -g(\gamma _0)\right\} (1-p_Z)^2 + \left\{ \alpha _0 + \alpha _1g(\beta _0+\beta _1) -g(\gamma _0+\gamma _1)\right\} 2p_Z(1-p_Z) + \left\{ \alpha _0 + \alpha _1g(\beta _0+2\beta _1) -g(\gamma _0+2\gamma _1)\right\} p_Z^2\) and \(0= \left\{ \alpha _0 + \alpha _1g(\beta _0+\beta _1)\right. \) \(\left. -g(\gamma _0+\gamma _1)\right\} 2p_Z(1-p_Z) +2 \left\{ \alpha _0 + \alpha _1g(\beta _0+2\beta _1) -g(\gamma _0+2\gamma _1)\right\} p_Z^2\), where \(\alpha _0=1-S\), \(\alpha _1=\text{ SE }(S)+S-1.\) Here, we assume that the SNP \(Z\sim \text {Bin}(2, p_Z)\), where \(p_Z\) is the MAF. Simultaneously solving these two equations for \((\gamma _0, \gamma _1)\) yields the desired \((\gamma ^*_0, \gamma ^*_1).\) The calculation of \(V^*({\gamma }^*)\) proceeds similarly to the calculation of \(V({\beta }).\) \(V^*({\gamma }^*)=B^*({\gamma }^*)^{-1}A^*({\gamma }^*)(B^*({\gamma }^*)^{-1})^{\mathsf{\scriptscriptstyle {T}}}.\) where \(B^*({\gamma }) = E\left[ g({\gamma }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})(g({\gamma }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})-1) {\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}\right] \) as before. Here, though, \( A^*({\gamma })= (1-S)E[{\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}]+(\text{ SE }(S)-3(1-S)) E[g({\gamma }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}}){\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}] +(2(1-S-\text{ SE }(S)) + 1) E[g({\gamma }^{\mathsf{\scriptscriptstyle {T}}}{\mathbf{X}})^2{\mathbf{X}}{\mathbf{X}}^{\mathsf{\scriptscriptstyle {T}}}].\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sinnott, J.A., Dai, W., Liao, K.P. et al. Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records. Hum Genet 133, 1369–1382 (2014). https://doi.org/10.1007/s00439-014-1466-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-014-1466-9

Keywords

Navigation