Skip to main content

Advertisement

Log in

Identifiability in biobanks: models, measures, and mitigation strategies

  • Original Investigation
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

The collection and sharing of person-specific biospecimens has raised significant questions regarding privacy. In particular, the question of identifiability, or the degree to which materials stored in biobanks can be linked to the name of the individuals from which they were derived, is under scrutiny. The goal of this paper is to review the extent to which biospecimens and affiliated data can be designated as identifiable. To achieve this goal, we summarize recent research in identifiability assessment for DNA sequence data, as well as associated demographic and clinical data, shared via biobanks. We demonstrate the variability of the degree of risk, the factors that contribute to this variation, and potential ways to mitigate and manage such risk. Finally, we discuss the policy implications of these findings, particularly as they pertain to biobank security and access policies. We situate our review in the context of real data sharing scenarios and biorepositories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Other strategies leverage security measures to limit access, but are beyond the scope of this discussion. We refer the reader to Langella et al. (2008) and Lemrow et al. (2007) for further discussions on such security practices.

  2. It should be noted that Safe Harbor actually permits the first 3-digit zip code of a region to be disclosed when the population is greater than 20,000. We use the simplification of state of residence for illustrative purposes and because it has been observed that many organizations choose to withhold such information in their application of the policy.

References

  • Adam N, Wortman J (1989) Security-control methods for statistical databases: a comparative study. ACM Comput Surv 21:515–556

    Article  Google Scholar 

  • Anonymous (2011) CODIS: the combined DNA index system. DNANews.org. http://dnanews.org/codis-the-combined-dna-index-system/. Accessed 27 May 2011

  • Bayardo R, Agrawal R (2005) Data privacy through optimal k-anonymity. In: Proceedings of the 21st IEEE International Conference on Data Engineering, pp 217–228

  • Bellazi R, Zupan B (2008) Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform 77:81–97

    Article  Google Scholar 

  • Benitez K, Malin B (2010) Evaluating re-identification risk with respect to the HIPAA Privacy Rule. J Am Med Inform Assoc 17:169–177

    Article  PubMed  Google Scholar 

  • Benitez K, Loukides G, Malin B (2010) Beyond Safe Harbor: automatic discovery of health information de-identification policy alternatives. In: Proceedings of the ACM International Health Informatics Symposium, ACM Press, New York, pp 163–172

  • Bexelius C, Hoeyer K, Lynöe N (2007) Will forensic use of medical biobanks decrease public trust in healthcare services? Some empirical observations. Scand J Public Health 35:442

    Article  PubMed  Google Scholar 

  • Botkin J (2001) Protecting the privacy of family members in survey and pedigree research. JAMA 285:207–211

    Article  PubMed  CAS  Google Scholar 

  • Burke W, Psaty B (2007) Personalized medicine in the era of genomics. JAMA 298:1682–1684

    Article  PubMed  CAS  Google Scholar 

  • Cassa C, Schmidt B, Kohane I, Mandl K (2008) My sister’s keeper? Genomic research and the identifiability of siblings. BMC Med Genomics 1:32

    Article  PubMed  Google Scholar 

  • Chiang Y, Hsu T, Kuo S, Liau C, Wang D (2003) Preserving confidentiality when sharing medical database with the Cellsecu system. Int J Med Inform 71:17–23

    Article  PubMed  Google Scholar 

  • Clayton D (2010) On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11:661–673

    Article  PubMed  Google Scholar 

  • Clayton E, Smith M, Fullerton SM et al (2010) Confronting real time ethical, legal, and social issues in the Electronic Medical Records and Genomics (eMERGE) Consortium. Genet Med 12:616–620

    Article  PubMed  Google Scholar 

  • Collins F (2010) Has the revolution arrived? Nature 464:674–675

    Article  PubMed  CAS  Google Scholar 

  • Currie P (2005) Balancing privacy protections with efficient research: institutional review boards and the use of certificates of confidentiality. IRB 27:7–12

    Article  PubMed  Google Scholar 

  • Dankar F, El Emam K (2010) A method for evaluating marketer re-identification risk. In: Proceedings of the EDBT/ICDT Workshops, ACM Press, New York

  • Eiseman E, Bloom G, Brower J, Clancy N, Olmstead S (2003) Case studies of existing human tissue repositories: “best practices” for a biospecimen resource for the genomic and proteomic era. Rand Corporation, Santa Monica

    Google Scholar 

  • El Emam K (2008) Heuristics for de-identifying health data. IEEE Secur Priv Mag 6:58–61

    Article  Google Scholar 

  • El Emam K, Dankar K (2008) Protecting privacy using k-anonymity. J Am Med Inform Assoc 15:627–637

    Article  PubMed  Google Scholar 

  • El Emam K, Jabbouri, Sams S, Drouet Y, Power M (2006) Evaluating common de-identification heuristics for personal health information. J Med Internet Res 8:e28

  • El Emam K, Dankar K, Issa R et al (2009) A globally optimal k-anonymity method for the de-identification of health data. J Am Med Inform Assoc 16:670–680

    Article  PubMed  Google Scholar 

  • Glaser J, Henley D, Downing D, Brinner K (2008) Advancing personalized health care through health information technology: an update from the American Health Information Community’s Personalized Health Care Workgroup. J Am Med Inform Assoc 15:391–396

    Article  PubMed  Google Scholar 

  • Golle P (2006) Revisiting the uniqueness of simple demographics in the US population. In: Proceedings of the ACM Workshop on Privacy in Electronic Society, ACM Press, New York, pp 77–80

  • Green ED, Guyer MS, National Human Genome Research Institute (2011) Charting a course for genomic medicine from base pairs to bedside. Nature 470:204–213

    Article  PubMed  CAS  Google Scholar 

  • Guttmacher A, Collins F (2005) Realizing the promise of genomics in biomedical research. JAMA 294:1399–1402

    Article  PubMed  CAS  Google Scholar 

  • Haga S, O’Daniel J (2011) Public perspectives regarding data sharing practices in genomics research. Public Health Genomics. doi:10.1159/000324705 (published online March 24)

  • Hamburg M, Collins F (2010) The path to personalized medicine. N Engl J Med 363:301–304

    Article  PubMed  CAS  Google Scholar 

  • Hansson S, Björkman B (2006) Bioethics in Sweden. Camb Q Healthc Ethics 15:285–293

    Article  PubMed  Google Scholar 

  • Hindmarsh R, Abu-Bakar A (2007) Balancing benefits of human genetic research against civic concerns: essentially Yours and beyond—the case of Australia. Pers Med 4:497–505

    Article  Google Scholar 

  • Homer N, Szelinger S, Redman M et al (2008) Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 4:e1000167

    Article  PubMed  Google Scholar 

  • Kaufman DJ, Murphy-Bollinger J, Scott J, Hudson KL (2009) Public opinion about the importance of privacy in biobank research. Am J Hum Geneti 85:643–654

    Article  CAS  Google Scholar 

  • Kaye J (2006) Police collection and access to DNA samples. Genomics Soc Policy 2:16–72

    Google Scholar 

  • Kayser M, Schneider P (2009) DNA-based prediction of human externally visible characteristics in forensics: motivations, scientific challenges, and ethical considerations. Forensic Sci Int Genet 3:154–161

    Article  PubMed  CAS  Google Scholar 

  • Kohane I, Altman R (2005) Health information altruists—a potentially critical resource. N Engl J Med 353:2074–2077

    Article  PubMed  CAS  Google Scholar 

  • Kullo I, Fan J, Pathak J, Savova G, Ali Z, Chute C (2010) Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease. J Am Med Inform Assoc 17:568–574

    Article  PubMed  Google Scholar 

  • Langella S, Hastings S, Oster S et al (2008) Sharing data and analytical resources securely in a biomedical research grid environment. J Am Med Inform Assoc 15:33–373

    Article  Google Scholar 

  • Lemke A, Wolf W, Hebert-Beirne J, Smith M (2010) Public and biobank participant attitudes toward genetic research participation and data sharing. Public Health Genomics 13:368–377

    PubMed  CAS  Google Scholar 

  • Lemrow S, Colditz G, Vaught J, Hartge P (2007) Key elements of access policies for biorepositories associated with population science research. Cancer Epidemiol Biomarkers Prev 16:1533–1535

    Article  PubMed  Google Scholar 

  • Li G, Wang Y, Su X (2011) Improvements on a privacy-protection algorithm for DNA sequences with generalization lattices. Comput Methods Programs Biomed. doi:10.1016/j.cmpb.2011.02.013

  • Lin Z, Hewett M, Altman R (2002) Using binning to maintain confidentiality of medical data. Proc AMIA Symp 454–458

  • Lin Z, Owen A, Altman R (2004) Genetics: genomic research and human subject privacy. Science 305:183

    Article  PubMed  CAS  Google Scholar 

  • Lin Z, Altman R, Owen A (2006) Confidentiality in genome research. Science 313:441–442

    Article  PubMed  CAS  Google Scholar 

  • Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P (2007) Data integration and genomic medicine. J Biomed Inform 4:5–16

    Article  Google Scholar 

  • Loukides G, Denny J, Malin B (2010a) The disclosure of diagnosis codes can breach research participants’ privacy. J Am Med Inform Assoc 17:322–327

    PubMed  Google Scholar 

  • Loukides G, Gkoulalas-Divanis A, Malin B (2010b) Anonymization of electronic medical records for validating genome-wide association studies. Proc Natl Acad Sci USA 107:7898–7903

    Article  PubMed  CAS  Google Scholar 

  • Lowrance W, Collins F (2007) Ethics: identifiability in genomic research. Science 317:600–602

    Article  PubMed  CAS  Google Scholar 

  • Lunshof J, Chadwick R, Vorhaus D, Church G (2008) From genetic privacy to open consent. Nature Rev Genet 9:406–411

    Article  PubMed  CAS  Google Scholar 

  • Mailman MD, Feolo M, Jin Y et al (2007) The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 39:1181–1186

    Article  PubMed  CAS  Google Scholar 

  • Malin B (2005a) An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. J Am Med Inform Assoc 12:28–34

    Article  PubMed  Google Scholar 

  • Malin B (2005b) Protecting genomic sequence anonymity with generalization lattices. Methods Inf Med 44:687–692

    PubMed  CAS  Google Scholar 

  • Malin B (2007) A computational model to protect patient data from location-based re-identification. Artif Intell Med 40:222–239

    Article  Google Scholar 

  • Malin B (2008) K-unlinkability: a privacy protection model for distributed data. Data Knowl Eng 64:294–311

    Article  Google Scholar 

  • Malin B, Sweeney L (2004) How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J Biomed Inform 37:179–192

    Article  PubMed  Google Scholar 

  • Malin B, Karp D, Scheuermann R (2010) Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med 58:11–18

    PubMed  Google Scholar 

  • Malin B, Benitez K, Masys D (2011) Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule. J Am Med Inform Assoc 18:3–10

    Article  PubMed  Google Scholar 

  • McCartney C (2004) Forensic DNA sampling and the England and Wales National DNA database: a sceptical approach. Crit Criminol 12:157–178

    Article  Google Scholar 

  • McCarty C, Chisholm R, Chute C et al (2011) The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics 4:13

    Article  PubMed  Google Scholar 

  • McGuire A, Gibbs R (2006) Genetics: no longer de-identified. Science 312:370–371

    Article  PubMed  CAS  Google Scholar 

  • McGuire A, Fisher R, Cusenza P et al (2008a) Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genet Med 10:495–499

    Article  PubMed  Google Scholar 

  • McGuire A, Hamilton J, Lunstroth R, McCullough L, Goldman A (2008b) DNA data sharing: research participants perspectives. Genet Med 10:46–53

    Article  PubMed  Google Scholar 

  • Miler G (2009) The looming crisis in human genetics. The Economist November 13

  • Miller E (2010) Relative doubt: familial searches of DNA databases. Mich Law Rev 109:291–348

    Google Scholar 

  • National Institutes of Health (2002) NIH announces statement on certificates of confidentiality. NOT-OD-02-037 March 15

  • National Institutes of Health (2003) Final NIH statement on sharing research data. NOT-OD-03-032 February 26

  • National Institutes of Health (2007) Policy for sharing of data obtained in NIH supported or conducted genome-wide association studies (GWAS). NOT-O-07-088 August 28

  • Ng P, Murray S, Levy S, Venter C (2009) An agenda for personalized medicine. Nature 461:724–726

    Article  PubMed  CAS  Google Scholar 

  • Ollier W, Sprosen T, Peakman T (2005) UK Biobank: from concept to reality. Pharmacogenomics 6:639–646

    Article  PubMed  Google Scholar 

  • Ossorio P (2006) About face: forensic genetic testing for race and visible traits. J Law Med Ethics 34:277–292

    Article  PubMed  Google Scholar 

  • Phillips C, Salas A, Sanchez JJ et al (2007) Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs. Forensic Sci Int Genet 1:273–280

    Article  PubMed  CAS  Google Scholar 

  • Ritchie M, Denny J, Crawford D et al (2010) Robust replication of genotype–phenotype associations across multiple diseases in an electronic medical record. Am J Human Genet 86:560–572

    Article  CAS  Google Scholar 

  • Roden D, Pulley J, Basford M et al (2008) Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 84:362–369

    Article  PubMed  CAS  Google Scholar 

  • Roses A (2004) Pharmacogenetics and drug development: the path to safer and more effective drugs. Nat Rev Genet 5:645–656

    Article  PubMed  CAS  Google Scholar 

  • Samarati P (2001) Protecting respondents identities in microdata release. IEEE Trans Knowl Data Eng 13:1010–1027

    Article  Google Scholar 

  • Sankararaman S, Obozinski G, Jordon M, Halperin E (2009) Genomic privacy and limits of individual detection in a pool. Nat Genet 41:965–967

    Article  PubMed  CAS  Google Scholar 

  • Subcommittee on Disclosure Limitation Methodology, Federal Committee on Statistical Methodology (2005) Report on statistical disclosure limitation methodology. Statistical Policy Working Paper 22, Office of Management and Budget. Revised by the Confidentiality and Data Access Committee

  • Sweeney L (1997) Weaving technology and policy together to maintain confidentiality. J Law Med Ethics 25:98–110

    Article  PubMed  CAS  Google Scholar 

  • Sweeney L (2002a) k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10:557–570

    Article  Google Scholar 

  • Sweeney L (2002b) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertain, Fuzziness Knowl Based Syst 10:571–588

    Article  Google Scholar 

  • U.S. Department of Health and Human Services (2002) Standards for privacy of individually identifiable health information, final rule. Federal Register, 45 CFR: 160–164

  • Vinterbo S, Ohno-Machado L, Dreiseitl S (2001) Hiding information by cell suppression. Proc AMIA Symp 26–730

  • Wang D, Liau C, Hsu T (2004) Medical privacy protection based on granular computing. Artif Intell Med 32:137–149

    Article  PubMed  Google Scholar 

  • Wang R, Li Y, Wang X, Tang H, Zhou X (2009) Learning your identity and disease from research papers: information leaks in genome wide association study. In: Proceedings of the ACM Conference on Computer and Communications Security, ACM Press, New York, pp 34–55

  • Willenborg L, De Waal T (1996) Statistical disclosure control in practice. Springer Lecture Notes in Statistics. Springer, New York

    Google Scholar 

  • Wolf L, Zandecki J (2006) Sleeping better at night: investigators’ experiences with certificates of confidentiality. IRB 28:1–7

    PubMed  Google Scholar 

  • Zerhouni E, Nabel E (2008) Protecting aggregate genomic data. Science 322:44

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

This work was supported, in part, by grants 1R01LM009989 and 1U011HG004603 from the US National Institutes of Health.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bradley Malin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malin, B., Loukides, G., Benitez, K. et al. Identifiability in biobanks: models, measures, and mitigation strategies. Hum Genet 130, 383–392 (2011). https://doi.org/10.1007/s00439-011-1042-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-011-1042-5

Keywords

Navigation