Abstract
The technical landscape of clinical machine learning is shifting in ways that destabilize pervasive assumptions about the nature and causes of algorithmic bias. On one hand, the dominant paradigm in clinical machine learning is narrow in the sense that models are trained on biomedical data sets for particular clinical tasks, such as diagnosis and treatment recommendation. On the other hand, the emerging paradigm is generalist in the sense that general-purpose language models such as Google’s BERT and PaLM are increasingly being adapted for clinical use cases via prompting or fine-tuning on biomedical data sets. Many of these next-generation models provide substantial performance gains over prior clinical models, but at the same time introduce novel kinds of algorithmic bias and complicate the explanatory relationship between algorithmic biases and biases in training data. This paper articulates how and in what respects biases in generalist models differ from biases in prior clinical models, and draws out practical recommendations for algorithmic bias mitigation. The basic methodological approach is that of philosophical ethics in that the focus is on conceptual clarification of the different kinds of biases presented by generalist clinical models and their bioethical significance.
Similar content being viewed by others
Notes
For an excellent discussion on the nature and causes of algorithmic bias simpliciter, that is, outside the specific context of clinical medicine, see [16].
The problem is underscored by the fact that absent equal base rates across subpopulations or perfect predictive performance binary classifiers cannot equalize precision and false positive/negative rates [7]. See Kleinberg et al. [41] for an analogous result for continuous risk scores. For discussion of the significance of the fairness impossibility theorems for healthcare see Grote and Keeling [28] and Grote and Keeling [29].
The implication here is not that all performance biases are explained by biases in training data, as biases can arise at every stage of the machine learning pipeline [61]. Rather, the claim is that performance biases (at least in healthcare, where demographic data biases are widespread and pervasive) in a broad class of cases arise due to biases in data sets, such as under-representative or misrepresentative training data. Such is the extent of data biases that it makes sense for organizations like the to orient their general bias mitigation advice around data representativeness [c.f. 17].
This formulation is rough, because strictly speaking tokens and not words are inputted into the function, i.e., input sequences of text are first tokenized [see 26].
A third case bracketed here is sequence-to-sequence models such as Google’s T5 that include an encoder and a decoder [58].
These examples are illustrative and are not intended as an exhaustive taxonomy.
Sun et al. [71, p.206] note that ‘the use of negative descriptors might not necessarily reflect bias among individual providers; rather, it may reflect a broader systemic acceptability of using negative patient descriptors as a surrogate for identifying structural barriers.
Note that this issue is not unique to LLMs. Similar considerations hold for models and research studies that rely on discrete EHR data, which can also encode biases [c.f. 63].
References
Abid, A., Farooqi, M., Zou, J.: Persistent anti-muslim bias in large language models. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society 298–306 (2021)
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, (2022)
Bender, E. M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623 (2021)
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, (2021)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Challen, R., Denny, J., Pitt, M., Gompels, L., Edwards, T., Tsaneva-Atanasova, K.: Artificial intelligence, bias and clinical safety. BMJ Quality Safety 28(3), 231–237 (2019)
Chouldechova, A.: Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2), 153–163 (2017)
Chowdhury, A., Rosenthal, J., Waring, J., Umeton, R.: Applying self-supervised learning to medicine: review of the state of the art and medical implementations. Informatics 8(3), 59 (2021). (MDPI)
Cirillo, D., Catuara-Solarz, S., Morey, C., Guney, E., Subirats, L., Mellino, S., Gigante, A., Valencia, A., Rementeria, M.J., Chadha, A.S., et al.: Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digital Med. 3(1), 81 (2020)
Daneshjou, R., Vodrahalli, K., Novoa, R.A., Jenkins, M., Liang, W., Rotemberg, V., Ko, J., Swetter, S.M., Bailey, E.E., Gevaert, O., et al.: Disparities in dermatology ai performance on a diverse, curated clinical image set. Sci. Adv. 8(31), eabq6147 (2022)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, (2009)
Department of Health and Human Services. Artificial intelligence (ai) strategy, 2022
Department of Health and Social Care. £21 million to roll out artificial intelligence across the nhs, (2023)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, (2018)
Dieterich, W., Mendoza, C., Brennan T.: Compas risk scales: demonstrating accuracy equity and predictive parity. Northpointe Inc, 7(7.4):1 (2016)
Fazelpour, S., Danks, D.: Algorithmic bias: senses, sources, solutions. Philos. Compass 16(8), e12760 (2021)
Food and Drug Administration. Artificial intelligence/machine learning (ai/ml)-based software as a medical device (samd) action plan. Food Drug Admin., Silver Spring, MD, USA, Tech. Rep, 1, (2021a)
Food and Drug Administration. Good machine learning practice for medical device development: Guiding principles, (2021b)
Frosch, D.L., May, S.G., Rendle, K.A., Tietbohl, C., Elwyn, G.: Authoritarian physicians and patients’ fear of being labeled ‘difficult’among key obstacles to shared decision making. Health Aff. 31(5), 1030–1038 (2012)
Ganguli D., Lovitt L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E, Schiefer, N., Ndousse, K., et al.: Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, (2022)
S. García-Méndez, F. De Arriba-Pérez, F. J. González-Casta no, J. A. Regueiro-Janeiro, and F. Gil-Casti neira. Entertainment chatbot for the digital inclusion of elderly people without abstraction capabilities. IEEE Access, 9:75878–75891, (2021)
Genin, K., Grote, T.: Randomized controlled trials in medical ai: a methodological critique. Philos. Med. 2(1), 1–15 (2021)
Gianattasio, K.Z., Prather, C., Glymour, M.M., Ciarleglio, A., Power, M.C.: Racial disparities and temporal trends in dementia misdiagnosis risk in the United States. Alzheimer’s Dement: Transl. Res. Clin. Interventions 5, 891–898 (2019)
Gianfrancesco, M.A., Tamang, S., Yazdany, J., Schmajuk, G.: Potential biases in machine learning algorithms using electronic health record data. JAMA Intern. Med. 178(11), 1544–1547 (2018)
Gramling, R., Stanek, S., Ladwig, S., Gajary-Coots, E., Cimino, J., Anderson, W., Norton, S.A., Aslakson, R.A., Ast, K., Elk, R., et al.: Feeling heard and understood: a patient-reported quality measure for the inpatient palliative care setting. J. Pain Symptom Manage. 51(2), 150–154 (2016)
Grefenstette, G.: Tokenization. Syntactic Wordclass Tagging, pages 117–133, (1999)
Groh, M., C. Harris, L. Soenksen, F. Lau, R. Han, A. Kim, A. Koochek, and O. Badri. Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1820–1828, (2021)
Grote, T., Keeling, G.: On algorithmic fairness in medical practice. Camb. Q. Healthc. Ethics 31(1), 83–94 (2022)
Grote, T., Keeling, G.: Enabling fairness in healthcare through machine learning. Ethics Inf. Technol. 24(3), 39 (2022)
Hall, W.J., Chapman, M.V., Lee, K.M., Merino, Y.M., Thomas, T.W., Payne, B.K., Eng, E., Day, S.H., Coyne-Beasley, T.: Implicit racial/ethnic bias among health care professionals and its influence on health care outcomes: a systematic review. Am. J. Public Health 105(12), e60–e76 (2015)
Halpern, S.D., Loewenstein, G., Volpp, K.G., Cooney, E., Vranas, K., Quill, C.M., McKenzie, M.S., Harhay, M.O., Gabler, N.B., Silva, T., et al.: Default options in advance directives influence how patients set goals for end-of-life care. Health Aff. 32(2), 408–417 (2013)
Hasan, O., Meltzer, D.O., Shaykevich, S.A., Bell, C.M., Kaboli, P.J., Auerbach, A.D., Wetterneck, T.B., Arora, V.M., Zhang, J., Schnipper, J.L.: Hospital readmission in general medicine patients: a prediction model. J. Gen. Intern. Med. 25, 211–219 (2010)
Haug, C.J., Drazen, J.M.: Artificial intelligence and machine learning in clinical medicine, 2023. N. Engl. J. Med. 388(13), 1201–1208 (2023)
Hedden, B.: On statistical criteria of algorithmic fairness. Philos. Public Aff. 49(2), 209–231 (2021)
Hellström, T., Dignum, V., Bensch, S.: Bias in machine learning–what is it good for? arXiv preprint arXiv:2004.00686, (2020)
Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, (2019)
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12), 1–38 (2023)
Jiang, H., Nachum, O.: Identifying and correcting label bias in machine learning. In: International Conference on Artificial Intelligence and Statistics, pages 702–712. PMLR, (2020)
Karystianis, G., Cabral, R.C., Han, S.C., Poon, J., Butler, T.: Utilizing text mining, data linkage and deep learning in police and health records to predict future offenses in family and domestic violence. Front. Digital Health 3, 602683 (2021)
Kelly, B.S., Judge, C., Bollard, S.M., Clifford, S.M., Healy, G.M., Aziz, A., Mathur, P., Islam, S., Yeom, K.W., Lawlor, A., et al.: Radiology artificial intelligence: a systematic review and evaluation of methods (raise). Eur. Radiol. 32(11), 7998–8007 (2022)
Kleinberg, J., Mullainathan, S., Raghavan, M.: Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016
Laurençon, H., Saulnier, L., Wang, T., Akiki, C., Villanova del Moral, A., Le Scao, T., Von Werra, L., Mou, C., González Ponferrada, E., Nguyen, H., et al.: The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Adv. Neural Inform. Proces. Syst. 35, 31809–31826, (2022)
Lee, C.S., Lee, A.Y.: Clinical applications of continual learning machine learning. Lancet Digital Health 2(6), e279–e281 (2020)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Lin, C., Bethard, S., Dligach, D., Sadeque, F., Savova, G., Miller, T.A.: Does bert need domain adaptation for clinical negation detection? J. Am. Med. Inform. Assoc. 27(4), 584–591 (2020)
Liu, Q., Kusner, M. J., Blunsom, P.: A survey on contextual embeddings. arXiv preprint arXiv:2003.07278, (2020)
McNeil, B.J., Pauker, S.G., Sox, H.C., Jr., Tversky, A.: On the elicitation of preferences for alternative therapies. N. Engl. J. Med. 306(21), 1259–1262 (1982)
Mitsios, J.P., Ekinci, E.I., Mitsios, G.P., Churilov, L., Thijs, V.: Relationship between glycated hemoglobin and stroke risk: a systematic review and meta-analysis. J. Am. Heart Assoc. 7(11), e007858 (2018)
Mosteiro, P., Rijcken, E., Zervanou, K., Kaymak, U., Scheepers, F., Spruit, M.: Machine learning for violence risk assessment using dutch clinical notes. arXiv preprint arXiv:2204.13535, (2022)
Norori, N., Hu, Q., Aellen, F.M., Faraci, F.D., Tzovara, A.: Addressing bias in big data and ai for health care: a call for open science. Patterns 2(10), 100347 (2021)
Norton, S.A., Tilden, V.P., Tolle, S.W., Nelson, C.A., Eggman, S.T.: Life support withdrawal: communication and conflict. Am. J. Crit. Care 12(6), 548–555 (2003)
Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, S.: Dissecting racial bias in an algorithm used to manage the health of populations. Science 366(6464), 447–453 (2019)
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Panch, T., Mattie, H., Atun, R.: Artificial intelligence and algorithmic bias implications for health systems. J. Global Health 149, (2019)
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474, (2019)
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. arXiv preprint arXiv:2202.03286, (2022)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21(1), 5485–5551 (2020)
Rahimi, S., Oktay, O., Alvarez-Valle, J., Bharadwaj, S.: Addressing the exorbitant cost of labeling medical images with active learning. In: International Conference on Machine Learning in Medical Imaging and Analysis, page 1, (2021)
Rajkomar, A., Hardt, M., Howell, M.D., Corrado, G., Chin, M.H.: Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 169(12), 866–872 (2018)
Rajkomar, A., Dean, J., Kohane, I.: Machine learning in medicine. N. Engl. J. Med. 380(14), 1347–1358 (2019)
Rasmy, L., Xiang, Y., Xie, Z., Tao, C., Zhi, D.: Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine 4(1), 86 (2021)
Ross, A.B., Kalia, V., Chan, B.Y., Li, G.: The influence of patient race on the use of diagnostic imaging in united states emergency departments: data from the national hospital ambulatory medical care survey. BMC Health Serv. Res. 20(1), 1–10 (2020)
Secinaro, S., Calandra, D., Secinaro, A., Muthurangu, V., Biancone, P.: The role of artificial intelligence in healthcare: a structured literature review. BMC Med. Inform. Decis. Mak. 21, 1–23 (2021)
Shamout, F., Zhu, T., Clifton, D.A.: Machine learning for clinical outcome prediction. IEEE Rev. Biomed. Eng. 14, 116–126 (2020)
Shang, J., Ma, T., Xiao, C., Sun, J.: Pre-training of graph augmented transformers for medication recommendation. arXiv preprint arXiv:1906.00346, (2019)
Sheng, E., Chang, K.-W., Natarajan, P. Peng, N.: The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, (2019)
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N. Tanwani, A., Cole-Lewis, H., Pfohl, S., et al.: Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, (2022)
Sirrianni, J., Sezgin, E., Claman, D., Linwood, S.L.: Medical text prediction and suggestion using generative pretrained transformer models with dental medical notes. Methods Inf. Med. 61(05/06), 195–200 (2022)
Stephenson, J.: Racial barriers may hamper diagnosis, care of patients with alzheimer disease. JAMA 286(7), 779–780 (2001)
Sun, M., Oliwa, T., Peek, M.E., Tung, E.L.: Negative patient descriptors: Documenting racial bias in the electronic health record: Study examines racial bias in the patient descriptors used in the electronic health record. Health Aff. 41(2), 203–211 (2022)
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et al.: Lamda: language models for dialog applications. arXiv preprint arXiv:2201.08239, (2022)
Tschandl, P., Rosendahl, C., Akay, B.N., Argenziano, G., Blum, A., Braun, R.P., Cabo, H., Gourhant, J.-Y., Kreusch, J., Lallas, A., et al.: Expert-level diagnosis of nonpigmented skin cancer by combined convolutional neural networks. JAMA Dermatol. 155(1), 58–65 (2019)
Uthoff, J., Nagpal, P., Sanchez, R., Gross, T.J., Lee, C., Sieren, J.C.: Differentiation of non-small cell lung cancer and histoplasmosis pulmonary nodules: insights from radiomics model performance compared with clinician observers. Translational Lung Cancer Res. 8(6), 979 (2019)
van Wezel, M. M., Croes, E. A., Antheunis, M. L.: “i’m here for you”: Can social chatbots truly support their users? a literature review. In: Chatbot Research and Design: 4th International Workshop, CONVERSATIONS 2020, Virtual Event, November 23–24, 2020, Revised Selected Papers 4, pages 96–113. Springer, (2021)
Wang, L., Mujib, M. I., Williams, J., Demiris, G., Huh-Yoo, J.: An evaluation of generative pre-training model-based therapy chatbot for caregivers. arXiv preprint arXiv:2107.13115, (2021)
Ware, O.R., Dawson, J.E., Shinohara, M.M., Taylor, S.C.: Racial limitations of fitzpatrick skin type. Cutis 105(2), 77–80 (2020)
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al.: Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, (2021)
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big data 3(1), 1–40 (2016)
Willemink, M.J., Koszek, W.A., Hardell, C., Wu, J., Fleischmann, D., Harvey, H., Folio, L.R., Summers, R.M., Rubin, D.L., Lungren, M.P.: Preparing medical imaging data for machine learning. Radiology 295(1), 4–15 (2020)
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, (2022)
Zhao, W., Katzmarzyk, P.T., Horswell, R., Wang, Y., Johnson, J., Hu, G.: Sex differences in the risk of stroke and hba 1c among diabetic patients. Diabetologia 57, 918–926 (2014)
Zhou, K., Ethayarajh, K., Jurafsky, D.: Frequency-based distortions in contextualized word embeddings. arXiv preprint arXiv:2104.08465, (2021)
Acknowledgement
The author is grateful to Michael Howell, Heather Cole-Lewis, Lisa Lehmann, Diane Korngiebel, Thomas Douglas, Bakul Patel, Kate Weber, and Rachel Gruner for helpful comments, alongside participants at the Workshop on the Ethics of Influence at the Uehiro Centre for Practical Ethics at the University of Oxford.
Funding
This study was funded by Google LLC and/or a subsidiary thereof (‘Google’).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author(s) are current or former employees of Google LLC and own stock as part of the standard compensation package.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Keeling, G. Algorithmic bias, generalist models, and clinical medicine. AI Ethics (2023). https://doi.org/10.1007/s43681-023-00329-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43681-023-00329-x