Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings

Tsai, Chung-You; Hsieh, Shang-Ju; Huang, Hung-Hsiang; Deng, Juinn-Horng; Huang, Yi-You; Cheng, Pai-Yu

doi:10.1007/s00345-024-04957-8

Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings

Original Article
Published: 23 April 2024

Volume 42, article number 250, (2024)
Cite this article

World Journal of Urology Aims and scope Submit manuscript

142 Accesses
Explore all metrics

Abstract

Purpose

To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan urology board examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across 12 urology domains.

Methods

450 multiple-choice questions from TUBE(2020–2022) were presented to two models. Three urologists assessed correctness and consistency of each response. Accuracy quantifies correct answers; consistency assesses logic and coherence in explanations out of total responses, alongside a penalty reduction experiment with prompt variations. Univariate logistic regression was applied for subgroup comparison.

Results

ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7% (2020), 58.0% (2021), and 50.7% (2022), significantly surpassing ChatGPT-3.5 (33.8%, OR = 2.68, 95% CI [2.05–3.52]). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR = 0.69, p = 0.05). A high consistency rate of 91.6% in explanations across all domains indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p = 0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making.

Conclusions

ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements of urology-specific AI tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accuracy of ChatGPT in head and neck oncological board decisions: preliminary findings

Article 22 November 2023

How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models’ accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology

Article 10 January 2024

ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions

Article Open access 07 June 2023

Data availability

The data sets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

OpenAI (2023) Introducing ChatGPT. https://openai.com/blog/chatgpt.
OpenAI (2023) Research GPT-4. https://openai.com/research/gpt-4. Accessed Jun 10, 2023
Sallam M (2023) ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 6:887
Article Google Scholar
Patel SB, Lam K (2023) ChatGPT: the future of discharge summaries? Lancet Digital Health 5(3):e107–e108
Article CAS PubMed Google Scholar
Talyshinskii A, Naik N, Hameed BMZ, Zhanbyrbekuly U, Khairli G, Guliev B, Juilebø-Jones P, Tzelves L, Somani BK (2023) Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surge 10:1257191
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J (2023) Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health 2(2):e0000198
Article PubMed PubMed Central Google Scholar
Huynh LM, Bonebrake BT, Schultis K, Quach A, Deibert CM (2023) New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology. Urology Practice. https://doi.org/10.1097/UPJ.0000000000000406
Article PubMed Google Scholar
Eppler M, Ganjavi C, Ramacciotti LS, Piazza P, Rodler S, Checcucci E, Rivas JG, Kowalewski KF, Belenchón IR, Puliatti S, Taratkin M, Veccia A, BaekelandtL, Teoh JY-C, Somani BK, Wroclawski M, Abreu A, Porpiglia F, Gill IS, Declan G (2023) Awareness and Use of ChatGPT and Large Language Models: A Prospective Cross-sectional Global Survey in Urology. Eur Urol 85(2):146–153
Cocci A, Pezzoli M, Lo Re M, Russo GI, Asmundo MG, Fode M, Cacciamani G, Cimino S, Minervini A, Durukan E (2023) Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis 27(1):103–108
Coskun B, Ocakoglu G, Yetemen M, Kaygisiz O (2023) Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer? Urology 180:35–58
Article PubMed Google Scholar
Szczesniewski JJ, Tellez Fouz C, Ramos Alba A, Diaz Goizueta FJ, García Tello A, Llanes González L (2023) ChatGPT and most frequent urological diseases: analysing the quality of information and potential risks for patients. World J Urol 41(11):3149–3153
Whiles BB, Bird VG, Canales BK, DiBianco JM, Terry RS (2023) Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology 180:278–284
Article PubMed Google Scholar
Kleebayoon A, Wiwanitkit V (2024) ChatGPT in answering questions related to pediatric urology: comment. J Pediatr Urol 20(1):28
Cakir H, Caglar U, Yildiz O, Meric A, Ayranci A, Ozgor F (2024) Evaluating the performance of ChatGPT in answering questions related to urolithiasis. Internat Urol Nephrol 56(1):17–21
OpenAI (2023) Models overview. https://platform.openai.com/docs/models/continuous-model-upgrades
Deebel NA, Terlecki R (2023) ChatGPT performance on the American Urological Association (AUA) Self-Assessment Study Program and the potential influence of artificial intelligence (AI) in urologic training. Urology 177:29–33
Bhayana R, Krishna S, Bleakney RR (2023) Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology 307(5):e230582
Antaki F, Touma S, Milad D, El-Khoury J, Duval R (2023) Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings. Ophthalmol Sci 3(4):100324
Kumah-Crystal Y, Mankowitz S, Embi P, Lehmann CU (2023) ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? J Am Med Informat Assoc 30(9):1558-1560
Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D (2023) How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educat 9(1):e45312
Article Google Scholar
Kaneda Y, Tanimoto T, Ozaki A, Sato T, Takahashi K (2023) Can ChatGPT Pass the 2023 Japanese National Medical Licensing Examination? Preprints 2023:2023030191
Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J (2023) ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc 86(8):762–766
Article PubMed Google Scholar
Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, Fan Q, Wu S, Hu W, Li X (2023) ChatGPT Performs on the Chinese National Medical Licensing Examination. J Med Syst 47(1):86. https://doi.org/10.1007/s10916-023-01961-0
Article PubMed Google Scholar
Lai VD, Ngo NT, Veyseh APB, Man H, Dernoncourt F, Bui T, Nguyen TH (2023) Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:230405613
Xiao Y, Wang WY (2021) On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:210315025

Download references

Acknowledgements

We would like to acknowledge the efforts of the three independent raters who evaluated ChatGPT's responses.

Author information

Authors and Affiliations

Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan
Chung-You Tsai, Shang-Ju Hsieh, Hung-Hsiang Huang & Pai-Yu Cheng
Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
Chung-You Tsai & Juinn-Horng Deng
Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan
Yi-You Huang & Pai-Yu Cheng

Authors

Chung-You Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Shang-Ju Hsieh
View author publications
You can also search for this author in PubMed Google Scholar
Hung-Hsiang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Juinn-Horng Deng
View author publications
You can also search for this author in PubMed Google Scholar
Yi-You Huang
View author publications
You can also search for this author in PubMed Google Scholar
Pai-Yu Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.Y.T contributed to conception and design, acquisition of data, analysis and interpretation of data, drafting of the manuscript and statistical analysis. S.JH and H.H.H contributed to acquisition of data, analysis and interpretation. J.HD and Y.YH contributed study conception and design. P.Y.C contributed to analysis and visualization of data, critical revision of the manuscript and supervision.

Corresponding author

Correspondence to Pai-Yu Cheng.

Ethics declarations

Conflict of interests

None of the contributing authors have any conflict of interest, including specific financial interests or relationships and affiliations relevant to the subject matter or materials discussed in the manuscript. This study did not receive any financial support from any third party.

Ethical approval

Institutional review board approval was not needed, since human participants were not involved.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 188 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tsai, CY., Hsieh, SJ., Huang, HH. et al. Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings. World J Urol 42, 250 (2024). https://doi.org/10.1007/s00345-024-04957-8

Download citation

Received: 27 November 2023
Accepted: 25 March 2024
Published: 23 April 2024
DOI: https://doi.org/10.1007/s00345-024-04957-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings