Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Coskun, Belkis Nihan; Yagiz, Burcu; Ocakoglu, Gokhan; Dalkilic, Ediz; Pehlivan, Yavuz

doi:10.1007/s00296-023-05473-5

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Observational Research
Published: 25 September 2023

Volume 44, pages 509–515, (2024)
Cite this article

Rheumatology International Aims and scope Submit manuscript

736 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

We aimed to assess Large Language Models (LLMs)—ChatGPT 3.5–4, BARD, and Bing—in their accuracy and completeness when answering Methotrexate (MTX) related questions for treating rheumatoid arthritis. We employed 23 questions from an earlier study related to MTX concerns. These questions were entered into the LLMs, and the responses generated by each model were evaluated by two reviewers using Likert scales to assess accuracy and completeness. The GPT models achieved a 100% correct answer rate, while BARD and Bing scored 73.91%. In terms of accuracy of the outputs (completely correct responses), GPT-4 achieved a score of 100%, GPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. BARD produced 17.39% incorrect responses and 8.7% non-responses, while Bing recorded 13.04% incorrect and 13.04% non-responses. The ChatGPT models produced significantly more accurate responses than Bing for the “mechanism of action” category, and GPT-4 model showed significantly higher accuracy than BARD in the “side effects” category. There were no statistically significant differences among the models for the “lifestyle” category. GPT-4 achieved a comprehensive output of 100%, followed by GPT-3.5 at 86.96%, BARD at 60.86%, and Bing at 0%. In the “mechanism of action” category, both ChatGPT models and BARD produced significantly more comprehensive outputs than Bing. For the “side effects” and “lifestyle” categories, the ChatGPT models showed significantly higher completeness than Bing. The GPT models, particularly GPT 4, demonstrated superior performance in providing accurate and comprehensive patient information about MTX use. However, the study also identified inaccuracies and shortcomings in the generated responses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Article Open access 06 March 2024

To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries

Article Open access 23 April 2024

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Article Open access 04 October 2023

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Cronstein BN (1997) The mechanism of action of methotrexate. Rheumatic Dis Clin N Am 23:739–755. https://doi.org/10.1016/S0889-857X(05)70358-6
Article CAS Google Scholar
Brown PM, Pratt AG, Isaacs JD (2016) Mechanism of action of methotrexate in rheumatoid arthritis, and the search for biomarkers. Nat Rev Rheumatol 12:731–742. https://doi.org/10.1038/nrrheum.2016.175
Article CAS PubMed Google Scholar
Bedoui Y, Guillot X, Sélambarom J et al (2019) Methotrexate an old drug with new tricks. Int J Mol Sci 20:5023. https://doi.org/10.3390/ijms20205023
Article CAS PubMed PubMed Central Google Scholar
García-González CM, Baker J (2022) Treatment of early rheumatoid arthritis: Methotrexate and beyond. Curr Opin Pharmacol 64:102227. https://doi.org/10.1016/j.coph.2022.102227
Article CAS PubMed Google Scholar
Wang W, Zhou H, Liu L (2018) Side effects of methotrexate therapy for rheumatoid arthritis: a systematic review. Eur J Med Chem 158:502–516. https://doi.org/10.1016/j.ejmech.2018.09.027
Article CAS PubMed Google Scholar
Lloyd ME (1999) The effects of methotrexate on pregnancy, fertility and lactation. QJM 92:551–563. https://doi.org/10.1093/qjmed/92.10.551
Article CAS PubMed Google Scholar
Weber-Schoendorfer C, Diav-Citrin O (2023) Methotrexate in pregnancy: still many unanswered questions. RMD Open 9:e002899. https://doi.org/10.1136/rmdopen-2022-002899
Article PubMed PubMed Central Google Scholar
Bourré-Tessier J, Haraoui B (2010) Methotrexate drug interactions in the treatment of rheumatoid arthritis: a systematic review. J Rheumatol 37:1416–1421. https://doi.org/10.3899/jrheum.090153
Article CAS PubMed Google Scholar
Kwon OC, Lee JS, Kim Y-G et al (2018) Safety of the concomitant use of methotrexate and a prophylactic dose of trimethoprim-sulfamethoxazole. Clin Rheumatol 37:3215–3220. https://doi.org/10.1007/s10067-018-4005-6
Article PubMed Google Scholar
Fayet F, Savel C, Rodere M et al (2016) The development of a questionnaire to evaluate rheumatoid arthritis patient’s knowledge about methotrexate. J Clin Nurs 25:682–689. https://doi.org/10.1111/jocn.12999
Article PubMed Google Scholar
Daraz L, Morrow AS, Ponce OJ et al (2019) Can patients trust online health information? A meta-narrative systematic review addressing the quality of health information on the internet. J GEN INTERN MED 34:1884–1891. https://doi.org/10.1007/s11606-019-05109-0
Article PubMed PubMed Central Google Scholar
Harrer S (2023) Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. EBioMedicine 90:104512. https://doi.org/10.1016/j.ebiom.2023.104512
Article PubMed PubMed Central Google Scholar
Sallam M (2023) ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 11:887. https://doi.org/10.3390/healthcare11060887
Article PubMed PubMed Central Google Scholar
Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed 9 Sep 2023
Huynh LM, Bonebrake BT, Schultis K et al (2023) Google bard artificial intelligence versus the 2022 self-assessment study program for urology. Urol Pract. https://doi.org/10.1097/UPJ.0000000000000453
Article PubMed Google Scholar
Confirmed: the new Bing runs on OpenAI’s GPT-4. https://blogs.bing.com/search/march_2023/Confirmed-the-new-Bing-runs-on-OpenAI%E2%80%99s-GPT-4
Johnson D, Goodman R, Patrinely J, et al (2023) Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT Model (in review)
Sezgin E, Sirrianni J, Linwood SL (2022) Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the US Health Care System: outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model. JMIR Med Inform 10:e32875. https://doi.org/10.2196/32875
Article PubMed PubMed Central Google Scholar
Nune A, KarthikeyanP I, Manzo C et al (2023) Chat generative pre-trained transformer (ChatGPT): potential implications for rheumatology practice. Rheumatol Int 43:1379–1380. https://doi.org/10.1007/s00296-023-05340-3
Article PubMed Google Scholar
Uz C, Umay E (2023) “Dr ChatGPT ”: Is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis 26:1343–1349. https://doi.org/10.1111/1756-185X.14749
Article PubMed Google Scholar
Rahsepar AA, Tavakoli N, Kim GHJ et al (2023) How AI responds to common lung cancer questions: ChatGPT versus google bard. Radiology 307:e230922. https://doi.org/10.1148/radiol.230922
Article PubMed Google Scholar
Agarwal M, Sharma P, Goswami A (2023) Analysing the applicability of ChatGPT, Bard, and Bing to generate reasoning-based multiple-choice questions in medical physiology. Cureus 15:e40977. https://doi.org/10.7759/cureus.40977
Article PubMed PubMed Central Google Scholar
Coskun B, Ocakoglu G, Yetemen M, Kaygisiz O (2023) Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer? Urology. https://doi.org/10.1016/j.urology.2023.05.040
Article PubMed Google Scholar
Campbell DJ, Estephan LE, Mastrolonardo EV et al (2023) Evaluating ChatGPT responses on obstructive sleep apnea for patient education. J Clin Sleep Med. https://doi.org/10.5664/jcsm.10728
Article PubMed PubMed Central Google Scholar
Seth I, Xie Y, Rodwell A et al (2023) Exploring the role of a large language model on Carpal tunnel syndrome management: an observation study of ChatGPT. J Hand Surg Am. https://doi.org/10.1016/j.jhsa.2023.07.003
Article PubMed Google Scholar
Marks JL, Edwards CJ (2012) Protective effect of methotrexate in patients with rheumatoid arthritis and cardiovascular comorbidity. Therap Adv Musculoskeletal 4:149–157. https://doi.org/10.1177/1759720X11436239
Article CAS Google Scholar
Ji Z, Lee N, Frieske R et al (2023) Survey of hallucination in natural language generation. ACM Comput Surv 55:1–38. https://doi.org/10.1145/3571730
Article Google Scholar
Toole J (2015) Osteonecrosis of the jaws. Rheumatology 54:1755–1756. https://doi.org/10.1093/rheumatology/kev094
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank Assoc. Prof. Burhan Coskun, MD, for proofreading.

Funding

The work referenced in this article did not receive any specific funding from the governmental, corporate, or nonprofit sectors.

Author information

Authors and Affiliations

Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
Belkis Nihan Coskun, Burcu Yagiz, Ediz Dalkilic & Yavuz Pehlivan
Department of Biostatistics, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
Gokhan Ocakoglu

Authors

Belkis Nihan Coskun
View author publications
You can also search for this author in PubMed Google Scholar
Burcu Yagiz
View author publications
You can also search for this author in PubMed Google Scholar
Gokhan Ocakoglu
View author publications
You can also search for this author in PubMed Google Scholar
Ediz Dalkilic
View author publications
You can also search for this author in PubMed Google Scholar
Yavuz Pehlivan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

BNC: study design, data collection, data ınterpretation, manuscript preparation, literature search. BY: data collection, data ınterpretation, manuscript preparation. GO: statistical analysis, data, ınterpretation and manuscript preparation. ED: data ınterpretation, critically reviewing the manuscript for key ıntellectual content. YP: study design, critically reviewing the manuscript for key ıntellectual content. All authors reviewed and approved the final version of the manuscript before publication. All authors have agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Belkis Nihan Coskun.

Ethics declarations

Conflict of interest

The authors are not required to disclose any conflicts of interest.

Ethics approval

Ethics committee approval was not required for this study because neither human nor animal data were used.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 54 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Coskun, B.N., Yagiz, B., Ocakoglu, G. et al. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int 44, 509–515 (2024). https://doi.org/10.1007/s00296-023-05473-5

Download citation

Received: 21 August 2023
Accepted: 14 September 2023
Published: 25 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00296-023-05473-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Abstract

Access this article

Similar content being viewed by others

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 54 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Abstract

Access this article

Similar content being viewed by others

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 54 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation