Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy

Onder, C. E.; Koc, G.; Gokbulut, P.; Taskaldiran, I.; Kuskonmaz, S. M.

doi:10.1038/s41598-023-50884-w

Download PDF

Article
Open access
Published: 02 January 2024

Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy

Scientific Reports volume 14, Article number: 243 (2024) Cite this article

2032 Accesses
5 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Hypothyroidism is characterized by thyroid hormone deficiency and has adverse effects on both pregnancy and fetal health. Chat Generative Pre-trained Transformer (ChatGPT) is a large language model trained with a very large database from many sources. Our study was aimed to evaluate the reliability and readability of ChatGPT-4 answers about hypothyroidism in pregnancy. A total of 19 questions were created in line with the recommendations in the latest guideline of the American Thyroid Association (ATA) on hypothyroidism in pregnancy and were asked to ChatGPT-4. The reliability and quality of the responses were scored by two independent researchers using the global quality scale (GQS) and modified DISCERN tools. The readability of ChatGPT was assessed used Flesch Reading Ease (FRE) Score, Flesch-Kincaid grade level (FKGL), Gunning Fog Index (GFI), Coleman-Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG) tools. No misleading information was found in any of the answers. The mean mDISCERN score of the responses was 30.26 ± 3.14; the median GQS score was 4 (2–4). In terms of reliability, most of the answers showed moderate (78.9%) followed by good (21.1%) reliability. In the readability analysis, the median FRE was 32.20 (13.00–37.10). The years of education required to read the answers were mostly found at the university level [9 (47.3%)]. Although ChatGPT-4 has significant potential, it can be used as an auxiliary information source for counseling by creating a bridge between patients and clinicians about hypothyroidism in pregnancy. Efforts should be made to improve the reliability and readability of ChatGPT.

Interviews in the social sciences

Article 15 September 2022

Vision–language foundation model for echocardiogram interpretation

Article Open access 30 April 2024

The consequences of generative AI for online knowledge communities

Article Open access 06 May 2024

Introduction

Thyroid hormones play a critical role in many vital conditions such as regulation of the body’s basal metabolic rate, growth, and neural development. Hypothyroidism is characterized by thyroid hormone deficiency, and the most common cause is Hashimoto’s thyroiditis in cases where adequate iodine intake is provided. During pregnancy, the risk of hypo- and hyperthyroidism increases regardless of medical history^1,2. Moreover, especially overt hypothyroidism has negative effects on both pregnancy and fetal health^3,4. Therefore, the treatment and follow-up of hypothyroidism during pregnancy are crucial. Levothyroxine (LT4) is used in the treatment indicated by maternal thyroid function and thyroid peroxidase antibody (TPOAb) level. The serum thyroid stimulating hormone levels of pregnant women treated with LT4 should be closely monitored throughout the pregnancy to determine the risk of under- or overtreatment^2,5.

Artificial Intelligence (AI) is a field within computer science dedicated to developing intelligent machines capable of emulating human-like thinking and behavior. AI systems are engineered to acquire knowledge from their surroundings and make decisions by processing the information they gather⁶. Chat Generative Pretrained Transformer (ChatGPT), first version released by OpenAI in November 2022, is a large trained language model designed to provide information on various topics over the internet using a huge amount of text data⁷. In March 2023, the latest version, ChatGPT-4, was introduced with a new feature: image evaluation. ChatGPT is successful in both open-ended questions and multiple-choice single-answer questions in the fields of medicine^8,9. However, in terms of cancer treatment recommendations, Chen et al.¹⁰ reported that ChatGPT was generally unreliable in their study. In addition, Dash et al.¹¹ revealed that the use of GPT-3.5 and GPT-4 to support real-world information needs in healthcare delivery responses from the GPT were largely devoid of harm, but less than only 20% of the responses agreed with the response from an informatics consultation service. Despite its promising potential, ChatGPT still has deficiencies in accuracy and knowledge.

The number of studies evaluating ChatGPT about endocrinological diseases is limited, and to the best of our knowledge, none on hypothyroidism during pregnancy¹². Thus, this study was aimed to evaluate the reliability and readability of ChatGPT-4 answers about hypothyroidism during pregnancy using open-ended questions and patient scenarios.

Materials and methods

Question source and processing

In line with the recommendations on hypothyroidism and iodine in the latest guideline of the American Thyroid Association (ATA) on the diagnosis and treatment of thyroid disease during pregnancy and postpartum, a total of 19 questions were created in English by two specialists in endocrinology¹³. Disagreements between authors were adjudicated by a third specialists in endocrinology. The content of the questions consists of the negative pregnancy outcomes, treatment, follow-up strategy and treatment goals of overt and subclinical hypothyroidism. The questions were designed as open-ended questions and patient scenarios, as if formulated by patients. To remove ambiguity, all questions were grammatically edited by the two authors before being asked to ChatGPT. An example question and the answer given by ChatGPT-4 are shown in Table 1.

Table 1 Example question and response from ChatGPT-4.

Full size table

ChatGPT

GPT is a large language model (LLM) that generates text following a given input prompt. ChatGPT is a version of GPT-3.5 (later GPT-4) that was fine-tuned and optimized for conversations¹⁴. The database contains information up to the year 2021, and unlike search engines, ChatGPT generates text word by word¹⁵. In this study, the laterst version, ChatGPT-4, was used to answer the questions because it was reported to outperform the previous version¹⁶. Each question was tested twice on different days for variation in answers. In addition, a new chat session was opened in ChatGPT for each question to prevent retention bias. Reproducibility of responses was determined by categorizing responses into two categories based on the grading of each response. ChatGPT responses were categorized according to the presence or absence of misinformation. The first category included scores of “comprehensive” and “correct but inadequate” while the second category comprised scores of “some correct and some incorrect” and “completely incorrect”. A question whose responses differed in terms of accuracy and therefore were in different categories was defined as non-repeatable. This method, which evaluates the reproducibility of responses, has been used in similar studies before¹⁷. All questions and ChatGPT-4 answers are available online in “Supplementary materials”.

Evaluation of reliability, quality, and readability

Each ChatGPT-4 response was independently evaluated by two endocrinologists with at least 10 years of experience based on ATA guidelines and clinical practice. Responses were considered misleading if they contained at least one misleading statement. Scores for reliability and quality were set by both authors. Finally, the consensus score was determined (Fig. 1).

The DISCERN scale is a three-part scale used to evaluate the reliability and quality of online health information used in previous studies^18,19. The first section consists of eight questions to evaluate the reliability of the publication. The second section consists of seven questions to assess the quality of the information about treatment options. The last section focuses on the general quality of the publication as a source of information about treatment options. However, in this study, to evaluate the reliability of ChatGPT responses, we modified the DISCERN scale since not all of our questions were related to treatment. The modified DISCERN (mDISCERN) scale includes only the first part of the DISCERN scale (Table 2). For each question in the mDISCERN scale, the total score was calculated by scoring the no answer as 1, the partial answer as 2–3–4, and the yes answer as 5. A score below 40% (8–15) was graded as poor, 40–79% (16–31) as fair, and above 80% (32–40) as good¹⁹.

Table 2 Contents of mDISCERN, GQS and Readability indexes.

Full size table

The previously used Global Quality scale (GQS) was applied to assess the quality of ChatGPT responses¹⁸. Accordingly, 1 point indicates poor quality, and 5 points indicate excellent quality (Table 2). In addition, this scale is also used for quality classification: 1–2 points representing low quality, 3 points moderate quality, and 4–5 points high quality²⁰.

Finally, the readability of ChatGPT was evaluated using widely used Flesch Reading Ease (FRE) Score, Flesch–Kincaid grade level (FKGL), Gunning Fog Index (GFI), Coleman–Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG) tools (Table 2)²¹. The FRE score ranges from 0 to 100; the higher the score, the easier to read the passage. FKGL is equivalent to the level of education in the United States. GFI is used to estimate the level of education required to understand the text. CLI is equivalent to the US reading grade level. Contrarily, SMOG is used to calculate the years of education an average person needs to understand any piece of writing. The FRE score drops when the readability scores increase based on FKGL, CLI, SMOG, and GFI. The FRE score accepted by the general public for texts targeting the reader is a minimum of 60, while the other 4 readability tools have a score of < 7. A score of 60–70 in the FRE corresponds to a grade level of 8–9 (age approximately 13–15 years). The five specified readability scores were calculated by copying ChatGPT’s responses for each question to the free online readability calculator tool²².

Ethical approval

Since ChatGPT-4 used in the present study is a public application, and there is no human/animal participant, ethics committee approval was not required.

Statistical analysis

The agreement between the two authors who independently evaluated the ChatGPT responses was tested using Weighted Cohen’s Kappa coefficient. Shapiro–Wilk test was used to assess the normality of the distribution. Normally distributed data were presented as mean ± SD, and non-normally distributed data as median (minimum–maximum). Categorical variables were expressed as numbers and percentages. The relationship between the data was evaluated with Pearson correlation test for parametric variables and Spearman correlation test for non-parametric variables. All statistical analyzes were performed using SPSS version 22 (IBM SPSS Statistics for Windows, Armonk, NY: IBM Corp).

Results

Reliability and quality

The Weighted Cohen’s Kappa coefficient between the two authors was 0.76 and 0.87 for reliability in the first and second sessions, respectively, 0.81 and 0.79, respectively, for quality.

Responses were highly consistent between the first and second sessions, and all ended with an advice to consult a healthcare professional. No misleading information was found in any of the ChatGPT-4 responses. In terms of reproducibility, the results showed that the ChatGPT provided reproducible responses (100% score for all responses) for all questions.

The mean mDISCERN score after consensus of the answers to 19 questions was 30.26 ± 3.14; the median GQS score was 4 (2–4) (Table 3). The score distribution of ChatGPT-4 responses according to the mDISCERN scale and the quality classification are presented in Table 4. Most of the answers showed moderate (78.9%) and good (21.1%) reliability. In terms of quality, the responses were mostly high quality (84.2%), followed by medium quality (10.5%). Only one response was of low quality. In this response, there was no information about iodine deficiency in pregnancy in the patient scenario created regarding the necessity of levothyroxine treatment in a 28-week pregnant patient with isolated free t4 deficiency.

Table 3 Summary of DISCERN, GAS and Readability results of ChatGPT-4 responses after consensus score.

Full size table

Table 4 Score distribution of ChatGPT-4 responses according to the mDISCERN scale and quality classification.

Full size table

Readability

The median FRE was 32.20 (13.00–37.10), which indicated that the text was difficult to read. To understand the answers, the required level of education was college level (9 [47.3%]), followed by college graduate (5 [26.3%]). All findings are summarized in Table 3.

Correlation analysis

mDISCERN was moderately positively correlated with GQS. However, both mDISCERN and GQS were not correlated with other readability formulas. As expected, FRE was significantly negatively correlated with other readability formulas. The correlation analysis between the reliability, quality, and readability scores of ChatGPT-4 responses is shown in detail in Table 5.

Table 5 Correlation analysis between reliability, quality, and readability scores of ChatGPT-4 responses.

Full size table

Discussion

In the present study, the reliability and readability of ChatGPT-4 responses about hypothyroidism during pregnancy were evaluated using open-ended questions and patient scenarios. Our results revealed that ChatGPT-4 responses had moderate to good reliability and high quality. In terms of readability, it mostly requires a college-level education to understand that the reading level is difficult or very difficult.

Overt maternal hypothyroidism during pregnancy is associated with an increased risk of preterm birth, low birth weight, pregnancy loss, and low IQ in the children. Subclinical hypothyroidism may lead to varying degrees of adverse pregnancy outcomes. While overt hypothyroidism during pregnancy requires LT4 treatment, it is recommended that the need for treatment in subclinical hypothyroidism be decided by individual evaluation of the patients¹³. Therefore, the management of hypothyroidism in pregnancy is of great importance in terms of treatment and follow-up. Patients often have concerns about their disease and find it difficult to obtain accurate information. The rate of using the internet as the first source of reference for accessing health-related information is higher²³. Unlike search engines (e.g., Google), ChatGPT is an advanced AI language model that has become increasingly recently popular¹⁵. Therefore, the reliability and readability of the responses in ChatGPT becomes important. Aiming to evaluate the accuracy and reliability of the ChatGPT model, which is used in many fields, in medical responses in their study, Johnson et al.²⁴ obtained answers from ChatGPT by creating 284 medical questions that were subjectively categorized as easy, medium, and difficult by 33 physicians from 17 specialties; they reported that ChatGPT produced mostly accurate information.

ChatGPT has a wide range of uses for patients and clinicians, ranging from answering basic fact-based questions to answering complex clinical questions. In the United States Medical Licensing Examination (USMLE), ChatGPT was found to perform at or close to passing on multiple-choice single answer questions⁸. Similarly, ChatGPT answers in multiple-choice single-answer questions for the Medical School Admission Test (MSAT) were reported to be successful²⁵. Contrarily, Suchman et al.²⁶ used ChatGPT-3 and ChatGPT-4 to answer the 2022 and 2021 American College of Gastroenterology self-assessment tests and found that both models could not pass the test in multiple-choice single-answer questions and therefore they do not recommend using them for medical education in gastroenterology in their current form.

Upon examining the studies in which the answers of ChatGPT were evaluated with open-ended questions for patients, we found a study that obtained medical information about blepharoplasties and reported that ChatGPT-4 generally provided fast and sound medical advice while not using excessive medical jargon²⁷. On self-management and education of diabetes in the field of endocrinology, Sng et al.²⁸ evaluated ChatGPT’s answers to the questions under four main headings: diet and exercise, hypoglycemia and hyperglycemia education, insulin storage, and insulin administration. They found that while ChatGPT generally provided easy and accurate answers to the questions, it could also contain some incorrect statements, such as not accepting that insulin analogs can be stored at room temperature after opening²⁸. Similarly, another study evaluated patient questions about obesity surgery using ChatGPT and reported that 131 (86.8%) of 151 questions were answered comprehensively²⁹. Chen et al.¹⁰ reported that ChatGPT responses are generally unreliable in terms of cancer treatment recommendations. The high quality and moderate-good level of reliability of ChatGPT-4 responses related to hypothyroidism during pregnancy in the present study is similar to most studies in the literature^27,29. Unlike in the study by Sng et al.²⁸, no misleading information was found in the present study possibly because we used the new version of ChatGPT (ChatGPT-4), unlike Sng et al.²⁸ and Chen et al.¹⁰. Because ChatGPT is a trained language model that learns with human feedback, its new version will have a larger database. In contrast to the studies by Sng et al.²⁸ and Cox et al.²⁷, the present study concludes that ChatGPT-4’s responses are not easy to understand,indicating that patients must have a certain level of education to understand the answers.

A previous study investigated the use of ChatGPT in cases requiring a more advanced multidisciplinary approach³⁰ by comparing the recommendations of the multidisciplinary tumor board and ChatGPT for primary breast cancer cases. In that previous study, the results were found to have an agreement rate of 64.2%, which may be attributed to the fact that it consisted of questions containing more complex patient scenarios or that the treatment options in the oncology department change very quickly and the database of ChatGPT is limited to information until the year 2021. Lukac et al. noted that ChatGPT mostly provides general responses, but in individual therapy, the current version does not provide specific recommendations for the treatment of patients with primary breast cancer, so it cannot replace a multidisciplinary board in the real world. Another study evaluating ChatGPT models for complex clinical questions such as multidisciplinary approach was conducted by Hirosawa et al. The investigators asked the differential diagnosis for 52 clinical vignettes in internal medicine and evaluated the accuracy of top 5 and top 10 answers given by ChatGPT. The accuracy rate of ChatGPT was reported as 80% in this study. The authors emphasized the potential utility of ChatGPT-4 as a complementary tool for physicians, despite the fact that it was derived from a limited dataset of case reports of a single event³¹.

The present study determined that the reading level of ChatGPT-4 is difficult or very difficult to read, and it mostly requires a college-level education to understand the text. Contrarily, the results of the correlation analysis suggested that readability was not related to reliability and quality. These results are similar to a study by Momenaei et al.³² on the readability and suitability of the responses generated by ChatGPT-4 for the surgical treatment of retinal diseases; however, the questions were prepared for clinicians, so it is acceptable for the response to be at an advanced reading level. In the present study, questions targeting both patients and clinicians were created, so the development of ChatGPT-4 should be aimed to increase readability. Alternatively, separate user interfaces for patients and clinicians can be considered.

There were some limitations of our study. First, responses were analyzed only in English, so results cannot be generalized to all languages. While ChatGPT-4 is available in many languages, studies in languages other than English are few. Second, there is a lack of standardized tools for evaluating ChatGPT-4 responses. This creates heterogeneity in the comparison of studies. Third, there is no single answer to a question, although consistency has been found between the answers in different sessions. Finally, the database contains information up to 2021. As it is not connected to the internet, it is currently unable to conduct an updated literature review.

Conclusion

In conclusion, the present study reports that ChatGPT-4 responses about hypothyroidism during pregnancy do not contain misleading information and has moderate and good reliability. In addition, the reading level is difficult or very difficult, and it requires a college-level education to understand. The difficulty of reading ChatGPT-4 will limit its easy use by the general public. Although the answers were considered safe in our study and in most of the ChatGPT-4 studies in the literature, clinicians and patients should be aware of the limitations when using the model. Although ChatGPT-4 has significant potential, it can be used as a source of auxiliary information for counseling about hypothyroidism in pregnancy. Further studies in different languages and with more comprehensive questions are needed to better evaluate ChatGPT-4 performance. Finally, efforts should be made to improve the reliability and readability of ChatGPT and to develop domain-specific models.

Data availability

Data generated and/or analyzed during the current study are available from the corresponding author on request.

References

Mégier, C., Dumery, G. & Luton, D. Iodine and thyroid maternal and fetal metabolism during pregnancy. Metabolites 13, 633 (2023).
Article PubMed PubMed Central Google Scholar
Sullivan, S. A. Hypothyroidism in pregnancy. Clin. Obstet. Gynecol. 62, 308–319 (2019).
Article PubMed Google Scholar
Shinohara, D. R. et al. Pregnancy complications associated with maternal hypothyroidism: A systematic review. Obstet. Gynecol. Surv. 73, 219–230 (2018).
Article PubMed Google Scholar
Abalovich, M. et al. Overt and subclinical hypothyroidism complicating pregnancy. Thyroid 12, 63–68 (2002).
Article CAS PubMed Google Scholar
Pearce, E. N. Management of hypothyroidism and hypothyroxinemia during pregnancy. Endocr. Pract. 28, 711–718 (2022).
Article PubMed Google Scholar
Deng, J. & Lin, Y. The benefits and challenges of ChatGPT: An overview. Front. Comput. Intell. Syst. 2, 81–83 (2023).
Article Google Scholar
Long, C. et al. Evaluating ChatGPT-4 in otolaryngology-head and neck surgery board examination using the CVSA model. MedRxiv https://doi.org/10.1101/2023.05.30.23290758 (2023).
Article PubMed PubMed Central Google Scholar
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit. Health 2, e0000198 (2023).
Article PubMed PubMed Central Google Scholar
Ali, R. et al. Performance of ChatGPT, GPT-4, and google bard on a neurosurgery oral boards preparation question bank. MedRxiv https://doi.org/10.1101/2023.04.06.23288265 (2023).
Article PubMed PubMed Central Google Scholar
Chen, S. et al. The utility of ChatGPT for cancer treatment information. medRxiv https://doi.org/10.1101/2023.03.16.23287316 (2023).
Article PubMed PubMed Central Google Scholar
Dash, D. et al. Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery. ArXiv. Preprint at https://doi.org/10.48550/arXiv.2304.13714 (2023).
Sng, G. G. R., Tung, J. Y. M., Lim, D. Y. Z. & Bee, Y. M. Potential and pitfalls of ChatGPT and natural-language artificial intelligence models for diabetes education. Diabetes Care 46, e103–e105 (2023).
Article PubMed Google Scholar
Alexander, E. K. et al. Guidelines of the American thyroid association for the diagnosis and management of thyroid disease during pregnancy and the postpartum. Thyroid 27, 315–389 (2017).
Article PubMed Google Scholar
OpenAI. ChatGPT: Optimizing Language Models For Dialogue. https://openai.com/blog/chatgpt/ (2022).
Ouyang, L. et al. Training language models to follow instructions with human feedback. ArXiv. Preprint at https://doi.org/10.48550/arXiv.2203.02155 (2022).
Teebagy, S., Colwell, L., Wood, E., Yaghy, A. & Faustina, M. Improved performance of ChatGPT-4 on the OKAP exam: A comparative study with ChatGPT-3.5. medRxiv https://doi.org/10.1101/2023.04.03.23287957 (2023).
Article Google Scholar
King, R. C. et al. Appropriateness of ChatGPT in answering heart failure related questions. medRxiv. https://doi.org/10.1101/2023.07.07.23292385 (2023).
Article PubMed PubMed Central Google Scholar
Ozduran, E. & Büyükçoban, S. Evaluating the readability, quality and reliability of online patient education materials on post-covid pain. PeerJ. 10, e13686 (2022).
Article PubMed PubMed Central Google Scholar
Kumar, V. S., Subramani, S., Veerapan, S. & Khan, S. A. Evaluation of online health information on clubfoot using the DISCERN tool. J. Pediatr. Orthop. 23, 135–138 (2014).
Article Google Scholar
Onder, M. E., Onder, C. E. & Zengin, O. Quality of English-language videos available on YouTube as a source of information on osteoporosis. Arch. Osteoporos. 17, 19 (2022).
Article PubMed PubMed Central Google Scholar
Lim, S. J. M., Kelly, M., Selvarajah, L., Murray, M. & Scanlon, T. Transjugular intrahepatic portosystemic shunt (TIPS) procedure: An assessment of the quality and readability of online information. BMC Med. Inform. Decis. Mak. 21, 149 (2021).
Article PubMed PubMed Central Google Scholar
Simpson, D. The Readability Test Tool. http://www.readable.com (2013).
Hesse, B. W. et al. Trust and sources of health information: the impact of the Internet and its implications for health care providers: Findings from the first Health Information National Trends Survey. Arch. Intern. Med. 165, 2618–2624 (2005).
Article PubMed Google Scholar
Johnson, D. et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the chat-GPT model. Res Sq. 2566942 (2023).
Bommineni, V. L. et al. Performance of ChatGPT on the MCAT: The road to personalized and equitable premedical learning. medRxiv https://doi.org/10.1101/2023.03.05.23286533 (2023).
Article Google Scholar
Suchman, K., Garg, S. & Trindade, A. J. Chat generative pretrained transformer fails the multiple-choice American college of gastroenterology self-assessment test. Am. J. Gastroenterol. 118(12), 2280–2282 (2023).
Article PubMed Google Scholar
Cox, A., Seth, I., Xie, Y., Hunter-Smith, D. J. & Rozen, W. M. Utilizing ChatGPT-4 for providing medical information on blepharoplasties to patients. Aesthet. Surg. J. 43, 658–662 (2023).
Article Google Scholar
Sng, G. G. R., Tung, J. Y. M., Lim, D. Y. Z. & Bee, Y. M. Potential and pitfalls of ChatGPT and natural-language artificial intelligence models for diabetes education. Diabetes Care 46, e103–e105 (2023).
Article PubMed Google Scholar
Samaan, J. S. et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes. Surg. 33, 1790–1796 (2023).
Article PubMed PubMed Central Google Scholar
Lukac, S. et al. Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases. Arch. Gynecol. Obstetr. https://doi.org/10.1007/s00404-023-07130-5 (2023).
Article Google Scholar
Hirosawa, T. et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: Diagnostic accuracy evaluation. JMIR Med. Inform. 11, e488008 (2023).
Article Google Scholar
Momenaei, B. et al. Appropriateness and readability of ChatGPT-4 generated responses for surgical treatment of retinal diseases. Ophthalmol. Retina https://doi.org/10.1016/j.oret.2023.05.022 (2023).
Article PubMed Google Scholar

Download references

Acknowledgements

Preparation for publication of this article is supported by Society Endocrinology and Metabolism of Turkey.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Department of Endocrinology and Metabolic Diseases, Ankara Training and Research Hospital, Ankara, Turkey
C. E. Onder, G. Koc, P. Gokbulut, I. Taskaldiran & S. M. Kuskonmaz

Authors

C. E. Onder
View author publications
You can also search for this author in PubMed Google Scholar
G. Koc
View author publications
You can also search for this author in PubMed Google Scholar
P. Gokbulut
View author publications
You can also search for this author in PubMed Google Scholar
I. Taskaldiran
View author publications
You can also search for this author in PubMed Google Scholar
S. M. Kuskonmaz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the understanding and design of the study. [G.K.] and [S.M.K.] formulated the idea and designed the study. [C.E.O.] and [P.G.] collected data. [C.E.O.], [P.G.] and [I.T.] and conducted the analysis. [S.M.K.] and [G.K.] provided clinical expertise. The first draft of the article was written by [C.E.O.]. [S.M.K.] and [I.T.] analyzed the manuscript in a critical way. All authors have read and approved the final article.

Corresponding author

Correspondence to C. E. Onder.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Onder, C.E., Koc, G., Gokbulut, P. et al. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep 14, 243 (2024). https://doi.org/10.1038/s41598-023-50884-w

Download citation

Received: 21 September 2023
Accepted: 27 December 2023
Published: 02 January 2024
DOI: https://doi.org/10.1038/s41598-023-50884-w

This article is cited by

AI-driven translations for kidney transplant equity in Hispanic populations
- Oscar A. Garcia Valencia
- Charat Thongprayoon
- Wisit Cheungpasitporn
Scientific Reports (2024)
Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis
- Mehmet Fatih Şahin
- Hüseyin Ateş
- Cenk Murat Yazıcı
Journal of Medical Systems (2024)
Large language models as decision aids in neuro-oncology: a review of shared decision-making applications
- Aaron Lawson McLean
- Yonghui Wu
- Vagelis Hristidis
Journal of Cancer Research and Clinical Oncology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Interviews in the social sciences

Vision–language foundation model for echocardiogram interpretation

The consequences of generative AI for online knowledge communities

Introduction

Materials and methods

Question source and processing

ChatGPT

Evaluation of reliability, quality, and readability

Ethical approval

Statistical analysis

Results

Reliability and quality

Readability

Correlation analysis

Discussion

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information 1.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

AI-driven translations for kidney transplant equity in Hispanic populations

Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis

Large language models as decision aids in neuro-oncology: a review of shared decision-making applications

Comments

Search

Quick links