Introduction

Thyroid hormones play a critical role in many vital conditions such as regulation of the body’s basal metabolic rate, growth, and neural development. Hypothyroidism is characterized by thyroid hormone deficiency, and the most common cause is Hashimoto’s thyroiditis in cases where adequate iodine intake is provided. During pregnancy, the risk of hypo- and hyperthyroidism increases regardless of medical history1,2. Moreover, especially overt hypothyroidism has negative effects on both pregnancy and fetal health3,4. Therefore, the treatment and follow-up of hypothyroidism during pregnancy are crucial. Levothyroxine (LT4) is used in the treatment indicated by maternal thyroid function and thyroid peroxidase antibody (TPOAb) level. The serum thyroid stimulating hormone levels of pregnant women treated with LT4 should be closely monitored throughout the pregnancy to determine the risk of under- or overtreatment2,5.

Artificial Intelligence (AI) is a field within computer science dedicated to developing intelligent machines capable of emulating human-like thinking and behavior. AI systems are engineered to acquire knowledge from their surroundings and make decisions by processing the information they gather6. Chat Generative Pretrained Transformer (ChatGPT), first version released by OpenAI in November 2022, is a large trained language model designed to provide information on various topics over the internet using a huge amount of text data7. In March 2023, the latest version, ChatGPT-4, was introduced with a new feature: image evaluation. ChatGPT is successful in both open-ended questions and multiple-choice single-answer questions in the fields of medicine8,9. However, in terms of cancer treatment recommendations, Chen et al.10 reported that ChatGPT was generally unreliable in their study. In addition, Dash et al.11 revealed that the use of GPT-3.5 and GPT-4 to support real-world information needs in healthcare delivery responses from the GPT were largely devoid of harm, but less than only 20% of the responses agreed with the response from an informatics consultation service. Despite its promising potential, ChatGPT still has deficiencies in accuracy and knowledge.

The number of studies evaluating ChatGPT about endocrinological diseases is limited, and to the best of our knowledge, none on hypothyroidism during pregnancy12. Thus, this study was aimed to evaluate the reliability and readability of ChatGPT-4 answers about hypothyroidism during pregnancy using open-ended questions and patient scenarios.

Materials and methods

Question source and processing

In line with the recommendations on hypothyroidism and iodine in the latest guideline of the American Thyroid Association (ATA) on the diagnosis and treatment of thyroid disease during pregnancy and postpartum, a total of 19 questions were created in English by two specialists in endocrinology13. Disagreements between authors were adjudicated by a third specialists in endocrinology. The content of the questions consists of the negative pregnancy outcomes, treatment, follow-up strategy and treatment goals of overt and subclinical hypothyroidism. The questions were designed as open-ended questions and patient scenarios, as if formulated by patients. To remove ambiguity, all questions were grammatically edited by the two authors before being asked to ChatGPT. An example question and the answer given by ChatGPT-4 are shown in Table 1.

Table 1 Example question and response from ChatGPT-4.

ChatGPT

GPT is a large language model (LLM) that generates text following a given input prompt. ChatGPT is a version of GPT-3.5 (later GPT-4) that was fine-tuned and optimized for conversations14. The database contains information up to the year 2021, and unlike search engines, ChatGPT generates text word by word15. In this study, the laterst version, ChatGPT-4, was used to answer the questions because it was reported to outperform the previous version16. Each question was tested twice on different days for variation in answers. In addition, a new chat session was opened in ChatGPT for each question to prevent retention bias. Reproducibility of responses was determined by categorizing responses into two categories based on the grading of each response. ChatGPT responses were categorized according to the presence or absence of misinformation. The first category included scores of “comprehensive” and “correct but inadequate” while the second category comprised scores of “some correct and some incorrect” and “completely incorrect”. A question whose responses differed in terms of accuracy and therefore were in different categories was defined as non-repeatable. This method, which evaluates the reproducibility of responses, has been used in similar studies before17. All questions and ChatGPT-4 answers are available online in “Supplementary materials”.

Evaluation of reliability, quality, and readability

Each ChatGPT-4 response was independently evaluated by two endocrinologists with at least 10 years of experience based on ATA guidelines and clinical practice. Responses were considered misleading if they contained at least one misleading statement. Scores for reliability and quality were set by both authors. Finally, the consensus score was determined (Fig. 1).

Figure 1
figure 1

Flow chart of question selection and consensus score.

The DISCERN scale is a three-part scale used to evaluate the reliability and quality of online health information used in previous studies18,19. The first section consists of eight questions to evaluate the reliability of the publication. The second section consists of seven questions to assess the quality of the information about treatment options. The last section focuses on the general quality of the publication as a source of information about treatment options. However, in this study, to evaluate the reliability of ChatGPT responses, we modified the DISCERN scale since not all of our questions were related to treatment. The modified DISCERN (mDISCERN) scale includes only the first part of the DISCERN scale (Table 2). For each question in the mDISCERN scale, the total score was calculated by scoring the no answer as 1, the partial answer as 2–3–4, and the yes answer as 5. A score below 40% (8–15) was graded as poor, 40–79% (16–31) as fair, and above 80% (32–40) as good19.

Table 2 Contents of mDISCERN, GQS and Readability indexes.

The previously used Global Quality scale (GQS) was applied to assess the quality of ChatGPT responses18. Accordingly, 1 point indicates poor quality, and 5 points indicate excellent quality (Table 2). In addition, this scale is also used for quality classification: 1–2 points representing low quality, 3 points moderate quality, and 4–5 points high quality20.

Finally, the readability of ChatGPT was evaluated using widely used Flesch Reading Ease (FRE) Score, Flesch–Kincaid grade level (FKGL), Gunning Fog Index (GFI), Coleman–Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG) tools (Table 2)21. The FRE score ranges from 0 to 100; the higher the score, the easier to read the passage. FKGL is equivalent to the level of education in the United States. GFI is used to estimate the level of education required to understand the text. CLI is equivalent to the US reading grade level. Contrarily, SMOG is used to calculate the years of education an average person needs to understand any piece of writing. The FRE score drops when the readability scores increase based on FKGL, CLI, SMOG, and GFI. The FRE score accepted by the general public for texts targeting the reader is a minimum of 60, while the other 4 readability tools have a score of < 7. A score of 60–70 in the FRE corresponds to a grade level of 8–9 (age approximately 13–15 years). The five specified readability scores were calculated by copying ChatGPT’s responses for each question to the free online readability calculator tool22.

Ethical approval

Since ChatGPT-4 used in the present study is a public application, and there is no human/animal participant, ethics committee approval was not required.

Statistical analysis

The agreement between the two authors who independently evaluated the ChatGPT responses was tested using Weighted Cohen’s Kappa coefficient. Shapiro–Wilk test was used to assess the normality of the distribution. Normally distributed data were presented as mean ± SD, and non-normally distributed data as median (minimum–maximum). Categorical variables were expressed as numbers and percentages. The relationship between the data was evaluated with Pearson correlation test for parametric variables and Spearman correlation test for non-parametric variables. All statistical analyzes were performed using SPSS version 22 (IBM SPSS Statistics for Windows, Armonk, NY: IBM Corp).

Results

Reliability and quality

The Weighted Cohen’s Kappa coefficient between the two authors was 0.76 and 0.87 for reliability in the first and second sessions, respectively, 0.81 and 0.79, respectively, for quality.

Responses were highly consistent between the first and second sessions, and all ended with an advice to consult a healthcare professional. No misleading information was found in any of the ChatGPT-4 responses. In terms of reproducibility, the results showed that the ChatGPT provided reproducible responses (100% score for all responses) for all questions.

The mean mDISCERN score after consensus of the answers to 19 questions was 30.26 ± 3.14; the median GQS score was 4 (2–4) (Table 3). The score distribution of ChatGPT-4 responses according to the mDISCERN scale and the quality classification are presented in Table 4. Most of the answers showed moderate (78.9%) and good (21.1%) reliability. In terms of quality, the responses were mostly high quality (84.2%), followed by medium quality (10.5%). Only one response was of low quality. In this response, there was no information about iodine deficiency in pregnancy in the patient scenario created regarding the necessity of levothyroxine treatment in a 28-week pregnant patient with isolated free t4 deficiency.

Table 3 Summary of DISCERN, GAS and Readability results of ChatGPT-4 responses after consensus score.
Table 4 Score distribution of ChatGPT-4 responses according to the mDISCERN scale and quality classification.

Readability

The median FRE was 32.20 (13.00–37.10), which indicated that the text was difficult to read. To understand the answers, the required level of education was college level (9 [47.3%]), followed by college graduate (5 [26.3%]). All findings are summarized in Table 3.

Correlation analysis

mDISCERN was moderately positively correlated with GQS. However, both mDISCERN and GQS were not correlated with other readability formulas. As expected, FRE was significantly negatively correlated with other readability formulas. The correlation analysis between the reliability, quality, and readability scores of ChatGPT-4 responses is shown in detail in Table 5.

Table 5 Correlation analysis between reliability, quality, and readability scores of ChatGPT-4 responses.

Discussion

In the present study, the reliability and readability of ChatGPT-4 responses about hypothyroidism during pregnancy were evaluated using open-ended questions and patient scenarios. Our results revealed that ChatGPT-4 responses had moderate to good reliability and high quality. In terms of readability, it mostly requires a college-level education to understand that the reading level is difficult or very difficult.

Overt maternal hypothyroidism during pregnancy is associated with an increased risk of preterm birth, low birth weight, pregnancy loss, and low IQ in the children. Subclinical hypothyroidism may lead to varying degrees of adverse pregnancy outcomes. While overt hypothyroidism during pregnancy requires LT4 treatment, it is recommended that the need for treatment in subclinical hypothyroidism be decided by individual evaluation of the patients13. Therefore, the management of hypothyroidism in pregnancy is of great importance in terms of treatment and follow-up. Patients often have concerns about their disease and find it difficult to obtain accurate information. The rate of using the internet as the first source of reference for accessing health-related information is higher23. Unlike search engines (e.g., Google), ChatGPT is an advanced AI language model that has become increasingly recently popular15. Therefore, the reliability and readability of the responses in ChatGPT becomes important. Aiming to evaluate the accuracy and reliability of the ChatGPT model, which is used in many fields, in medical responses in their study, Johnson et al.24 obtained answers from ChatGPT by creating 284 medical questions that were subjectively categorized as easy, medium, and difficult by 33 physicians from 17 specialties; they reported that ChatGPT produced mostly accurate information.

ChatGPT has a wide range of uses for patients and clinicians, ranging from answering basic fact-based questions to answering complex clinical questions. In the United States Medical Licensing Examination (USMLE), ChatGPT was found to perform at or close to passing on multiple-choice single answer questions8. Similarly, ChatGPT answers in multiple-choice single-answer questions for the Medical School Admission Test (MSAT) were reported to be successful25. Contrarily, Suchman et al.26 used ChatGPT-3 and ChatGPT-4 to answer the 2022 and 2021 American College of Gastroenterology self-assessment tests and found that both models could not pass the test in multiple-choice single-answer questions and therefore they do not recommend using them for medical education in gastroenterology in their current form.

Upon examining the studies in which the answers of ChatGPT were evaluated with open-ended questions for patients, we found a study that obtained medical information about blepharoplasties and reported that ChatGPT-4 generally provided fast and sound medical advice while not using excessive medical jargon27. On self-management and education of diabetes in the field of endocrinology, Sng et al.28 evaluated ChatGPT’s answers to the questions under four main headings: diet and exercise, hypoglycemia and hyperglycemia education, insulin storage, and insulin administration. They found that while ChatGPT generally provided easy and accurate answers to the questions, it could also contain some incorrect statements, such as not accepting that insulin analogs can be stored at room temperature after opening28. Similarly, another study evaluated patient questions about obesity surgery using ChatGPT and reported that 131 (86.8%) of 151 questions were answered comprehensively29. Chen et al.10 reported that ChatGPT responses are generally unreliable in terms of cancer treatment recommendations. The high quality and moderate-good level of reliability of ChatGPT-4 responses related to hypothyroidism during pregnancy in the present study is similar to most studies in the literature27,29. Unlike in the study by Sng et al.28, no misleading information was found in the present study possibly because we used the new version of ChatGPT (ChatGPT-4), unlike Sng et al.28 and Chen et al.10. Because ChatGPT is a trained language model that learns with human feedback, its new version will have a larger database. In contrast to the studies by Sng et al.28 and Cox et al.27, the present study concludes that ChatGPT-4’s responses are not easy to understand,indicating that patients must have a certain level of education to understand the answers.

A previous study investigated the use of ChatGPT in cases requiring a more advanced multidisciplinary approach30 by comparing the recommendations of the multidisciplinary tumor board and ChatGPT for primary breast cancer cases. In that previous study, the results were found to have an agreement rate of 64.2%, which may be attributed to the fact that it consisted of questions containing more complex patient scenarios or that the treatment options in the oncology department change very quickly and the database of ChatGPT is limited to information until the year 2021. Lukac et al. noted that ChatGPT mostly provides general responses, but in individual therapy, the current version does not provide specific recommendations for the treatment of patients with primary breast cancer, so it cannot replace a multidisciplinary board in the real world. Another study evaluating ChatGPT models for complex clinical questions such as multidisciplinary approach was conducted by Hirosawa et al. The investigators asked the differential diagnosis for 52 clinical vignettes in internal medicine and evaluated the accuracy of top 5 and top 10 answers given by ChatGPT. The accuracy rate of ChatGPT was reported as 80% in this study. The authors emphasized the potential utility of ChatGPT-4 as a complementary tool for physicians, despite the fact that it was derived from a limited dataset of case reports of a single event31.

The present study determined that the reading level of ChatGPT-4 is difficult or very difficult to read, and it mostly requires a college-level education to understand the text. Contrarily, the results of the correlation analysis suggested that readability was not related to reliability and quality. These results are similar to a study by Momenaei et al.32 on the readability and suitability of the responses generated by ChatGPT-4 for the surgical treatment of retinal diseases; however, the questions were prepared for clinicians, so it is acceptable for the response to be at an advanced reading level. In the present study, questions targeting both patients and clinicians were created, so the development of ChatGPT-4 should be aimed to increase readability. Alternatively, separate user interfaces for patients and clinicians can be considered.

There were some limitations of our study. First, responses were analyzed only in English, so results cannot be generalized to all languages. While ChatGPT-4 is available in many languages, studies in languages other than English are few. Second, there is a lack of standardized tools for evaluating ChatGPT-4 responses. This creates heterogeneity in the comparison of studies. Third, there is no single answer to a question, although consistency has been found between the answers in different sessions. Finally, the database contains information up to 2021. As it is not connected to the internet, it is currently unable to conduct an updated literature review.

Conclusion

In conclusion, the present study reports that ChatGPT-4 responses about hypothyroidism during pregnancy do not contain misleading information and has moderate and good reliability. In addition, the reading level is difficult or very difficult, and it requires a college-level education to understand. The difficulty of reading ChatGPT-4 will limit its easy use by the general public. Although the answers were considered safe in our study and in most of the ChatGPT-4 studies in the literature, clinicians and patients should be aware of the limitations when using the model. Although ChatGPT-4 has significant potential, it can be used as a source of auxiliary information for counseling about hypothyroidism in pregnancy. Further studies in different languages and with more comprehensive questions are needed to better evaluate ChatGPT-4 performance. Finally, efforts should be made to improve the reliability and readability of ChatGPT and to develop domain-specific models.