Machine Learning Tools Match Physician Accuracy in Multilingual Text Annotation

doi:10.21203/rs.3.rs-4157610/v1

Download PDF

Article

Machine Learning Tools Match Physician Accuracy in Multilingual Text Annotation

https://doi.org/10.21203/rs.3.rs-4157610/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

In the medical field, text annotation involves categorizing clinical and biomedical texts with specific medical categories, enhancing the organization and interpretation of large volumes of unstructured data. This process is crucial for developing tools such as speech recognition systems, which help medical professionals reduce their paperwork. It addresses a significant cause of burnout reported by up to 60% of medical staff. However, annotating medical texts in languages other than English poses unique challenges and necessitates using advanced models. In our research, conducted in collaboration with Gdańsk University of Technology and the Medical University of Gdańsk, we explore strategies to tackle these challenges. We evaluated the performance of various tools and models in recognizing medical terms within a comprehensive vocabulary, comparing these tools' outcomes with annotations made by medical experts. Our study specifically examined categories such as 'Drugs', 'Diseases and Symptoms', 'Procedures', and 'Other Medical Terms', contrasting human expert annotations with the performance of popular multilingual chatbots and natural language processing (NLP) tools on translated texts. The conclusion drawn from our statistical analysis reveals that no significant differences were detected between the groups we examined. This suggests that the tools and models we tested are, on average, similarly effective—or ineffective—at recognizing medical terms as categorized by our specific criteria. Our findings highlight the challenges in bridging the gap between human and machine accuracy in medical text annotation, especially in non-English contexts, and emphasize the need for further refinement of these technologies.

Health sciences/Health care

Health sciences/Health care/Health services

Health sciences/Health care/Public health

Medical Text Annotation

AI in Healthcare

Medical Speech Recognition

Comparative Study of Text Annotation Methods

In the era of rapid technological development, AI-based tools and information systems are becoming increasingly popular and valuable. Medicine is a specific field of AI application, where we can see the use of this technology in many areas - from image diagnostics to genetic data analysis. One of the promising areas of AI application is text analysis, particularly annotation of medical texts, which is paving the way for the automated creation of medical records.

However, such tasks become much more complicated when they involve a language other than English. As the dominant language on the Internet, this language has made it possible to train artificial intelligence algorithms to the greatest extent, including those in which deep learning has been applied to natural language processing (NLP). The authors of this article, who are working on applications of such algorithms, would like to present their experimental results concerning the Polish medical language. This will give the reader an idea of the current state of development of NLP technology in specific applications for less commonly used languages and how it compares to the ability of human specialists.

The history of research progress on NLP in medicine over the past 20 years is analyzed based on a bibliometric study on PubMed in the paper [1], where the role of natural language processing in intelligence-based medicine is discussed. Among recent publications devoted to a similar subject, the paper entitled “Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms” can be cited [2]. In an earlier paper, the authors discuss lexicon-based sentiment analysis on open-text patient reviews of certain drugs [3] or in neuroscience and psychiatry [4] or cardiology [5]. The paper [6] provides a broader review of similar subjects, while [7] presents recent general progress in named entity recognition.

The above examples confirm that the medical applications of NLP are a popular topic of many publications, all of which are impossible to cite in this article. However, our goal is different, as we would like to show the reader how to use the currently available tools and evaluate the results they provide. Accordingly, in the next chapter, we intend to present contemporary ways of annotating medical text, and in the following chapters, we will give our experimental results in this field.

Text annotation, a critical process in natural language processing (NLP), as explained in the fundamental work [8], involves identifying and labeling essential terms within a body of text.

As a result of the project carried out at the Gdańsk University of Technology in cooperation with the Gdańsk Medical University, a solution is being developed and implemented, which enables physicians to recall available diagnostic test results and clinical parameters of patients by voice, fill-in disease charts during medical history in interactive mode, create descriptions, and prescribe treatment as required. Templates will be generated automatically for filling them out, allowing data to be entered directly into widespread health care information systems, including data from the medical history, and automatically structured descriptions of diagnostic results that will be voice-editable and enable dictation of test referrals, prescriptions, or sick leaves. Due to the complexity of medical terminologies, the annotation process can be challenging. Still, it also provides immense value by structuring unstructured text data, thus aiding in research, diagnosis, treatment, and monitoring of diseases.

The role of annotating medical terms is paramount from the point of view of this project, as speech recognition models should be trained with the medical vocabulary in Polish, where most models do not work effectively [9]. In addition, the adaptive forms, and the automation of voice-based medical record creation, in general, require contextual speech understanding. Training machine learning-based models for this purpose also involves the selection of medical terms from among the expressions present in longer texts. The block labeled Named Entity Recognition is related to this objective in the diagram reflecting our solution prepared with a current project “ADMEDVOICE - Adaptive intelligent speech processing system of medical personnel with the structuring of test results and support of therapeutic process” is in Fig. 1. In preparation for solving the project tasks, a practical review of various NLP tools was carried out, selecting the more promising ones for testing with doctors of various medical specialties.

The development of specialized NLP tools like MedLEE [10] and cTAKES [11] has led to significant progress in biomedical informatics, particularly in improving the extraction, processing, and analysis of clinical data from various medical texts and records. The article by Friedman on the research and development of the MedLEE system contributes valuable insights into the early stages of NLP application in medical informatics, demonstrating the potential of language processing tools in enhancing the analysis and interpretation of clinical data. The article [11] extensively presents the architecture and components of the cTAKES system and its applications in medicine. cTAKES is one of the first natural language processing systems developed specifically for medical texts, utilizing information extraction and semantic text processing methods. The paper provides insights into how cTAKES analyzes clinical documents and extracts critical medical information.

In addition to the tools listed above, other modern solutions suitable for these purposes are currently available, including advanced AI systems like ChatGPT and Google’s Gemini (formerly named Bard), which excel in handling such tasks. There are also well-known solutions offered by companies that incorporate off-the-shelf NLP models, such as Amazon Comprehend Medical [12], AWS HealthScribe [13], Microsoft Azure AI Language [14], and others.

In December 2023, Google made an advancement in artificial intelligence development that promises to benefit all users. They entered the Gemini era [15], incorporating the latest AI features across text, image, sound, and video processing. Their most sophisticated model, Ultra 1.0, set a new benchmark by outperforming experts in the Massive Multitask Language Understanding (MMLU) test. This evaluation measures knowledge and problem-solving abilities across 57 fields, including mathematics, physics, history, law, medicine, and ethics. As of February 2024, they are taking another step forward by integrating the Ultra model into their services. This integration AI's role within the medical domain offers unprecedented opportunities for enhancing natural language processing applications in healthcare. By leveraging Ultra 1.0's advanced understanding capabilities, medical professionals and researchers can expect significant improvements in data analysis, diagnosis processes, and patient care, marking a pivotal moment in the convergence of AI technology and medical science.

Also, PubTator [16] presents an enhanced version of PubTator, a tool for the automated annotation of biomedical concepts in full-text scientific articles. Prodi.gy [17] includes a case study demonstrating the use of Prodi.gy in annotating various texts, highlighting its flexibility and efficiency in annotation. The paper devoted to Llama Alpaka [18] discusses its use for advanced language model analysis and knowledge annotation in large data sets. SenTag [19] describes a web-based tool for semantic annotations of textual documents. INCEpTION [20] presents the INCEpTION platform, combining machine-assisted and knowledge-based interactive annotation. The paper [21] describes Brat, a web-based tool for NLP-assisted text annotation. Doccano [22] introduces a user-friendly text annotation tool that supports multiple data formats. LightTag [23] is a scalable platform for structured text annotation, emphasizing its performance and ease of use. These tools can identify and annotate various medical terms, making it easier for researchers and clinicians to extract meaningful insights from the data. Moreover, the medical field has developed various other annotation tools. BioC [24] is one such minimalist tool designed to resolve issues related to interoperability among different software systems used for text mining.

Another popular software BERT [25], is a web-based tool for NLp-assisted text annotation. BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking language model developed by Google first in 2018. It is designed to pre-train deep bidirectional representations from the unlabeled text by joint conditioning on all layers' left and right context. BERT can accurately predict the grammatical parts of speech of words in a sentence, such as nouns, verbs, adjectives, etc. This is crucial for understanding the structure of Polish sentences, which have a rich morphology and flexible word order. BERT's deep contextual understanding can be leveraged to disambiguate words with multiple meanings based on their context. This is particularly useful in Polish, where many words can have different meanings in different contexts.

The paper [26] adapts the breakthrough BERT transformer-based model (Bidirectional Encoder Representations from Transformers) for clinical applications and demonstrates its effectiveness in natural language analysis in a medical context. These could include disease names, drug names, symptoms, and other relevant health information in the medical field. The authors of this article also tested the BERT model in the context of its capabilities to recognize words belonging to the Polish medical language (see Chapter V). BERT is also used and mentioned in literature [27] for labeling nonclinical applications.

Different models have been proposed for chemical-named entity recognition. This has a connection with medicine, as the names of chemicals sometimes appear in the names of drugs. One such model uses transformer models, as described by Devlin et al. [28], and Rajkomar et al. [29]. Both discussed the application of automated encoding of clinical documents based on natural language processing. The article [30] comprehensively surveys the advancements in Transformer models, focusing on their efficiency improvements. It discusses how the Transformer architecture, initially introduced by Vaswani et al. in 2017, has become foundational for many modern NLP models, including those used for Named Entity Recognition (NER). The survey covers various modifications and optimizations that have been proposed to enhance the Transformer’s performance and adaptability in diverse NLP applications. This reference offers a broader perspective on the evolution and optimization of the Transformer architecture since its introduction, highlighting its continuous impact on the NLP field.

In their article [31], Honnibal and Montani, the creators of SpaCy, present key aspects and innovations introduced in version 2 of this natural language processing (NLP) tool. They discuss the use of Bloom Embeddings, convolutional neural networks, and incremental parsing in the context of natural language understanding. The paper details the technologies and methods contributing to the efficiency of SpaCy as an NLP tool.

In clinical and biomedical texts, a range of terms requires annotation. These may include clinical findings, anatomical sites, procedures, medications, etc. To illustrate that simply, let us consider a patient’s diagnosis (in Polish): "Pacjent cierpi na nadciśnienie tętnicze". The diagnosis states that the patient suffers from hypertension. In this sentence, "nadciśnienie tętnicze" is the medical term that refers to hypertension and translates to "hypertensio arterialis" in Latin. Meanwhile, "patient" (pacjent) "suffers" (cierpi) are words that will be repeated many times in speeches, as they rather belong to colloquial language, which non-specialized speech recognition systems can handle quite well.

In the remainder of this article, we will provide some examples of Python programming language code that allows the effective extraction of medical concepts from longer statements made by medical personnel. Such selection is important for training speech-to-text transcription systems since the overuse of colloquial words repeated in natural language can lead to overfitting deep neural models during transfer learning.

Then, we will show what material we used for our experiments and what results we obtained from them. Verification by specialized medical personnel of the obtained automatic results will allow us to calculate evaluation metrics, through which it will become possible to estimate the effectiveness of the software for this purpose. On this basis, we will be able to summarize this work and draw general conclusions.

Models like GPT-4 [32], developed by OpenAI, hold great potential for medical text annotation due to their deep understanding of language semantics and context. These transformer-based models have been trained on diverse and extensive text data, enabling them to generate human-like text based on the input provided. Notably, GPT-4 can assist in identifying and annotating medical terms within a given text.

GPT-4 is a generative model, which means it is designed to generate text based on the input it receives. This allows GPT-4 to not only identify and highlight medical terms within a text but also generate explanations, contextual information, and even related content based on its training. BERT, on the other hand, is primarily a model for understanding the context of words in a sentence. While it can be used for tasks like Named Entity Recognition (NER), it does not generate new text in the same way GPT-4 does.

In the case of English, which was the dominant language when the model was trained, annotation can be ordered directly, but for other less common languages, such as Polish, it is useful to customize the model. To facilitate this process, Gdansk University of Technology has prepared its own "fine-tuned" version of GPT4 chatbot, called MediTag Assistant, which is available at https://chat.openai.com/g/g-cp2hudybz-meditag-assistant

The following example includes an exemplary medical diagnosis of a patient’s case suffering from hypertension, annotating the medical terms contained in this diagnosis and listing them below with Polish and Latin translations. Exemplary medical diagnosis: "Mr. Smith, a 65-year-old male, complains of persistent headaches and fatigue over the last two months. He has a history of hypertension, which was diagnosed five years ago. His blood pressure readings have been consistently above 140/90 mmHg. He is currently on Hydrochlorothiazide 25 mg once daily. He follows a low-salt diet but admits to inconsistent physical activity.”

Identified medical terms for annotation:

"Symptoms": ["headaches'', "fatigue"],

"Diagnosis": ["hypertension"],

"Medication": ["Hydrochlorothiazide"],

"Measurements": ["blood pressure", "140/90 mmHg"],

"Lifestyle": ["low salt diet", "physical activity"].

List of Annotated Medical Terms with Polish and Latin Translations:

Symptoms (Objawy) [Latin: Symptomata] Headaches (Bóle głowy) [Latin: Cephalalgia] Fatigue (Zmęczenie) [Latin: Fatigatio]

Diagnosis (Diagnoza) [Latin: Diagnoses] Hypertension (Nadciśnienie) [Latin: Hypertensio arterialis]

Medication (Leki) [Latin: Medicamentum] Hydrochlorothiazide (Hydrochlorotiazyd) [Latin: Hydrochlorothiazidum]

Measurements (Pomiary) [Latin: Mensurationes] Blood Pressure (Ciśnienie krwi) [Latin: Tensio sanguinis] 140/90 mmHg (140/90 mmHg) [Latin: 140/90 mmHg]

Lifestyle (Styl życia) [Latin: Modus vivendi] Low salt diet (Dieta niskosolna) [Latin: Dieta hyposodica] Physical activity (Aktywność fizyczna) [Latin: Activitas physica]

Named Entity Recognition (NER) is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing proper nouns and other "named" entities in text into predefined categories. These categories include person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

NER is a crucial step in the information extraction process as it helps convert unstructured text into structured data. This data is then used in various downstream applications, including but not limited to machine translation, question answering, and knowledge graph construction.

In the broader context of NLP, NER works alongside other tasks such as part-of-speech tagging, dependency parsing, and coreference resolution to analyze the structure and meaning of the text. For instance, understanding who or what an entity refers to in a text can be critical for determining the text’s overall meaning. So, in summary, NER is a vital component of NLP, playing a crucial role in turning raw text into structured, interpretable data, which in turn can fuel a range of different NLP applications.

Natural Language Processing (NLP) can be applied to medical speech processing in several ways to improve healthcare services and enhance patient outcomes. Natural language processing has significant potential in medical speech processing in English and other languages, including Polish. Here are a few possible applications: Medical Transcriptions, Patient-Doctor Communication, Real-time Translation, Medical Research, Speech Therapy, Mental Health Analysis, and more.

To implement these applications, one must consider that languages have unique syntax, grammar, and vocabulary. Therefore, creating a specialized NLP model for Polish medical speech processing would require extensive training in Polish medical data.

The following script gives an example of using spaCy (scispaCy model [33]) for basic text-processing tasks like tokenization, lemmatization, and named entity recognition, which might help process medical diagnoses.

In this example, "Pacjent cierpi na przewlekłą niewydolność nerek" is the sentence we are processing, which translates to "The patient suffers from chronic kidney failure" in English. We printed noun phrases and verbs and looked for named entities to be marked.

import spacy

nlp = spacy.load("en_core_sci_sm")

# Process the sentence

sentence = "The patient suffers from chronic kidney failure."

doc = nlp(sentence)

print("Tokenization & Lemmatization:")

for token in doc:

print(f"Token: {token.text}, Lemma: {token.lemma_}")

print("\nNoun Phrases and Verbs:")

for chunk in doc.noun_chunks:

print(f"Noun Phrase: {chunk.text}")

for token in doc:

if token.pos_ == "VERB":

print(f"Verb: {token.text}")

print("\nNamed Entities:")

for ent in doc.ents:

print(f"Entity: {ent.text}, Label: {ent.label_}")

This script will tokenize the sentence, perform lemmatization, extract noun phrases, identify verbs, and look for named entities. The effectiveness in a medical context would be enhanced significantly with a model trained on medical data, which can recognize specific medical terms and conditions. To run this script, one needs to have spaCy installed, and the English model downloaded to do this with the following commands:

pip install spacy

python -m spacy download en_core_web_sm.

The script gives a simple example of how to process text. Still, one would likely need a specialized model trained on medical text data in languages like Polish for more complex tasks like understanding medical terminology or predicting diagnoses. Training such models could involve techniques like transfer learning and fine-tuning specific tasks, which is more complex. There are some further applications of NLP in this context as:

Speech Recognition
Clinical Documentation
Information Extraction
Sentiment Analysis
Clinical Decision Support
Patient Monitoring

These are just a few examples of how NLP can be applied to medical speech processing. As technology advances and datasets become more available, the potential for NLP applications in healthcare will continue to grow.

Here is another example script in Python for processing a medical diagnosis in the Polish language using the spaCy library for natural language processing:

import spacy

nlp = spacy.load("pl_core_news_sm")

# Polish medical diagnosis

diagnosis_text = "Pacjent cierpi na przewlekłą niewydolność nerek."

doc = nlp(diagnosis_text)

print("Tokens and Parts of Speech:")

for token in doc:

print(f"{token.text} ({token.pos_})")

print("\nEntities and their Labels:")

for ent in doc.ents:

print(f"{ent.text} ({ent.label_})")

The spaCy library in this script was used to load the Polish language model (pl_core_news_sm). The text (diagnosis_text) is processed using the loaded language model. It extracts the entities and their labels. As a result, the list of entities is printed on the screen. One can perform further processing or analysis on these entities per existing requirements. It is necessary to have the spaCy library and the Polish language model installed for this script to work. Performance would be better if a medical Polish language model exists – a general Polish language model is used in this example.

To illustrate the use of a BERT model for NLP-assisted text annotation, we can follow a step-by-step approach. This process will involve loading a pre-trained BERT model, tokenizing the input text, and then using the model to predict entities within the text. The transformer library from Hugging Face will be used for this purpose. The example will focus on annotating medical entities such as symptoms, medications, and medical conditions from the provided text.

The code snippet below outlines the basic steps to use a BERT model for annotating the given text. Note that for specific medical entity recognition, one would ideally use a model trained on a medical corpus or fine-tune a BERT model on a dataset that includes the types of entities of interest. We used the Hugging Face medical documents NER model (“bert-medical-ner-proj”) for demonstration purposes, but customized fine-tuning should be considered for a real-world application.

from transformers import AutoTokenizer, AutoModelForTokenClassification

from transformers import pipeline

# Load a pre-trained model and tokenizer

model_name = "medical-ner-proj/bert-medical-ner-proj"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForTokenClassification.from_pretrained(model_name)

# Initialize the NLP pipeline for named entity recognition

nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy="simple")

# The text to be annotated

text = "Mr. Smith, a 65-year-old male, complains of persistent headaches and fatigue over the last two months. He has a history of hypertension, which was diagnosed five years ago. His blood pressure readings have been consistently above 140/90 mmHg. He is currently on Hydrochlorothiazide 25mg once daily. He follows a low salt diet but admits to inconsistent physical activity."

# Use the pipeline to predict entities

entities = nlp(text)

# Print out the entities and their types

for entity in entities:

start = entity['start']

end = entity['end']

print(f"Text: {text[start:end]}, Entity: {entity['entity_group']}")

This code will print out entities recognized in the text and their types (e.g., PER for person, LOC for location). Since we're using a generic NER model, it may not perfectly identify all medical-specific entities or might not classify them with the precision a specialized medical NER model would. Table 1 presents the results of “bert-medical-ner” model together with simple mapping to desired classification, i.e., problem to symptoms, etc.

Table 1

The results of medical NER performed by BERT
Text	Entity	Mapped classification
Mr	B_person
. Smith	I_person
persistent	B_problem	Symptoms
headaches	I_problem	Symptoms
fatigue	B_problem	Symptoms
He	B_person
hyper	B_problem	Symptoms
tension	I_problem	Symptoms
which	B_pronoun
His	B_person
blood pressure readings	I_test	Diagnosis
He	B_person
Hydro	B_treatment	Medication
chlorothiazide	I_treatment	Medication
He	B_person
a	B_treatment	Medication
low salt diet	I_treatment	Medication
activity	I_problem	Symptoms

We can observe that, for example, values of medical measurements are not present in the NER results. For medical applications, exploring models specifically trained on medical datasets, as mentioned earlier, would yield better results. The model and code may need improvements to get the following categorized results:

"Symptoms": ["headaches'', "fatigue"],

"Diagnosis": ["hypertension"],

"Medication": ["Hydrochlorothiazide"],

"Measurements": ["blood pressure", "140/90 mmHg"],

"Lifestyle": ["low salt diet", "physical activity"].

To achieve a categorized result as specified, we would need to adapt the approach to not only extract entities using a pre-trained BERT model but also categorize them according to specific categories like Symptoms, Diagnosis, Medication, Measurements, and Lifestyle. Since general-purpose NER models do not directly provide such specific categorizations, a custom processing step is required to map recognized entities to these categories.

This process involves two main steps:

Entity Recognition: Use the BERT model to identify entities in the text
Entity Categorization: Manually map recognized entities to the desired categories based on their context or entity type

This code performs a considerably basic categorization based on the presence of specific keywords in the entity text. For more accurate and nuanced categorization, the model should be fine-tuned using additional keywords, i.e., indicating medical measurements or lifestyle.

This chapter presents the results of experiments on medical entities identified by scispaCy package. There were three goals, first - to verify the performance of scispaCy, second - to compare human classification of medical entities between different specialties, and third - to compare human and chatbot classifications. The experiment consisted of classifying the identified medical entities, in the basic version - into one of two classes in the extended version - into one of five classes. In the study, seven doctors took part in various specialties. What is more, the examination of the utilization and comparative analysis of several critical natural language processing (NLP) tools and machine learning models in medical entities classification, specifically scispaCy, Gemini (Bard), and GPT chatbots, including a fine-tuned version of GPT-4 named MediTag Assistant were tested.

The dataset obtained using scispaCy (2000 entities) was meticulously annotated by a diverse group of doctors in different medical specialties. This process ensures a prominent level of accuracy and relevance in the medical terms included, providing a foundation for evaluating our methodologies. Such a varied and expertly curated dataset allows for a nuanced comparative assessment, enabling us to measure the effectiveness of AI-based tools in extracting medical terminology against human-based methods. By doing so, we aim to not only highlight the capabilities and limitations of current AI technologies in understanding and processing medical language but we were able to make insights into the potential areas where human expertise continues to play an indispensable role. This comparative analysis is vital for understanding the complementary nature of AI and human judgment in advancing medical informatics.

First, a pilot test of the efficiency of natural language processing was organized. The scispaCy [27] package was employed to identify medical entities for subsequent analysis. The data utilized in this study was prepared through collaboration between Gdańsk University of Technology and the Medical University of Gdańsk. This dataset is enriched with medical terminology, simulating various scenarios such as doctor’s appointments, drug prescriptions, surgical procedures, emergency room dialogues, and more. The dataset comprises 6888 phrases containing 85907 words. The shortest phrase is a single word, while the longest spans 109 words. The average length, measured in words, is 12. As an illustrative example, consider the phrase, “W wywiadzie astma, choroba zwyrodnieniowa wielostawowa i otyłość” (English translation: 'History of asthma, polyarthritis, and obesity'). Experienced medical specialists provided the English translation because some models, including scispaCy were not trained in Polish.

Having a translation, moreover, makes it possible to compare the results obtained by multilingual models in situations where they are given prompts alternately in two languages. So, the popular multilingual chatbots can be fed with Polish phrases directly. Meanwhile, if it is known that a given tool operates only in English, it makes little sense to test it in other languages. This is the situation with scispaCy.

A special portal has been built, including corresponding web applications to facilitate the collection of texts and their subsequent reading or synthesis for training automatic speech recognition systems. The homepage of this portal is shown in Fig. 2. In sequence, the numbers in the top figure reflect the number of total texts, active texts, registered files containing medical staff speech recordings, and the current number of users.

A study was conducted in the form of a survey to validate the fundamental performance of scispaCy in detecting medical entities. The primary objective of this study was to assess the accuracy of scispaCy, specifically, to determine the proportion of entities identified by the tool that did not correspond to medical terminology compared to those that did. To enhance the study's complexity, the 'medical terms' category was further subdivided into four categories: drugs, diseases and symptoms, procedures, and other medical terms. The fifth category represented non-medical terms without further subdivision. The survey enlisted participation from physicians specializing in cardiology, internal medicine, emergency radiology (ER).

Responses generated by four chatbots, GPT-3.5, GPT-4, chatbot GPT integrated with the Bing browser, and Gemini (Bard), accessed on 2024, March. 11–14, were also included in this stage of the study. The detailed results are presented in Table 2, while cumulative outcomes are shown in Fig. 3.

Further investigations involved conducting additional comparisons between chatbots and humans, with the outcomes outlined in Table 3, the numbers presented in the table are absolute values. Notably, the most substantial disparities emerge within the categories labeled as 'other medical terms' and 'non-medical terms.' The underlying reasons for these differences can be attributed to how individuals interpret terms literally. For instance, if there was a typographical error, human respondents typically classified the term as non-medical, accompanied by comments explaining the rationale behind their decision. In contrast, chatbots exhibit a distinct approach by automatically rectifying typos and promptly assigning a predefined class to the term. Global differences between chatbots and humans (per medical specialization) are presented in Fig. 4.

Additionally, a comparison among human participants was conducted, and the results are illustrated in Fig. 5. The absolute agreement can be understood as a scenario in which all respondents are selected from the same class. In this context, agreement denotes a situation where most votes align with one class (3 < x < 7). Lack of agreement is indicated when no distinct winner exists among the classes (x < = 3).

To verify if there are statistically significant differences between the means of three groups of specialists (cardiology, internal medicine, emergency radiology) and one group of all chatbots, an analysis of variance (ANOVA) was performed. Since there are more than two groups, the ANOVA test was selected, which extends the t-test for independent samples to accommodate more than two groups. The following steps need to be taken to use ANOVA:

1. Calculating the F-statistic is presented in formula (1):

$$F=\frac{MSB}{MSW}$$

(1)

where:

F is the coefficient of ANOVA,
MSB is the mean sum of squares between the groups,
MSW is the mean sum of squares within groups.

Finding the critical F value in the F table. The sample averages are not significantly different if the F-statistic is less than the critical F value. The observed differences among the sample averages could reasonably be due to random chance alone.

Checking the p-value. For the test, alpha = 0.05 was used, which is why the p-value is compared with 0.05. We fail to reject the null hypothesis if the p-value is greater than 0.05.

The null hypothesis follows: "There are no significant differences between the means of the individual groups". The alternative hypothesis states that the mean of at least one population will vary from that of others. Table 4 presents F values, p-values, and F-critical values calculated for the groups. Five tests were performed, one for each category: Drugs, Diseases and Symptoms, Procedures, Other Medical Terms, and Non-Medical Terms. The p-value for all tests is more significant than the alpha. Moreover, all F-values are below the critical F value of 4.757063, and all p-values are greater than the typical significance level of 0.05. This suggests that no statistically significant evidence exists for each category to reject the null hypothesis, indicating that the differences among groups within each category may not be statistically significant based on this ANOVA test. The statement means that based on our tests, we did not find strong enough evidence to say that the groups we compared are significantly different. In simpler terms, our experiments did not show clear differences between our study groups.

Table 2

Survey results by specialty in the case of humans and by tool in the case of chatbots
	Drugs	Diseases and Symptoms	Procedures	Other Medical Terms	Non-Medical Terms
Cardiology (AVG)	914	155	42	288	601
Internal medicine (AVG)	989	169	64	328	452
ER (AVG)	951.25	161.75	53.00	307.75	526.25
Human AVERAGE	951.25	161.75	53.00	307.75	526.25
Bard	868	120	114	560	338
Bing (GPT4)	890	303	98	139	570
ChatGPT (GPT3.5)	900	157	59	748	136
ChatGPT (GPT4)	1028	205	62	272	433
Chatbot AVERAGE	921.50	196.25	83.25	429.75	369.25
Chatbots AVG vs Humans AVG	46.33	15.42	22.58	113.42	105.08

Table 3

Comparison between chatbot and human responses
	Drugs	Diseases and Symptoms	Procedures	Other Medical Terms	Non-Medical Terms
Chatbot AVG vs Human AVG	29.75	34.50	30.25	122.00	157.00
Gemini (Bard) vs card (AVG)	46	35	72	272	263
Gemini (Bard) vs internal med. (AVG)	121	49	50	233	114
Gemini (Bard) vs ER (AVG)	83	42	61	252	188
Bing vs card (AVG)	24	148	56	149	31
Bing vs internal med. (AVG)	99	135	34	189	119
Bing vs ER (AVG)	61	141	45	169	44
ChatGPT (GPT3.5) vs card (AVG)	14	2	17	460	465
ChatGPT (GPT3.5) vs internal med. (AVG)	89	12	5	421	316
ChatGPT (GPT3.5) vs ER (AVG)	51	5	6	440	390
ChatGPT (GPT4) vs card (AVG)	114	50	20	16	168
ChatGPT (GPT4) vs internal med. (AVG)	40	37	2	56	19
ChatGPT (GPT4) vs ER (AVG)	77	43	9	36	93

Table 4

ANOVA single-factor test results
	Drugs	Diseases and Symptoms	Procedures	Other Medical Terms	Non-MedicalTerms
F	0.674794092	0.358571	1.39115	0.285038	1.046116
p-value	0.598336138	0.785405	0.33345	0.834818	0.437859
F crit	4.757062663	4.757063	4.757063	4.757063	4.757063

Table 2, placed in the previous chapter, shows the annotation results for the categories "Drugs", "Diseases and Symptoms", "Procedures", "Other Medical Terms", and "Non-Medical Terms" categorized by medical specialty (cardiology, internal medicine, and ER specialists) and tool (Bard, Bing, ChatGPT (GPT3.5), ChatGPT (GPT4)). Table 3 and Fig. 4 compare the chatbots' answers and the average human results, looking in detail at the differences in term classification. The differences in classification are also natural and common between doctors, as illustrated in Fig. 5. Chatbots that have the least number of differences with human responses are Bing chat and ChatGPT (GPT4). Based on the exercise results, these two best mimic human classifications.

The results presented in the preceding chapter show that human experts and chatbots differ in ratings, especially in "Other medical terms" and "Non-medical terms." Chatbots appear to be better at correcting spelling errors and automatically classifying terms, which can lead to higher overall performance in some categories but also to potential errors through overgeneralization or misinterpretations.

Based on an analysis of these tables, there are some rather unexpected details, such as the temporarily better performance of the earlier version of Chat GPT 3.5 and different results for Bing, which is based on GPT 4.

It is unsurprising to witness variations in performance and capabilities among chatbots, especially when considering the inherent differences even among human experts, such as medical doctors. The diversity in chatbot responses reflects the nuanced and complex nature of language understanding and processing, much like the varied diagnoses and treatments offered by doctors who bring their unique experiences, specializations, and perspectives to patient care. This variability underlines the importance of leveraging AI tools and human expertise to achieve the best outcomes, whether in medical practice or in developing and utilizing chatbot technologies for diverse applications. The nuanced understanding and contextual interpretation that human professionals offer cannot be entirely replicated by AI, underscoring the complementary roles both play in fields requiring specialized knowledge and judgment.

The results obtained in the study underscore the necessity for a tool capable of handling Polish medical named entity recognition. Based on English translations, the findings presented in this paper substantiate the argument that such a tool must be developed and thoroughly tested to achieve even greater reliability.

Text annotation in medicine holds enormous potential to transform raw clinical and biomedical texts into structured and standardized data, thus paving the way for more advanced analysis and research, improved clinical decision-making, and personalized patient care. Historically, Wikipedia creators have acknowledged that the development of services and repositories presents greater challenges for less commonly used languages with smaller speaking populations worldwide. An example of such a language is Polish, which approximately 40 million people speak. This situation necessitates the expansion of training datasets, particularly for specialized applications such as medicine. In light of this, there is a pressing need to enhance linguistic resources and tools for languages like Polish to ensure that advancements in AI and natural language processing can be fully leveraged in medical applications. Addressing this gap requires a focused effort to build comprehensive training collections that can support the nuanced needs of medical text annotation in languages spoken by smaller populations. This effort will not only improve the quality and accessibility of medical AI tools in these languages but also ensure that the benefits of AI in healthcare can be universally accessed, regardless of the language spoken by healthcare providers and patients.

In our study, we observed a notable variance in the process of medical word annotation among physicians across different specialties. This diversity in annotation practices underscores the complexity of medical language and highlights the unique perspectives that various medical disciplines bring to interpreting terminology. Our findings suggest that the interdisciplinary knowledge inherent to the medical profession significantly influences the categorization of medical terminology, thereby emphasizing the need for tailored natural language processing (NLP) solutions that accommodate the nuanced differences across medical fields. This insight not only enriches our understanding of medical language processing but also underscores the importance of developing adaptable and specialty-specific NLP tools to support the diverse needs of the healthcare sector.

The application of statistical analysis to our experimental results revealed that the effectiveness of AI tools and physicians in medical term annotation is statistically indistinguishable across various categories, including Drugs, Diseases and Symptoms, Procedures, and Other Medical Terms. This equivalence in performance challenges the assumption that either AI or human expertise alone is superior for medical text annotation. Instead, it highlights the potential for a synergistic approach where AI tools complement the nuanced understanding of medical professionals. This integration can potentially lead to more accurate and comprehensive annotations, thereby enhancing the quality of medical databases and supporting the development of advanced AI applications in healthcare.

In summary, AI tools and human expertise are complementary. AI can augment human capabilities, allowing medical professionals to focus on tasks requiring unique skills and experience, such as patient care, critical thinking, and complex decision-making. Combining AI and human expertise can help deliver more accurate, efficient, and effective healthcare services. Moreover, structured data output from text annotation provides a valuable tool for the healthcare industry, especially in its potential to augment the quality of clinical decision-making. By leveraging such structured data, healthcare professionals can better understand their patients’ medical history, present conditions, and potential risks. This, in turn, enables them to make more accurate diagnoses, design more effective treatment strategies, and anticipate possible complications. It is also expected that solutions for dictating medical documentation will be performed more efficiently while supported with speech context analysis tools.

Despite the advancements, there are still challenges to overcome. The significant difficulties result from the complex and diverse nature of medical texts, the evolving medical terminology, the need for high precision, and the privacy concerns related to sensitive health data. Future research should focus on developing more robust and efficient tools for text annotation, integrating different health data sources, and ensuring patient data privacy and security. Since neither AI-based methods nor humans relying on themselves achieve supremacy in annotating medical texts, the ability to skillfully integrate human and technical capabilities in this area seems crucial.

Conflict of interest:

the authors declare no conflict of interest in this work.

Author Contribution

Conceptualization, M.Z. and A.C.; Methodology, M.Z. and A.C.; Supervision, A.C.; Data Preprocessing, M.Z.; Specialized Data Labeling, D.S., B.G., A.S., M.B., K.N.; Data Analysis, M.Z., Writing—original draft, M.Z. and A.C.; Writing—review and editing, M.Z. and A.C. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

The Polish National Centre for Research and Development (NCBR) supported this research in the project: “ADMEDVOICE- Adaptive intelligent speech processing system of medical personnel with the structuring of test results and support of therapeutic process,” no. INFOSTRATEG4/0003/2022.

Data Availability

The anonymized datasets generated during and analyzed during the current study are available from the corresponding author upon reasonable request.

Since the study involves only physicians and no patients, it is not a medical experiment. This work is a part of the broader ADMEDVOICE project, which the Bioethical Commission for Scientific Research has favorably evaluated at the Gdańsk Medical University (Resolution no. KB/508/2023 dated 2023, Sept. 15th).

M. Panahiazar, N. Chen, R. E. Beygui, and D. Hadley, “The Role of Natural Language Processing in Intelligence-Based Medicine”, In: AI in Clinical Medicine, John Wiley & Sons, Ltd., 2023, pp. 73–80. https://doi.org/10.1002/9781119790686.ch8
A. Poniszewska-Maranda, Elina, Vynogradnyk, W. Maranda "Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms", Applied Sciences, 13 (2023), pp. 682-682. doi: 10.3390/app13020682
Conrad, J., Harrison, Chris, Sidey-Gibbons. "Machine learning in medicine: a practical introduction to natural language processing", BMC Medical Research Methodology, 21, 2021, pp. 1-11. doi: 10.1186/S12874-021-01347-1
C. Crema, G. Attardi, D. Sartiano, A. Redolfi. "Natural language processing in clinical neuroscience and psychiatry: A review." Frontiers in Psychiatry, 13, 2022. doi: 10.3389/fpsyt.2022.946387
M. Reading, et. al., "Systematic review of current natural language processing methods and applications in cardiology", Heart, 2021. doi: 10.1136/HEARTJNL-2021-319769
I. Li, et al., “Neural natural language processing for unstructured data in electronic health records: A review”, Computer Science Review, 46, 2022, 100511. https://doi.org/10.1016/j.cosrev.2022.100511
J. Sun,.Y. Liu., J. Cui, et. Al. “Deep learning-based methods for natural hazard named entity recognition”. Sci Rep 12, 4598, 2022. https://doi.org/10.1038/s41598-022-08667-2
S. Bird, E. Klein, E. Loper, "Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit", O'Reilly Media,ISBN: 978-0596516499.
M. Zielonka, et. al., “A survey of automatic speech recognition deep models performance for Polish medical terms”, Signal Processing Algorithms, Architectures, Arrangements, and Applications, SPA 2023, Sept. 20th–22nd, 2023, Poznań, Poland DOI: 10.23919/SPA59660.2023.10274442
C. Friedman, P. O. Alderson, J. H. M. Austin, J. J. Cimino, and S. B. Johnson, “A General Natural-language Text Processor for Clinical Radiology”, Journ. of the American Medical Informatics Association, vol. 1, no. 2, pp. 161–174, 03, 1994, https://doi.org/10.1136/jamia.1994.95236146
G. K. Savova, et al., “Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications”, Journ. of the American Medical Informatics Association: JAMIA vol. 17, 5 2010, pp. 507-13. doi:10.1136/jamia.2009.001560.
P. Bhatia, B. Celikkaya, M. Khalilia, and S. Senthivel, ‘Comprehend Medical: a Named Entity Recognition and Relationship Extraction Web Service’, arXiv [cs.CL]. 2019. https://doi.org/10.48550/arXiv.1910.07419 (Accessed:17.02.2024)
J. Mark, S. Handa, and T. Syed "Introducing AWS HealthScribe – automatically generate clinical notes from patient-clinician conversations using AWS HealthScribe"https://aws.amazon.com/blogs/industries/industries-introducing-aws-healthscribe/ (Accessed:17.02.2024)
Y. Xiong "What’s new Azure AI Language | BUILD 2023" https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/what-s-new-azure-ai-language-build-2023/ba-p/3828842 (Accessed:17.02.2024)
G. Team et al., ‘Gemini: A Family of Highly Capable Multimodal Models’, arXiv [cs.CL]. 2023.https://doi.org/10.48550/arXiv.2312.11805
C. H. Wei, et. al., “PubTator central: automated concept annotation for biomedical full text articles”, Nucleic acids research vol. 47, 2019, W587-W593. doi:10.1093/nar/gkz389.
M. Mochtak, P. Rupnik, and N. Ljubešić, ‘The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings’, arXiv [cs.CL]. 2023.
https://doi.org/10.48550/arXiv.2309.09783
H. Brink, B. Richards, “Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca.” (2023). doi: 10.48550/arxiv.2304.08177
SenTag: A Web-Based Tool for Semantic Annotation of Textual Documents. (2022).;36(11):13191-13193. doi: 10.1609/aaai.v36i11.21724
J.-C. Klie, M. Bugert, B. Boullosa, R. Eckart de Castilho, and I. Gurevych, “The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation”, In: Proc. of the 27th International Conference on Computational Linguistics: System Demonstrations, 2018, pp. 5–9.
P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, and J. Tsujii, “brat: a Web-based Tool for NLP-Assisted Text Annotation”, In: Proc. of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 102–107.
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., and Liang, X. (2018). “doccano: Text annotation tool for human.” Software available from https://github.com/chakkiworks/doccano.
Jacobs, M., Thompson, C., "LightTag: A Scalable Platform for Structured Text Annotation", Proc. of the 12th Language Resources and Evaluation Conference, 2020, pp. 45-52.
J. Lee, W., et. al., "BioBERT: a pre-trained biomedical language representation model for biomedical text mining." Bioinformatics, 36 (4), 2020, pp. 1234-1240. https://doi.org/10.1093/bioinformatics/btz682
Y. Peng, S. Yan, Z. Lu, Z., "Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets", In: Proc. of the ACL-IJCNLP, 2021, 58-65. https://doi.org/10.48550/arXiv.1906.05474
T. Valizadeh Aslani et al., “PharmBERT: a domain-specific BERT model for drug labels”, Briefings in Bioinformatics, vol. 24, no. 4, Jun. 2023, doi: 10.1093/bib/bbad226.
X. Li, Y. Shao, T. Sun, H. Yan, X. Qiu, and X. Huang, “Accelerating BERT Inference for Sequence Labeling via Early-Exit”, arXiv [cs.CL]. 2021. https://doi.org/10.48550/arXiv.2105.13878
J. Devlin, et. al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", In: Proc. of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171-4186. https://doi.org/10.48550/arXiv.1810.04805
A. Rajkomar, et al., "Scalable and accurate deep learning with electronic health records", NPJ Digital Medicine, 2018, 1(1), pp. 1-10. https://doi.org/10.1038/s41746-018-0029-1
Y. Tay, et. al. "Efficient Transformers: A Survey", arXiv preprint arXiv:2009.06732, 2020.
M. Honnibal, "SpaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing", Sentometric Research, 2017, [Accessed on 2024, Jan, 11^th, https://sentometrics-research.com/publication/72/] https://doi.org/10.5281/zenodo.3358113
OpenAI et al., ‘GPT-4 Technical Report’, arXiv [cs.CL]. 2023. https://doi.org/10.48550/arXiv.2303.08774
M. Neumann, D. King, I. Beltagy, and W. Ammar, “ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing”, In: Proc. of the 18th BioNLP Workshop and Shared Task, 2019, pp. 319–327. https://doi.org/10.48550/arXiv.1902.07669

No competing interests reported.

Download PDF

Reviews received at journal
29 Apr, 2024
Reviewers agreed at journal
27 Apr, 2024
Reviewers invited by journal
24 Apr, 2024
Editor assigned by journal
24 Apr, 2024
Editor invited by journal
24 Apr, 2024
Submission checks completed at journal
24 Apr, 2024
First submitted to journal
24 Mar, 2024

You are reading this latest preprint version

Machine Learning Tools Match Physician Accuracy in Multilingual Text Annotation

Status:

Version 1

Abstract

Figures

1. INTRODUCTION

2. ANNOTATIONS OF MEDICAL TEXTS

3. GPT-4 AND MEDICAL TEXT ANNOTATION

4. NAMED ENTITY RECOGNITION APPROACH TO MEDICAL DOCUMENTATION

5. EXPERIMENTS IN AUTONOMIC EXTRACTION OF MEDICAL TERMS

1. Calculating the F-statistic is presented in formula (1):

6. DISCUSSION

7. CONCLUSIONS AND FUTURE DIRECTIONS

Declarations

Author Contribution

Acknowledgments

Data Availability

References

Additional Declarations

Status:

Version 1