Named Entity Recognition (NER) is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing proper nouns and other "named" entities in text into predefined categories. These categories include person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
NER is a crucial step in the information extraction process as it helps convert unstructured text into structured data. This data is then used in various downstream applications, including but not limited to machine translation, question answering, and knowledge graph construction.
In the broader context of NLP, NER works alongside other tasks such as part-of-speech tagging, dependency parsing, and coreference resolution to analyze the structure and meaning of the text. For instance, understanding who or what an entity refers to in a text can be critical for determining the text’s overall meaning. So, in summary, NER is a vital component of NLP, playing a crucial role in turning raw text into structured, interpretable data, which in turn can fuel a range of different NLP applications.
Natural Language Processing (NLP) can be applied to medical speech processing in several ways to improve healthcare services and enhance patient outcomes. Natural language processing has significant potential in medical speech processing in English and other languages, including Polish. Here are a few possible applications: Medical Transcriptions, Patient-Doctor Communication, Real-time Translation, Medical Research, Speech Therapy, Mental Health Analysis, and more.
To implement these applications, one must consider that languages have unique syntax, grammar, and vocabulary. Therefore, creating a specialized NLP model for Polish medical speech processing would require extensive training in Polish medical data.
The following script gives an example of using spaCy (scispaCy model [33]) for basic text-processing tasks like tokenization, lemmatization, and named entity recognition, which might help process medical diagnoses.
In this example, "Pacjent cierpi na przewlekłą niewydolność nerek" is the sentence we are processing, which translates to "The patient suffers from chronic kidney failure" in English. We printed noun phrases and verbs and looked for named entities to be marked.
import spacy nlp = spacy.load("en_core_sci_sm") # Process the sentence sentence = "The patient suffers from chronic kidney failure." doc = nlp(sentence) print("Tokenization & Lemmatization:") for token in doc: print(f"Token: {token.text}, Lemma: {token.lemma_}") print("\nNoun Phrases and Verbs:") for chunk in doc.noun_chunks: print(f"Noun Phrase: {chunk.text}") for token in doc: if token.pos_ == "VERB": print(f"Verb: {token.text}") print("\nNamed Entities:") for ent in doc.ents: print(f"Entity: {ent.text}, Label: {ent.label_}") |
This script will tokenize the sentence, perform lemmatization, extract noun phrases, identify verbs, and look for named entities. The effectiveness in a medical context would be enhanced significantly with a model trained on medical data, which can recognize specific medical terms and conditions. To run this script, one needs to have spaCy installed, and the English model downloaded to do this with the following commands:
pip install spacy
python -m spacy download en_core_web_sm.
The script gives a simple example of how to process text. Still, one would likely need a specialized model trained on medical text data in languages like Polish for more complex tasks like understanding medical terminology or predicting diagnoses. Training such models could involve techniques like transfer learning and fine-tuning specific tasks, which is more complex. There are some further applications of NLP in this context as:
These are just a few examples of how NLP can be applied to medical speech processing. As technology advances and datasets become more available, the potential for NLP applications in healthcare will continue to grow.
Here is another example script in Python for processing a medical diagnosis in the Polish language using the spaCy library for natural language processing:
import spacy nlp = spacy.load("pl_core_news_sm") # Polish medical diagnosis diagnosis_text = "Pacjent cierpi na przewlekłą niewydolność nerek." doc = nlp(diagnosis_text) print("Tokens and Parts of Speech:") for token in doc: print(f"{token.text} ({token.pos_})") print("\nEntities and their Labels:") for ent in doc.ents: print(f"{ent.text} ({ent.label_})") |
The spaCy library in this script was used to load the Polish language model (pl_core_news_sm). The text (diagnosis_text) is processed using the loaded language model. It extracts the entities and their labels. As a result, the list of entities is printed on the screen. One can perform further processing or analysis on these entities per existing requirements. It is necessary to have the spaCy library and the Polish language model installed for this script to work. Performance would be better if a medical Polish language model exists – a general Polish language model is used in this example.
To illustrate the use of a BERT model for NLP-assisted text annotation, we can follow a step-by-step approach. This process will involve loading a pre-trained BERT model, tokenizing the input text, and then using the model to predict entities within the text. The transformer library from Hugging Face will be used for this purpose. The example will focus on annotating medical entities such as symptoms, medications, and medical conditions from the provided text.
The code snippet below outlines the basic steps to use a BERT model for annotating the given text. Note that for specific medical entity recognition, one would ideally use a model trained on a medical corpus or fine-tune a BERT model on a dataset that includes the types of entities of interest. We used the Hugging Face medical documents NER model (“bert-medical-ner-proj”) for demonstration purposes, but customized fine-tuning should be considered for a real-world application.
from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline # Load a pre-trained model and tokenizer model_name = "medical-ner-proj/bert-medical-ner-proj" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Initialize the NLP pipeline for named entity recognition nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy="simple") # The text to be annotated text = "Mr. Smith, a 65-year-old male, complains of persistent headaches and fatigue over the last two months. He has a history of hypertension, which was diagnosed five years ago. His blood pressure readings have been consistently above 140/90 mmHg. He is currently on Hydrochlorothiazide 25mg once daily. He follows a low salt diet but admits to inconsistent physical activity." # Use the pipeline to predict entities entities = nlp(text) # Print out the entities and their types for entity in entities: start = entity['start'] end = entity['end'] print(f"Text: {text[start:end]}, Entity: {entity['entity_group']}") |
This code will print out entities recognized in the text and their types (e.g., PER for person, LOC for location). Since we're using a generic NER model, it may not perfectly identify all medical-specific entities or might not classify them with the precision a specialized medical NER model would. Table 1 presents the results of “bert-medical-ner” model together with simple mapping to desired classification, i.e., problem to symptoms, etc.
Table 1
The results of medical NER performed by BERT
Text | Entity | Mapped classification |
Mr | B_person | |
. Smith | I_person | |
persistent | B_problem | Symptoms |
headaches | I_problem | Symptoms |
fatigue | B_problem | Symptoms |
He | B_person | |
hyper | B_problem | Symptoms |
tension | I_problem | Symptoms |
which | B_pronoun | |
His | B_person | |
blood pressure readings | I_test | Diagnosis |
He | B_person | |
Hydro | B_treatment | Medication |
chlorothiazide | I_treatment | Medication |
He | B_person | |
a | B_treatment | Medication |
low salt diet | I_treatment | Medication |
activity | I_problem | Symptoms |
We can observe that, for example, values of medical measurements are not present in the NER results. For medical applications, exploring models specifically trained on medical datasets, as mentioned earlier, would yield better results. The model and code may need improvements to get the following categorized results:
"Symptoms": ["headaches'', "fatigue"],
"Diagnosis": ["hypertension"],
"Medication": ["Hydrochlorothiazide"],
"Measurements": ["blood pressure", "140/90 mmHg"],
"Lifestyle": ["low salt diet", "physical activity"].
To achieve a categorized result as specified, we would need to adapt the approach to not only extract entities using a pre-trained BERT model but also categorize them according to specific categories like Symptoms, Diagnosis, Medication, Measurements, and Lifestyle. Since general-purpose NER models do not directly provide such specific categorizations, a custom processing step is required to map recognized entities to these categories.
This process involves two main steps:
This code performs a considerably basic categorization based on the presence of specific keywords in the entity text. For more accurate and nuanced categorization, the model should be fine-tuned using additional keywords, i.e., indicating medical measurements or lifestyle.