Application of machine reading comprehension techniques for named entity recognition in materials science

Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science. Scientific contribution We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00874-5.


Introduction
The field of materials science has witnessed a significant surge in research and literature in recent years.While scientific publications offer valuable and reliable data, the manual analysis of a vast number of papers to extract essential information for materials can be an arduous undertaking.The manual extraction of this information is time-consuming, impeding researchers' ability to access the necessary information.Emerging technologies in natural language processing (NLP) offer promising solutions to the process of extracting relevant information from scientific literature.Among them, automatically recognizing named entities in a given text is an important task in the field of NLP.In materials science, identification of various materials, compounds, elements, and other entities is crucial for extracting and transforming material science knowledge from unstructured texts.However, the task of identifying named entities in materials science (MatNER) [1][2][3][4][5] is extremely challenging because there are multiple entities in the materials science literature and their complex combinations, such as acronyms, misspellings, synonyms of compound names, etc.
In the early stages, named entity recognition (NER) mainly relied on rule-based and handcrafted feature methods [3,[6][7][8][9][10].These methods required manual definition of rules and feature templates, and had high requirements for domain knowledge.However, due to the complexity and limitations of rules and features, these methods had poor adaptability to different languages and domains.As machine learning has gained popularity, statistical and machine learning techniques have been increasingly utilized in NER.These approaches leverage annotated datasets to train models, enabling them to learn the statistical patterns and contextual information associated with entities in text.Common machine learning algorithms include Hidden Markov Models (HMM) [11], Conditional Random Fields (CRF) [12], and deep learning models.Deep learning has made significant progress in the field of NER [13][14][15][16][17]. Deep learning models can automatically learn text feature representations, extracting and classifying information through multi-layer neural networks.For example, Bidirectional Long Short-Term Memory networks (BiLSTM) [14] with Recurrent Neural Networks (RNNs) [18] can capture the contextual dependencies between entities, improving the accuracy of NER.Furthermore, the emergence of largescale pre-trained language models like ELMo [19] and BERT [20] has greatly benefited NER.Due to its significant performance, pre-training BERT on large corpora and fine-tuning on target datasets has become a mainstream approach.In the field of materials science, Gupta et al. [21] used MatSciBERT, i.e., BERT pre-trained on materials science corpus, to recognize material science entities, and their method achieved SOTA performance on multiple materials science datasets.The ability of deep learning methods to automatically learn features results in more competitive performance compared to feature engineering methods.
Existing methods typically approach the MatNER task by treating it as a sequence labeling problem.This involves training a model to assign labels to individual tokens in a given sequence.However, these methods have limitations in effectively capturing semantic information and addressing the nested entity problem.Motivated by the recent trend of transforming NLP tasks as machine reading comprehension (MRC) tasks [22][23][24][25][26][27], a MatSciBERT-MRC method based on the MRC framework was proposed in this study.In the MRC framework, each type of material science entity can be encoded through language queries and extracted in the given context by answering these queries, thus more effectively utilizing the information in the training data and improving the generalization ability of the model.Recent studies have converted various NLP tasks into MRC tasks.For instance, Levy et al. [23] proposed a method to cast the relation extraction task as a QA task by parameterizing each relation type R(x,y) as a question Q(x), with y being the answer.Similarly, McCann et al. [24] achieved competitive performance by uniformly implementing 10 different NLP tasks using a question answering framework.In the field of Named Entity Recognition (NER), Li et al. [26] applied BERT for entity recognition under the MRC framework in texts from regular domains, while Sun et al. [27] attained significant performance in texts from the biomedical domain.
To our knowledge, no specific study has focused on NER in materials science under the MRC framework.Herein, we aim to identify entities in materials science, which differs from previous research [26,27].Additionally, the impact of different MRC strategies on the MatNER task is explored.The performance of MatSciBERT-MRC was evaluated on five public materials science datasets, and a comparison was made with traditional sequence labeling models.Experimental results showed that MatSciBERT-MRC has good performance in detecting various material names, compounds, elements, etc., achieving the latest SOTA performance.A powerful tool is provided to material science researchers by this research, enabling them to handle large-scale material science literature and data more accurately and efficiently.Accurately identifying and extracting key information can accelerate the material research process and provide more possibilities for material design and discovery.

Datasets construction
The input to a traditional sequence annotation task is a sequence X = {x 1 , x 2 , ..., x N } , where x i represents the i-th word or label in the sequence.In this study, the labeled NER data needs to be transformed into triples of (Context, Query, and Answer).The Context is a given input sequence X , the Query is a query sentence designed based on that sequence X , and the Answer is the scope of the target entity.In the MRC task, the construction of the query sentence Q y to obtain relevant information is required.Specifically, for each type of entity, we can use keywords or phrases associated with label y and combine them into a query sentence.The length of the query sentence can be determined based on the specific requirements of the task.

Query generation
The generation of queries is recognized as a crucial process as it encompasses prior knowledge of labels, which ultimately influences the final results of MatNER tasks.In this study, the creation of queries relied upon annotation guidelines as references.These guidelines are composed of instructions provided by dataset producers to annotators, enabling them to effectively describe label categories.It is essential that these guidelines be expressed in a broad and precise manner in order to eliminate any ambiguity.Table 1 presents examples of queries we constructed in Matscholar [1] dataset.

Model details
In this study, BERT [20] was used as the model backbone, along with MatsciBERT [21] as the model weights, to identify entities in the field of materials science.Figure 1 depicts the implementation of the MatNER task using BERT in the MRC framework.Initially, the combined sequence{[CLS],q 1 ,q 2 ,…,q m , [SEP],x 1 ,x 2 ,…,x n } is formed by concatenating the query Q y with the sequence X, the special tokens [CLS] and [SEP] are used to represent the start and end positions of the labels.These tokens are combined with the input sequence and fed into the BERT  model, which received the combined string and outputs the context representation.Since we only require context predictions, query representations can be removed as they are not part of the target for model prediction.
In the MRC framework, there are two prevalent approaches to select spans.One strategy involves employing a pair of n-class classifiers, where one is responsible for predicting the starting index and the other for predicting the ending index.These classifiers can extract features using pre-trained models like BERT and output an n-dimensional probability distribution, indicating the probability of each position being the start or end position.Then, the highest probability start and end positions can be selected to form a span.Another approach involves the utilization of two binary classifiers, wherein one classifier is responsible for predicting the start index of each position, while the other classifier is responsible for predicting the end index of each position.Similarly, these binary classifiers can extract features using pre-trained models and output a binary probability distribution, indicating the probability of each position being the start or end position.This approach enables the output of multiple start and end indexes, catering to a given context and specific query, making it possible to extract all relevant entities based on the query.The second approach is utilized in this study and a detailed explanation of its workings is provided.
For the prediction of start index, softmax is used to get whether the token is start index using the following equation: where Q start is the weight to be learned and the probabil- ity distribution of each index as the starting position of the entity is represented by each row of K start .
The model then predicts the probability of each token being the corresponding end index, using the following formula: In order to determine the ending position of an entity for a given query, we introduce the weight Q end , which is to be learned.The probability distribution of each index as the ending position of the entity is represented by each row of K end .
For each given X, there may exist multiple possible start and end indices.Simply matching them based on proximity is not a reasonable approach.Therefore, by applying argmax to the output matrix for each row of K start and K end , we are able to obtain all possible start and end indices.This approach allows us to identify the (1) most probable start and end indices, as determined by the following formula: Superscripts i and j are used to indicate the i-th and j-th rows of the matrix, respectively.

Datasets and experiment settings
Five different NER datasets were considered, i.e.BC4CHEMD [28], Matscholar [1], NLMChem [29], SOFC-Slot and SOFC [30], to represent various text sources and questions related to materials science.Table 2 displays the statistics of all datasets used in this study, which were selected because they are publicly available and ensured a comprehensive evaluation of the proposed method.
The experiments in this study were conducted using Python 3.8.12 and torch 1.12.1.Training of the models was performed on a single GTX 3060 GPU.Due to the computational complexity limitations, many previous works in the field of material science utilized the BERT BASE model.To ensure comparability with these works, all BERT models employed in this study were based on the BERT BASE [20] model, which consists of 12 transformer layers, a 768-dimensional hidden layer, and a 12-head multi-head attention mechanism.For specific (3)   details on the hyperparameters used in the experiments, please refer to Table 3.

Evaluation metrics
In the experimental phase, the F1-score was employed as the metric to assess the overall performance of the model.Furthermore, precision and recall were utilized to evaluate the model's capability in accurately identifying positive and negative samples.Precision can be defined as the ratio of correctly identified positive values, also known as true values, to the total number of identified positive values.
Recall is the ratio between the predicted true value and the actual labeled result.F1-score is the harmonic mean of precision and recall.

The effect of different BERT models on NER performance
To investigate the effects of different BERT models on NER performance, the performance of MatSciBERT [21], BioBERT [31], and SciBERT [32] was evaluated for the NER (5) task.The aim was to determine which BERT model would yield the best performance for NER in materials science.The effect of different BERT models on NER performance is illustrated in Table 4. Overall, MatsciBERT achieved higher scores than BioBERT and SciBERT on all the datasets (p < 0.05).Unlike SciBERT, MatSciBERT has been trained using a large corpus of texts from the field of materials science, covering multiple research fields, journals, and data sources, encompassing a broad body of materials science knowledge and domain-specialized terminology.This gives MatSciBERT greater adaptability and accuracy when working with materials science-related texts.These experimental results indicate that there is a significant difference between the scientific literature in the materials domain pre-trained by MatSciBERT and the scientific literature in the biomedical domain pre-trained by BioBERT.It becomes evident that each scientific discipline presents substantial variation in terms of ontology, vocabulary, and domain-specific symbols.Therefore, the pre-trained corpus plays a pivotal role in determining the model's performance, and the utilization of MatSciBERT, trained on an extensive collection of materials science publications, proves to be more fitting for our MatNER task in this experiment.

MRC vs sequence labeling frameworks
A detailed comparison of the performance of BERT models with MRC framework and sequence tagging framework was conducted.The performance comparison between different models is presented in Table 5.In the MatSciBERT-Softmax, the classification of each token in the sequence is accomplished by utilizing the Softmax function on the MatSciBERT output.MatSciBERT-CRF learns the constraint relationships between labels through CRF to ensure the rationality of the predicted label sequence, thereby obtaining the best sequence annotation results.MatSciBERT-BiLSTM-CRF uses BiL-STM-CRF to enhance the learning ability of the context, enabling the model to better learn semantic information in the context.Among the three sequence labeling frameworks (i.e.MatSciBERT-CRF, MatSciBERT-BiL-STM-CRF, and MatSciBERT-Softmax), MatSciBERT-CRF achieves the best performance in all four datasets in the traditional sequence annotation model.It can be inferred that the CRF possesses the ability to acquire the interdependent connection between labels.This capability ultimately guarantees the rationality of the predicted label sequence and significantly enhances the accuracy of entity recognition.However, for BC4CHEMD dataset, the performance of MatSciBERT-BiLSTM-CRF model is the best among the three series of annotation models, which may be due to the relatively large items in BC4CHEMD dataset.The MatSciBERT-BiLSTM-CRF model can learn more contextual semantic information from the dataset.Unlike the above three methods, MatSciBERT-MRC turns the MatNER task into a machine reading comprehension problem and predicts the answer span x start,end based on the input sequence X and query statement Q y .As shown in Table 5, compared with sequence tag- ging methods, MatSciBERT-MRC improves a substantial enhancement to the performance of the MatNER task (p < 0.05).By encoding crucial prior knowledge into the query, the MRC effectively mitigates the problem of sparse tagging, corpus size or sentence length, leading to more improvements on all the datasets.The experimental results unequivocally showcase the superior entity identification capabilities of BERT within the MRC framework compared to the sequence tagging framework, particularly in the domain of material science.

The effect of different MRC strategies on NER performance
The influence of different span prediction strategies was also evaluated in this study.Specifically, the effect of end_index information in MatSciBERT-MRC model on the NER performance was investigated.To assess this, a baseline model called MatSciBERT-MRC-base was designed and its performance was compared to that of MatSciBERT-MRC.In implementation, MatSciBERT-MRC-base only replaced K end of MatSciBERT-MRC described in front with the following formula, while keeping everything else unchanged: A performance comparison between these two models is presented in Table 6.It can be observed that both models have competitive average F1 scores.Overall, MatSciBERT-MRC outperformed MatSciBERT-MRCbase on four out of five datasets (p < 0.05).This advantage is likely because the model considers the start index when predicting the end index, allowing for more accurate boundary prediction of entities.The start index provides context, helping the model determine the most likely end position of an entity, thereby reducing errors in boundary prediction.Additionally, independently predicting the start and end indices can lead to invalid spans, such as the end index being before the start index or spans that do not correspond to valid entities.However, the base model performed better than MatSciB-ERT-MRC models in the BC4CHEMD dataset, possibly because the entity structure in this dataset is relatively simple, allowing the baseline model to better capture this simple structure.
Based on the experimental findings, it has been observed that the model's performance can be influenced to a certain degree by the implementation of different end_index functions.Furthermore, considering the start index during the prediction of the end index has been found to enhance the overall performance of the model.
Due to the extensive data processing required in the early stage of MRC model, it has led to a decrease in the number of entities and an imbalance in labels.To (8)  Label Smoothing introduces some noise to make the labels relatively soft, thereby alleviating overfitting to the training data.
In this experiments, three distinct loss functions were employed to evaluate the performance of the model.The results in Table 7 demonstrate that using Focal Loss as the loss function leads to a certain improvement in model performance (p < 0.05).This can be attributed to Focal Loss effectively addressing the issue of class imbalance, thereby enhancing the model's ability to classify difficult samples.In conclusion, selecting an appropriate loss function is crucial for improving the performance of MRC models.When dealing with class imbalance, Focal Loss may be an effective choice.
Our findings show that incorporating appropriate loss functions and span prediction strategies can significantly improve the performance of the model on imbalanced datasets.

The effect of different query constructs on NER performance
The structure of the query plays a crucial role in determining the final outcomes.In this section, different approaches to query construction and their implications was explored.The label "MAT" in the Matscholar dataset was used as an example.Several common methods for query construction were employed, including: • Keywords: The query describes the label using keywords.For example, the query for the label MAT is "inorganic material".• Rule-based template filling: Queries are generated using templates.The query for the label MAT is "Which inorganic material is mentioned in the text?".• Wikipedia: Queries are constructed using Wikipedia definitions.The query for the label MAT is "Materials made from inorganic substances alone or in combination with other substances".• Synonyms: Words or phrases that possess identical or closely similar meanings to the original keyword extracted from the Oxford Dictionary.The query for the label MAT is "Inorganic material".• Keywords + Synonyms: Keywords are combined with their synonyms.• Annotation guideline annotation: This is the method we used in this paper.The query for the label MAT is "Look up any inorganic solids or alloys, any nongaseous elements." The experimental results of our MatNER on the Matscholar dataset are presented in Table 8.The BERT-MRC model performs better than the BERT-Tagger model in all settings.The Annotation Guideline Notes method outperforms other methods because it provides clearer and more detailed label definitions (p < 0.05).These guidelines typically include instructions provided by dataset producers to annotators, making the label category descriptions more explicit and specific, thereby reducing ambiguity and errors in the annotation process.In contrast, Wikipedia falls short in comparison to Annotation Guideline Notes.This can be attributed to the relatively general definitions provided by Wikipedia, which  may not precisely align with the specific data annotations required.These findings highlight the importance of query construction in MatNER tasks.The use of carefully designed queries can significantly improve the performance of MatNER models.

Performance comparison with other methods
The main purpose of this work is to compare the proposed method with previous studies on five material science datasets.We observed a significant improvement in the performance of the material science dataset compared to the previous SOTA models.The F1 scores on the Matscholar, BC4CHEMD, NLMChem, SOFC and SOFC-Slot datasets were 89.64%, 94.30%, 85.89%, 85.95% and 71.73%, respectively, which represent an improvement of + 0.89%, + 2.55%, + 1.08%, + 0.91%, and + 1.39% over the previous SOTA performances.
These results demonstrate the superior performance of our method in the field of material science, surpassing previous benchmarks.This improvement is crucial for information extraction and entity recognition tasks in material science.Through experiments on these five datasets, the robustness, generality, and effectiveness of our method have been validated across multiple datasets and different scenarios (Table 9).In addition, to verify the effectiveness of our model, we have also conducted a fivefold cross-validation on the dataset.For more details, please refer to Table S2 in the supporting information.

Case study
A comprehensive case study was conducted to further investigate the distinctions between MatSciBERT-MRC and MatSciBERT-CRF.The outcomes of the case study are presented in Table 10.Based on the case studies of Matscholar, BC4CHEMD, and SOFC-Slot, we can observe that the MatSciBERT-MRC model provides an accurate demarcation of the boundaries of entities, such as "UV-light illumination", "docosahexaenoic acids", "ceria-based ceramics".It can therefore be inferred that   In addition, since our material dataset lacks nested cases, we artificially created nested cases to evaluate the performance of both models.MatSciBERT-CRF can only recognize "BWT-Pt catalysts" entities, while MatSciB-ERT-MRC can recognize "BWT-Pt" entities nested within "BWT-Pt catalysts" entities.MatSciBERT-MRC addresses the limitation of sequence annotation architectures and efficiently handling nested entities.These examples can be inferred that MatSciBERT-MRC excels at accurately identifying entity boundaries while mitigating label inconsistency and resolving entity nesting issues.This highlights the robustness and practicality of MatSciBERT-MRC in various scenarios related to material science information extraction.This advancement is not only pivotal for materials science but also has broader implications.For instance, this method can be adapted for named entity recognition in various domains such as chemistry and biosciences.In these fields, the MRCbased approach can effectively handle diverse texts, including those related to chemical synthesis, chemical property analysis, biological processes, etc.

Conclusion
In summary, BERT in the MRC framework was employed to conduct named entity recognition in material science (MatNER) task.Compared to BERT in the sequence labeling framework, BERT (i.e., MatSciBERT) in the MRC framework can improve the performance in recognizing target entities.Moreover, the MRC framework has the advantage of incorporating prior knowledge, which can be effectively enhanced in performance through query design.The proposed approach achieves good SOTA performance on five MatNER datasets.The results demonstrate that utilizing BERT in the MRC framework with carefully designed queries can significantly improve the accuracy of MatNER models.The results clearly indicate that utilizing BERT in the MRC framework with thoughtfully designed queries can significantly improve the accuracy of MatNER models.By demonstrating the versatility and effectiveness of BERT in the MRC framework, our findings contribute to the development of more accurate and efficient natural language processing tools.These tools can be instrumental across a range of applications in materials science, chemistry, biosciences, and other fields, enabling precise extraction of named entities from different types of scientific texts.

Fig. 1
Fig. 1 Using BERT to perform MatNER in the MRC framework The UV-light illumination not only affects the morphology of the films … MatSciBERT-MRC The UV-light illumination not only affects the morphology of the films … BC4CHEMD MatSciBERT-CRF Fish contains both beneficial substances e.g.docosahexaenoic acids … MatSciBERT-MRC Fish contains both beneficial substances e.g.docosahexaenoic acids … NLMChem MatSciBERT-CRF Immunocytochemistry(ICC) was performed for leukocyte common … MatSciBERT-MRC Immunocytochemistry(ICC) was performed for leukocyte common … SOFC MatSciBERT-CRF Here the authors report a micro-monolithic ceramic cell design … MatSciBERT-MRC Here the authors report a micro-monolithic ceramic cell design … SOFC-Slot MatSciBERT-CRF Such as ceria-based ceramics for electrolyte and mixed ion-electron … MatSciBERT-MRC Such as ceria-based ceramics for electrolyte and mixed ion-electron.Other MatSciBERT-CRF The prepared BWT-Pt catalysts were used for aerobic oxidation reaction of alcohols … MatSciBERT-MRC The prepared BWT-Pt catalysts were used for aerobic oxidation reaction of alcohols … MatSciBERT-MRC model is able to successfully identify words and phrases related to entity categories and provide accurate boundary information.In contrast, MatSciBERT-CRF model has limitations in accurately determining boundary information, which may be attributed to the difficulties encountered by the CRF model in handling complex syntactic structures and boundary information.Furthermore, in the case studies of NLM-Chem and SOFC, we can also clearly observe that the MatSciBERT-MRC model is able to identify "ceramic cell" entities which MatSciBERT-CRF fails to capture and corrected MatSciBERT-CRF's misidentification of "Immunocytochemistry (ICC)" entities.This further validates the superiority of the MatSciBERT-MRC model in entity recognition tasks.

Table 1
Examples of constructed queries

Table 2
Statistics on datasets

Table 3
The detailed hyper-parameters of MatSciBERT-MRC

Table 4
Performance comparison for different BERT ModelsThe bold marking indicates the highest F1 score in the comparison of different BERT Models

Table 5
Performance comparison for different modelsThe bold marking indicates the highest F1 score in the comparison of all model architectures, while the underline marking indicates the best F1 score among models excluding the MRC architecture

Table 6
Performance comparison for different end index strategiesThe bold marking indicates the highest F1 score in the comparison of different end index strategies

Table 7
Performance comparison for different LossesThe bold marking indicates the highest F1 score in the comparison of different Loss

Table 8
Performance comparison for different Query constructs

Table 9
Performance comparison with other existing methods

Table 10
Representative results of case study