How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities?

The number of biomedical literature on new biomedical concepts is rapidly increasing, which necessitates a reliable biomedical named entity recognition (BioNER) model for identifying new and unseen entity mentions. However, it is questionable whether existing models can effectively handle them. In this work, we systematically analyze the three types of recognition abilities of BioNER models: memorization, synonym generalization, and concept generalization. We find that although current best models achieve state-of-the-art performance on benchmarks based on overall performance, they have limitations in identifying synonyms and new biomedical concepts, indicating they are overestimated in terms of their generalization abilities. We also investigate failure cases of models and identify several difficulties in recognizing unseen mentions in biomedical literature as follows: (1) models tend to exploit dataset biases, which hinders the models' abilities to generalize, and (2) several biomedical names have novel morphological patterns with weak name regularity, and models fail to recognize them. We apply a statistics-based debiasing method to our problem as a simple remedy and show the improvement in generalization to unseen mentions. We hope that our analyses and findings would be able to facilitate further research into the generalization capabilities of NER models in a domain where their reliability is of utmost importance.


I. INTRODUCTION
Recently, more than 3,000 biomedical papers are being published per day on average [1], [2]. Searching these documents efficiently or extracting useful information from them would be of great help to researchers and practitioners in the field. Biomedical named entity recognition (BioNER), which involves identifying biomedical named entities in unstructured text, is a core task to do so since entities extracted by BioNER systems are utilized as important features in many downstream tasks such as drug-drug interaction extraction [3].
One important desideratum of BioNER models is to be able to generalize to unseen entity mentions. This generalization capability is of paramount importance in the biomedical domain due to the following reasons. First, various expressions for a biomedical entity (i.e., synonyms) continue to be made. For instance, pharmaceutical companies come up with marketing-appropriate names such as Gleevec to replace old names (usually identifiers) such as CGP-57148B and STI-571, whereas entities in other domains such as countries and companies are relatively unchanged. Second, new biomedical entities and concepts such as the novel coronavirus disease 2019 (COVID-19) constantly emerge, which can have a direct impact on human life and health.
In contrast to the importance of generalizing to new entities in the biomedical literature, there has been little systematic analysis of the generalizability of BioNER models. While recent works have made great efforts to push the state-of-theart (SOTA) on various benchmarks [4]- [7], it is questionable whether a high overall performance on a benchmark indicates true generalization. We conducted a pilot study to check if current BioNER models are reliable in identifying new entities. Specifically, we trained BioBERT [7] on the NCBI corpus [8], and then tested how many spans containing the novel entity COVID-19 the model can extract from PubMed sentences. As a result, the model extracted only 45.7% of all the spans, although it achieved high overall performance on NCBI (90.5% in recall). From this, we conclude that existing BioNER models may have limitations in identifying unseen entities, and their generalizability should be explored in a more systematic way beyond measuring overall performance.
In this work, we analyze how well existing BioNER models generalize to unseen mentions. First, we define three types of recognition abilities that BioNER models should possess: • Memorization: The most basic ability is to identify the entity mentions that were seen during training. We call this type of mention memorizable mention. If there is no label inconsistency, even a simple rule-based model would recognize memorizable mentions easily. • Synonym generalization: Biomedical names are expressed in various forms, even when they refer to the same biomedical concepts. For instance, Motrin and Ibuprofen are the same entity, but their surface forms are highly different [9]. A BioNER model should be robust to these morphological variations (i.e., synonyms). • Concept generalization: While synonym generalization deals with recognizing new surface forms of existing entities, concept generalization refers to the generalization to novel entities or concepts that did not exist before. New biomedical concepts such as COVID-19 sometimes are very different from conventional entities in terms of their surface forms and the context in which they appear, which makes it difficult to identify them.
In terms of the three recognition types that we define, we partition the entity mentions in the test sets (or validation sets) into three splits based on mention and CUI (Concept Unique Identifier) overlaps with the training sets, as shown in Table 1. This gives us several advantages. First, we can compare models' generalization abilities in detail. For instance, we find that the gap in performance between BioBERT and BERT [10] is mainly from synonym and concept generalization, not memorization (Section III). Also, our classification is simple and can be easily adopted to other datasets and other downstream tasks in the biomedical domain such as relation extraction and normalization. We focus on two popular BioNER benchmark in this work: NCBI-disease [8] and BC5CDR [11].
On the three test splits, we investigate the generalizability of existing BioNER models. Despite their SOTA performance on the benchmarks, they have limitations in their generalizability. Specifically, the models perform well on memorizable mentions, but find it difficult to identify unseen mentions. For the disease mentions in the BC5CDR corpus, BioBERT achieved a recall of 93.3% on memorizable mentions, but 74.9% and 73.7% on synonyms and new concepts, respectively. Also, the models cannot recognize the newly emerging biomedical concept COVID-19 well. Surprisingly, BioBERT recognized only 3.4% spans containing COVID-19 when trained on BC5CDR. From these observations, we conclude that existing BioNER models achieved high performance on benchmarks, but they are overestimated in terms of their generalizability.
Also, we identify several difficulties in recognizing unseen mentions. First, through a qualitative analysis of error cases on Syn and Con splits, we find BioNER models can rely on the class distributions of each word in the training set, reducing the models' abilities to generalize. Since BioNER datasets is relatively small for training large neural networks, models may be sensitive to such dataset bias. Second, after examining the failure for COVID-19, we conclude models are not robust to new entities when they do not follow conventional surface patterns. This is an important issue to be addressed since many biomedical entities have rare morphologies (See Table 8 for examples), and such entities will continue to appear in biomedical literature.
The two difficulties can be viewed as models' biases on statistical cues and surface patterns. In order to show they are addressable, we apply a simple statistics-based debiasing method [12]. Specifically, we use the class distributions of words in the training set as bias prior distributions. This reduces the training signals from words whose surface forms are very likely to be entities (or non-entities), mitigating models' bias on class distributions and name regularity. In experiments, we demonstrate our debiasing method consistently improves the generalization to synonyms, new concepts, and entities with unique forms including COVID-19.
To sum up, we make the following contributions: • We first define memorization, synonym generalization, and concept generalization and systematically investigate existing BioNER models in this regard. • We raise the overestimation issue in terms of BioNER models' generalizability to unseen mentions and provide empirical evidence to support our claim. • We identify two types of bias as the main difficulty in generalization in BioNER and show that they are addressable using a current debiasing method.

A. PARTITIONING BENCHMARKS
We describe how we partition benchmarks. Several BioNER datasets provide entity mentions and also CUIs that link the entity mentions to their corresponding biomedical concepts in databases. We utilize overlaps in mentions and CUIs between training and test sets in the partitioning process. Let (x n , e n , c n ) be the n-th data example of a total of N examples in a test set. x n is the n-th sentence, e n = [e (n,1) , ..., e (n,Tn) ] is a list of entity mentions, and c n = [c (n,1) , ..., c (n,Tn) ] is a list of CUIs where T n is the number of the entity mentions (or CUIs) in the sentence. We partition all mentions e (n,t) in the original test set into three splits as follows: · Mem := e (n,t) : e (n,t) ∈ E train , c (n,t) ∈ C train · Syn := e (n,t) : e (n,t) / ∈ E train , c (n,t) ∈ C train · Con := e (n,t) : e (n,t) / ∈ E train , c (n,t) / ∈ C train where E train is the set of all entity mentions in the training set, and C train is the set of all CUIs in the training set. We describe the partitioning process in detail in the Appendix.

B. DATASETS
We use two popular BioNER benchmarks with CUIs to systematically investigate models' memorization, synonym generalization, and concept generalization abilities. Additionally, we automatically construct a dataset consisting of the novel entity COVID-19.

1) NCBI-disease
The NCBI-disease corpus [8] is a collection of 793 PubMed articles with manually annotated disease mentions and the corresponding concepts in Medical Subject Headings (MeSH) or Online Mendelian Inheritance in Man (OMIM).

2) BC5CDR
The BC5CDR corpus [11] is proposed for disease name recognition and chemical-induced disease (CID) relation extraction tasks. The corpus consists of 1,500 manually annotated disease and chemical mentions and the corresponding concepts in MeSH. We denote the disease-type dataset as BC5CDR dis and the chemical-type dataset as BC5CDR chem .

3) COVID-19
We construct a dataset to see if a model trained on current benchmarks can identify the newly emerging biomedical concept COVID-19. We sampled 5,000 sentences containing "COVID-19" from the entire PubMed abstracts through March 2021 and annotated all COVID-19 occurrences in the sentences, which results in 5,237 labels. Note that only the exact term "COVID-19" was considered, and synonyms for COVID-19 were not considered in this dataset creation process. Table 1 shows the statistics of the splits of the benchmarks. We found that a significant portion of the benchmarks correspond to Mem, implying that current BioNER benchmarks are highly skewed to memorizable mentions. In Section III, we discuss the overestimation problem that such overrepresentation of memorizable mentions may cause.

III. GENERALIZABILITY OF BIONER MODELS
This section describes baseline models and evaluation metrics and analyzes the three recognition abilities of the models.

A. BASELINE MODELS
We use four current best neural net-based models and two traditional dictionary-based models as our baseline models. See the Appendix for implementation details.

1) Neural Models
We use BioBERT [7], BlueBERT [14], and PubMed-BERT [13]. The models are all pretrained language models (PLMs) for the biomedical domain, with similar architectures. They are different in their vocabularies, weight initialization, and training corpora. See the Appendix for more details. Also, we use BERT [10] to compare general and domain-specific PLMs in terms of generalization in BioNER.

2) Dictionary Models
Traditional approaches in the field of BioNER are based on pre-defined dictionaries [15]. To compare the generalization abilities of traditional and recent approaches, we set two types of simple dictionary-based extractors as baseline models. DICT train uses all the entity mentions in a training set (i.e., E train ) as a dictionary and classifies text spans as entities when the dictionary includes the spans. If candidate spans overlap, the longest one is selected. DICT syn expands the dictionary to use entity mentions in the training set as well as their synonyms, which are pre-defined in biomedical databases.

B. METRICS
Following conventional evaluation metrics in BioNER, we use the precision (P), recall (R), and F1 score (F 1 ) at an entity level to measure overall performance [16]. We only use recall when evaluating three recognition abilities (i.e., Mem, Syn, and Con) since it is impossible to classify false positives into each recognition type. For COVID-19, we use a relaxed version of recall: if "COVID-19" is contained in the predicted spans, we classify this prediction as a true positive.  Table 2 shows the performance of the baseline models.
BioBERT outperforms other baseline models on NCBIdisease based on overall performance. For the BC5CDR corpus, PubMedBERT is the best performing model. BERT performs less than domain-specific PLMs, but far superior to dictionary models. DICT syn outperforms DICT train in recall due to its larger biomedical dictionary, but the precision scores decrease in general. Note that the performance of DICT syn on Mem is lower than that of DICT train as there exists annotation inconsistency between benchmarks and biomedical databases. We elaborate on this in the Appendix. Memorization can be easily obtained compared to the other two abilities. Although the dictionary models are the simplest types of BioNER models without learnable parameters, they work well on Mem. The degree of difficulty in recognizing synonyms and new concepts varies from data to data. The models' performances on Syn is lower than that on Con of BC5CDR dis , but vice-versa on BC5CDR chem .

2) Overestimation of Models
The neural models perform well on Mem, but they achieved relatively low performance on Syn and Con across all benchmarks. For instance, BioBERT achieved 93.3% recall on Mem, but only 74.9% and 73.7% recall on Syn and Con, respectively. Also, the neural models perform poorly on COVID-19 despite their high F1 scores. BioBERT performed the best, but the score is only 45.7% recall. Even more surprisingly, all the models hardly identify COVID-19 when trained on BC5CDR dis . To sum up, current BioNER models have limitations in their generalizability.
As shown in Table 1, a large number of entity mentions in existing BioNER benchmarks are included in Mem. This overrepresentation of memorizable mentions can lead to an overestimation of the generalization abilities of models. We believe our model has high generalization ability due to high performance on benchmarks, but the model may be highly fit to memorizable mentions. Taking these results into account, we would like to emphasize that researchers should be wary of falling into the trap of overall performance and misinterpreting a model's high performance with generalization ability at the validation and inference time.

3) Effect of Domain-specific Pretraining
Domain-specific PLMs constantly outperform BERT on Syn and Con. These results show that pretraining on domainspecific corpora mostly affects synonym generalization and concept generalization. On the other hand, BERT and domainspecific PLMs achieve similar performance on Mem because memorization does not require much domain-specific knowledge and the models have the same architecture and capacity.
In particular, we find the gap in performance between BERT and domain-specific PLMs is drastic in the generalization ability to abbreviations. Table 3 shows that neural models' performances on abbreviations on the Con splits of NCBI-

IV. ANALYSIS
In this section, we analyze which factors make the generalization to unseen biomedical names difficult based on failures of models on (1) Syn and Con splits, and (2) COVID-19. For simplicity, we will focus on only BERT and BioBERT.

A. DATASET BIAS
We qualitatively analyze the error cases of BioBERT by sampling a total of 100 incorrect predictions from the Syn and Con splits of BC5CDR dis . As a result, we found 36% of the error cases occur because the model tends to rely on statistical cues in the dataset and make biased predictions. Table 4 shows the examples of the biased predictions.
In the first example, the model failed to extract the whole phrase "acute encephalopathy." All the words "encephalopathy" in the training set are labeled as "B," 2 so the model classified the word as "B," resulting in an incorrect prediction. In the second example, there are four entity mentions: two  mentions are full names "anterior infarction" and "inferior infarction," and the others are their corresponding abbreviations "ANT-MI" and "INF-MI." As the abbreviations are enclosed in parentheses after the full names, it should be easy for a model to identify the abbreviations in general if the model can extract the full names. Interestingly, although BioBERT correctly predicted the full names in the example, it failed to recognize their abbreviations. This is because "MI" is only labeled as "B" in the training set, and so the model was convinced that "MI" is only associated with the label "B." In the last example, about 73% of the words "defects" are labeled as "I" in the training set as components of entity mentions such as birth defects and atrial septal defects. However, the word "epithelial" is only labeled as "O," so the model did not predict the phrase "epithelial defects" as an entity. From these observations, we hypothesize that BioNER models are biased to class distributions in datasets. Specifically, models tend to over-rely on the class distributions of each word in the training set, causing the models to fail when the class distribution shifts in the test set.

B. WEAK NAME REGULARITY
Name regularity refers to patterns in the surface forms of entities [18], [19]. For example, many disease names have patterns such as " disease" and " syndrome." These patterns are regarded as useful features for identifying unseen mentions and are often implemented in NER systems after being handcrafted. However, little analysis has been done on the difficulties a model can face when extracting novel entities that do not have common name patterns such as COVID-19. In this section, we hypothesize that the cause of models' failure to recognize COVID-19 is its rare morphology and perform detailed analyses to support the hypothesis.

1) Cause of Failing to Recognize COVID-19
We have already confirmed in Table 2 that models fail to recognize COVID-19. To see if the cause of this failure is due to the rare surface form of COVID-19, we replace all occurrences of "COVID-19" in the COVID-19 dataset with more disease-like mentions "COVID," while maintaining context. Interestingly, BioBERT can recognize the entity well after the replacement, as shown in Table 5.
Next, we train models with entity mentions having similar surface forms to COVID-19 and see how the performance changes on COVID-19. First, we randomly generate 3-5 capital letters and 1-3 numbers. We then combine the generated letters and numbers using the pattern "Abbreviation-Number" and create pseudo entities such as IST-5, CHF-113, and SRS-3517. We randomly select 1 or 10 entity mentions in the training set that are abbreviations and replace them with different pseudo entities. We then trained BioBERT on the modified training set and test the model how well it recognizes COVID-19. As shown in Table 6, augmenting COVID-19-like name patterns improves the ability to recognize COVID-19.
Note that low performance on COVID-19 is not due to lack of sufficient context. Models fail even if there is enough information in the context to determine whether COVID-19 is a disease, e.g., "treatment of COVID-19 patients with hypoxia" and "The 2019 novel coronavirus pneumonia (COVID-19) is an ongoing global pandemic with a worldwide death toll." Also, the small number of training data is not the cause for the failure. We trained BioBERT on the MedMentions corpus [20], which contains several times more disease mentions than NCBI-disease and BC5CDR dis , but the model extracted only 12.7% of COVID-19. From these observations, we conclude that the biggest difficulty in recognizing COVID-19 is the generalization to a novel surface form.

2) Comparison of NCBI-disease and BC5CDR
When trained on NCBI-disease and BC5CDR dis , the gap in performance between the models on COVID-19 is remarkable. This can be caused by three factors. First, the BC5CDR corpus contains a number of chemical mentions with the pattern "{Abbreviation}-{Number}" such as "MK-486" and "FLA-63," thus models can misunderstand the pattern must be the chemical type, not a disease type. Second, NCBI-disease contains several times more abbreviations than BC5CDR dis in the training set, which could help generalization to COVID-19 that is also an abbreviation. Lastly, NCBI-disease has the entity "EA-2" in the training set with a similar pattern to COVID-19, while BC5CDR dis does not have any disease entity with the pattern. Replacing "EA-2" with "EA" significantly reduces the performance of BioBERT dramatically decreases from 45.7 to 11.2, which supports our claim.

C. DEBIASING METHOD
We hypothesize BioNER models tend to rely on class distributions and name regularity experienced during training, making it difficult to generalize unseen entities, especially, entities with rare patterns (e.g., . To support our hypothesis and see if such bias can be handled, we adopt a bias product method [21], which is a kind of debiasing method effective in

1) Formulation
Bias product [21] trains an original model using a biased model such that the original model does not learn much from spurious cues. Let p (n,i) ∈ R K be the probability distribution over K target classes of the original model at the i-th word in the n-th sentence, and b (n,i) ∈ R K be that of the biased model. We add log(p (n,i) ) and log(b (n,i) ) element-wise, and then calculate a new probability distributionp (n,i) ∈ R K by applying the softmax function over K classes as follows: p (n,i) = softmax(log(p (n,i) ) + log(b (n,i) )). (1) We minimize the negative log-likelihood between the combined probability distributionp (n,i) and the ground-truth label. This assigns low training signals to words with highly skewed class distributions. As a result, it prevents the original model from being biased towards statistical cues in datasets. Note that only the original model is updated, and the biased model is fixed during training. At inference, we use only the probability distribution of the original model, p (n,i) .
In previous works, biased models are usually pretrained neural networks using hand-crafted features as input [21]- [24]. On the other hand, [12] used data statistics as the probability distributions of the biased model, which is computationally efficient and performs well. Similarly, we calculate the class distribution of each word using training sets, and then use the statistics. The probability that our biased model predicts k-th class is defined as follows: where N is the number of sentences in the training set, L m is the length of the m-th sentence, and x (n,i) is the i-th word in the n-th sentence. If the ground-truth label of the word x (n,i) is the k-th class, y k (n,i) = 1, otherwise 0.

2) Effect of Debiasing
We explore how the debiasing method affects models' generalization abilities. Table 7 shows models' performance changes after applying the debiasing method. The method decrease the memorization because it debiases models' bias towards memorizable mentions and their class distributions. On the other hand, the method constantly improves the performance on Syn and Con on the benchmarks. Debiasing methods usually decrease overall performance on benchmarks [22], [24], which is consistent with our results. With recent efforts to reduce bias while maintaining overall performance [24], our debiasing method could be improved in future work. Also, the debiasing method changes the model's behavior and corrects the errors in the first and third examples in Table 4.
We also see if our debiasing method can improve the generalizability to entities with weak name regularity. Before testing the method, we crawled a list of rare diseases and their descriptions from the NORD (National Organization for Rare Disorders) database 3 based on our hypothesis that rare diseases tend to have more unique surface forms than common diseases. Disease names were filtered if BioBERT trained on NCBI-disease successfully extracted them based on the descriptions. Since descriptions provide sufficient context to recognize entities, e.g., "African iron overload is a rare disorder characterized abnormally elevated levels of iron in the body," an entity's surface form would be rare if a model fails to recognize the entity from the description. Thus, we assumed that the remaining diseases after filtering have weak name regularity. Finally, we obtained 8 diseases from the database and collected PubMed abstracts in which the diseases appear. Table 8 shows the list of diseases and their frequency of occurrence. All diseases are different from conventional patterns, and their CUIs are unseen based on the NCBIdisease training data. We tested our debiasing model on the  [2] Global longitudinal (GLS), circumferential ( GCS ) , and radial strain (GRS) were . . .
diseases along with COVID-19 using the same relaxed version of recall as the same as for COVID-19. As a result, our debiasing method consistently improved the generalization to rare patterns as shown in Table 8.

3) Side Effects of Debiasing
Our debiasing method prevents models from overtrusting the class distributions and surface forms of mentions, making the models sometimes predict spans of text as entities, which have never appeared in the training set. Although this exploration of debiased models helps find unseen mentions as shown Table 7 and Table 8, they have some side effects at the expense of the exploration. To analyze this, we sample 100 cases from the test set of BC5CDR dis that an original BioBERT model predicted correctly, but a debiased one failed. Among all cases, we find 23 abnormal predictions of the debiased model and classify them into three categories as shown in Table 9. The most frequent type is predicting spans that are not noun phrases. As shown in the first example in the table, although "Loss of" is an incomplete phrase, the model predicted it. Also, the model predicted the word "infarcted" as an entity although the word is an adjective and is only labeled as "O" in the training set. Also, the second type is related to name regularity. We found that the model sometimes excluded strong patterns from their predictions. For instance, as shown in Table 9, the model predicted entities without "syndrome" and "injury". When using the debiasing method, there can be a trade-off between performance for entities with weak name regularity and those with strong name regularity. Lastly, the model occasionally predicts special symbols. As shown in the last row of the table, the model predicted the word "sarcomas" with a comma. The model also recognized parentheses as entities. From these results, we conclude that the debiasing method can lead to abnormal predictions by encouraging models to predict rare (or never appeared) classes of words and spans during training.

V. RELATED WORK A. BIONER MODELS
In recent years, BioNER has received significant attention for its potential applicability to various downstream tasks in biomedical information extraction. Traditional methods in BioNER are based on hand-crafted rules [25]- [27] or biomedical dictionaries [28], [29]. However, these methods require the knowledge and labour of domain experts and are also vulnerable to unseen entity mentions. With the development of deep learning and the advent of large training data, researchers shifted their attention to neural models [4], [30], which are based on recurrent neural networks (RNNs) with conditional random fields (CRFs) [31]. These models automatically learn useful features in datasets without the need of human labour and achieve competent performance in BioNER. The performance of BioNER models has been further improved with the introduction of multi-task learning on multiple biomedical corpora [5], [6], [32]. Several works demonstrated the effectiveness of jointly learning the BioNER task and other biomedical NLP tasks [33]- [36]. Recently, pretrained language models such as BioBERT achieved SOTA results in many tasks such as relation extraction and question answering, and also in BioNER [7], [13], [14].

B. GENERALIZATION TO UNSEEN MENTIONS
Generalization to unseen mentions has been an important research topic in the field of NER [37]- [39]. Despite recent attempts to analyze the generalization of NER models in the general domain [18], [40]- [42], there are few studies in the biomedical domain. Several studies investigated transferability of BioNER models across datasets [43], [44]. On the other hand, we study the generalization to new and unseen mentions based on our new data partitioning method. Note that they did not split benchmarks and evaluated models based on overall performance, so our method can be applied to their experimental setups in future work.

C. DATASET BIAS
While many recent studies pointed out dataset bias problems in various NLP tasks such as sentence classification [45]- [47] and visual question answering [48], neither works raise bias problems regarding BioNER benchmarks. Our work is the first to deal with dataset bias in BioNER and to demonstrate the effectiveness of the debiasing method. Recent works found that low label consistency (the degree of label agreement of an entity on the training set) decreases the performance of models on general NER benchmarks [40], [41]. In this work, we show that high label consistency also can harm the generalization when the label distribution of the test set is different from that of the training set.

VI. CONCLUSION
In this work, we thoroughly explored the memorization, synonym generalization, and concept generalization abilities of existing BioNER models. We found current best NER models are overestimated, tend to rely on dataset biases, and have difficulty recognizing entities with novel surface patterns. Finally, we showed that the generalizability can be improved using a current debiasing method. We hope that our work can provide insight into the generalization abilities of BioNER models and new directions for future work.

APPENDIX. DETAILS IN PARTITIONING BENCHMARKS
We classify the set of mentions {e (n,t) : e (n,t) ∈ E train , c (n,t) / ∈ C train } into Mem for single-type datasets (e.g., NCBIdisease and CDR). If a dataset is multi-type (e.g., MedMentions), we classify the mentions into Con. Since there are entity mentions that are mapped to more than one CUI, c (n,t) does not have to be a single CUI and may be a list of CUIs. In this case, we classify the mentions into Con if all CUIs in the list are not included in C train and otherwise as Syn. We classify mentions with the unknown CUI "-1" into Con because unknown concepts in the training and test sets are usually different. We lowercase mentions and remove punctuation in them when partitioning benchmarks.

APPENDIX. MODEL COMPARISON
Our neural baseline models (i.e., BERT, BioBERT, BlueBERT, and PubMedBERT) have the same model architecture, which are Transformer-based encoders [49] with a linear classifier. They differ in vocabulary, initialization method, and training corpus during pre-training, as summarized in Table 10. First, BERT is trained on Wikipedia and the BookCorpus [50] from scratch using the vocabulary within the corpora. BioBERT and BlueBERT are initialized with BERT's weights and further trained on PubMed articles. Additionally, BlueBERT is trained on the MIMIC-III corpus, which consists of clinical notes. PubMedBERT is also trained on the PubMed corpus, but it is trained from scratch and trained with the PubMed vocabulary.

Init. Corpus
PubMedBERT [13] PubMed -PubMed BlueBERT [14] Wiki+Books BERT PubMed+MIMIC BioBERT [7] Wiki+Books BERT PubMed BERT [10] Wiki+Books -Wiki+Books for BlueBERT, 7 and the BiomedNLP-PubMedBERT-baseuncased-abstract model for PubMedBERT. 8 The max length of input sequence is set to 128. Sentences whose lengths are over 128 are divided into multiple sentences at the preprocessing stage. We trained and tested our models on a single Quadro RTX 8000 GPU. For our synonym dictionaries, we used the July 2012 version of MEDIC [51] and the November 2019 version of CTD (Comparative Toxicogenomics Database), provided by Sung et al. [9]. For all models, we used the batch size of 64 and searched learning rate in the range of {1e-5, 3e-5, 5e-5}. For our debiasing method, we smooth the probability distribution of the biased model using temperature scaling [52] since excessive penalties for bias can hinder the learning process. We searched the scaled parameter in the range of {none, 1.1}, where none indicates that temperature scaling is not applied. We chose the best hyperparameters based on the F1 score on the development set. The selected hyperparameters are described in Table 11. Note that all results are averaged over 5 runs using a randomly selected seed.
The original BioBERT model [7] was trained on not only the training set, but also the development set, after the best hyperparameters are chosen based on the development set. This approach improves performance in general when the number of training examples is insufficient and is commonly used in many studies in BioNER. On the other hand, we did not use the development set for training models, resulting in lower performance of BERT and BioBERT compared to the performance reported by Lee et al. [7].

APPENDIX. ANNOTATION INCONSISTENCY IN BIOMEDICAL DATABASES
As shown in Table 2, the performance on Mem of DICT syn is lower than that of DICT train as there exists annotation inconsistency between benchmarks and biomedical databases. For example, "seizures" and "generalized seizures" are entities with the same concept in the databases, so the dictionary of DICT syn includes both "seizures" and "generalized seizures." However, in BC5CDR dis only "seizures" is annotated. Since dictionary models predict the longest text spans that are in their dictionaries, DICT syn predicts "generalized seizures," resulting in incorrect prediction. Also, the dictionary models cannot generalize to new concepts, but DICT syn achieved recall of 1.4 on Con of BC5CDR dis due to annotation inconsistency, i.e., there are mentions with the same surface forms, but different CUIs.

APPENDIX. TOKENIZATION ISSUE
Following Lee et al. [7], we split words into subwords based on punctuations. For example, "COVID-19" is splitted into three words "COVID", "-," and "19". This tokenization makes it easy to deal with nested entities. If "SARS-CoV" is splitted into subwords, a model cannot detect "SARS" as a disease. However, the tokenization is not an optimal way in predicting the whole word "SARS-CoV" as a virus.
To see if dramatic low performance is due to the tokenization issue, we preprocessed "COVID-19" as a single word and tested BioBERT on them. As a result, the performance of BioBERT has improved from 45.7 to 54.1, and from 3.4 to 15.4, when trained on NCBI-disease and BC5CDR dis , respectively. Although the change in tokenization clearly boosts performance, we have seen that the performance improvement in Table 5 and Table 6, which is not explained by the tokenization issue alone. The main reason for the failure in recognizing COVID-19 is that models are vulnerable to unique name patterns.