Few-shot Named Entity Recognition with Joint Token and Sentence Awareness

ABSTRACT Few-shot learning has been proposed and rapidly emerging as a viable means for completing various tasks. Recently, few-shot models have been used for Named Entity Recognition (NER). Prototypical network shows high efficiency on few-shot NER. However, existing prototypical methods only consider the similarity of tokens in query sets and support sets and ignore the semantic similarity among the sentences which contain these entities. We present a novel model, Few-shot Named Entity Recognition with Joint Token and Sentence Awareness (JTSA), to address the issue. The sentence awareness is introduced to probe the semantic similarity among the sentences. The Token awareness is used to explore the similarity of the tokens. To further improve the robustness and results of the model, we adopt the joint learning scheme on the few-shot NER. Experimental results demonstrate that our model outperforms state-of-the-art models on two standard Few-shot NER datasets.


INTRODUCTION
Few-Shot learning (FSL) can reduce the burden of annotated data and quickly generalize to new tasks without training from scratch (usually only one or five per category). The few-shot learning has been made remarkable progress in many areas, such as computer vision (CV) [1,2,3] and relation classification (RC) [4,5,6,7,8]. But the FSL progress is much slower in named entity recognition (NER), mainly because entity recognition which is token-level classification tasks is more fine-grained and complicated than In the current few-shot models, the prototypical network [4] is a simple and powerful approach for fewshot NER. The basic idea is to learn the prototype of each predefined entity class, then classify the query samples according to their closest prototype [9,10]. Most existing fewshot NER models mainly focus on the massive semantics hidden in token space [10,11], such as Tong et al. [10] utilized the clustering method to divide othe classes for learning entity prototype further. However, they ignore the rich semantics in the sentences containing the multiple entity classes. Meanwhile the experiments of these methods were either performed on coarse-grained entity types [10] or on the slot filling of dialogue task [11] which is pretty inefficient for few-shot NER.
Sentence level semantic information can help few-shot NER, mainly because of two aspects: 1) Entity Relation. A large number of sentences contain more than two entities. In fact, the sentences are representations of the relation between entities although this relation does not need to be identified and classified in the NER task. The entity relation in the sentences can be used to improve the entity prototypes in the few-shot NER. 2) O-class Positive and Negative Impact. Sentence-level semantics can leverage rich semantics in other class (O-class) to learn entity prototypes. The sentence semantics are embedded in sentence-level representations, focusing on the contextual information in sentences without other class label impacts. The sentence embedding could represent each predefined entity class. And this way can handle the other class noise issue. This paper proposes a novel model, few-shot NER with Joint Token and Sentence Awareness (JTSA). The token awareness module aims to learn the association between tokens from support and filter out the tokens that have a more significant impact on recognizing entities. In contrast, the sentence awareness module knows the semantics information from the sentences to improve the few-shot NER. In practice, the sentences often contain rich semantics of the entities and can provide abundant knowledge for discovering the best prototype of each entity class. The prototypes in tokens space are representations by abstracting the essential semantics of words and in the sentences space by embedding the semantic information of the sentences including multiple different entity classes. To improve the few-shot NER further, we joined the token and sentence modules for the final classification, which can learn the prototypes of entities in sentences better. The novel model joints token and sentence modules for deep interaction between token and sentence, capable of adopting their respective useful semantics information. Our model leverages the sentence-level prototype to calibrate the token-level prototype. It can also effectively alleviate the noise impacts of the O-class tokens to improve the few-shot NER.
We conduct a variety of experiments on the FEW-NERD [12] dataset that has just been released. The FEW-NERD is a large-scale human-annotated few-shot NER dataset with 66 finegrained entity types [12]. The experimental results demonstrate that our model outperforms the current SOTA approaches in few-shot NER. The subsequent ablation experiments show the significance of sentences-level awareness. Our contributions can be summarized as follows:

Few-shot Named Entity Recognition with Joint Token and Sentence Awareness
• We propose a novel module, sentence awareness, to leverage the entity relation in the sentences to improve few-shot NER. The module can also address O-class issue, and introduce a significant solution of how to adopt the useful semantic information of O-class words and alleviate noise impact.
• To improve few-shot NER further, we also propose token awareness to highlight more helpful tokens for recognizing entities and a novel approach with joint token and sentence awareness. The approach leverage the respective advantages of tokens and sentences to promote the experiment results.
• We conduct the experiments on the large-scale few-shot NER dataset with 66 fine-grained entity types, and compare our approach with multiple state-of-the-art baselines. The overall results strikingly outperforms the SOTA approaches in few-shot NER task. Further ablation studies show the effectiveness of our model and the modules.

Named Entity Recognition
In Natural Language Processing (NLP), Named Entity Recognition (NER) aims to identify entities (person, location, organization, drug, time, clinical procedure, biological protein, etc.) from unstructured text, which has been studied and developed widely for decades [13,14,15]. NER serves as the fundamental task in NLP, same as question answering, information retrieval, relation extraction, etc. Neural networks have significantly improved the results of the NER task in the last few years [16,17,18,19,20,21,22]. Although neural NER networks have achieved superior performance, these methods need large-scale training data. It is challenging to obtain massive annotated data. Recently, few-shot learning can handle the issue.

Few-Shot Learning
Few-shot learning has been proposed and rapidly emerging as a viable means for completing various tasks. Many few-shot models have been widely used for classification tasks. Siamese neural network was applied to few-shot classification by Koch et al. [1], and it utilized a convolutional architecture to rank the similarity between inputs naturally. Then, matching network [2] was proposed in 2017. It used some external memories to enhance the neural networks. It added an attention mechanism and a new method named cosine distance as the similarity metric to predict the relations. In 2018, Sung et al. [3] proposed a relation network for few-shot learning. The relation network learns an embedding and a deep non-linear distance metric for comparing query and sample items. Moreover, the Euclidean distance empirically outperforms the more commonly used cosine similarity on multi-tasks. Thus, a simpler and more efficient model prototypical network was proposed by Snell et al. [4]. The naive approach used a standard Euclidean distance as the distance function. In 2019, Gao et al. [6] introduced a hybrid attention-based prototypical network, which is a more efficient prototypical network, and trained a weight matrix for Euclidean distance.

Few-Shot Named Entity Recognition
Few-shot Named Entity Recognition refers to NER task with only one or few examples per category [23,24,25,26,27]. Hofer et al.
[24] studies the named entity recognition of electronic health records that 10 samples are collected from the target dataset for few-shot learning. Yang and Katiyar [25] presents a fewshot NER system based on nearest neighbor learning and structured inference. The approach shows that the nearest neighbor classifier in this feature space is more effective in the few-shot NER task. Hou et al. [11] focuses on the spoken language understanding task and leverages the label semantic to classify the entities. Tong et al. [10] proposes an approach, mining undefined classes from Other-class, adjusting single Otherclass prototype to multiple prototypes by clustering method. This way can reduce the O-classes negative impacts on the identification of target entities [10]. These methods are state-of-the-art based on token-level. And most of them aim to recognize coarse-grained entity types [12]. Ding et al. [12] releases a fine-grained dataset, FEW-NERD, which is a large-scale human-annotated few-shot NER dataset with 66 fine-grained entity types [12]. Also, they present superior performance few-shot NER model based on token-level on the FEW-NERD. But the above methods neglected sentence-level semantics. As a contrast, we propose a novel few-shot NER model with token and sentence awareness.

METHODOLOGY
In this section, we give a detailed introduction to the implementation of our proposed model JTSA which is shown in Figure 1. JTSA consists of three main parts, including: Token awareness module, Sentence awareness module and Joint learning scheme module.

Problem Definition
Following Ding et al. [12], we regard NER as a sequence labeling problem. NER aims to label each token x k in the input sequence where y k is one of the pre-defined class set Y or not belong to any entities(Other class) [12], and m is the maximum length of an sentence. In few-shot learning, a system is trained on an annotations of source domains In each query sample (x, y), x and y represent the entity and the corresponding class label respectively. N and K come from the definition of N-way K-shot NER task, K' is the number of test entities per class. In this paper, we follow the definition of Ding et al. [12] for few-shot NER. Specifically, this task randomly select N entity classes(N-way) at first, then K samples are randomly chosen(K-shot) from each class. In each instance s i , a word which not belong to the predefined entity class are regarded as O-class(other class or none-of-the-above), and the O-class is assumed as the (N + 1)-th class label. Thus, in the support set

Implementation Details
Our model is based on prototypical network, since it is a simple and effective method for few-shot learning. For few-shot NER task, prototypical network assumes that in each entity class, there is a prototype which is able to represent this class, and each entity clusters around the prototype of the class which they belong to. The purpose of prototypical network is to learn the representation of prototype p i for each class, and then predict the class label of query entity q. The query entity q is classified in three steps: First, prototypical network gets the representation of prototypes 1 2 n { , ,..., } P p p p = for all classes from support set. Second step is to calculate the distance between the query entity q and all the prototypes respectively. Finally, the query entity q is classified to the closest class.
In the first step, we encode each tokens in the support set into a D-dimensional embedding ( ) , through an embedding function with learnable parameters h: Figure 1. Architecture of our JTSA model. Circles with Blue, green, and yellow indicate different types of entities in support, while orange is the entity in query set that needs to be correctly categorized to the above 3 types by our model. Orange box marked as is our Token Awareness Module, which uses the similarity between the support sample and query sample characters to construct the prototypes in our network. The blue part marked as is our Sentence Awareness Module, which considers the association between sentences, and fi lters out the support samples that have a greater impact on entity classifi cation. Mark colored green is our Joint Learning Scheme, which joints the token awareness and sentence awareness to enhance the ability to identify entity classes.
In our model, the encoder f h is the pre-trained language model BERT [28] with transformers. Then, our model calculate the prototype of each class as the following way: where |c i | is the number of tokens in the class label c i .
The second step utilize the similarity function shown in Eq.3 to calculate the distance between entity q and each prototype p i .
For the last step, we get the class distribution about the entity q by Eq.4.
where |C| stands for the number of classes in class set |C|. Furthermore, during meta-testing,we add an Viterbi decoder module to get the transfer rules between adjacent entity labels, and then modify the class distribution ( | ) i q g y c x = h by the transition distribution g(y', y) as Eq.5.
To achieve better class representation of named entity, we design the sentence awareness module and token awareness module to get a task adaptive class prototype.

Sentence Awareness
Traditional prototypical network get the prototype of an entity class by simple averaging all the words embedding in it. However, in real-world, the semantic of sentences is different. The primary goal of our sentence awareness is to consider the contribution of each sentence and construct different matrix for each query.
Our sentence awareness module depends on the similarity of sentences. As shown in the sentence awareness module in Figure 1, when predicting a query entity q appeared in query sentence s q , our sentence awareness module captures the similar sentences from support set S. The entities in these sentences are more interrelated with the query entity q, and the weight of each sentence is updated by the degree of relevance.
First of all, we encode the sentence s q by 1-Dimension convolutional neural networks, and get a continuous low-dimensional sentence embedding h q . Then, each sentence s l in support set is encoded like this to generate the embedding l s h . The process is shown in Eq.6.
Eq.7 presents the way to calculate the similarity between sentences, where d is the distance function in Eq.3. .
Thus, the prototype sa i p in our model is defined as Eq.8, which is able to pay more attention on the correlation information between query sentence s q and support set S.
Furthermore, in few-shot NER task, O-class (none-of-the-above) is special and creates great challenges to recognize the query entities, because it contains all the entities which can not be classified into any of the class set and these entities have huge gaps between each other. In the sentence awareness module, we leverage the sentence semantics to alleviate this problem.

Token awareness module
In the paper, we propose the token awareness mechanism, which focus on the tokens in support set which are more relevant and have more similar feature to the query entity. Our token awareness module is shown in Figure 1. When identifying entities in query sentences, the module captures tokens that are more associated with the query entities from the support set and then update the token weights according to the degree of relevance. Our model calculates the correlation as Eq.9, where the correlation coefficient b j presents the similarity between the query entity q and the entity sample x j in c i class.
Then, for each query sample q, the prototypes of classes is defined as: For both Sentence Awareness Prototype (SAP) module and Token Awareness Prototype (TAP) module, the optimization goal is to minimize the cross-entropy loss function as Eq.11: where g h represents our SAP model and TAP model. l is the weight decay parameter, and l is the cost function to get the distribution between truth-label and predicted-label.

Joint learning acheme
In few shot NER task, distinct few-shot settings and distinct granularity have different requirements to the model. In practical, our sentence awareness which pays more attention on the correlation between sentences is superior, when the class label is fine-grained and the number of samples in the support set is extremely few. The token awareness module which focuses on the information shared between entities helps a lot when evaluating on coarse-grained dataset and the support set have a few samples. Therefore, for better coordinate the two modules and make our model achieve a good performance in various scenarios, we explore a simple and verifiable method JTSA. As shown in the Joint learning echeme in Figure 1, we combine the probability distributions predicted by the sentence awareness module and token awareness module to obtain the modified result. If one model considers that the probabilities of two class labels do not differ much when classifying an entity, the model has difficulties identifying the entity. There is a great possibility to be wrong. The introduction of the other model can effectively correct the previous error. When predicting the label of query sample q, our JTSA model gets the class distribution by Eq.12, which joint the distribution ( | ) where d and c is the hyper-parameter of model reliability, which is obtained by multiple episodes. For the final results, we utilize the Viterbi decoder as Eq.5.

EXPERIMENTS
In this section, we demonstrate the experiments and implementations in detail to show that our model is effective and superior. Firstly, we present the hyper-parameters and the datasets FEW-NERD which we used in our proposed model. Then, the results and the comparisons with existing state-of-the-art models are provided by evaluating our model on the datasets with different granularities. Last, we respectively study the validity of each component in our JTSA model, including sentence awareness module, token awareness module, and joint learning schema.

Datasets
For N-way K-shot NER tasks, we evaluate our proposed model JTSA on two open benchmarks: FEW-NERD(INTRA) and FEW-NERD(INTER), which are presented in Table 1.

Few-shot Named Entity Recognition with Joint Token and Sentence Awareness
FEW-NERD(INTER) randomly splits 60% fine-grained types for training set, 20% for validation, and 20% for test, that means, the coarse-grained types are shared and for one set may have the whole coarse-grained types.

Experimental Setup
We evaluate our proposed model JTSA on 5-way 1-shot, 5-way 5-shot, 10-way 1-shot, and 10-way 5-shot tasks. The hyper-parameters of our model are reported in Table 2. Pre-trained BERT module is implemented to extract the initial word embedding representation of our model. The batch size is set to 2 and the number of query is 1. Our model has 10,000 iterations for training, 1,000 iterations for validation, and 500 iteration for test. In our model, we use AdamW as the optimizer, and the learning rate is set to 1e-4 with (0.1 * training iteration) warmup step.

Overall Performance
In this part, we assess our proposed model JTSA from different perspectives based on the two benchmark datasets FEW-NERD(INTER) and FEW-NERD(INTRA), then compare our method with existing state-of-theart approaches.
For the FEW-NERD(INTER), we first adopt our SAP model and TAP model. As represented in Table 3, the SAP achieves higher performance compared with existing state-of-the-art models for 1-shot tasks. On 5-way 1-shot task, the F1 score of our SAP has 5.78% improvement, and on 10-way 1-shot task, the improvement is around 4%. This sufficiently demonstrates that the sentence awareness module in SAP is effectively to aid entity types identification by integrating the structure information of sentences, when there are few support samples. On multiple-shot tasks, our TAP model achieves a significant improvement due to the advantages of the token awareness module. Since the specificity of NER task, the type of entity is regarded as "O-class" when it is not included in the predefined types. This introduce a large number of futile samples inevitably, while our sentence awareness is designed to solve the problem. Our sentence awareness module utilizes the semantics in sentences to filter out the samples which may interfere the entity recognition. Table 3. Overall Performance on FEW-NERD(INTER). NNshot, Proto, Struct model are from [12] and add Viterbi decoder to the Proto as my baseline ProtoNet, and we evaluate them on our dataset.
For coarse-grained FEW-NERD(INTRA), empirical results reported in Table 4 suggest that the dataset is challenging for all existing models, since the query samples share little information with the reference. However, for various few-shot settings, the improvement of our TAP model relied on our token awareness module is greater compared with the results on FEW-NERD(INTER). The reason is that there are a huge gap between query entities and entities in the support set, while our token awareness has the significant capability to filter out the entities in the support set that are less correlation to the query sample. Table 4. Overall Performance on FEW-NERD(INTRA). NNshot, Proto, Struct model are from [12] and add Viterbi decoder to the Proto as my baseline ProtoNet, and we evaluate them on our dataset.
To comprehensively exploit the correlation information of entities and sentences, we construct the joint model JTSA, our mainly proposed method. The token awareness module and the sentence awareness

Few-shot Named Entity Recognition with Joint Token and Sentence Awareness
module complement each other and take advantages in different scenarios. As the results in Table 3 and  Table 4, our JTSA model achieves superior performance on various few-shot settings; meanwhile, it is better than the structshot model [12] which has strength on the single-shot and the prototypical network which is suitable for multiple-shot setting.

Convergence Speeds
Firstly, we compare the convergence speed between our proposed model SAP and existing state-of-theart methods (structshot model [12] and prototypical network [4]) on FEW-NERD(INTER) benchmark on 1-shot tasks. The results are reported in Figure 2. The curve colored red represents our SAP model, while the curve with blue is the structshot model and the curve with green is the prototypical network. Figure 2  Secondly, we also compare our TAP model with the two baseline methods on multiple-shot tasks and show the results on the 5-shot task in Figure 3. Although the convergence speed is almost the same initially, our model has better performance on the validation set, which suggests that our model has strong generalization capability. In the second half of time, our TAP model converges at a much faster speed with higher optimal points, and both of the two criteria generally exceed the prototypical network.

Ablation Study
In this part, we also conduct sets of subsidiary experiments to indicate our main contributions effect, including sentence awareness module, token awareness module, and joint learning scheme.

Sentence Awareness Module
To show the effect of our sentence awareness, we take a sample predicted on a 5-way 1-shot task as an example. Given a query sample q and the truth label of "Organization-religion", the NER model needs to distinguish the type label of q. In this experiment, we calculate the distance of each feature dimension between query q and the prototypes respectively, and compare the results of our SAP model with prototypical network. Figure 4 presents the visualization, and the darker the color of the bar, the closer the distance. In this figure, the SAP model provides a higher level of confidence when predicting the label of query q, since most of the prototype feature dimensions gained from our SAP model are more similar to query q. The specific distance over the whole features in our SAP model is 20% lower than prototypical network. To sum up,sentence awareness, which considers the similarity of sentences, is crucial for a prototypical network.

Token Awareness Module
As the experiments show above, we randomly extract samples to evaluate our TAP model on a 5-way 1-shot task, and the query entity comes from the "other-biologything" type. Figure 5 illustrates that the prototypes calculated by our TAP model with token awareness module are more similar to the query q, and the distance of ours is the only 67% of the prototypical network. Thus, the query sample is easier to be classified correctly with the token awareness module.

Joint Learning Scheme
To further indicate the effect of our joint learning scheme, we extract two groups of data and evaluate our model on a 5-way task. Figure 6 aims to illustrate how the TAP model with token awareness module corrects the prediction of the SAP model with the sentence awareness module. We show the probability distributions of our SAP, TAP, and JTSA in distinguishing the type label of the entity "stock" that appears in the sentence "It began focusing on foreign exchange transaction in 1976 and listed its shares on the Jakarta stock exchange in 1989". The truth label of the entity "stock" is "organization-government/governmentagency", but the SAP model believes that the entity should be labeled "building-hospital" with over 55% confidence. In this case, our joint model JTSA can take advantage of entity awareness(TAP) and give the correct type label with 80% confidence at last. Figure 7 presents how the sentence awareness module works when Figure 6. An example of SAP, TAP and JTSA predict the class label of entities "jakarta stock exchange" from the sentences.
predicting the label of the word "the" in the phrase "republic of the Philippines Commission on elections(Comelec)". We can find that the TAP model is controversial about whether the word "the" belongs to "other-class" or "event-election" and result in misidentification. In contrast, the SAP model gives high confidence to predict the correct type of "event-election". From above, we believe that the joint learning scheme of our JTSA model, which is joint entity awareness and sentence awareness, is meaningful and can achieve the best performance in a variety of scenarios.

Error Indicator Analysis
Following Ding et al. [12], we analyze our model from four aspects. Figure 8 presents the comparison results between our proposed model and baselines. All of our model SAP, EAP, and JTSA achieve lower error rates than baselines on most of the situations("FP", "WITHIN", and "OUTER"), for example, the error rate result on "FP" is reduced to 50% compared to the number of traditional prototypical network. The Figure 8. Comparison results on four error indicator. "FP" error stands for an entity with "O" class is predicted to other class. "FN" error indicates an entity is incorrectly predicted to class "O". "WITHIN" error means the coarsegrained class of the query entity is predicted correct, but the fi ne-grained class label is error. "OUTER" error presents the entityis predicted to a wrong coarse-grained class label.

Few-shot Named Entity Recognition with Joint Token and Sentence Awareness
results of these experiments sufficiently illustrate that token awareness and sentence awareness effectively recognize the entities with "O" class and alleviate the problem in similarity comparison caused by the ambiguity of "O" class. On the other hand, our models have the lowest error rate for "WITHIN" and "OUTER" with 10%-24% and 30%-40% reduced respectively, indicating our token awareness mechanism and sentence awareness mechanism is superior on specific classes, especially coarse-grained class with more significant semantic differences. In addition, our joint model JTSA has some reduced on "FP", "WITHIN", and "OUTER" compared with the results of TAP and SAP, which further presents the significance of our joint learning scheme.

CONCLUSION
In this paper, we proposed a state-of-the-art named entity recognition FSL model, JTSA. Our model contains 3 modules: a token awareness module, a sentence awareness module, and a joint learning scheme. Token awareness module captures the connections between entities from the token level. Sentence awareness module incorporates sentence information to capture the sentence-level relationships between entities. Then, the joint learning combines these two modules to strengthen the ability to identify entity classes and reduce the error of NER. Experimental results show that our two awareness modules have positive contributions to entity recognition in different contexts, and the joint learning scheme enables our final model to achieve advanced results in both coarse-grained and fine-grained NER.