CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models

Several recent efforts have been devoted to enhancing pre-trained language models (PLMs) by utilizing extra heterogeneous knowledge in knowledge graphs (KGs) and achieved consistent improvements on various knowledge-driven NLP tasks. However, most of these knowledge-enhanced PLMs embed static sub-graphs of KGs ("knowledge context"), regardless of that the knowledge required by PLMs may change dynamically according to specific text ("textual context"). In this paper, we propose a novel framework named Coke to dynamically select contextual knowledge and embed knowledge context according to textual context for PLMs, which can avoid the effect of redundant and ambiguous knowledge in KGs that cannot match the input text. Our experimental results show that Coke outperforms various baselines on typical knowledge-driven NLP tasks, indicating the effectiveness of utilizing dynamic knowledge context for language understanding. Besides the performance improvements, the dynamically selected knowledge in Coke can describe the semantics of text-related knowledge in a more interpretable form than the conventional PLMs. Our source code and datasets will be available to provide more details for Coke.


Introduction
Pre-trained language models (PLMs) such as BERT (Devlin et al., 2019) and RoBERTa  have achieved state-of-the-art performance on a wide range of natural language processing (NLP) tasks. As some research (Poerner et al., 2019) suggests that these PLMs still struggle to learn factual knowledge, intensive recent efforts (Lauscher et al., 2019;Yoav et al., 2019;Yu et al., 2019;Peters et al., 2019;He et al., 2019;Liu et al., 2020) Figure 1: The example of capturing knowledge context from a KG and incorporating them for language understanding. Different sizes of circles express different entity importance for understanding the given sentence.
have therefore been devoted to leveraging rich heterogeneous knowledge in knowledge graphs (KGs) to enhance PLMs.
An ideal process for injecting factual knowledge into PLMs is to first identify mentioned entities 1 in the input text ("textual context"), then dynamically select sub-graphs ("knowledge context") centered on these mentioned entities from KGs, and finally embed the selected knowledge context for PLMs. Intuitively, knowledge context contributes to better language understanding on the one hand, serving as an effective complementarity to textual context. For example, given two entities Steph Curry and Klay Thompson in Figure 1, we can infer that they play for the same basketball team, which is not explicitly described in the given sentence. On the other hand, not all knowledge in KGs is relevant to textual context, e.g., the fact (Riley, Daughter of, Steph Curry) has no positive effect on understanding the given sentence.
We argue that it is meaningful to dynamically select appropriate knowledge context that can match specific textual context for enhancing PLMs. How-ever, most knowledge context utilized in existing knowledge-enhanced PLMs is not highly matching textual context: (1) ERNIE  just uses entities mentioned in the text as knowledge context and only injects the embeddings of these entities into PLMs, ignoring informative neighbors in KGs; (2) KnowBert (Peters et al., 2019), K-BERT (Liu et al., 2020) and K-ADAPTER (Wang et al., 2020) consider more information as knowledge context than ERNIE (e.g, entity properties in KGs), yet their knowledge context is still static and cannot dynamically change according to textual context. As we mentioned before, not all information in static knowledge context can match textual context, and the knowledge interfere with redundant and ambiguous information may interfere understanding semantics. Hence, how to dynamically select and embed knowledge context according to textual context for PLMs still remains a challenge.
To alleviate the issue, we propose a novel framework named Coke to dynamically select knowledge context matching textual context and embed the dynamic context for enhancing PLMs: (1) For dynamically selecting knowledge context, according to textual context, we propose a novel semanticdriven graph neural network (S-GNN). Given an entity mentioned in textual context, S-GNN leverages an attention mechanism to filter out irrelevant KG information by assigning scores to neighbors (1-hop, 2-hop, etc) and relations between entities based on textual context. The score can weigh how much the information in KGs matches textual context and help Coke dynamically select an appropriate sub-graph as the knowledge context of the given entity mention.
(2) For dynamically embedding knowledge context, given a mentioned entity, S-GNN computes its representation conditioned on both its pre-trained entity embedding and the information aggregated from the selected contextual sub-graph in a recursive way, making Coke be aware of both global and local KG information and grasp the text-related information.
(3) By fusing the embeddings of dynamic knowledge context for PLMs with specific training and adaption strategies, Coke improves language understanding and benefits for downstream applications.
Following existing work, we conduct experiments on four datasets for two typical knowledgedriven tasks, i.e., entity typing and relation classification. The experimental results show that Coke outperforms various baselines, indicating the ef-fectiveness of dynamically selecting and embedding knowledge context for PLMs. Moreover, some qualitative analyses also suggest that, as compared with the state-of-the-art knowledge-enhanced PLMs, our model not only achieves competitive results but also provides a more interpretable approach to describing specific words based on their dynamic knowledge context.

Related Work
Intuitively, two types of context are involved in language understanding: (1) the semantic information of the text (textual context), and (2) the factual knowledge related to the text (knowledge context). The typical PLMs focus on capturing information from the textual context, like ELMO (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2019), XLNET , and RoBERTa . In order to enable PLMs to better understand the knowledge context, intensive efforts have been devoted to injecting various factual knowledge of KGs into PLMs. ERNIE  links entity mentions in textual context to their corresponding entities in KGs and then inject the pre-trained embeddings of the corresponding entities into PLMs. Although ERNIE has shown the feasibility and effectiveness of fusing knowledge embeddings for enhancing PLMs, it still doees not consider the informative neighbors of entities.
To this end, various models have been proposed to further incorporate a wider range of knowledge information. KnowBert (Peters et al., 2019) and KRL (He et al., 2019) employ attention mechanisms to learn more informative entity embeddings based on the entity-related sub-graphs. Nevertheless, the computation of entity embeddings is independent of textual context. K-BERT (Liu et al., 2020) heuristically converts textual context and entity-related sub-graphs into united input sequences, and leverages a Transformer (Vaswani et al., 2017) with a specially designed attention mechanism to encode the sequences. Unfortunately, it is not trivial for the heuristic method in K-BERT to convert the second or higher order neighbors related to textual context into a sequence without losing graph structure information. K-ADAPTER (Wang et al., 2020) proposes variant frameworks to inject factual knowledge in different domains, yet still suffers from the similar issue like K-BERT. Although most existing knowledge-enhanced PLMs are aware of utilizing both textual context and knowledge context, their knowledge context cannot change with textual context, like ERNIE using single entities, KRL and KnowBert embedding sub-graphs independently of textual context, K-BERT and K-ADAPTER using fixed subgraphs. In contrast, our proposed Coke model can leverage dynamic sub-graphs of arbitrary size as knowledge context according to textual context.
There are also several PLM methods for capturing knowledge from only textual context. Span-BERT  and ERNIE 1.0-Baidu (Yu et al., 2019) propose to predict masked variable-length spans or entity mentions to encourage PLMs to learn multi-token phrases. WKLM (Xiong et al., 2019) is trained to distinguish whether an entity mention has been replaced with the name of other entities having the same type to learn entity types. LIBERT (Lauscher et al., 2019) and SenseBERT (Yoav et al., 2019) extend PLMs to predict word relations (e.g., synonym and hyponym-hypernym) and word-supersense respectively to inject lexical-semantic knowledge. Moreover, there are also efforts on continual knowledge infusion (Yu et al., 2020;Wang et al., 2020). Although these models do not use extra knowledge context to understand factual knowledge, they are complementary to our work and can be used together towards better PLMs.

Methodology
As shown in Figure 2, Coke consists of three modules: (1) Text Encoder computes embeddings for the input text, i.e. textual context; (2) Dynamic Knowledge Context Encoder first dynamically selects knowledge context according to textual context, and then computes contextual knowledge embeddings conditioned on both textual context and KG context; (3) Knowledge Fusion Encoder fuses both textual context and dynamic knowledge context embeddings for better language understanding. In this section, we will first give the notations and then present the three modules in details.

Notations
A KG is denoted by G = {(h, r, t)|h, t ∈ E, r ∈ R}, where E and R are the set of entities and relations respectively. For each fact (h, r, t) ∈ G, it indicates that there is a relation r between the head entity h and the tail entity t. Given a token sequence S = {w j } N j=1 of the length N , some tokens in the sequence may correspond to certain entities in E, we name these tokens "entity mentions" and denote their mentioned entities in KGs as {e j } M j=1 , where M is the number of mentioned entities 2 .

Text Encoder
Similar to existing knowldege-enhanced PLMs, Coke leverages a L-layer bidirectional Transformer encoder (Vaswani et al., 2017;Devlin et al., 2019) to embed the input text (tokens) S = {w j } N j=1 and obtain its textual context representations, which is denoted as T-Encoder(·), As T-Encoder(·) is the same as that used in BERT, we refer the readers to the original paper (Devlin et al., 2019) for more details.

Dynamic Knowledge Context Encoder
Constructing Raw Knowledge Context As KGs are often in a large scale, we first construct raw knowledge context for computational efficiency. Then we dynamically select and embed appropriate knowledge context that can match the textual context. Specifically, given a mentioned entity m ∈ E mentioned by the input text S = {w j } N j=1 , we define its raw knowledge context G m as a sub-graph of G centered in m. The entities of G m are at most K-hops away from m. Formally, we define the 0hop away entity set as E 0 m = {m}. Then the i-hop away entity set E i m can be defined recursively as (2) Intuitively, all entities in E i m (both head or tail entities) only have relations to the entities in E i−1 m . Then, the raw knowledge context G m and its entity set E m can be defined as (3)

Kx DK-Encoder
Px K-Encoder  Figure 2: (a) The upper part is the overall framework of Coke and illustrates how to generate entity representations.
(b) The lower part is the example of inserting special tokens to the input sequence for specific tasks during finetuning.

Selecting and Embedding Knowledge Context
To dynamically select informative features in G m and embed these features for PLMs, we propose a semantic-driven graph neural network (S-GNN). For each entity in G m , i.e., e ∈ E m , we initialize its input features for S-GNN with its embedding pre-trained by TransE (Bordes et al., 2013) (Other knowledge embedding models can also provide pretrained embeddings for S-GNN), and named the initialized features as e 0 . In order to fully transfer the structure and knowledge information among entities in G m , S-GNN consists of several hidden layers to aggregate information following the structure of G m . At the i-th layer, given an entity e ∈ E m , S-GNN aggregates all information from its neighbors entity n and r in G m , where n i−1 is the embedding of n at the i − 1 layer, n and r are the entity and relation embeddings respectively pre-trained by TransE, W i is a learnable linear matrix, and [·; ·] denotes the horizontal concatenation of vectors. Then the embedding of e at the i-th layer can be computed as where N e is the neighboring set of e, f i (·) is the function to aggregate information at the i-th layer and will be introduced in detail next.
As not all information in the raw knowledge context G m is useful for understanding the input text tokens S = {w j } N j=1 , we design a special semantic attention mechanism as the function f i in Eq. (5) to filter out irrelevant information and aggregate essential information. The attention mechanism function f can be formally denoted as follows, where q, k n are referred to as query and key vectors respectively.
To dynamically select information according to textual context, the query vector q comes from the embedding of the input text (tokens): where σ = tanh(·), W i and b i are the learnable linear matrix and bias vector respectively for the query vector at the i-th layer, s is the whole semantic embedding of the input text (tokens). Specially, following BERT (Devlin et al., 2019), we place a special token [CLS] at the beginning of the input sequence, and s is the output embedding of [CLS] computed by Eq. (1).
The key vector k n is based on the embedding of the relation between the entity e and its neighboring entity n, and computed as where W i and b i are the learnable linear matrix and bias vector respectively for the key vector at the i-th layer. Two triples with head an tail entities switched will get the reverse key vectors.
In summary, S-GNN utilizes textual context to adjust the weight of feature aggregation, and finally selects and embeds knowledge related to the textual context into embbedings for PLMs. Hence, given the mentioned entity m, the output embedding of m at the last layer of S-GNN is its final embedding computed by its dynamic knowledge context. For simplicity, given the input text (tokens) {w j } N j=1 and the mentioned entities {e j } M j=1 , the whole computation to achieve dynamic knowledge context embeddings is denoted as,

Knowledge Fusion Encoder
Knowledge fusion encoder aims to fuse the information of contextual entity embedding {e j } M j=1 and the text (tokens) embedding {w j } N j=1 . We leverage the encoder K-Encoder(·) similar to  to serve the purpose, We refer the readers to  for more details. Roughly speaking, K-Encoder(·) consists of P aggregators. As shown in Figure 2, in each aggregator, there are two multi-head selfattentions injecting text (tokens) and contextual knowledge embeddings respectively, and a multilayer perceptron (MLP) fusing two heterogeneous features.

Training Details
Pre-Training Strategies To incorporate knowledge embeddings into language understanding, we randomly mask token-entity alignments and let the model learn to predict all corresponding entities for these tokens by masking their alignments. We refer this to a denoising entity auto-encoder (dEA), which is one of the pre-training tasks for existing knowledge-enhanced PLMs . Besides, we choose BERT BASE (Devlin et al., 2019), RoBERTa BASE , and RoBERTa LARGE  as our base models. Considering that our base models are originally pre-trained by different pre-training tasks, we have two different training objectives for them.
For the Coke BERT BASE , which is based on BERT BASE , the training objective can be described as: where the L MLM and L NSP are loss functions for masked language model and next sentence prediction correspondingly. The denoising entity autoencoder (dEA) loss is L dEA . For Coke ROBERTA BASE and Coke ROBERTA LARGE , which are representatively based on RoBERTa BASE and RoBERTa LARGE , their training objective can be described as: where the sentence prediction loss is removed.
Fine-Tuning for Downstream Tasks Coke applies the fine-tuning procedure similar to BERT and take the final output embedding of the first token [CLS] for various common NLP tasks. Simliar to the previous knowledge-enhanced PLMs, for knowledge-driven tasks such as entity typing and relation classification, we apply specific fine-tuning procedures. As shown in Figure 2, to help Coke combine context information and entity mention attentively, we modify the input sequence with the mention markers. We attend the token which is in front of the entity mention as

Experiments
In the experiments, we first introduce the training dataset and other training details of our model. After that, we give an empirical analysis to show the  usefulness of the selected knowledge context. Then we compare Coke with several strong baselines in two typical knowledge-guided tasks including entity typing and relation classification. Finally, we perform an ablation study to show the effectiveness of our dynamic knowledge context encoder.

Training Dataset
We use English Wikipedia 3 as our pre-training corpus and align the entity mentions to Wikidata with widely-used entity linking tool TAGME (Ferragina and Scaiella, 2010). There are nearly 4, 500M subwords and 140M entities in the pre-training corpus and we we sample 24, 267, 796 fact triples, including 5, 040, 986 entities in Wikidata. We conduct our experiments on the following datasets: FIGER, Open Entity, FewRel, and TACRED. The statistics of these datasets are shown in Table 1. Besides, we use knowledge embeddings of WikiData released by .

Experimental Settings
Training and Parameter Settings In experiments, we choose BERT BASE (Devlin et al., 2019), RoBERTa BASE and RoBERTa LARGE  as our base models. To reduce the cost of training from scratch, we adopt these models' released parameters to initialize our text encoder and the rest of parameters of Coke are all initialized randomly. For optimization, we set the learning rate as 5 × 10 −5 , the max sequence length as 256, the batch size as 32, and the rest settings largely following the original PLMs. For fine-tuning, we use the same parameters as pre-training except the batch sizes and the learning rates. In all downstream tasks, we select the batch size from {16,32,64}, the learning rate is 2 × 10 −5 , the number of epochs from {5,6,7,8,9,10}. The following ranges of value all perform well. Besides, to prevent Coke from overfitting in FIGER, we use large batch size 1024. We refer more details of training and hyperparameter settings to our Appendix.
Baselines We split baseline models into three groups: BERT BASE based models, RoBERTa BASE based models, and RoBERTa LARGE based models. For the sake of fairness, all models only incorporate factual knowledge from Wikidata. For knowledgeenhanced PLMs like ERNIE, KnowBert, and K-BERT, we re-implement them or use their released code for our experiments, and report the results which can match their results in the original papers. As K-ADAPTER is similar to K-BERT and without any released code, we thus directly compare with K-BERT rather than K-ADAPTER.

Empirical Analysis for Dynamically Selecting Knowledge Context
To demonstrate Coke is able to capture useful information from KGs, we design a qualitative and quantitative experiments to evaluate Coke.
In the qualitative experiment, given the same entity mentions in different context, we adopt PLMs for selecting text-related 1-hop triples ("1hop knowledge context") from Wikidata, which is similar to Eq. (6) without summation. More specifically, we apply the [CLS] of the input text (tokens) computed by these PLMs to attend each neighbouring triple of entity mentions.
As shown in Table 2, when given the sentence ". . . Bill Gates and Mark Zuckerberg dropped out of Harvard . . ." indicating the relation alumni between Mark Zuckerberg and Bill Gates, our model pays more attention to the factual knowledge of their education. Yet when given the sentence "Bill Gates and Mark Zuckerberg are working together . . ." indicating the cooperation between Mark Zuckerberg and Bill Gates, the factual knowledge of their enterprises is considered by our model. Apparently, we can find the importance scores of attended triples is interpretable and can help us understand the semantics more clearly.
In the quantitative experiment, we annotate the test sets of FewRel and TACRED. Given a sample, including context and the corresponding entity mentions, we manually annotate its 1-hop triples by judging the relevance between context and triples. Finally, we extract 15981 instances from FewRel and 5684 instances from TACRED. By ranking importance scores of all triples for an entity mention and setting a threshold, we can obtain positive   triples and negative triples to calculate F1 scores for evaluation.
To fairly demonstrate effectiveness of extracting triples via Coke, we choose ERNIE as our baseline model, which inherently aligns the language embedding space and KG embedding space using the same training data as Coke. As shown in Table 3, the F1 scores of Coke are better than the baseline model by 14.8%-17.8% on FewRel and 14.5%-18.3% on TACRED.

Overall Evaluation Results
In this section, we compare our models with various effective PLMs on entity typing and relation classification, including both vanilla PLMs and knowledge-enhanced PLMs.
Entity Typing Given an entity mention and its corresponding sentence, entity typing requires to classify the entity mention into its types. For this task, we fine-tune Coke on FIGER (Ling et al., 2015) and Open Entity (Choi et al., 2018). The training set of FIGER is labeled with distant supervision, and its test set is annotated by human. Open Entity is a completely manually-annotated dataset. We compare our model with baseline models we mentioned in Baselines 4.2.
As shown in Table 4, Coke can achieve comparable F1 scores on Open Entity. On FIGER, Coke significantly outperform the BERT BASE and RoBERTa BASE by 3.7% and 3.5% Micro scores respectively. Besides, the performance of Coke is better than other baseline models as well. It directly demonstrates that Coke has better ability to reduce the noisy label challenge in FIGER than the baseline models that we mentioned above.
Moreover, we found the domain of FIGER is similar to Wikidata, this is consistent with the observation in the empirical analysis section, which further highlights the importance of selecting knowledge context cross domains.  Relation Classification Relation classification aims to determine the correct relation between two entities in a given sentence. We fine-tune Coke on two widely-used benchmark dataset FewRel (Han et al., 2018) and TACRED (Zhang et al., 2017). We also compare our model with baseline models we mentioned in Baselines 4.2. On FewRel, Coke significantly outperforms the BERT BASE and RoBERTa BASE by 4.5% and 3.8% F1 scores respectively as shown in Table 4. It directly demonstrates that Coke can capture the relation between two entities better than ERNIE by considering the information of higher-order neighbours, especially in small dataset FewRel.
Besides, Coke models have comparable results with other baseline models on TACRED but achieve substantially improvements on FewRel. As we mentioned before, the domain of FewRel data is more similar to Wikidata and therefore it gains more benefit from pre-training.

Ablation Study
In order to indicate the effect of S-GNN on the process of dynamically selecting knowledge context, we conduct essential ablation studies for different modules in S-GNN.

K-Hop Sub-Graphs
In this section, we explore the effects of dynamic knowledge context encoder. There are two main components in the dynamic knowledge context encoder: raw knowledge context construction and S-GNN. Coke applies raw knowledge context construction to sample K-hop sub-graphs, and then incorporates S-GNN to embed informative knowledge in the raw context. From Figure 3, we find that Coke incorporating the 2-hop sub-graph outperforms by 0.4% to 0.6% than incorporating the 1-hop sub-graph. It proves that considering a wider range of knowledge can lead to better entity embeddings.
Attention Mechanism In S-GNN, there is an essential mechanism: attention. It takes responsibility for weighing how much knowledge matches the text and help compute final dynamic contextual embeddings. To further demonstrate the effect of the attention mechanism, we simplify it with a mean-pooling operation to aggregate features. From Figure 4, we can find that the attention mechanism outperforms than the mean-pooling mechanism and fixed embeddings (ERNIE), indicating the effectiveness of our attention mechanism.

Conclusion and Future Work
We have proposed an effective and general framework to enable PLMs to dynamically select ap- propriate knowledge context with textual context, and then insert the embedded knowledge into PLMs. The experiments demonstrate that Coke can achieve comparable results with the state-of-the-art knowledge-enhanced PLMs in the entity typing and relation classification. Coke dynamically selects knowledge context with textual context is more interpretable than injecting all knowledge context from KGs. In the empirical analysis, Coke demonstrates the effective selection of knowledge context as well. This direction may lead to more general and effective language understanding. In the future, we will continue to explore how to inject other type of knowledge (e.g. linguistic knowledge) in conjunction with factual knowledge to further enhance PLMs. And it is also an interesting direction to explore how to continually inject emerging factual knowledge into PLMs without re-training the whole model.