A General Method for Transferring Explicit Knowledge into Language Model Pretraining

,


Introduction
Recently, substantial work has shown that pretrained models [1][2][3][4] can learn language representations over large-scale text corpora, which are beneficial for many downstream NLP tasks. For example, XLNet [5] obtains new state-of-theart results on twelve NLP tasks including reading comprehension, question answering, and text classification. Researchers [6][7][8] ascertain that pretraining allows models to learn syntactic and semantic information of language that is then transferred to other tasks. Most of the existing works model the representations by predicting the missing word only through the contexts. It is solely at the word token level [9] without considering the background knowledge in the text.
Background knowledge [10] comprises all of the world knowledge that the reader brings to the task of reading. is can include episodic (events), declarative (facts), and procedural (how-to) knowledge as well as related vocabulary. Background knowledge influences the efficacy of understanding. It has been considered an indispensable part of language understanding [11][12][13]. For instance, one major step in improving reading is to improve prior knowledge of the topics being read [14]. We argue that background knowledge can lead to better language understanding. For example, given the sentence "Xiaomi was officially listed on the main board of HKEx," the background knowledge may include Xiaomi is a science and technology company, HKEx refers to Hong Kong Exchanges and Clearing Limited, and main board is an economic term. Knowing these knowledge facts can help us better understand the word sense and the sentence topic.
Explicit knowledge is knowledge that can be readily articulated, codified, stored, and accessed (https:// enẇikipedia rg/wiki/Explicit_knowledge), such as Freebase [15] and DBpedia [16]. For example, "Xiaomi is a science and technology company" is kind of typical explicit knowledge. It could be well stored in knowledge bases, represented in the form of SPO (subject, predicate, and object) triplet [17], where subject and object are entities and predicate is a relation between those entities. is paper seeks to find the impact of explicit knowledge on transformer pretraining.
How to inject explicit knowledge from external sources into the transformer-based language models has gradually become a research hotspot [18]. ere are some improved models based on Bert [19][20][21] or GPT [22], which prove that injecting extra knowledge information can significantly enhance original models. e difference between these methods lies in the different ways of knowledge injection. For example, ERNIE [19] refines the transformer architecture by using entity-level masking and phrase-level masking. KALM [22] signals the existence of entities to the input of the transformer in pretraining and adds an entity prediction task in the output.
In this paper, we propose a simple but general method for transferring knowledge into LM pretraining without changing the model architecture. Taking XLNet as a running example, given a sentence, we first use a dictionary look up method to map its knowledge facts. A knowledge injection layer is then designed to combine external knowledge with the original sentence, in a way that is close to natural language and accepted by XLNet without losing the structure information. Finally, we take the output of the knowledge injection layer directly as the input for XLNet and design a three-stage training method to save training time and hardware cost. To investigate the impact of explicit knowledge, we leverage open-domain and domain-specific knowledge to combine with XLNet and test their performances on various NLP tasks.
Our contributions in this paper are threefold: (1) proposal of a simple but general knowledge transferring method for language model pretraining, (2) proposal of K-XLNet for implementation of the proposed method on XLNet, and (3) empirical verification of the effectiveness of K-XLNet on various downstream tasks. e rest of the paper is organized as follows. Section 2 summarizes related work. In Section 3, we elaborate on our knowledge transferring method taking XLNet as a running example. Section 4 reports experimental results, and finally Section 5 concludes the whole paper.

Related Work
In recent years, with the rapid development of deep learning, the pretraining technology [1-3, 23, 24] in the field of natural language processing has made great progress. Many efforts [4,5] are devoted to pretraining language representation models for learning language representations over largescale corpus and then utilizing the representations for downstream NLP tasks such as question answering [25] and text classification [26].
Petroni et al. [8] discovered that language models embed some facts and relationships in their weights during pretraining.
is can help explain the performance of these models in semantic tasks [27,28]. However, relaying only on the text corpus, they have some tendency to hallucinate knowledge, whether through bias or incorrect knowledge in the training data [18]. Moreover, they [22] are still far from ready to serve as an "unsupervised multitask learner." ere are still gaps between model pretraining and task-specific fine-tuning [29]. Pretraining models usually learn universal language representation from general-purpose large-scale text corpora, but lack domain or task-specific knowledge.
is leads to the need of huge efforts for task-specific finetuning or overparameterization [30].
Recent work shows that combining with knowledge is a promising way to improve language models. e knowledge bases such as WordNet [31], Freebase [15], DBpedia [16], and ConceptNet [32] contain a large amount of reliable knowledge information. By leveraging DBpedia, K-Bert [21] have revealed promising results in knowledge-driven applications such as named entity recognition, entity typing, and relation extraction. Based on Bert and improved by refining the transformer architecture with Baidu Baike (Baidu Baike is a wiki-like online Chinese encyclopedia), ERNIE [19] and ERNIE1 (THU) [20] achieved good results on five Chinese natural language processing tasks including natural language inference, semantic similarity, named entity recognition, sentiment analysis, and question answering. Based on GTP2.0 [33], KALM [22] significantly improved downstream tasks such as zero-shot question answering by adding entity signals to the input of the transformer and an entity prediction task to the output. KEPLE [34] incorporated factual knowledge into language representations by jointly training with knowledge embedding and the masked language modeling objectives. is makes it also work well as an inductive knowledge embedding model on knowledge graph link prediction.
Our work is similar to AMS [35] and K-Bert [21], in which knowledge triplets are aligned with textual sentences. However, different from them, we focus on finding the direct impact of explicit knowledge on pretraining. Instead of changing the transformer architecture, we take sentences and the related knowledge facts as input into a knowledge injector. e injector is designed to combine the knowledge with original text and generate knowledge enriched output as the input of the transformer. A three-step training method is proposed to combine domain knowledge and pretrained model flexibly. Taking XLNet as a running example, we investigate the impact of open-domain and domain-specific knowledge on pretraining in various downstream NLP tasks.

Methodology
We propose a simple but general method for transferring knowledge into language model pretraining. In this paper, we take XLNet as a running example. Figure 1 shows the overall framework. We can see that the proposed method does not change the original architecture of XLNet. A knowledge injunction layer is designed and connected to Transformer-XL. We elaborate on the proposed method in the following three sections.

Knowledge
Matching. For a given sentence, how to effectively match its related knowledge is the primary problem we need to solve. We take SPO triplets in knowledge base (e.g., DBpedia) as the source of knowledge facts. A term lexicon is first built by leveraging all the subjects in knowledge base. en, we identify subject terms in the given sentence by means of a lexicon lookup. During the parsing, if one term is a substring of another term (e.g., New York and New York Times), we choose the longest term as the subject term. We finally get the related SPO triplets through a frequency-based dictionary lookup. If there are multiple matched terms in knowledge base, we choose the one with highest frequency. e subject term matching is also known in natural language processing as entity linking [36]. Instead of using a more sophisticated entity linker [37], the dictionary lookup is more efficient for large-scale text corpus than entity linking methods, and using a highly tuned entity linker may propagate its own biases into the transformer.

Knowledge Injection.
We aim to inject the structured and confident knowledge information in KBs into that of the high-coverage, contextual information found in language models. is injection would permit the model to incorporate some of the information found in the KB to improve its performance in many downstream tasks. However, it is a challenge to fuse lexical, syntactic, and knowledge information in the pretraining procedure for language representation [20].
Instead of designing a special pretraining objective for the fusion, we aim to integrate knowledge naturally into the original text in a way that conforms to the grammatical rules of natural language. For example, given the sentence "Xiaomi listed in Hong Kong" and a knowledge fact (Xiaomi, is_a, science and technology company), generate a knowledge-enriched sentence like "Xiaomi, a science and technology company, listed in Hong Kong." By this way, we let the pretraining model to use clues in the knowledge-enriched text to learn word representation that better reflect how language conveys knowledge. erefore, we treat the knowledge injection problem like a machine translation problem and design a knowledge injector with the structure similar to the transformer used in the field of natural language translation [38], as the knowledge injection layer shown in Figure 1. It mainly consists of two modules, i.e., embedding layer and transformer. For the input sentence and knowledge facts, embedding layer first converts them together with positional information into embedding representation and then feeds into the transformer for knowledge combining.

Embedding Layer.
e function of the embedding layer is to convert the given sentence and its related knowledge facts into an embedding representation that can be fed into the following transformer.
In order to express the positional information of the original sentence and the SPO triplet, we splice the original sentence with the matched knowledge triplets and use a twodimensional array composed of two elements to encode the position of each word. e first element is the sequence number of the word position, called absolute position index. e second element is the sequence number of the word that matches the subject in the original sentence, called relative position index. Figure 2 shows an embedding example. e encoding of words in the original sentence is composed of two elements with the same value because each word in the original sentence matches itself. For instance, the encode of the first word "Xiaomi" is [1,1]. For the matched knowledge triplets, each SPO triplet is horizontally spliced into a whole   Figure 1, the transformer consists of an encoder and a decoder, which has the same structure as the typical transformer [39]. It takes advantage of the positional embedding as a mechanism to encode order within a knowledge-enriched sentence. e encoder stacks 6 identical layers, in which each of them uses the multihead attention and a 2-sublayer feed-forward network, coupled with layer normalization and residual connection. e multihead attention mechanism computes the softmax distribution for each word within a sentence, including the word itself, as shown in the following equation:

Transformer. As shown in
e input consists of queries Q and keys K of dimension d k and values V of dimension d v . e queries, keys, and values are linearly projected h times, to allow the model to jointly attend to information from different representations. On top of the multihead attention, there is a feed-forward network that consists of two layers with ReLU activation in between. Each encoder layer takes as input the output of the previous layer, allowing it to attend to all positions of the previous layer. e decoder has a similar architecture as the encoder, but has two multihead attention sublayers.
We use it to inject the matched knowledge triplets to the original sentence. Regarding the matched knowledge triplets as a different language with the original sentence, we turn the knowledge injection problem into a machine translation problem and aim to translate them into a language that confirms to the natural language grammar. By this way, the knowledge-enriched output can be naturally used as the input of Transformer-XL.

XLNet Connection.
XLNet is an advanced transformerbased language model. We take it as a running example for combining knowledge with language model pretraining. e XLNet connected after the knowledge injection layer does not have a tokenization module, namely, the Transformer-XL. Tokenization and encoding have been performed in the embedding module.
Normally, retraining the pretrained model is necessary for model refinement by leverage knowledge information. However, the cost of retraining is very high, both in terms of time and hardware cost. We propose a simple and general way to resolve this problem, inspired by the mainstream idea of pretraining and fine-tuning. Figure 3 shows the process of training K-XLNet. It mainly has three stages: XLNet pretraining, task-specific fine-tuning, and K-XLNet training for the specific task. e first two stages are consistent with the usual two-stage pretraining model. Instead of pretraining K-XLNet on large-scale general corpus, we train it on specific tasks by leveraging external knowledge. In addition to cost saving, this approach makes it easy to flexibly test the effects of different knowledge bases on different downstream tasks. e following experiments show that this method is effective.

Experiment
In this section, we evaluate the performance of K-XLNet through seven downstream tasks, among which one is an English task for a specific domain and six are Chinese tasks for open domain.

Domain-Specific Task.
We first compare the performance of K-XLNet with the original XLNet on an English domain-specific task, namely, emotion classification for movie reviews.
To be specific, we use the IMDB [42] dataset for this test. It includes 25,000 positive reviews and 25,000 negative reviews. We divided them into three parts: train, dev, and test. We used the train part to fine-tune the model and then evaluated its performance on the dev and test parts. For knowledge injection in K-XLNet, we used MovieKG and DBpedia, respectively. Table 1 shows the experimental results.
It can be seen that K-XLNet is superior to the original XLNet in both parameter settings (base and large). is shows that our approach of knowledge injection to XLNet is effective. In addition, MovieKG performs better than DBpedia, indicating that domain knowledge is preferred for domain-specific tasks.
We further investigate the effect of different SPO triplet (knowledge) amounts in K-XLNet. In this test, we use MovieKG for knowledge injection and set the amount of knowledge triplets to be 1,000, 5,000, 6,000, and 7,000, respectively. e results are shown in Figure 4.
We can see that from 1k to 5k, the performances of K-XLNet models are improving with the increase of knowledge injection. After 5k, the performance tends to be stable or even slightly decreased. is gives us a hint that when using knowledge for model improvement, the more is not the better.
In the following experiments, we set the triplet (knowledge) amount to 5k for all K-XLNet models and compare our K-XLNet to the original XLNet using the Large setting, since the Base setting has similar performance trend.

Security and Communication Networks
It is used to recognize the named entities in the text, including person names, place names, and organization names\enleadertwodots.
Similarly, the open-domain datasets are split into three parts: train, dev, and test, used for fine-tuning, model selection, and model test, respectively. Table 2 shows the test results of various models in terms of accuracy. We can see that K-XLNet performs better than XLNet consistently on the six open-domain tasks. To be specific, the improvements are significant on NER task, but not on sentence classification tasks. Moreover, the model leveraging HowNet performs better than that using CN-DBpedia on sentence classification tasks, but it is the opposite on the NER task. e above observations show that knowledge injection to XLNet is also effective on open-domain downstream tasks, and it is important to choose appropriate knowledge base according to the specific task.

Discussion.
is work focused on transferring knowledge facts into language model pretraining. e above experiments showed that the proposed method achieved good results in all the downstream tasks. However, the improvements are limited due to the following reasons: (1) In knowledge matching, a simple frequency-based dictionary lookup is used to match knowledge fact from KB. Its accuracy depends on the ambiguity of terms in text and the popularity of matched terms in KB. e higher the matching accuracy, the better the model effect. For this task, we need trade-off between time efficiency and matching precision. (2) In knowledge transferring, we use a typical transformer [39] to learn the knowledge-enriched sentence representation. Because of the lack of sufficient computational resources, we train the K-XLNet model on datasets for downstream tasks, instead of retraining on large corpus (shown in Figure 3). is may limit its performance to some extent.  Figure 4). ese indicate that injecting relevant explicit knowledge is useful, but the relationship between explicit knowledge and downstream tasks is unknown and worth further study.

Conclusion
In this paper, we propose a simple but general knowledge transferring method for language model pretraining. Taking XLNet as a running example, we construct K-XLNet to show  the effectiveness of our method. Extensive experiments show that K-XLNet performs better than XLNet in both opendomain and domain-specific tasks. is work suggests some interesting directions for future work. For example, we can further explore different machine learning techniques to improve the knowledge matching. It would be also interesting to investigate the potential adaptability of explicit knowledge in different downstream tasks.

Data Availability
e databases and test data sets used in this paper are public data, and the data download links can be found in the relevant references.

Conflicts of Interest
e authors declare that they have no conflicts of interest.