PluDG: enhancing task-oriented dialogue system with knowledge graph plug-in module

Task-oriented dialogue systems continue to face significant challenges as they require not only an understanding of dialogue history but also domain-specific knowledge. However, knowledge is often dynamic, making it difficult to effectively integrate into the learning process. Existing large language model approaches primarily treat knowledge bases as textual resources, neglecting to capture the underlying relationships between facts within the knowledge base. To address this limitation, we propose a novel dialogue system called PluDG. We regard the knowledge as a knowledge graph and propose a knowledge extraction plug-in, Kg-Plug, to capture the features of the graph and generate prompt entities to assist the system’s dialogue generation. Besides, we propose Unified Memory Integration, a module that enhances the comprehension of the sentence’s internal structure and optimizes the knowledge base’s encoding location. We conduct experiments on three public datasets and compare PluDG with several state-of-the-art dialogue models. The experimental results indicate that PluDG achieves significant improvements in both accuracy and diversity, outperforming the current state-of-the-art dialogue system models and achieving state-of-the-art performance.


INTRODUCTION
Building task-oriented dialogue systems has become a prevalent research subject in both academic and business settings.The commonly used method to create a dialogue system is developing an end-to-end system, which increases efficiency by generating responses directly from a knowledge base and dialogue history (Lu et al., 2023a;Lu et al., 2023b;Liu et al., 2023b).Figure 1 depicts the whole data needed by the task-oriented dialogue system.To make full use of the external knowledge base information, Madotto, Wu & Fung (2018) proposed Mem2Seq.The model enhances the MemNN framework (Sukhbaatar et al., 2015) using a sequence generation framework and incorporates a global multi-hop attention mechanism to replicate words directly from the dialogue history or knowledge base.In addition, some researchers propose that entities' relationships in an external knowledge base should be considered rather than treated as isolated triples.Banerjee & Khapra (2019) achieved state-of-the-art results in goal-directed dialog systems using GCN U: I want somewhere that serves traditional food.... U: What is the address and phone number?
What is the address and phone number?
The dojo_noodle_bar is in ...  (Kipf & Welling, 2016) to combine structural information with encoded sequences and developed contextual graphs for constructing hybrid dialogues in different languages.

Dialogue System
Later, Zhao et al. (2023) proposed a multi task learning method based on graph attention networks for modeling a multi-domain task-oriented dialogue system.On the other hand, researchers have also utilized large language models (LLMs) in task-oriented dialogue systems by treating the response as the natural language generation task.One such system is UBAR (Yang, Li & Quan, 2021), a modularly designed task-based dialogue system based on GPT-2 that facilitates module replacement and functional extensions for different domains and scenarios.Rony, Usbeck & Lehmann (2022) proposed DialoKG, a model that incorporates knowledge into the GPT-2 architecture.To achieve this, the model leverages the structural information of the knowledge base by treating each entity as a sequence and calculating its weight for the dialogue history with the help of RoBERTa (Liu et al., 2019).Nevertheless, LLMs may face challenges in capturing these structured relations when processing knowledge bases to treat entities as sequences since the information contained in knowledge bases is usually structured, consisting of entities and their relations (Shen et al., 2021;Liu et al., 2023a).
To address this limitation, this article presents a novel method called PluDG (PLUgins-Assisted Dialogue Generation).Specifically, we designed a plug-and-play module called Kg-Plug, which treats knowledge as a knowledge graph.Kg-Plug utilizes LR-GCN modules to leverage low-dimensional decomposition for feature extraction.Furthermore, it employs the attention mechanism to align with the dialogue history to get the prompt entities, which are inferred from the dialogue history and knowledge base and are related to the user's true intent.Subsequently, prompt entities are generated and provided to the decoder for dialogue generation.Additionally, we employ a GPT-2-based decoder for generating responses.We enhance it by incorporating an entity memory ensemble embedding, which utilizes special tokens and embeddings to improve GPT-2 s ability to produce contextually appropriate results.
Our article outlines several major contributions: • We proposed PluDG, a task-oriented dialogue system that integrates a plug-and-play Kg-Plug component into a GPT-2-based decoder.PluDG learns intrinsic graph structure information from the knowledge base and gets entity hints to pass to the decoder for better response generation.
• We proposed a novel embedding technique for GPT-2, named Unified Memory Integration (UMI), which utilizes multi-layered and position embeddings that are aware of the structure of the dialogue history, knowledge base, and prompt entities.
• Experiment results on three benchmark datasets show the superior performance of PluDG compared to other state-of-the-art models.Our model outperforms existing approaches based on metrics, particularly in complex knowledge-base information datasets.

RELATED WORKS
A task-oriented dialog system has been employed with an end-to-end approach.Originally, researchers considered the KB and dialogue history as sequences.Lately, many researchers have emphasized the importance of preserving the connection between entities in the KB to achieve improved bot responses.The most recent studies have applied pre-trained language models to enhance dialog systems.RNN-based dialogue systems.Wen et al. (2016) proposed a web-based task-oriented dialogue system capable of directly learning parameters from raw data.Later, Wu, Socher & Xiong (2019) proposed GLMP that integrates the external knowledge base.The external knowledge utilized an end-to-end memory network (MN), storing word-level information about the knowledge base and conversation history.Regrettably, prior studies have failed to acknowledge the plentiful structural information present in knowledge bases, specifically the graph structural information formed by entity-entity relationships.
Knowledge graph-augmented dialogue systems.Graph neural networks are also used by some researchers to encode knowledge-base entities.He et al. (2020) developed Fg2Seq, which can integrate the latent semantics of conversation history, improving the description of entities and enabling better inference of knowledge related to conversation history.Wu, Harris & Zhao (2022) employed a GMN to comprehend the intrinsic patterns in the dialogue history and their connection with the KB.Although this method treats the KB as a graph, their decoders are still based on RNN, which does not provide a superior understanding of contextual information compared to the GPT.
Pretrain-language-model-based dialogue systems.Madotto et al. (2020) employed a strategy called knowledge embedding to embed knowledge bases directly into model parameters.This approach does not require dialogue state tracking or template responses as inputs and can dynamically update its knowledge base through fine-tuning.Recently, Huang, Quan & Wang (2022) proposed a task-oriented dialog model that employs an Auto-regressive Entity Generation technique, which consists of three major components: a GPT-2 that generates replies, an entity generator that identifies entities in the responses, and a final stage that embeds the entities to generate the ultimate dialog response.It is an end-to-end task-oriented dialogue model that combines natural language processing and generation methods.
In contrast to previous studies, our work introduces PluDG, a novel task-oriented dialogue system.PluDG incorporates Kg-Plug, a plug-and-play component, to extract features from the knowledge base and align them with the dialogue context before passing prompt entities to the decoder.Additionally, to enhance the decoder's comprehension of the underlying semantic information, we employ the UMI module to provide the structure of the knowledge base and dialogue history.

METHOD
Prior to presenting the complete method, we provide a description of the problem.
For the given dialogue history, we regard the utterance of the user as U and the system's response as S. For given turn i, dialogue consists of T i , which is made up of U i and S i : If we assume that there have been K turns in the dialogue history, then the entire history can be defined as Regarding the knowledge base, we utilize the triple format G = (e,r,o) to represent various entities and their relationships.Note that, e refers to the entities, r represents the relationships, and o represents objects.For instance, in the case of the ith potential triple G i , G i = (j restaurant ,place,north).
Suppose there are n entities for a given turn i, then we use K i to denote the given knowledge base construct by the format above mentioned The probability distribution of responses generated by the language model in the ith turn is formally defined as follows: where S i = [s 1 ,...,s n ] represents the response generated from the ith round of the system, and N is the maximum number of words in the response S i .The 1 : j −1 represents elements 1 to j − 1.

Overview
To address the problem that LLMs may face challenges in capturing these structured relations when processing knowledge bases, treat entities as sequences.We propose a model called PluDG.This model is composed of three components: the Kg-Plug and the Decoder.More details are shown in Fig. 2.

Kg-Plug module
The Kg-Plug module is designed as a plug-and-play component, as illustrated in Fig. 3.It treats the provided knowledge as a graph and employs LR-GCN for feature extraction.Subsequently, it infers the most probable entity hints based on the dialogue history information.Finally, the prompt entities are passed to the decoder.

Utterance encoder
Assuming that there are K turns in the dialogue history, the history contains 2K − 1 utterances, where each utterance includes L i words.The words in the ith utterance are represented by word w il , where L ∈ [1,L i ].First, a Bi-GRU, which includes both a forward unit and a backward unit, is used to obtain the hidden representation of the sentences: where Emb(w il ) represents the embedding state of the word w il .Next, a self-attention unit is utilized to capture the contextual information of each token in order to obtain a comprehensible semantic representation of the utterance, as shown below: where W w ,b w ,u w are trainable parameters of the model.Lastly, a GRU is utilized to encode the utterance vector v i :

Context knowledge encoder
The Context Knowledge Encoder is employed to extract hidden information from both the dialogue history and knowledge base.
Context-KB Alignment.Following Chen et al. ( 2017), the Context-KB Alignment module aims to capture the alignment representation of each entity in the knowledge base through the incorporation of dialogue history.To achieve this goal, an attention mechanism is employed to align the dialogue history embedding with the knowledge base entity embedding, allowing for the creation of a coherent representation of the graph.Specifically, the module concatenates each word w il with the entity representation e, applies a tanh activation, and derives attention scores through a Softmax operation.These scores are then multiplied with the corresponding words and summed to generate an aligned representation of the entity's conversation history: where W e , b e , and u e are trainable weight parameters, and [;] denotes the concatenation.Next, the jth entity embedding Emb(e j ) is concatenated with its correspondingly aligned embedding f i align (e j ).In this way, we obtain a sequence of history-alignment entity input representations.Then, the sequence is passed to the GRU unit to obtain a more robust history-alignment entity representations.Formally, for each entity e j , the representation f ij is obtained as follow: Knowledge Graph Encoder.In this section, we introduce a GCN (Kipf & Welling, 2016) to extract the intrinsic features of the knowledge graph.However, inspired by Hu et al. (2021), we leveraged the low-rank decomposition into the weights of GCN and named this new module LR-GCN.For given weights W 0 ∈ x×y , we use W o + W = W o + BA to replace the update, where B ∈ x×y , A ∈ y×z , and y << min(x,z).
In this section, we represent each entity as a node, where N represents the set of nodes.The relationships between entities are denoted as edges, and R represents the set of edges.Following the Context-KB Alignment operation, each entity in the dialog history has 2K −1 representations, which correspond to the 2K −1 utterances spoken.To capture the features from each node and its neighborhoods, we employ the GCN in the Graph operation: In Eq. ( 11), N r i denotes the set of neighborhood-indices of entity i under relation r, r ∈ R; W r and W o are trainable parameters.An activation function σ () is adopted in this research, and ReLU is the specific function utilized.
Finally, an appropriate pooling method is used to fuse the data in g ij and f ij to obtain a question and text representation matrix G f : where W g , b g , and u g are trainable weight parameters, [;] denotes the concatenation, and

Entity reasoner
The entity reasoner is an important component of the Kg-Plug.In this component, we concatenate the Utterance Encoder's output and Context Knowledge Encoder's output as q r 0 , formally: then use to two-hop attention to get the final entity probability.
Dong and Chen (2023), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.17077/17 [ SEP ] [ PAD ] [ PAD ] Entities Sequence Dialog History Sequence Two-Hop update.In the reason stage, followed by the MemNN, we design a two-hop update mechanism to get the precise entity.For the sake of clarity in our description, we denote the number of hops as X , where X = 2.For the given hidden state, q r 0 , we use learnable attention to search for deeper information.In each hop have the following: In the final hop, we use the Softmax function to get the final entity probability p ent :

Decoder
In this study, the decoder is based on the GPT-2 model and is responsible for generating the final response.Unified Memory Integration.As shown in Fig. 4, to incorporate entity structural information from the Knowledge base and prompt entities from Kg-Plug into the GPT-2, we use various embedding techniques, including entity embedding, type embedding, as well as the traditional word token and positional embedding.These techniques enable the decoder to extract the knowledge graph structure, which is linearized into a sequence as input, with special tokens ([NAME] and [ADDR] etc.) to separate the subject, relation, and object of an entity.The entity embedding layers capture entity-level separate information about the word token, and the type embedding distinguishes the relevant tokens.Furthermore, we incorporate speaker information into the dialogue history.To differentiate between the system's response and the user's utterance, we employ the [SYS] token for system responses and the [USR] token for user utterances.Additionally, we use [Query] to indicate the user's current utterance for clear separation.
For generating responses, the GPT-2 decoder relies heavily on the input sequence, and the sequence of tokens plays a crucial role in determining the output.We position the prompt entities after the history, as shown in Fig. 4, in order to enhance the generation process.By doing so, we hope the decoder can draw upon a more precise context, which improves its ability to understand user queries and generate appropriate responses.
Calculate the modeling response word's probability p final by using the embedding token as follows: h t l = TransformerBlock(h t l−1 ), ( 19) where e x presents x in one-hot representation, W v presents the word vector mask, W p is the position mask, and l ∈ L presents the Transformer layers.

EXPERIMENTS Datasets
We evaluate our model on three publicly available benchmark datasets: CamRest (Wen et al., 2016), In-Car Assistant (Eric & Manning, 2017), MultiWOZ 2.1 (Budzianowski et al., 2018).Details of each dataset are provided below: • CamRest.The dataset comprises dialogs in the restaurant reservation domain, consisting of 676 multi-turn dialogs with an average of five turns per dialog.Additionally, each dialog has an average of 22.5 KB of triples.• Multi-WOZ 2.1.The dataset comprises three distinct domains: attractions, hotels, and restaurants.Each dialog in the dataset has an average of 5.6 turns and 54.4 KB of triples.
To process the data, we followed the method used by Rony, Usbeck & Lehmann (2022) and divided the dataset into training, validation, test sets, each containing 1,839, 117, and 141 dialogs, respectively.

Metrics
We utilize two popular evaluation metrics in dialogue studies to evaluate our model: BLEU (Papineni et al., 2001) and Entity F1.To ensure a fair comparison with previous work, we adopted these widely used metrics in the community.• BLEU.The Bilingual Evaluation Understudy (BLEU) metric measures the n-gram overlap between generated responses and gold standard responses.
• Entity F1.We use Entity F1 to assess the system's ability to produce relevant entities that can accomplish specific tasks by retrieving accurate entities from the provided knowledge base.To compute the Entity F1 score, we micro-average the precision and recall over knowledge base entities of the generated responses.

Model training
The cross-entropy is utilized to direct the model-training process.Specifically, the negative log likelihood is calculated between the predicted and actual distributions of the training data: where D is the dialogue dataset consisting of D 1 ,D 2 ,...,D i .|D| is the number of the dialogue datasets.Let s j i be the response generated by the model at D j , corresponding to the words output by the ith time step of the model.Here, n represents the maximum response length, while dialogue history T and knowledge base K are given by D j .

Training settings
We employed the PyTorch framework to implement our model, which was trained on an NVIDIA GeForce GTX 3070 with 8 GB of GPU memory.Our experiments entailed setting the Kg-Plug's embedding dimensions and hidden units to 128, while the batch size was set to 8. Additionally, we set the number of hops for the Entity Reasoner at 2.
For the decoder, we used the normal pretrained GPT-2 with 137M parameters.The model underwent end-to-end training utilizing the AdamW23 optimizer, with the learning rate was set to 6.25e −5 and the decay was set to 1e −8 .For all the datasets, the dropout ratio was set at 0.2.More hyper-parameters used to train PluDG are listed in Table 1.

Evaluation results
Table 2 illustrates the superior performance of our model compared to baseline models on three datasets, as demonstrated by both BLEU (Papineni et al., 2001) and Entity-F1 metrics.Additionally, we present the architectures the models utilized.Our experimental results indicate that PluDG achieves a BLEU score of 23.0 and an F1 score of 76.9 on the CamRest dataset, along with a significantly improved BLEU score of 21.6 and 69.5 Entity-F1 score on

Ablation study
To assess the necessity of each component in PluDG, we conducted an ablation study by removing the Kg-Plug and Unified Memory Integration (UMI) modules and analyzing their impact on the performance of the framework.As shown in Table 3, our results indicate that these two modules are essential for achieving high performance in task-oriented dialogue generation tasks.
After removing Kg-Plug, a component added as a plug-in to the model, we observed a significant drop in performance for various evaluation indicators, particularly Entity F1 of CamRest and In-Car Assistant, both decreasing by more than three points.We speculate that the Prompt entities provided to the GPT-2 decoder play a vital role in generating responses.Conversely, removing the UMI module leads to a performance drop across all the three datasets.Although the BLEU index of the In-Car dataset experienced the most significant drop, exceeding 15 points, the Entity F1 indicator increased.Thus, we

Significance test
To rigorously assess the significance of the performance improvement in our proposed method, we conducted an evaluation using the t -test method.We compared PluDG with the best model.The comparison is divided into BLEU significance, Entity F1 significance, and the significance comparison of both.The results, presented in Table 4, demonstrate that our PluDG exhibits significantly improved performance metrics compared to the best baselines, with all p-values below the 0.05 significance level.

Comparison with other GNN models
Our proposed approach, PluDG, exhibits significant improvements over existing baselines.We hypothesize that this improvement can be traced to the Kg-Plug for the powerful graph feature extraction.To test our hypothesis, we compared four different GNNs, including GIN24, GSE25, and GAT26, all of which were modified to directly replace our original LR-GCN modules, ensuring a fair evaluation.Figure 5A illustrates that LR-GCN outperforms other GNNs in terms of BLEU on the Camrest and MultiWoZ2.1 datasets, but its score is comparatively lower on the In-Car Assistant dataset.In Fig. 5B, LR-GCN exhibits a slightly higher Entity F1 score compared to other GNNs on the Camrest and MultiWoZ2.1 datasets, and significantly outperforms them on the In-Car Assistant dataset.Overall, while different GNNs offer unique advantages for specific datasets, our LR-GCN approach demonstrates the most substantial cumulative improvement in two evaluation metrics across all three datasets.We attribute this observation to LR-GCN's utilization of low-rank matrix weight factorization to prevent overfitting and potentially better capture the global characteristics of the entire graph.question, but with less romance compared to the ground truth.In contrast, DialoKG's response was slightly less satisfactory, while Fg2Seq provided a more comprehensive, yet mechanical, reply.In the second round, DialoKG barely met the expectations of the ground truth.Conversely, Fg2Seq mechanically responded to the first-round responses.On the other hand, PluDG offered nearly correct answers and generated smoother, more engaging responses.In the third round, it appears that all three models responded similarly.

Case study
Overall, the responses generated by PluDG are more contextually appropriate and comprehensible to humans.Combining these three cases, despite the remaining gap between the sentences generated by PluDG and the Reference Entities of real responses, the first two cases demonstrate PluDG's capability to generate semantically similar responses and provide more informative replies.

Figure 6 Figure 5 Fg2SeqFigure 6
Figure 6 displays the responses of PluDG along with multiple baseline models over 3 rounds of the In-Car Assistant dataset.In the first round, PluDG accurately answered the

Table 4 Result of the significance test.
significantly impacts the output.By incorporating more semantic information, the GPT-2 model obtains a more accurate and comprehensive context, leading to more relevant responses.Additionally, when all the extra modules were removed, we observed a drop in all indicators, performing even worse than the previous baseline model.In conclusion, our ablation study emphasizes the critical importance of Kg-Plug and UMI in PluDG, as they are essential for achieving state-of-the-art performance in task-oriented dialogue tasks.