1 Introduction

Dialogue systems tend to directly solve the practical problems in our daily life, which has been filled with much redundant information. The dialogue system can respond quickly to questions from users based on the knowledge base. There are many applications in our real life, such as personal assistants, E-commerce customer service, and chatbots. There are various classifications of dialogue systems, including single-turn and multi-turn dialogue systems according to the type of the dialogue context. While single-turn conversations are well developed, multi-turn conversations have received increasing attention from researchers in recent years because of their more complex contexts and their relevance to real-life scenarios. In multi-turn dialogues, how to use the contextual information effectively becomes an important task in multi-turn dialogues, as there are often multiple turns of interaction between the user and the dialogue system, and the topic in each turn of context often changes.

Existing approaches for multi-turn dialogue generation could be categorized into two groups: handling complex contexts and integrating additional relevant knowledge. The first group tends to construct different methods to deal with the diverse contexts, which are also the traditional and classic ways to solve the multi-turn dialogue generation problem. Researchers have been attempting to extract important information from complex contexts, and one of the earliest efforts on multi-turn dialogue generation is the HRED model proposed by Serbern [1, 2], in which the model multi-turn dialogue is enhanced by adding some additional encoders to HRED. The HRED model provides many useful ideas for modeling multi-turn conversations, but the neuron networks used in it are based on RNN. Later, the Transformer [3] has quickly replaced the RNN-based neural network with its superior performance and computational speed, and there are many researchers working on multi-turn dialogues based on the transformer architecture nowadays. The ReCoSa [4] model intends to use complex attention mechanisms to obtain important word information in context, and a combination of transformer-based self-attention mechanisms and attention mechanisms are used in the ReCoSa model. After obtaining word level information then information fusion is performed and then the decoder is used to decode the generated sentences with maximum probability. However, using this approach still have risks in losing important information about the current turn, the answer results of this model are often repetitive and the decoder is lacking in directionality. Because this model is inadequate for capturing important information about the current turn of conversation, it leads to a lack of thematic consistency in the final generated answers.

The second group of researchers believes that the current multi-turn contexts are not enough to support the diversity of answers, so they hope to find more relevant knowledge and adopt ingenious methods to integrate into the context. Zhang et al. [5] propose the Short-text Topic-level Attention Relevance with Biterm Topic Model (STAR-BTM) model and integrate implicit topic information. However, implicit information is difficult to pay attention to explicit information, like the user’s current turn question. ConditionalHistorical Generation (CHG) [6] tends to utilize more relevant historical dialogues and the model can see the same previous scenario questions. Zhou et al. [7] propose a Commonsense Conversational Model, which can retrieve relevant knowledge graphs from a knowledge base and then encodes the graphs. Although these methods have achieved good results, but in real life, the application scenario of our dialogue systems are very complex, and there is no auxiliary information to help the model improve the quality of sentences. what’s more, using the retrieval method will greatly increase the parameters of the model, it will increase the time of model training, and slow down the speed of model referencing.

Dependency tree can analyze the sentence structure and parse the sentence into the form of tree including the relationship between words. This structure tree is used by Shi et al. [8] and has achieved good results. It is a great way to transform the tree structure into a graph structure and use the graph embedding method to handle these tree problem. In the process of constructing knowledge map, entity link is also very necessary. Azzalini et al. [9] use deep learning to capture the semantic properties of data. Utilizing the subject-predicate-object triples to build knowledge attracts more attention. Sikos et al. [10] mention many new methods of constructing knowledge graph in a survey.

In this paper, we propose an extracting auxiliary graph structure model in multi-turn dialogue generation, called EAGS. We believe that the syntactic dependency relationship can replace the external related background knowledge, and the structure contains the explicit information of the sentence. Our EAGS model can combine implicit information and explicit information through semantic and structure extracting. we also store the trained subgraphs, which can be retrieved as external knowledge for multi-turn dialogue generation in the specific domain. We split the contexts in the dataset into multiple sentences, and then parse each sentence into a dependency tree. Because the tree structure is a special graph structure, we employ a Graph convolutional neural networks (GCN) to model the graph features. In this paper, we use two publicly available datasets to validate the effectiveness of our model, the Ubuntu multi-turn dialogue dataset [11] and the Daily Dialogue multi-turn dialogue dataset [12]. The relevant experiments validate the effectiveness of our proposed approach.

Table 1 shows an example of a multi-turn conversation selected from the daily dialogue dataset. We segment the contexts according to the user speaking order, which are Utterance 1, Utterance 2, Utterance 3, Utterance 4, Utterance 5 and Current Turn. We believe the current turn contains more useful information, so we give it a higher attention weight. Our proposed EAGS model consists of several existing modules. Although each module is an existing work that has already been proposed, we are the first to utilize the related techniques to combine knowledge graph information for better results in multi-turn dialogue generation task.

Table 1 In an example of a multi-turn dialogue on the daily dialogue dataset, we split the multi-turn dialogue into many utterences and current turns

The contributions of this paper are summarized as follows:

  • We propose the EAGS model, which integrates the syntactic dependency information as the substitution of external knowledge. Due to the employment of syntactic dependency, our model can combine implicit information and explicit information.

  • We use graph embedding methods to model dependency tree, and we propose a cross attention method to combine semantic level attention and structure level attention.

  • We build the subgraphs that reached the equilibrium state in each context. These specific domain subgraphs can be retrieved as auxiliary knowledge for multi-turn dialogue generation.

  • We design a multi-task training approach to enhance graph local features and semantic local features. Multi-task learning method can prompt local characters and balance with global features.

  • We conduct experiments on the Ubuntu large-scale English multi-turn dialogue community dataset and Daily dialogue dataset. The experimental results show that our model performs well on both automatic evaluation and human evaluation compared with the existing baseline models.

2 Related work

There are many application scenarios for multi-turn conversations, intelligent customer service bots in e-commerce, voice assistants, blog posts and replies on social media platforms, etc. This data can help users get a better shopping experience, and Yin et al. [13] use NLP technology for tweets on Twitter to analyze the effectiveness of COVID-19. Despite many existing research works on single turn dialogue generation [14,15,16], multi-turn dialogue generation has gained increasing attention from both academia and industry in recent years. Existing approaches for multi-turn dialogue generation could be categorized into two groups: handling complex contexts and integrating additional relevant knowledge.

2.1 Handling complex contexts

Multi-turn dialogue generation models are mainly based on the encoder decoder architecture, which is proposed by Sequence to Sequence model [1]. This model first started the task of dialogue generation, because it is different from retrieving the answer index from the knowledge base. It can combine the words in the vocabulary dictionary according to the user’s questions to generate a logical answer sentence. The Sequence to Sequence (Seq2Seq) model is widely used in single turn dialogue generation. In real life, multi-turn dialogue has more application scenarios. It is obviously unreasonable to simply concatenate multi-turn dialogue as a single-turn dialogue context. So hierarchical recurrent encoder-decoder architectures (HRED) [1] are proposed to capture context information. HRED combines an additional hierarchical encoder to model a part of the conversation. From then on, researchers began to focus on how to deal with complex contexts representation. Later, Serben propose the VHRED [17] model with hidden variables, based on HRED. This method introduces hidden variables into the intermediate state in the previous HRED to improve the diversity of the generated dialogue. The performance of the two models is similar, and VHRED with hidden variables is more robust. Then Transformer [3] has attracted more and more attention because of its ability to extract natural language features and amazing computing speed. ReCoSa [4] model based on attention mechanism extract the sentence feature and words feature via long self-attention mechanism. However, the ReCoSa model always generates repeat answers, such as ‘I don’t know’, ‘Me too’, and so on. Hierarchical self-attention network (HSAN) [18] can find the most important words and utterances in contexts simultaneously. They use the hierarchical encoder to update the word and utterance representations with their position information respectively.

2.2 Integrating additional relevant knowledge

The second group of researchers believes that it is not enough to only deal with the contexts, so they hope to introduce more external background knowledge, including implicit topic information, context related knowledge map, context related historical dialogue information, and so on. Topic level implicit information mining has a lot of recent works. Zhang et al. [5] propose the STAR-BTM model, which can find latent topic level information and integrates the topic information in the dialogue generation. Xing et al. [19] design a neural topic segmentation model. They enhance a hierarchical attention Bi-LSTM network to better model context, by adding a topic-related auxiliary task and restricted self-attention. Shuai et al. [20] propose a Topic Enhanced Multi-head Co-Attention model (TMCA) based on hierarchical networks to better capture the interactions between sentences via implicit topic information. CMTE model is designed by Li et al. [21], they tend to represent topics with topically related words. The CMTE model focuses not only on coherence with context, but also brings up new chatting topics. Some researchers hope to introduce the knowledge graph structure with external knowledge. Li et al. [22] propose a Topic-level Knowledge-aware Dialogue Generation model to capture context-aware topic-level knowledge information. They decompose the given knowledge graph into a set of topic-level sub-graphs and integrate graph features into their model. Jiang et al. [23] conduct probabilistic topic modeling from the perspective of data privacy in industry. Wu et al. [24] propose a MHKD-Seq2Seq framework to utilise knowledge from other sources. A data manipulation is proposed by Cao et al. [25], which can introduce explicit personas in generation models. However, these models all need to rely on external knowledge, in real life, the multi-turn dialogue task we have to deal with does not have a well-defined knowledge graph structure, and it is very difficult to build a new knowledge graph. There are some works that integrate retrieval and generative methods. Zhu et al. [26] used adversarial training methods to combine generative sentences with sentences obtained from retrieval to get good results, but this method is based on a single turn of dialogue. And the generative adversarial networks (GAN) is hardly to train. CHG [6] focuses more on the integration of dialogue and historical information. Li et al. [27] propose a novel subspace clustering framework, which can map a non-linearly basic theme data into a latent space.

2.3 Graph neural networks

Graphs are a kind of data structure that models a set of objects (nodes) and their relationships (edges). Recently, research on analyzing graphs with machine learning has been receiving more and more attention. Based on CNNs and graph embedding, Graph convolutional neural networks (GCNs) [28] are proposed to collect aggregate information from graph structure. Ying et al. [29] develop a GCN network algorithm, which integrates efficient random walks and convolutions to generate embeddings of nodes. There are many applications that can be used with GCN. Yao et al. [30] use GCN for text classification and use this method to learn documents and text embeddings for better classification. Li et al. [31] try to use ontology information to constrain the knowledge representation learning model, called TransO. TransO models incorporate rich ontology information with explicit relations. In the process of training entity embedding on knowledge graph, Zhang et al. [32] propose hyperrelational feature learning network (HRFN) to use meta-learned relation features from the dataset. There’s also a lot of work on the Knowledge Graph question and answering. Saxena et al. [33] propose a effective method to deal with multi-hop KGQA through sparse Knowledge graphs. In a community question-answering (CQA) system, Jing et al. [34] propose a knowledge-enhanced attentive answer selection model, using this method can consider the professional knowledge and limits of authority. Jian et al. [35] propose a knowledge-aware dialogue generation model to address the issue of introducing common sense into the open domain dialogue system. Atwood et al. [36] incorporate the h-hop transition probability matrix into the convolution operation. Then graph attention network (GAT) [37] is proposed, which can consider the importance of all neighbors of the current node. GAT uses self-attention to build the graph attention layer. Previous approaches are inherently transductive and do not naturally generalize to unseen nodes. GraphSAGE [38] is presented to leverage node feature information and generate node embeddings for previously unseen data. Next, many researchers have done the practical application of graphs, Yu et al. [39] propose a novel deep learning framework to tackle the time series prediction problem in the traffic domain. There are also similar applications of graph in multi-turn dialogue generation. Zhou et al. [7] propose a conversation generation model, which can attentively read the retrieved knowledge graphs and the knowledge triples within each graph to facilitate better generation through a dynamic graph attention mechanism. Xue et al. [40] focus on the dynamic network embedding and refine the category hierarchy by typical learning models. Cai et al. [41] try to use physical interactions and design an influence diffusion model to take into account both cyber and physical user interactions in an effective and practical way. Zhang et al. [42] propose the TransRHS approach, using relational structures to build a more complete knowledge graph. Liu et al. [43] propose a Knowledge Graph Interactive Visual Query Language KGVQL to improve the understanding of knowledge graphs by end users. Knowledge Tracing (KT) can trace the state of evolutionary mastery for particular knowledge or concept, which can also construct a graph structure. Song et al. [44] propose a Bi-CLKT to address the deep knowledge tracing problems, using this method can obtain discriminative representations of concepts based on graph-level contrastive learning. Song et al. [45] try to establish connections between exercises under cross-concepts and enhance model’s interpretability.

3 EAGS model

In this section, we will illustrate our model in detail, which architecture is depicted in Figure 1.

Figure 1
figure 1

The architecture of EAGS model. The solid lines represent the direction of vector flow, the dotted lines represent the addition of multi-task losses. The blue circles represent the words in a sentence, the box composed with green circles represents the word distribution of vectors

3.1 Multi-task training

In this subsection, we will explain how multi-task works. Multi-task learning is a kind of joint learning. Multiple tasks learn in parallel, and the results affect each other.

In the EAGS model, the Node attention and structure are auxiliary tasks to enhance the representation of node and structure respectively. The semantic representation of nodes can affect the structural representation of dependency graphs, and the relationship aggregation of dependency graphs can also be used as auxiliary information for semantic nodes. Multi-task learning is only used in the training process, and the EAGS model proposed in this paper is somewhat time consuming in training, but many of the computational modules which are time consuming in the model are not used in the actual inference process. There are GCN module, BI-GRU module, encoder module and multi-task decoder module in the EAGS model. These modules compose a large number parameters of EAGS model, but the EAGS model is time consuming during training processing. Once the training of model is done, the time cost during the actual inference process is not significant.

The loss of the Node decoder can be calculated as:

$$Loss_{Node}=P\left(y_t\vert C,y_1,...,y_{t-1};\theta\right)=P\left(y_t\vert h^{NR};\;\theta\right)$$
(1)

The loss of the Structure decoder can be calculated as:

$$Loss_{Structure}=P\left(y_t\vert C,y_1,...,y_{t-1};\theta\right)=P\left(y_t\vert h^{SR};\;\theta\right)$$
(2)

The total loss of the EAGS model is the sum of the three losses.

$$Loss = sum(Loss_{Node}, Loss_{Structure}, Loss_{Fusion})$$
(3)

Algorithm 1 shows the calculation process of loss value in multi-task learning. When we get the total loss value, we use Adam optimizer to optimize the whole parameters of the model, which can amplify the effect of local feature extraction of the model and maintain a balance with the global feature. The GCN is jointly optimized with the whole model during model training, and the node embeddings obtained from BI-GRU are put into the GCN to further fuse structural information. The role of the GCN in the EAGS model is to extract structural information from the syntactic dependency tree, and we use the graph embedding representation to extract structural relationships between different nodes in this paper.

Algorithm 1
figure a

Multi-task algorithm.

3.2 Hierarchical encoder

In the Multi-turn dialogue generation task, we consider the context contains several sentences, we divide the context into multiple utterances according to the order in which users speak. We believe the most important turn in context is the current turn, which contains questions that users want to ask most in the current turn. Using this method can not only utilise the current turn information as a query to find the most relevant information in previous turns, but also add auxiliary information of user rotation. Furthermore, every separate turn can be parsed by the dependency tree algorithm. The hierarchical architecture we designed is fit for the downstream dependency parsing operations.

Given contexts, which can be divided as follows:

$$Contexts = \left\{ {utterance{_{1}},utterance{_{2}},...,utterance{_{n}}} \right\},$$
(4)

each utterancei in contexts can be represented as \(utterance{_{i}} = \left \{ {{w_{1}},{w_{2}},...,{w_{n}}} \right \}\), where wi represents a word in the utterance.

Different utterances are encoded by various Bi-GRU neural networks in a hierarchical way. Given an utterance utterancei as input, a standard GRU first encodes each input utterance to a fixed-dimension vector. The GRU is calculated as follows:

$$\begin{array}{@{}rcl@{}} {z_{t}} &=& \sigma \left({{W_{z}}\left[ {{h_{t - 1}},{x_{t}}} \right]} \right) \end{array}$$
(5)
$$\begin{array}{@{}rcl@{}} {r_{t}} &=& \sigma \left({{W_{r}}\left[ {{h_{t - 1}},{x_{t}}} \right]} \right) \end{array}$$
(6)
$$\begin{array}{@{}rcl@{}} {\tilde{h}_{t}} &=& \tanh \left({W\left[ {{r_{t}}*{h_{t - 1}},{x_{t}}} \right]} \right) \end{array}$$
(7)
$$\begin{array}{@{}rcl@{}} {h_{t}} &=& \left({1 - {z_{t}}} \right)*{h_{t - 1}} + {z_{t}}*{{\tilde{h}}_{t}} \end{array}$$
(8)

where σ is an activation function. We use Bi-directional GRU (Bi-GRU) to get the forward direction latent distribution and backward direction latent direction. In addition, we integrate the positional embedding to the contexts so as to find the importance of different contexts in the sentence level. The TOS embedding refers to the original position embedding [3]. TOS embedding can make the distribution of each turn present a uniform distribution. And using this embedding can make the following attention modules pay more attention to the turn characteristics in the complex contexts. The TOS embedding is calculated as follows:

$$\begin{array}{@{}rcl@{}} TOS\;\left({pos,2i} \right) &= &\sin \left({\frac{{pos}}{{{{10000}^{\frac{{2i}}{{{d_{model}}}}}}}}} \right) \end{array}$$
(9)
$$\begin{array}{@{}rcl@{}} TOS\;\left({pos,2i + 1} \right) &=& \cos \left({\frac{{pos}}{{{{10000}^{\frac{{2i}}{{{d_{model}}}}}}}}} \right), \end{array}$$
(10)

where dmodel represents hidden vector dimension in the EAGS model.

The final contexts representation is shown as follows:

$$Contexts = \left\{\left[{\overset{\rightharpoonup}{h}}_{1}, {\overset{\leftharpoonup}{h}}_{1}\right] + {TOS_{1}},...,\left[ {\overset{\rightharpoonup}{h}}_{T}, {\overset{\leftharpoonup}{h}}_{T}\right] + {TOS_{t}} \right\},$$
(11)

where TOSi is the turn of sentence embedding, which represents positional embedding to indicate the order of turns, \({\overset {\rightharpoonup }{h}}\) and \({\overset {\leftharpoonup }{h}}\) express Bi-directional sentence vector respectively. In the same way, the current turn representation is:

$$CurrentTurn = \left\{ {\left[{\overset{\rightharpoonup}{h}},{\overset{\leftharpoonup}{h}}\right]} \right\}$$
(12)

3.3 Graph constructing

In order to transform the utterance into a graph, we employ the syntactic parsing (dependency parsing) algorithm. In the EAGS model, we use Stanford’s syntactic analysis method for English. And the syntactic analysis methods can be easily replaced by other methods to transform a specific sentence into a graph structure. The parsing result is a tree structure, which can be represented by graph structure, because tree structure is a special graph structure. The building graph nodes represent the words in the sentence and the edges can express the part of speech relationship between different words, such as nominal subject, passive nominal subject, etc. Table 2 and Figure 2 show the process of parsing sentences and constructing parsing graph. We use utterance 5 and current turn as examples.

Table 2 An example in Daily dialogue dataset
Figure 2
figure 2

Examples of converting sentences into graphs

From Figure 2, we can see every sentence has been transformed into a graph, which can be expressed by triplets.

3.4 Graph encoder

In this subsection, we will explain how to model graph structure. Graph convolutional neural networks (GCN) are widely used as structure encoders for aggregating node information in natural language processing. GCN uses the message passing mechanism to update the current node through its neighbor nodes. Every node will be updated with this mechanism in a GCN layer. During the training procedure, there are several layers composed of the complete neural networks. Information interaction between nodes is carried out through multiple layers of the GCN, and eventually the nodes reach a steady state through a messaging passing mechanism. And we consider that the nodes fully absorb the structural information between the syntactic dependency graphs. Figure 3 shows the message passing mechanism, which is employed in the EAGS model. Each GCN node represents a word in the sentence. We use GRU’s every time step outputs as the GCN initial node embeddings instead of random initial in previous work to enhance the representation of the extracted graph on semantic level. Our EAGS training is joint training, so that the representation of each node can not only be updated by neighbor nodes, but also aggregate word semantic vectors. The message passing paradigm between nodes is calculated as :

$$h^{\left({l + 1} \right)} = \sigma \left({{b^{\left(l \right)}} + \sum\limits_{j \in N\left(i \right)} {\frac{1}{{{c_{ji}}}}h_{j}^{\left(l \right)}{W^{\left(l \right)}}} } \right),$$
(13)

where \(h^{\left ({l+1} \right )}\) represents the l + 1 layer, \(N_{\left (i \right )}\) is the set of neighbors of Nodei, cji is the product of the square root of node degrees, \({b^{\left (l \right )}}\) is bias, and σ is an activation function. In this paper, the initialization parameters of GCN are not random, and in order to improve the training speed of the model and make full use of the semantic information of BI-GRU, we use each word embedding representation input to GRU as the initial representation of words in GCN.

Figure 3
figure 3

The message passing mechanism of GCN usually has many layers in a GCN network. We use the last layer as the final representation of nodes

The context node representation is:

$${h^{Utteranc{e_{i}}}} = Average\left({{{w_{1}},{w_{2}},...,{w_{n}}}} \right),$$
(14)

where Utterancei represents the ith utterance’s sentence vector, wi represents the word in the Utterancei.

The current turn node representation is:

$${h^{Current}} = Average\left({{{w_{1}},{w_{2}},...,{w_{n}}}} \right),$$
(15)

where wi represents the word in the current turn. We use average pooling to get the sentence representation.

3.5 Node level attention

In this subsection, we will express the Node level attention. In the EAGS model, word attention can be seen as node level attention, because the representation of each node corresponds to the word representation in the sentence. To find the critical turn in contexts, we use the attention mechanism [3]. The context attention layer and attention formula are calculated as follows:

$$Attention = softmax \left({\frac{{Q{K^{T}}}}{{\sqrt {{D_{k}}} }}} \right)V,$$
(16)

where Q,K,V represents random matrix respectively and \({\sqrt {{D_{k}}}}\) represents the dimension of the vector. The node level attention can be calculated as follows:

$$h^{Node}=Attention\;\left(CurrentTurn,Contexts\right),$$
(17)

where hNode is the Node fusion vector of current turn and contexts.

3.6 Structure level attention

In this subsection, we will express the Structure level attention. In the EAGS model, different utterances in contexts can be extracted as a dependency graph. Different graphs have various structures and relations, which have relevance to the current turn and can be used as auxiliary information to enhance the final fusion vectors. The structure level attention can be calculated as follows:

$$h^{Structure}=Attention\;\left(h^{Current},h^{Utterance_i}\right),$$
(18)

where hStructure is the structure fusion vector of current turn and contexts.

3.7 Decoder

In this subsection, we will present the decoder, jointly training with the hierarchical encoder and graph encoder. Based on the Sequence to Sequence (Seq2Seq) architecture, we used several decoders instead of only one decoder. But the final outputs are only generated by the main decoder, the other two decoders are used to enhance the representation of their own aspects, which are structure representation and node representation. In a multi-task decoder, the sub-decoder has the same structure as the main decoder, consisting of the transformer decoder, but the loss values for the different sub-tasks are calculated separately by the model and finally.

Given an input response: Response = {y1,y2,…,ym}, for each word yt, we use the mask operator on the response for the training [3] to avoid revealing ground true answers. For each word yt, we mask \(\left \{ {{y_{t + 1}},...,{y_{m}}} \right \}\) and only see \(\left \{ {{y_{1}},...,{y_{t - 1}}} \right \}\).

$$Response = \left\{ {\left({{w_{1}} + {P_{1}}} \right),\left({{w_{2}} + {P_{2}}} \right),...,\left({{w_{t - 1}} + {P_{t - 1}}} \right)} \right\},$$
(19)

where Pi represents the positional embedding of words.

In addition, we also used the attention component to feed the matrix of response vectors as queries, keys, and values matrices by using different linear projections. The response vector can be calculated as:

$$h\;^{Res}=Attention\left(Response\right),$$
(20)

where hRes represents the self-attention outputs vectors.

The node decoder focuses on semantic level enhance, which can be calculated as follow:

$$h^{\;NR}=Attention\;\left(h^{Node},h^{Res}\right),$$
(21)

where hNR represents the node hierarchical fusion of contexts and current turn.

The structure decoder focuses on structure level enhance, which can be calculated as follow:

$$h^{\;SR}=Attention\left(h^{Structure},h^{Res}\right),\\$$
(22)

where hSR represents the structure hierarchical fusion of contexts and current turn.

The log-likelihood of the corresponding response sequence is:

$$P\left({Y \vert C;\theta } \right) = \prod\limits_{t = 1}^{T^{\prime}} {P\left({{y_{t}} \vert C,{y_{1}},...,{y_{t - 1}};\;\theta } \right)},$$
(23)

the 𝜃 parameter represents the whole model parameters which can be optimized through the different deep neural models.

In order to integrate the node features and structure features, we concatenate the node attention vectors and structure attention vectors.

$$h^{Fusion} = Concat\left(h^{Node},h^{Structure}\right)$$
(24)

After passing the composed attention, which can be seen in the pink box in Figure 1, we can generate words of response through softmax:

$$Loss_{Fusion} = P\left({{y_{t}} \vert C,{y_{1}},...,{y_{t - 1}};\;\theta } \right) = P\left({{y_{t}} \vert h^{Fusion};\;\theta } \right),$$
(25)

where P(yt) is the most likely word in the generated answer sentence and the 𝜃 parameter represents the whole model parameters which can be optimized through the different deep neural models.

4 Experiment

4.1 Datasets

We use the Ubuntu community multi-turn dialogue datasetFootnote 1 [11] and Daily dialogue datasetFootnote 2 [12] to evaluate the performance of our proposed model. In the multi-turn dialogue dataset, since each turn is distinguished by a special symbol between contexts, this makes it very easy to distinguish each turn in the present paper and incorporate TOS embeddings to enhance the turn features. We use the official given “__eot__” as a criterion to segment the different turns.

4.2 Baselines

  • Seq2Seq: Classic Sequence to sequence model with attention mechanism [46].

  • HRED: Hierarchical Recurrent Encoder-Decoder [1], which add an additional encoder.

  • VHRED: VHRED is a variant of HRED, the implicit variable information makes the VHRED model more robustness [17].

  • ReCoSa: Relevant context with self-attention [4] model. Long distance attention and self-attention to capture key words and sentences features.

  • STAR-BTM: Multi-turn dialogue generation with implicit topic level information [5].

  • CHG: Utilizing historical dialogue representation to learn a historical dialogue selection model [6].

  • HSAN: A recent hierarchical self-attention mechanism, which combines the important word features and utterances features in contexts together [18].

4.3 Experiment settings

In order to make a fair comparison between all baseline methods, the hidden layer size is set to 256, the batch size is set to 128, 8 heads attention is used, the Adam optimizer is used, and the learning rate is 0.001. We use Pytorch to run all models on three Tesla T4 GPU.

4.4 Human evaluation

We randomly sampled 200 messages from the Ubuntu test set to conduct the human evaluation as it is extremely time-consuming. We recruit 5 evaluators to judge the response from three aspects [47].

  • Appropriateness: a response is logical and appropriate to its message.

  • Informativeness: a response has meaningful information relevant to its message.

  • Grammaticality: a response is fluent and grammatical.

4.5 Automatic evaluation

We use perplexity[1], BLEU [48] and Dist-1, Dist-2 [49] to evaluate the diversity of our responses, where Dist-k is the number of different k-grams after normalization of the total number of words in the response.

PPL is used in the field of natural language processing to measure the quality of the language model, it is related to the loss of the model. BLEU is to compare the coincidence degree of candidate translation and reference translation. Dist-1 and dist-2 measure the diversity of texts from different angles. Therefore, a good model usually has smaller PPL values and larger BLEU, Dist-1, and Dist-2 values.

4.6 Experiment design

We have done a lot of experiments on both datasets to verify the effectiveness of our EAGS model. Firstly, we compare our proposed multi-turn dialogue generation model EAGS with the existing multi-turn dialogue generation models (as shown in Table 5). In order to verify the necessity and importance of the important modules of our proposed model, we did several groups of experiments.

The EAGS model contains a novel graph structure, it can utilize syntactic features of the sentence itself and it doesn’t need external knowledge. In order to ensure the fairness and consistency of the baseline models, we simply added the graph encoder to the other baseline models (as shown in Table 6). The simple fusion method is to directly concatenate the sentence vectors obtained from the average pooling of graph nodes with the other baseline models’ sentence vectors. We also employed the several attention mechanisms to extract the most important information from contexts, using this method can give the current turn a higher level of attention weight and maintain more valid information of question. So we removed different modules to check the contribution of each attention module to the model (as shown in Table 7). We didn’t use a single decoder to decoder the final answer sentence, we used a multi-task method to help balance the self-promotion of local features and global performance (as shown in Table 8). We also conducted some ablation experiments to verify which module has the greatest improvement on the model (as shown in Table 9).

5 Analysis

The different test results are shown in the tables. We use two evaluation criteria, one is human evaluation, and the other is mainstream evaluation algorithms based on machine translation. In terms of overall performance, our model has achieved good results only through the current contexts information without introducing external knowledge. Using this method is more suitable for the actual situation of lack of data in life.

5.1 Human evaluation

The human evaluation focuses more on areas that are not covered by the automatic assessment. From the human evaluation in Tables 3 and 4, our method has the highest average score in terms of appropriateness, informativeness and grammaticality.

Table 3 Human evaluation results of mean score, proportions of three levels (+ 2, + 1, and 0 represent excellent, good and average respectively) on Ubuntu dataset
Table 4 Human evaluation results of mean score, proportions of three levels (+ 2, + 1, and 0 represent excellent, good and average respectively) on Daily dialogue dataset

Compared with the most classical Seq2Seq model, our EAGS model has significant improvement in all indicators. Compared with the ReCoSa model, which doesn’t integrate any auxiliary information, our EAGS model, our model has significantly improved appropriateness, informativeness, and grammaticality after fusing syntactic structure information. Compared with the model with implicit topic information, the answer sentence generated by the EAGS model with explicit information of syntactic structure and implicit information of semantics are better in terms of human evaluation. This also shows that after the fusion of syntactic structure information, our EAGS model doesn’t generate repetitive answers, such as yes, me too, etc. Compared with the model combined with context history information, the EAGS model with each turn of syntactic structure graph generates higher quality answer sentence.

5.2 Automatic evaluation

In automatic evaluation experiments, we test our EAGS model on both datasets. From Table 5, we can see our EAGS model has achieved good results in various indicators without introducing external auxiliary knowledge. Compared with the most classical Seq2Seq model and HRED model, our EAGS has dramatically improved. Compared with ReCoSa and HSAN models, which use complex operation on contexts, our model performs better after fusing graph information. The dist-1 index of our model on the Ubuntu dataset does not exceed the ReCoSa model, which may be related to the excessive length of multi-turn dialogue sentences in the Ubuntu dataset. On the daily dialogue dataset, the sentence length is not as long as the Ubuntu dataset, so our model achieves very good results. This may be related to our different splitting methods for contexts and the use of syntactic graph structure information. Compared with STAR-BTM and CHG models, which integrate topic level implicit information and historic background information respectively, the EAGS model with syntactic structure graph is better than them. This should benefit from the combination of explicit information from syntactic parsing of sentences and semantic implicit information of nodes representing words.

Table 5 Performance of different models on Ubuntu dataset and Daily dialogue dataset

In order to ensure the fairness and consistency of the baseline models, we simply added the graph encoder to the other baseline models. From Table 6, we can find that after simply adding graph structure information to the baseline models, the loss of these models increases, but the quality of generating answer sentence is slightly improved. Compared with STAR-BTM and CHG model, we find their performance indicators are improved, we believe that the structure graph information covers explicit information that topic implicit information and historical information don’t have.

Table 6 Comparison of our complete model and simple fusion graph information model on Ubuntu and Daily dialogue dataset

In the attention ablation experiment in Table 7, we remove the node attention and structure attention respectively. Removing structural attention has the greatest impact on the model. However, no matter which attention module is removed separately, our model performed better than most indicators of the previous baseline models. We believe that structure attention can learn more syntactic relations, such as nominal subject, passive nominal subject, etc. What’s more, using distinguished contexts segmentation method according to users’ utterance order is also more suitable for the attention mechanism to find the key turns.

Table 7 Attention ablation experiment on Daily dialogue dataset

In the multi-task ablation experiment in Table 8, we modified the final loss addition method to remove the node decoder loss and structure decoder loss respectively. From Table 8, we can see the performance indicators of the EAGS model had decreased after removing the multi-task module, but the decline of the performance indicators of the model is smaller than that of removing the attention module. We believe that multi-task learning will improve the final performance of the model when there are auxiliary tasks. Our EAGS model can self-correct the structure and semantic aspects of nodes, and the local loss can affect global performance. But using multi-task may increase the parameters of the model and the time to optimize the model.

Table 8 Multi-task ablation experiment on Daily dialogue dataset

We removed several important parts of our EAGS model, the graph encoder and decoder are a group, and node encoder and decoder are a group. The graph encoder is responsible for aggregating sentence structures, the node encoder is responsible for optimizing the semantic embedding of each word and each node corresponds to a word in the sentence. From Table 9, we can find that removing node encoder-decoder and graph encoder-decoder have a great impact on the model. After removing the structure module and node module, the downward trend of the model is similar to that of removing the attention module.

Table 9 Modules ablation experiment on Daily dialogue dataset

In general, we believe that without any external effective knowledge, our EAGS model is more suitable for the actual situation in life. The use of sentence native syntactic dependency information is helpful to the final performance of the model. It is feasible to use GCN to help solve the dependency tree problem, and the use of the multi-task learning method can help the model better aggregate local features.

6 Conclusion and future work

In this paper, we proposed the EAGS model, which can improve the quality of multi-turn dialogue generation without the help of external information. We believed that the syntactic dependency information can replace the external knowledge. We used GCN and multi-task learning method to fit our model. Furthermore, we stored the subgraphs which are built by the EAGS model, the subgraphs can be used as external knowledge graph information in the future multi-turn dialogue generation task. Experiment results showed the superiority of our proposed EAGS model. In the future, we will continue to mine the user’s character information contained in the complex multi-turn contexts and combine it with the dialogue systems.