Multi-Turn Response Selection in Retrieval Based Chatbots with Hierarchical Residual Matching Network

Response selection in retrieval-based chatbot aims to find the most relevant response in a candidate repository given the conversation context. A key technique to this task lies in how to measure the matching degree between conversation context and response at rich semantic information. In this paper, we propose a hierarchical residual matching network (HRMN) to fully extract and make use of the rich semantic information in the conversation history and response for themulti-turn response selection task. We empirically verify HRMN on two benchmark data sets and compare against advanced approaches. Evaluation results demonstrate that HRMN outperforms strong baselines and has a distinct improvement in response selection.


Introduction
The rapid development of artificial intelligence makes a fundamental breakthrough in the construction of chatbots [1]. Existing chatbots are either generation-based or retrieval-based. The generation-based chatbot could generate highly coherent new responses given the conversation context [2][3][4]. On the contrary, retrieval-based chatbot tries to find the most relevant response in a candidate responses repository with some conversation context given [5,6]. In this paper, we focus on the problem of response selection for retrieval-based chatbots since it has theadvantage of informative responses.
Recent studies have shown that capturing and making full use of the richersemantic information from the conversation history is essential to facilitate selecting the next utterance [6][7][8][9]. A deep neural network is oneof the ways to discover richer and more useful semantic information. But it has two shortcomings: (1) The deeper the level, the more abstract the semantic information captured where matching on high abstract context vectors loses relationships among utterances [6]. (2) It is a recognized fact that deep networks are difficult to train. Thisis boiled down to not only disappear/explode gradient but also difficulty pertaining to feature propagation.
To cope with these issues, this paper proposes a hierarchical residual matching network (HRMN) for multi-turn response selection. The HRMN distills hierarchical semantic information in the multiple

Model
Firstly, we formulate the response selection problem. Suppose that we have a dataset D = {C, R, Y}. Specifically, C = { 1 , 2 ,…, }represents the conversation context with { } =1 as the utterances and R as a response candidate. Y ∈ {0,1} is a binary label. The goal of the selection task is to learnthe matching models from the conversation data set D, which reflects the matching degree between C and R. To this end, we devise a well-designed hierarchical residual matching network which consists of multiple semantic encoder layer, semantic matching layer, residual layer, and aggregation predictionlayers.
Multiple semantic encoder layer: our strong intuition is that the dialogue context contains manifold hierarchical semantic information, which is beneficial to the choice of response. Therefore, we design multiple encoders to exhaustively distill multifarious semantic information from the dialogue context. As shown in Fig. 1, multiple semantic encoder layer stacks Embedding sub-layer, GRU-based sublayer, CNN-based sub-layer, and Transformer-based sub-layer and different sub-layer help model extract diverse hierarchical semantic information. More concretely, embedding sub-layer, jointing pretraining word embedding, and characterembedding, represents the original information. The GRUbased sub-layer captures context information. The CNN-based sub-layer promotes the model to extract higher-level semantic units instead of simple semantic information [10]. Notably, we perform the multi-scale convolution which has four different convolution kernel sizes (1,2,3,4). Following [9], we utilize a Transformer-based sub-layer to fetch latent semantic information such asco-reference, which is beneficial to match utterance and response in high-levelsemantic and is also verified by experiments. Fig. 1 The architecture of multiple semantic encoder layer Semantic matching layer: Our model matches response and multi-turn context under the matchingaggregation framework. The interaction between the conversational context and response provides important information to determinethe matching degree. Therefore, we employ a bi-directional CNN [11] as an interaction function for the aforementioned hierarchical semantic information. Specifically, for the hierarchical semantic information obtained after multiple semantic encoded layers: { , } =1 and { , } =1 where enc ∈ {emb, gru,cnn,trans}, we have: A is a linear transforming matrix, and is the interaction score. Moreover, we use the elementwisemultiplication and element-wise subtraction operations [12] to obtain semanticinteraction matrix of the conversational context and response and: Particularly, we use SemMatch(u,r) to denote all operations of the semantic matching layer.
Residual layer: The purpose of the residual layer is not only to improve thegradient flow to better train the deep neural network but also to make the modelcomprehensively matches on residual semantic information. To this end, we design two residual techniques including concatenate residual and cross residual.As shown in Fig. 2, we splice the across layer semantic information of utterances and responses into two groups. One group is , , and , another group is , , and . As demonstrated in Fig. 2 and represent different types of semantic information separately but cannot be the same at the same time. The intuition behind the maximum operation is that the correlation with a high ). The final match score is g = ∑ .

Experiment
We conduct experiments on two well-studied multi-turn response selection datasets, the Ubuntu Corpus V1 [13] and the Douban Conversation Corpus [6] . The Ubuntu Corpus V1 is a specific domain data set, which containsmulti-turn dialogues about Ubuntu system troubleshooting in English. While the Douban Conversation Corpus is a multi-turnChinese conversation data set crawled from Douban group. Next, we briefly introducethe evaluation metric, along with several state-of-the-art approaches that we compare against. Following the previous studies [6,13], we choose the information retrieval metrics including Recall at position k in n candidates (Rn@k), Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Precision at 1 (P@1). The competitor baselines could be roughly divided into three categories: Basedon basic methods include TF-IDF, LSTM [11], CNN [14]. Based on singleturnmatching methods. These methods concatenated the context utterances togetherto match a  [16], Attentive-LSTM [17]. Based on multi-turn matching methods. These methods make useof conversation history in matching including Multi-View [8], SMN [6], DUA [7], DAM [9]. Table 1 reports the evaluation results of HRMN aswell as previous models. Overall, HRMN, either withconcatenate residual technique or with cross residual technique, outperforms the other models on all metrics and datasets. This exhibits thatour model has the ability to select the best-matched response. The three categoriesof baselines show a consistent trend on both data sets over all metrics: Basicmethods < Single-turn matching methods < Multi-turn matching methods. Thisis a further demonstration that considering not only matching degree betweenresponses and utterances in previous turns but also utterance relationships canimprove the performance of the response selection. The stateof-the-art matchingmodel, DAM, performs worse than our models. The reason is that albeit theDAM employs semantic information of all stacked layers for matching, the naïvefusion of semantic information drops the measurement of matching degree, whichnot fully utilizes the semantic information. This somehow indicates that making fulluse of the richer semantic information is essential to facilitate selecting the nextutterance. One notable point is that the best performance for Ubuntu corporais HRMN with cross residual technique, while on Douban corpora, the bestperformance is HRMN with concatenate residual technique. The difference maystem from the nature of the two data, which have different domains (systemtroubleshooting and social network). Furthermore, the HRMN with two residualtechnique has a steep improvement on different corpora, which also proves thatits compatibility across domains. only one type of encoder such as RNN, CNN, Transformer instead of multiple encoders. (2) the effect of each encoder in multiple encoder layer. Therefore, we respectively utilize only one type of encoder in the multi-encoder layer and compare them with HRMN, denoting the model as HRMN with {X}, where X ∈ {GRU, CNN, Transformer}. At the same time, the concatenate residual technique is employed. We then removethe encoder layer by layer for HRMN to show the effect of each encoder anddenote the models as HRMN−{X}. The results are shown in Table 2. For question (1), we can find that using only one type of encoder results in performance degradation, which implies that multiple encoders enjoy the advantage ofcapturing various semantic information. For question (2), we can concludethat each encoder is useful, and removing one lead to a performance drop. We next study the effect of residual technique by designing two experiments. In the first experiment, we only remove the residual technique but remain the multi-hierarchies matching and denote HMN, while in the second we only match a response with the utterances at the topmost semantic information, namely, the output of latent relationship encoder sub-layer, and denote MN. As demonstrated in Table 3, residual semantic information is useful to further improve the performance of the model compared with HRMN and HMN. We also find matching at the topmost abstract semantic information results in dramatic performance degradation, which suggests the importance of residuals and further demonstrates the effect of multi-hierarchies matching. In this paper, we propose the hierarchical residual matching network (HRMN) forthe multi-turn response selection task. Our model sheds new insights on how tomake full use of various semantic information for measuring the matching degree between conversation context and response. Meanwhile, we present twokinds of residual techniques to overcome the defects of the deep networks. To ourknowledge, there is no prior work using these residual techniques in the multi-turnconversation model. Experiments on two well-studied datasets demonstrate ourproposed model significantly outperforms the baseline models by a large marginon all metrics.