Dialogue manager domain adaptation using Gaussian process reinforcement learning

Spoken dialogue systems allow humans to interact with machines using natural speech. As such, they have many benefits. By using speech as the primary communication medium, a computer interface can facilitate swift, human-like acquisition of information. In recent years, speech interfaces have become ever more popular, as is evident from the rise of personal assistants such as Siri, Google Now, Cortana and Amazon Alexa. Recently, data-driven machine learning methods have been applied to dialogue modelling and the results achieved for limited-domain applications are comparable to or outperform traditional approaches. Methods based on Gaussian processes are particularly effective as they enable good models to be estimated from limited training data. Furthermore, they provide an explicit estimate of the uncertainty which is particularly useful for reinforcement learning. This article explores the additional steps that are necessary to extend these methods to model multiple dialogue domains. We show that Gaussian process reinforcement learning is an elegant framework that naturally supports a range of methods, including prior knowledge, Bayesian committee machines and multi-agent learning, for facilitating extensible and adaptable dialogue systems.


Introduction
Spoken dialogue systems allow humans to interact with machines using natural speech. As such, they have many benefits. By using speech as the primary communication medium, a computer interface can facilitate swift, human-like acquisition of information. In recent years, systems with speech interfaces have become ever more popular, as is evident from the rise of personal assistants such as Siri, Google Now, Cortana and Amazon Alexa. Statistical approaches to dialogue management have been shown to reduce design costs and provide superior performance to hand-crafted systems particularly in noisy environments [1]. Traditionally, spoken dialogue systems were built for limited domains described by an underlying ontology, which is essentially a structured representation of the database of entities that the dialogue system can talk about.
The semantic web is an effort to organise the large amount of information available on the Internet into a structure that can be more easily processed by a machine designed to perform reasoning on this data [2]. Knowledge graphs are good examples of such structures. They typically consist of a set of triples, where each triple represents two entities connected by a specific relationship. Current knowledge graphs have millions of entities and billions of relations and are constantly growing. There has been a significant amount of work in spoken language understanding focused on exploiting knowledge graphs in order to improve coverage [3,4]. More recently there have also been efforts to build statistical dialogue systems that operate on large knowledge graphs, but limited so far to the problem of belief tracking [5,6]. In this article, we address the problem of decision-making in multi-domain dialogue systems. This a necessary step towards open-domain dialogue management. A previously proposed model for multi-domain dialogue management [7] assumes a dialogue expert for each domain and the central controller which decides to which dialogue expert to pass the control. The dialogue experts are rule-based and the central controller is optimised using reinforcement learning. A related work in [8] proposes a domain independent feature representation of the dialogue state so that the dialogue policy can be applied to different domains. Here, we explore multi-domain dialogue management which retains a separate statistical model for each domain.
Moving from a limited domain dialogue system that operates on a relatively modest ontology size to an open domain dialogue system that can converse about anything in a very large knowledge graph is a non-trivial problem. An open domain dialogue system can be seen as a (large) set of limited domain dialogue systems. If each of them were trained separately then an operational system would require sufficient training data for each individual topic in the knowledge graph, which is simply not feasible. What is more likely is that there will be limited and varied data drawn from different domains. Over time, this data set will grow but there will always be topics within the graph which are rarely visited.
The key to statistical modelling of multi-domain dialogue systems is therefore the efficient reuse of data. Gaussian processes are a powerful method for efficient function estimation from sparse data. A Gaussian process is a Bayesian method which specifies a prior distribution over the unknown function and then given some observations estimates the posterior [9]. A Gaussian process prior consists of a mean function -which is what we expect the unknown function to look like before we have seen any data -and the kernel function which specifies the prior knowledge of the correlation of function values for different parts of the input space. For every input point, the kernel specifies the expected variation of where the function value will lie and once given some data, the kernel therefore defines the correlations between known and unknown function values. In that way, the known function values influence the regions where we do not have any data points. Also, for every input point the Gaussian process defines a Gaussian distribution over possible function values with mean and variance. When used inside a reinforcement learning framework, the variance can be used to guide exploration, avoiding the need to explore parts of the space where the Gaussian process is very certain. All this leads to very data efficient learning [10].
In this article, we explore how a Gaussian process-based reinforcement learning framework can be augmented to support multi-domain dialogue modelling focussing on three inter-related approaches. The first makes use of the Gaussian process prior. The idea is that where there is little training data available for a specific domain, a generic model can be used that has been trained on all available data. Then, when sufficient in-domain data becomes available, the generic model can serve as a prior to build a specific model for the given domain. This idea was first proposed in [11].
The second approach is based on a Bayesian committee machine [12]. The idea is that every domain or sub-domain is represented as a committee member. If each committee member is a Bayesian model, e.g. a Gaussian process, then the committee too is a Bayesian model, with mean and variance estimate. If a committee member is trained using limited data its estimates will carry a high uncertainty so the committee will rely on other more confident committee members, until it has seen enough training data. This method was proposed in [13]. It is similar to Products of Gaussians which have previously been applied to problems such as speech recognition [14].
Finally, we extend the committee model to a multi-agent setting where committee members are seen as agents that collaboratively learn. This overarching framework subsumes the first two approaches and provides a practical approach to on-line learning of dialogue decision policies for very large scale systems. It constitutes the primary contribution of this article.
The remainder of the paper is organised as follows. In Section 2, the use of Gaussian process-based reinforcement learning (GPRL) is briefly reviewed. The key advantage of GPRL in this context is that in addition to being data efficient, it directly supports the use of an existing model as a prior, thereby facilitating incremental adaptation. In Section 3, various strategies for building a generic policy are considered and evaluated. We then review the Bayesian committee machine in Section 4.1. Following that, in Section 4.2, we present a multi-domain dialogue manager based on the committee model. In Section 5, we describe how multi-agent learning can be applied to the policy committee model. Then, in Section 6, we present the experimental results. Finally, in Section 7, conclusions together with future research directions are presented.

Gaussian process reinforcement learning
The input to a statistical dialogue manager is typically an N-best list of scored hypotheses obtained from the spoken language understanding unit. Based on this input, at every dialogue turn, a distribution of possible dialogue states called the belief state, b ∈ B, an element of belief space, is estimated. The belief state must accurately represent everything that happened prior to that turn in the dialogue. The quality of a dialogue is defined by a reward function r(b, a) and the role of a dialogue policy π is to map the belief state b into a system action a ∈ A, an element of action space, at each turn so as to maximise the expected cumulative reward.
The expected cumulative reward for a given belief state b and action a is defined by the Q-function: where r τ is the immediate reward obtained at time τ , T is the dialogue length and γ is a discount factor, 0 < γ ≤ 1. Optimising the Q-function is then equivalent to optimising the policy π.
GP-Sarsa is an on-line reinforcement learning algorithm that models the Q-function as a Gaussian process [15]: where m(·, ·) is the prior mean function and the kernel k(·, ·) is factored into separate kernels over belief and action spaces k For a training sequence of belief state-action pairs B = [(b 0 , a 0 ), . . . , (b t , a t )] T and the corresponding observed immediate rewards r = [r 1 , . . . , r t ] T , the posterior of the Q-function for any belief state-action pair (b, a) is given by: where the posterior mean and covariance take the form: K is the Gram matrix [9], H is a band matrix with diagonal [1, −γ] and σ 2 is an additive noise factor which controls how much variability in the Qfunction estimate is expected during the learning process. Since the Gaussian process for the Q-function defines a Gaussian distribution for every belief state-action pair (3), when a new belief point b is encountered, for each action a ∈ A, there is a Gaussian distribution over Q-values. Sampling from these Gaussian distributions gives Q-valuesQ(b, a) ∼ N (Q(b, a), Σ Q (b, a)) where Σ Q (b, a) = cov((b, a), (b, a)) from which the action with the highest sampled Q-value can be selected: To use GPRL for dialogue, a kernel function must be defined on both the belief state space B and the action space A. Here we use the Bayesian Update of Dialogue State (BUDS) dialogue model [16]. The action space consists of a set of slot-dependent and slot-independent summary actions. Slot-dependent summary actions include requesting the slot value, confirming the most likely slot value and selecting between top two slot values. Summary actions are mapped to master actions using a set of rules and the kernel is defined as: where δ a (a ) = 1 iff a = a , 0 otherwise. The belief state consists of the probability distributions over the Bayesian network hidden nodes that relate to the dialogue history for each slot and the user goal for each slot. The dialogue history nodes can take a fixed number of values, whereas user goals range over the values defined for that particular slot in the ontology and can have very high cardinalities. User goal distributions are therefore sorted according to the probability assigned to each value since the choice of summary action does not depend on the values but rather on the overall shape of each distribution. The kernel function over both dialogue history and user goal nodes is based on the expected likelihood kernel [17], which is a simple linear inner product. The kernel function for belief space is then the sum over all the individual hidden node kernels: where b h is the probability distribution encoded in the h th hidden node. One way to build a dialogue manager which can operate across a large knowledge graph is to decompose the dialogue policy into a set of topic specific policies that are distributed across the class nodes in the graph. Initially, there will be relatively little training data and the system will need to rely on generic policies attached to high level generic class nodes which have been trained on whatever examples are available from across the pool of derived classes. As more data is collected, specific policies can be trained for each derived class 1 . An example of this is illustrated in Fig 1. On the left side is the initial situation where conversations about hotels and restaurants are conducted using a generic model, M V , trained on example dialogues from both the hotel and restaurant domains. Once the system has been deployed and more training data has been collected, specific restaurant and hotel models M R and M H can be trained. 2 This type of multi-domain model assumes an agile deployment strategy which can be succinctly described as "deploy, collect data, and refine". Its viability depends on the following assumptions:

Distributed dialogue policies
1. it is possible to construct generic policies which provide acceptable user performance across a range of differing domains; 2. as sufficient in-domain data becomes available, it is possible to seamlessly adapt the policy to improve performance, without subjecting users to unacceptable disruptions in performance during the adaptation period.
In GPRL, the computation of Q(b, a) requires the kernel function to be evaluated between (b, a) and each of the belief-action points in the training data. If the training data consists of dialogues from subdomains (restaurants and hotels in this case) which have domain-specific slots and actions, a strategy is needed for computing the kernel function between domains.
If domains are organised in a class hierarchy it is expected that they share some of the slots. Calculating the kernel for shared parts of the belief state is straightforward: where R and H are the considered subdomains. When goal nodes are paired 1 cf analogy with speech recognition adaptation using regression trees [18] 2 Here a model M is assumed to include input mappings for speech understanding, a dialogue policy π and output mappings for generation. In this article, we are only concerned with dialogue management and hence the dialogue policy component π of each model. with differing cardinalities (eg name might have different cardinality for different domains), the shorter vector is padded with zeros. Pairing of nonmatching slots is achieved by treating them as abstract slots: slot-1, slot-2, etc so that they become the same in both subdomains according to some heuristics. Hence for example, food is matched with dogs allowed, and so on. As with the case with shared slots, when goal nodes are paired with differing cardinalities, the shorter vector is padded with zeros. Other adaptation strategies are also possible but may result in increasing the dimensionality (see for example [19]).

Bayesian committee machine
The Bayesian committee machine is an approach to combining estimators that have been trained on different datasets. It is particularly suited to Gaussian process regression [12]. Here we apply the method to combine the outputs of multiple estimates of Q-values Q i with mean Q i and covariance Σ Following the description in [12], the combined mean Q and covariance Σ Q are calculated as:

Multi-domain Dialogue Manager
Section 3 introduced the notion of a generic policy, which can be trained from data coming from different domains, and a specific policy that can be derived from a generic policy using additional in-domain data. In order to produce a generic policy that works across multiple domains, a kernel function must be defined on belief states and actions that come from different domains. In that case, domains are organised in a class hierarchy so it is reasonable to assume that there are shared portions of the belief for different domains. These portions relate to shared slots and are directly mapped to each other and for slots which are different, the mapping can be defined either manually or by using some similarity metric.
When using a Bayesian committee machine, it is possible to have two domains which have no shared slots. Therefore, a different approach is required for building policies that can operate (and be trained on) belief states and actions that come from different domains. The approach is as follows. The slots from each domain are divided into semantic classes 3 . We have three semantic classes: name slot refers to the name of the entity in the database; requestable slots are the ones the user can specify to constrain a search, for instance slot food or slot batteryrating; informable slots are the ones the user can request further information, such as slot phone or slot dimension.
Then, the following steps are taken: 1. For each semantic class and for each slot in that semantic class, the normalised entropy η is calculated by where s is a slot that takes values v from a set V s and where p(s = v) is the empirical probability that an entity in the database with slot s takes value v for that slot. For example, if all entities in the database for the restaurant domain have area=centre, then that slot has a normalised entropy equal to 0. The measure is normalised so that slots that take different numbers of values can be compared. This measure provides an indicator of how useful each slot is in the dialogue. For instance, in this case it is not useful for the system to ask the user about their preference for slot area since the answer provides no information gain. 2. For each domain, and for each semantic class, the slots are sorted based on their normalised entropy and given abstract names • Otherwise disregard the elements of the belief state relating to unpaired slots j and if one of the actions relates to slot j , consider the action kernel to be 0.
This slot matching process is illustrated in Figure 3.
This approach has three important properties: 1. Once semantic classes are defined, the further process does not require human intervention to define the relationship between slots that come different domains; 2. it provides a well-defined computable relationship between any two domains; and 3. the kernel function that is defined in step 3 is positive definite so the Gaussian process is well-defined.

Multi-agent learning in the policy committee framework
In the standard reinforcement learning framework there is a single agent that is trying to solve a specific task in a given environment. However, for complex tasks it has been shown [20] that it is more effective to decompose the problem into sub-tasks and introduce a distinct agent to solve each subtask. In this case, each agent needs to take into account only part of the state space and this can significantly speed up the learning process. Learning in such multi-agent systems is typically performed in three steps [20]. First, each agent proposes an action. Second, a gating mechanism, which can be either handcrafted or optimised automatically, is deployed to select the actual system action. Finally, the reward is distributed among the agents and they each re-estimate their policy.
The multi-agent framework can be seen as an extension of the policy committee model (see Figure 4). In fact, the first two steps are exactly the same: each committee member estimates its own Q-function and then Eq. 9 is used as the gating to automatically combine the output. The multi-agent framework, however, includes a third step which is to distribute the reward so that each agent (i.e. committee member) can learn from every dialogue. Intuitively, the reward should be given to the agent for the domain that naïve approach: The total reward that the system obtains is directly fed back to each committee member [20].
winner-takes-all approach: The total reward that the system obtains is fed back to the committee member that proposed the highest Q-value for the action that was finally chosen by the gating mechanism [21].
reward scaling approach: The total reward is redistributed to each committee member in such a way as to reflect its contribution to the final action chosen by the gating mechanism [20].

Experimental set-up
In order to investigate the effectiveness of the methods discussed above, a variety of experimental contrasts were examined using an agenda-based simulated user operating at the dialogue act level [22,23]. The reward function allocates −1 at each turn to encourage shorter dialogues, plus 20 at the end of each successful dialogue. The user simulator includes an error generator and this was set to generate incorrect user inputs 15% of time.
The proposed methods were also incorporated into a real-time spoken dialogue system in which policies were trained on-line using subjects recruited via Amazon Mturk. Each user was assigned specific tasks in a given subdomain and then asked to call the system in a similar set-up to that described in [24,25]. After each dialogue, the users were asked whether they judged the dialogue to be successful or not. Based on that binary rating, the subjective success was calculated as well as the average reward. An objective rating was also computed by comparing the system outputs with the assigned task specification. During training, only dialogues where both objective and subjective score were the same were used.
In order to examine the ability of the proposed methods to operate on multiple domains, four different domains were used:

SFH consisting of hotels in San Francisco
L6 consisting of laptops with 6 properties that the user can specify L11 same as L6 but with 11 user-specifiable properties.
A description of each domain with slots sorted according to their normalised entropy is given in Table 1.

Generic policy performance in simulation
In order to investigate the effectiveness of the generic policies discussed in Section 3, generic policies were trained and then tested in two domains -SFR and SFH using equal numbers of restaurant and hotel dialogues. In addition, in-domain policies were trained as a reference.
For each condition, 10 policies were trained using different random seeds and varying numbers of training dialogues. Each policy was then evaluated using 1000 dialogues in each subdomain. The overall average reward, success rate and number of turns is given in Table 2 together with a 95% confidence interval. The most important measure is the average reward, since the policies are trained to maximise this.
As can be seen from Table 2, all generic policies perform better than the in-domain policies trained only on the data available for that subdomain (i.e. half of the training data available for the generic policy in this case) and this is especially the case when training data is limited. This suggests that   the provision of generic policies in a large multi-domain will indeed provide robustness against the user moving into a domain for which there is very little training data.

Adaptation of in-domain policies using a generic policy as a prior in simulation
We now investigate the effectiveness of using a generic policy as a prior for training an in-domain policy as in the right hand side of Fig. 1. In order to examine the best and worst case, the generic priors (from the 10 randomly seeded examples) that gave the best performance and the worst performance on each sub-domain trained with 500 and 5000 dialogues were selected. This results in four prior policies for each subdomain: generic-500-worst, generic-500-best, generic-5000-worst and generic-5000-best.
In addition, a policy with no prior was also trained for each subdomain (i.e. the policy was trained from scratch). After every 1000 training dialogues each policy was evaluated with 1000 dialogues. The results are given in Fig. 5 and 6 with a 95% confidence interval. Performance at 0 training dialogues corresponds to using the generic policy as described in the previous section, or using a random policy for the no prior case. The results from Figs. 5 and 6 demonstrate that the performance of the policy in the initial stages of learning are significantly improved using the generic policy as a prior, even if that prior is trained on a small number of dialogues and even if it was the worst performing prior from the batch of 10 training sessions. These results also show that the use of a generic prior does not limit the optimality of the final policy. In fact, the use of Reward no-prior generic-500-worst generic-500-best generic-5000-worst generic-5000-best a prior can be seen as resetting the variance of a GP which may lead to better sample efficiency [26]. This may be the reason why in some cases, the no-prior policies never catch up with the adapted policies -as in Figure 6.
In Table 3, the performance of the best performing generic prior is compared to the performance of the same policy adapted using an additional 50K dialogues. The results show that additional in-domain adaptation has the potential to improve the performance further. So when enough training data is available, it is beneficial to create individual in-domain policies rather than continuing to train the generic policy.

Adaptation in interaction with human users
To examine performance when training with real users, rather than a simulator, two training schedules were performed in the SFR subdomain -one training from scratch without a prior, and the other performing adaptation Reward no-prior generic-500-worst generic-500-best generic-5000-worst generic-5000-best using the best generic prior obtained after 5000 simulated training dialogues. For each training schedule three sample runs were performed and the results were averaged to reduce any random variation. Fig. 7 shows the moving average reward as a function of the number of training dialogues. The moving window was set to 100 dialogues so that after the initial 100 dialogues each point on the graph is an average of 300 dialogues (3 sample runs × window size). The shaded area represents a 95% confidence interval. The initial parts of the graph exhibit more randomness in behaviour because the number of training dialogues is small.
The results show an upward trend in performance particularly for the policy that uses no prior. However, the performance obtained with the prior is significantly better than without a prior both in terms of the reward and the success rate. Equally importantly, unlike the system trained from scratch with no prior, the users of the adapted system are not subjected to poor  performance during the early stages of training.

Policy committee evaluation with simulated user
In the previous section, the benefit of training generic models was demonstrated when training data is sparse. Here we investigate whether the use of a Bayesian committee machine can improve robustness further. The contrasts studied were as follows: INDOM In-domain policy -trained only on in-domain data, other data is not taken into consideration, action-selection is based only on the in-domain policy. This is the baseline.
GEN Single generic policy -one policy trained on all available data (as in Section 3).
MBCM Multi-policy Bayesian committee machine -as described in Section 4.1. There is one committee member for each domain and each committee member is trained only on in-domain data. However, for action-selection, the estimates of all committee members are taken into account using Eq 9, both during training and testing.
GOLD Gold standard -this is the performance of the single policy where all training data comes from the same domain i.e. for N domains, GOLD has N times the number of in-domain dialogues for training as provided to INDOM.
We examine two cases: when the training data is limited, with only 250 dialogues available for each domain, and when there is more training data available, 2500 for each domain. In the evaluation of generic policies in Section 6.2, the test domains were relatively similar. Here, we consider more diverse domains: • Multi-domain system for SFR, SFH and L6, where the domains have different slots but each domain has the same number of slots, and • Multi-domain system for SFR, SFH and L11, where not only are the slots different, but also one of the domains, L11, has many more slots than the others.
For each contrast described above, 10 policies were trained on the simulated user using different random seeds. Each policy was then evaluated using 1000 dialogues in each domain. The overall average reward, success rate and number of turns are given in Table 4 together with 95% confidence intervals. We do not report results on SFH domain as policies on this domain behave similarly to the ones on SFR domain (see Fig. 5 and 6).
There are several conclusions to be drawn from the results given in Table 4. First, as shown in Section 6.2, generic policies make use of data that comes from different domains and this improves performance over an indomain baseline, even in the case presented here where the domains are very different. The multi-policy MBCM results in performance which is either significantly better than other methods or statistically indistinguishable from other methods. In the case of limited training data, its performance is at least as good as the gold standard 4 . Another advantage of MBCM is that it does not require storing a separate generic policy model but only ever produces in-domain models that have the ability to contribute to action-selection for other domains. Unlike domain-independent policy models [8], MBCM allows flexible selection of committee members. The usefulness of each committee member in the MBCM multi-policy model is explored in Table 5 for the SFR domain. As can be seen from the results, all committee members contribute to performance gains. However not all committee members are equally important. In this case, for good performance on the SFR domain, the SFH committee member is more useful than the L11 committee member.

Policy committee evaluation with human users
In order to fully examine the effectiveness of the proposed adaptation scheme, policies were also trained in direct interaction with human users. We compare two set-ups: one where an in-domain L6 policy is trained online and another where a multi-policy Bayesian committee machine is trained from scratch using data from the SFR, SFH and L6 domains, which produces a policy committee which can operate on all three domains. To the best of our knowledge, this is the first time a dialogue policy has been trained on multiple domains on-line in interaction with real users. Fig. 8 shows the moving average reward as a function of the number of training dialogues for the L6 domain comparing the in-domain (INDOM) policy and the multi-policy Bayesian committee machine (MBCM) as defined in Section 4.2. The performance of the MBCM policy was only shown on training dialogues that came from the L6 domain, but in fact it was also trained on SFR and SFH domains in parallel. The training data across the domains was equally distributed. Each plot is an average of three sample runs. The moving window was set to 100 dialogues so that after the initial 100 dialogues each point on the graph is an average of 300 dialogues. The shaded area represents a 95% confidence interval. The initial parts of the graph exhibit more randomness in behaviour because the number of training dialogues is small. The results show that the multi-policy Bayesian committee machine consistently outperforms the in-domain policy. The caveat is that the computational complexity linearly increases with the number of committee members. Therefore this method would require a technique that selects and removes committee members as needed. An exploration of such a method goes beyond the scope of this article.

Multi-agent simulation results
Finally, we examine the effectiveness of extending the policy committee model to multi-agent learning. The contrasts studied were as follows: NAÏV Naïve approach -The total reward is given to each committee member during every interaction regardless of the current domain.
WINN Winner-takes-all approach -The total reward is given to the policy member which on average gave the highest Q-value Q-variance ratio during the whole dialogue, Σ Q i (b, a) −1 Q i (b, a) from Eq. 9, for each action taken by the system. SCALE Scale received reward according to all committee members' Q-value estimate -the total reward is distributed to each policy committee member in proportion to the average Q-value Q-variance ratio, a) from Eq. 9, for each action that the system took relative to the Q-value Q-variance ratios of the other committee members for the taken action.
MBCM Multi-policy Bayesian committee machine -Each committee member is trained only on in-domain data, so the reward is passed only to the committee member which is specific to that domain (see Section 4.2 for details).
We consider a multi-domain system for SFR, SFH and L11. Two scenarios are examined: (a) when the training data is limited, with only 250 dialogues available for each domain, and (b) when there is more training data available, 2500 for each domain.
For each method described above, 10 policies were trained on the simulated user using different random seeds. Each policy was then evaluated using 1000 dialogues on each domain. The overall average reward, success rate and number of turns are given in Table 6 together with 95% confidence intervals. We do not report results on SFH domain as policies on this domain behave similarly to the ones on SFR domain (see Fig. 5 and 6).
Conclusions that can be drawn from these results are the following. First, on a smaller dataset the WINN approach, which chooses the winning committee member to pass the total reward to, is less effective than the approaches which distribute the reward. This is expected, as in the latter case the agent's policy learns from a larger set of dialogues, which is particularly useful in the early stages of the optimisation process. On larger datasets, the winner-takes-all approach gives similar or better performance to the approaches which distribute the reward. This means that if large amount of data is available one can afford to use the model which chooses a which subset of data to optimise the policy. In this case that is the data which has the most accurate reward estimate. If we average results across the domains and the sizes of the training data, however, we can see that it is generally more effective to use the approaches which distribute the reward.
It is also important to understand behaviour when a new domain is added alongside a set of existing agents which themselves are not yet fully trained. We are interested in both the performance in the new domain as well as the existing domains. To investigate this, two agents operating in the SFR and SFH domains were pre-trained with 250 dialogues each using SCALE reward distribution mechanism. Performance was then evaluated in the SFR and the new as yet untrained L11 domain. The L11 agent was then trained with 250 dialogues in that domain. Again, the performance was tested in both L11 and SFR. Finally training continued with another 250 dialogues for each of the three domains -SFR, SFH and L11 and the performance in the SFR and L11 domains tested for a final time. The results are shown in Table 7. The performance of the dialogue manager in the L11 domain when trained only with SFR and SFH dialogues is very poor, which is expected as these are very different domains. However with the addition of 250 L11 dialogues, the performance dramatically improves. What is more, adding these L11 dialogues does not impede performance in the SFR domain, in fact it improves slightly. With an additional 750 dialogues spread across all three domains, the performance significantly improves in both the L11 and SFR domains.

Multi-agent human user evaluation
To ensure that the benefits of the proposed reward distribution approach suggested by the above simulation results carry over into systems trained on-line, two systems were also trained in direct interaction with human users. First, a multi-policy Bayesian committee machine (MBCM) was trained from scratch using data from the SFR restaurant, the SFH hotel and the L6 laptop domains. This MBCM policy committee machine operates on all three domains but is dependent on the knowledge of the current domain for policy updating. This is compared to the committee reward scaling (SCALE) machine, presented in Section 6.7, which distributes the reward to every committee member for each dialogue regardless of the domain. The system was deployed in a telephone-based set-up, with subjects recruited via Amazon MTurk and a recurrent neural network model was used to estimate the reward [27]. Fig. 9 shows the moving average reward as a function of the number of training dialogues for the L6 domain comparing the MBCM and SCALE committee approaches. The committees were also trained on SFR and SFH domains in parallel. The training data across the domains was equally distributed. As in Section 6.6 each plot is an average of three sample runs. The moving window was set to 100 dialogues so that after the initial 100 dialogues each point on the graph is an average of 300 dialogues. The shaded area represents a 95% confidence interval. As can be seen from the reward graph for the SCALE approach, the results confirm that it is not necessary for the committee to be aware of the domain. On the contrary, distributing reward to each committee member according to their contribution can even produce better performance than only sending the reward signal to the committee member dedicated to the current domain.

Conclusion
This paper has described three models which support dialogue system domain extension. First, a distributed multi-domain dialogue architecture was proposed in which dialogue policies are organised in a class hierarchy aligned to an underlying knowledge graph. The class hierarchy allows a system to be deployed by using a modest amount of data to train a small set of generic policies. As further data is collected, generic policies can be adapted to give in-domain performance. Using Gaussian process-based reinforcement learning, it has been shown that it is possible to construct generic policies which provide acceptable in-domain user performance, and better performance than can be obtained using under-trained domain specific policies. To construct a generic policy, a design consisting of all common slots plus a number of abstract slots which can be mapped to domain-specific slots works well. It has also been shown that as sufficient in-domain data becomes available, it is possible to seamlessly adapt to improve performance, without subjecting users to unacceptable disruptions in performance during the adaptation period and without limiting the final performance compared to policies trained from scratch.
An alternative to hierarchically structured policies is the distributed committee model which uses estimates from different policies for action selection at every dialogue turn. The results presented have shown that this model is particularly useful for training multi-domain dialogue systems where the data is limited and varied. As shown in both simulations and in real user trials, the Bayesian policy committee approach gives superior performance to the traditional one-policy-approach across multiple domains and allows flexible selection of committee members during testing.
Finally, the basic policy committee model was extended using ideas from multi-agent learning to distribute the reward signal among the committee members. This model is particularly useful in real-world scenarios where the domain is a priori unknown and indeed, may change during a dialogue. In simulations, the proposed approach achieves a performance which is close to that which relies on explicit domain information to assign reward, while in a real human trial, it produced better performance.
For future work, these methods will be applied to a dialogue manager operating over a large knowledge graph in order to demonstrate that they do indeed scale and offer a viable approach to building truly open domain spoken dialogue systems which learn on-line in interaction with real users.