Advances and Challenges in Conversational Recommender Systems: A Survey

Recommender systems exploit interaction history to estimate user preference, having been heavily used in a wide range of industry applications. However, static recommendation models are difficult to answer two important questions well due to inherent shortcomings: (a) What exactly does a user like? (b) Why does a user like an item? The shortcomings are due to the way that static models learn user preference, i.e., without explicit instructions and active feedback from users. The recent rise of conversational recommender systems (CRSs) changes this situation fundamentally. In a CRS, users and the system can dynamically communicate through natural language interactions, which provide unprecedented opportunities to explicitly obtain the exact preference of users. Considerable efforts, spread across disparate settings and applications, have been put into developing CRSs. Existing models, technologies, and evaluation methods for CRSs are far from mature. In this paper, we provide a systematic review of the techniques used in current CRSs. We summarize the key challenges of developing CRSs in five directions: (1) Question-based user preference elicitation. (2) Multi-turn conversational recommendation strategies. (3) Dialogue understanding and generation. (4) Exploitation-exploration trade-offs. (5) Evaluation and user simulation. These research directions involve multiple research fields like information retrieval (IR), natural language processing (NLP), and human-computer interaction (HCI). Based on these research directions, we discuss some future challenges and opportunities. We provide a road map for researchers from multiple communities to get started in this area. We hope this survey can help to identify and address challenges in CRSs and inspire future research.


Introduction
Recommender systems have become an indispensable tool for information seeking. Companies such as Amazon and Alibaba, in e-commerce, Facebook and Wechat, in social networking, Instagram and Pinterest, in content sharing, and YouTube and Netflix, in multimedia services, all have the need to properly link items (e.g., products, posts, and movies) to users. An effective recommender system that is both accurate and timely can help users find the desired information and bring significant value to the business. Therefore, the development of recommendation techniques continues to attract academic and industrial attention.
Static recommendation models are typically trained offline on historical behavior data, which are then used to serve users online [33]. Despite their wide usage, they fail to answer two important questions: 1. What exactly does a user like? The learning process of static model is usually conducted on historical data, which may be sparse and noisy. Moreover, a basic assumption of static models is that all historical interactions represent user preference. Such a paradigm raises critical issues. First, users might not like the items they chose, as they may make wrong decisions [182,183]. Second, the preference of a user may drift over time, which means that a user's attitudes towards items may change, and capturing the drifted preference from past data is even harder [71]. In addition, for cold users who have few historical interactions, modeling their preferences from data is difficult [87]. In short, a static model can hardly capture the precise preference of a user. 2. Why does a user like an item? Figuring out why a user likes an item is essential to improve recommender model mechanisms and thus increase their ability to capture user preference. There are many factors affecting a user's decisions in real life [110,17,49]. For example, a user might purchase a product because of curiosity or influ-enced by others [202]. Or it may be the outcome of a deliberate consideration. It is common that different users purchase the same product but their motivations are different. Thus, treating different users equally or treating different interactions by the same user equally, is not appropriate for a recommendation model. In reality, it is hard for a static model to disentangle different reasons behind a user's consumption behavior. Even though much effort has been done to eliminate these problems, they make limited assumptions. For example, a common setting is to exploit a large amount of auxiliary data (e.g., social networks, knowledge graphs) to better interpret user intention [152]. However, these additional data may also be incomplete and noisy in real applications. We believe the key difficulty stems from the inherent mechanism: the static mode of interaction modeling fundamentally limits the way in which user intention can be expressed, causing an asymmetric information barrier between users and machines.
The emergence of conversational recommender systems (CRSs) changes this situation in profound ways. There is no widely accepted definition of CRS. In this paper, we define a CRS to be:

A recommendation system that can elicit the dynamic preferences of users and take actions based on their current needs through real-time multiturn interactions using natural language.
Our definition highlights two properties of CRSs: one is multi-turn interactions and the other is natural language. By a narrow definition, conversation means multi-turn dialogues in both written or spoken form; from a broader perspective, conversation means any form of interaction between users and systems using natural language. Conversational interaction is a natural solution to the long-standing asymmetry problem in information seeking. Through natural language-based interactions, CRSs can easily elicit the current preference of a user and understand the motivations behind a consumption behavior. Figure 1 shows an example of a CRS where a user resorts to the agent for the suggestions about music. Combining the user's previous preference (loving Jay Chou's songs) and the intention elicited through conversational interactions, the system can offer desired recommendations easily.
Since the born of recommender systems, researchers have realized the importance of the human-machine interaction. Some studies propose interactive recommender systems [57,177,20,226], which mainly focus on improving the recommendation strategy online by leveraging real-time user feedback on previous recommended items. However, the way that most methods use to interact with users suffers from low efficiency, as there are too many items. An alternate solution is to leverage attribute information of items, which is self-explanatory for understanding users' intention and can quickly narrow down candidate items. The critiquing-based recommender system is such a solution that is designed to elicit users' feedback on certain attributes, rather than items.
Sounds good, let me try it! Yeah, Mojito is too popular these day. Maybe you like some old songs like this one. The singer is also Jay Chou.
Oh, I love it! But I have listened it like 100 times in Tom's home. I wanna try something new.
As you wish, how about this one? It is a new song just released by him.
Yeah, I love his songs.
Okay, what kind of music do you want?
Which singer do you want to listen to? Jay Chou as usual?
Some relaxing ones, better to be a pop song.
I want some music.
By Jay Chou

Malt Candy
By Jay Chou Mojito Figure 1: A toy example of a conversational recommender system in music recommendation.
It can be viewed as an early form of CRSs [166,168,12,154,135,23,108,107]. Critiquing is like a salesperson who collects user preference by asking questions proactively on item attributes. For example, when seeking mobile phones, a user may follow the hint of the system and provides feedback such as "cheaper" or "longer battery life." Based on such feedback, the system will recommend more appropriate items; this procedure repeats several times until the user find satisfactory items or gives up. The mechanism gives the system an improved ability to infer user preference and helps quickly narrow down recommendation candidates. Though effective, existing interactive and critiquing methods are constrained by their representation ability since users can only interact with the system through a few predefined options. The integration of a conversational module in CRSs allows for more flexible forms of interaction, e.g., in the form of tags, template utterances, or free natural language. Undoubtedly, user intention can be more naturally expressed and comprehended through a conversational module.
Recently, attracted by the power of CRSs, many researchers have devoted to exploring this topic. These efforts are spread across a broad range of task formulation, in diverse settings and application scenarios. We collect the papers related to CRSs by searching for "Conversation* Recommend*" on DBLP 1     Although many studies have been done on CRSs, there is no uniform task formulation. Jannach et al. [73] made a survey of CRSs and classified the methods according to the aspects like using knowledge sources or interaction modalities, e.g., methods based on forms or natural language, or driven by the system or the user. Now, some researchers focus on the dialogue ability of CRSs and try to build models based on end-to-end dialogue systems [94,25,104,223] or deep language models [225]. However, these models aim to learn the patterns in human conversation corpora, and are usually non-transparent and hard to interpret. As shown in the work of Jannach and Manzoor [72], the end-to-end methods [94,25] perform poorly in human evaluation in terms of both recommendation and response generation. Therefore, an explicit conversation strategy is necessary and needs lots of research efforts.
In this survey, we present all CRSs as the general framework that consists of three decoupled components illustrated in Figure 3. Specifically, a CRS is made of a user interface, a conversation strategy module, and a recommendation engine. The user interface serves as a translator between the user and machine; generally, it extracts information from raw utterances of the user and transforms the information into machine-understandable representation, and it generates meaningful responses to the user based on the conversation strategy. The conversation strategy module is the brain of the CRS and coordinates the other two components; it decides the core logics of the CRS such as eliciting user preference, maintaining multi-turn conversations, and leading new topics. The recommendation engine is responsible for modeling relationships among entities (e.g., the useritem interaction or item-item linkage), learning and recording user preference on items and attributes of items, retrieving the required information.
There are many challenges in the three components, we summarize five main challenges as following.
• Question-based User Preference Elicitation. CRSs provide the opportunity to explicitly elicit user preference by asking attribute questions. There are two important questions that need to be answered: (1) What to ask? and (2) How to adjust the recommendations based on user response? The former focuses on constructing questions to elicit as much information as possible; the latter leverages the information in user response to make more appropriate recommendations. • Multi-turn Conversational Recommendation Strategies.
The system needs to repeatedly interact with a user and adapts to the user's response dynamically in multiple turns. An effective strategy concerns when to ask questions and when to make recommendations, i.e., let the model choose between (1) continuing to ask questions so as to further reduce preference uncertainty, and (2) generating a recommendation based on estimation of current user preference. Generally, the system should aim at a successful recommendation using the least number of turns, as users will lose their patience after too many turns [88]. Furthermore, some sophisticated conversational strategies try to proactively lead dialogues [189,4], which can introduce diverse topics and tasks in CRSs [104,225,90,185]. • Natural Language Understanding and Generation.
Communicating like a human being continues to be one of the hardest challenges in CRSs. For understanding user interests and intentions, some CRS methods define the model input as pre-defined tags that capture semantic information and user preferences [31,88,89,230]. Some methods extract the semantic information from users' raw utterances via slot filling techniques and represent user intents in slot-value pairs [208,161,141]. And for generating human-understandable responses, CRSs use many strategies such as directly providing a recommendation list [230,208], incorporating recommended items in a rulebased natural language template [161,88,89]. Moreover, some researchers propose the end-to-end framework to enable CRSs to precisely understand users' sentiment and intentions from the raw natural language and to generate readable, fluent, consistent, and meaningful natural language responses [94,104,141,25,223]. • Trade-offs between Exploration and Exploitation (E&E).
One problem of recommender systems is that each user can only interact with a few items out of the entire dataset. A large number of items that a user may be interested will remain unseen by the user. For cold-start users (who have just joined the system and have zero or very few interactions), the problem is especially severe. Thanks to the interactive nature, CRSs can actively explore the un-

Conversation Strategy
Module § Evaluation and User Simulation. seen items to better capture the user preference. In this way, users can benefit from having chances to express their intentions and obtain better-personalized recommendations. However, the process of exploration comes at a price. As users only have limited time and energy to interact with the system, an failed exploration will waste time and lose the opportunity to make accurate recommendations. Moreover, exposing unrelated items hurts user preference, compared to exploiting the already captured preference by recommending the items of high confidence [150,93,52]. Therefore, pursuing E&E trade-offs is a critical issue in CRSs. • Evaluation and User Simulation. Evaluation is an important topic. Unlike static recommender models that are optimized on offline data, CRSs emphasize the user experience during dynamic interactions. Hence, we should not only consider the turn-level evaluation for both recommendation and response generation, but also pay attention to the conversation-level evaluation. Besides, evaluating CRSs requires a large number of online user interactions, which are expensive to obtain [93,71,66]. Hence, using simulated users is necessary. Developing reliable user simulators is challenging and remains an open problem. The five challenges are allocated to the corresponding component as illustrated in Figure 3, where trading off the E&E balance is exclusive to the recommender engine; handling natural language understanding and generation is exclusive to the conversation module. The rest three challenges are related to both the components. We will discuss existing solution with regard to these five directions, and we hope to inspire the audience to push the frontiers of building CRSs.
The remainder of this paper is organized as follows. In next several sections, we discuss the main challenges in CRSs. Specifically, in Section 2, we illustrate how CRSs can elicit user preferences by asking informative questions. In Section 3, we describe the strategies in CRSs to interact with users in a multi-turn conversation. In Section 4, we point out the problems and provide solutions in dialogue understanding and generation for CRSs. In Section 5, we discuss how CRSs can balance the exploration-exploitation tradeoff for cold start users. In Section 6, we explore metrics for evaluating CRSs and present techniques for user simulation. In Section 7, we envision some promising future research directions. And in Section 8, we conclude this survey.

Question-based User Preference Elicitation
A user looking for items with specific attributes may get assess to them by actively searching. For instance, a user may search "iphone12 red 256gb", where the key phrases "red" and "256gb" are the attributes of the item iPhone12. In this scenario, users construct a query themselves, and the performance relies on both the search engine and the user's expertise in constructing queries. Even though there are efforts on helping users complete queries by suggesting possible options based on what they entered [109,5,35,13], users still need to figure out appropriate query candidates. Besides, searching in this way requires users to be familiar with each item they want, which is not true in practice. Recommender systems introduce users to the potential items that they may like. However, traditional recommender systems can only utilize the static historical records as the input, which results in the two main limitations mentioned in Section 1.
Fortunately, CRSs can bridge the gap between the search engine and recommender system. Empowered by real-time interactions, CRSs can proactively consult users by asking questions. And with the feedback returned by users, CRSs can directly comprehend users' needs and attitudes towards certain attributes, hence making proper recommendations. Even if users are not satisfied with the recommended items, a CRS has the opportunity to adjust its recommendations in the interaction process.
Question-driven methods focus on the problem of what to ask in conversations. Generally, there are two kinds of methods: (1) asking about items [215,32,151], or (2) asking about attributes/topics/categories of items [88,89].

Asking about Items
Early studies directly ask users for opinions about an item itself [215,180,32,233,171]. Unlike traditional recommender systems which need to estimate user preferences in advance, CRSs can construct and modify the user profile during the interaction process. In traditional recommender system models, the recommended items are produced in a relatively stable way from all candidates. In the CRS scenario, the recommended items should be updated after the system receives feedback from a user and it could be a complete change in order to adapt to the user's real-time preferences. Hence, instead of merely updating parameters of models online, some explicit rules or mechanisms are required. We introduce three methods that can elicit users' attitudes towards items and can quickly adjust recommendations. Most of these methods did not use natural language in their user interface, but it can easily integrate an natural language-based interface to make a CRS.
Choice-based Methods. The main idea of choice-based preference elicitation is to recurrently let users choose their preferred items or item sets from the current given options. The common strategies include (1) choosing an item from two given options [151], (2) selecting an item from a list of given items [75,53,144], and (3) choosing a set of items from two given lists [105]. After the user chooses preferred items, the methods change the recommendations according to the user's choice. For example, Loepp et al. [105] use the matrix factorization (MF) model [6] to initialize the embedding vectors of users and items, then select two sets of items from the item embedding space as candidate sets and let a user choose one of the two sets. It is important to ensure that the two candidate sets are as different or distinguishable as possible. To achieve this, the authors adopt a factor-wise MF algorithm [6], which factorizes the user-item interaction matrix and obtains the embedding vectors one by one in decreasing order of explained variance. Hence, the factors, i.e., different dimensions of embedding vectors, are ordered by distinctiveness. Then, the authors iteratively select two item sets with only a single factor value varying. For example, if two factors represent the degree of Humor and Action of movies, respectively, then the two candidate sets are one set of movies with a high degree of Humor and another with a low degree of Humor, while the degree of Action of the two sets is fixed to the average level. When a user chooses one item set, the user's preference embedding vector is set to the average of the embedding vectors of the chosen items. The choice becomes harder as the interaction process continues. Users can choose to ignore the question, which means the users cannot tell the difference between the two item sets or they do not care about it. Carenini et al. [14] further explore other strategies to select query items, e.g., selecting the most popular or the most diverse items in terms of users' history.
Bayesian Preference Elicitation. In addition, there are studies based on a probabilistic view of preference elicitation, which has been researched for a long time [18,9,171]. Basically, there is a utility function or a score function ( , ) representing user 's preference for item . Usually, it can be written as a linear function as In a Bayesian setting, user 's preference is modeled by a probabilistic distribution instead of a deterministic vector, which means that the vector is sampled from a prior user belief  ( ) . Therefore, the utility of an item for a user is computed as the expectation: The item with the maximum expected utility for user is considered as the recommendation items: Based on the utility function, the system can select some items to query. And the user belief distribution can be updated based on users' feedback. Specifically, given a user response to the question , the posterior user belief ( | , ) can be written as: As for the query strategy, i.e., selecting which items to ask, there are different criteria. For example, Boutilier [9] propose a partially observed Markov decision process (POMDP) framework as the sequential query strategy. And Vendrov et al. [171] and Guo and Sanner [54] use the expected value of information (EVOI) paradigm as a relatively myopic strategy to select items to query. Furthermore, the query type can be classified into two different types: (1) a pairwise comparison query, in which the users are required to choose what they prefer more between two items or two item sets [32,54,151]; or (2) a slate query, where users need to choose from multiple given options [171].
Interactive Recommendation. Interactive recommendation models are mainly based on reinforcement learning. Some researchers adopt a multi-armed bandit (MAB) algorithm [215,32,180]. The advantage is two-fold. First, MAB algorithms are efficient and naturally support conversational scenarios. Second, MAB algorithms can exploit the items that users liked before and explore items that users may like but never tried before. There are also researchers formulate the interactive recommendation as a meta learning problem which can quickly adapt to new tasks [233,87]. A task here is to make recommendations based on several conversation histories. Meta learning methods and MAB-based methods have the capability of balancing exploration and exploitation. We will describe it later in Section 5.
Recently, researchers incorporate deep reinforcement learning (DRL) models into interactive recommender systems [214,20,191,216,65,231,24,68,97,131,226,232,176]. Unlike MAB-based methods which usually assume the user preference is unchanged during the interaction, DRL-based methods can model a dynamic preference and long-term utility. For example, Mahmood and Ricci [112] introduce a modelbased techniques and use the policy iteration algorithm [163] to acquire an adaptive strategy. Model-free frameworks such as deep Q-network (DQN) [214,216,231,226] and deep deterministic policy gradient (DDPG) [65] are used in interactive recommendation scenarios. Most reinforcement learning (RL)-based methods often suffer from low efficiency issues and cannot handle cold-start users. Zhou et al. [226] propose to integrate a knowledge graph into the interactive recommendation to solve these problems.
However, directly requiring items is inefficient for building the user profile because the candidate item set is large. In real-world CRS applications, users will get bored as the number of conversation turns increases. It is more practical to ask attribute-centric questions, i.e., to ask users whether they like an attribute (or topic/category in some works), and then make recommendations based on these attributes [208,88]. Therefore, the estimation and utilization of a user's preferences towards attributes become a key research issue.

Asking about Attributes
Asking about attributes is more efficient because whether users like or dislike an attribute can significantly reduce the recommendation candidates. The challenge is to determine a sequence of attributes to ask so as to minimize the uncertainty of current user needs [120,165]. The aforementioned critiquing-based methods fall into this category. Besides, there are other kinds of methods, we introduce some mainstream branches as below.

Fitting Patterns from Historical Interaction
A conversation can be deemed as a sequence of entities including consumed items and mentioned attributes, and the objective is to learn to predict the next attribute to ask or the next item to recommend. Therefore, the sequential neural network such as the gated recurrent unit (GRU) model [29] and the long short term memory (LSTM) model [62] can be naturally adopted in this setting, due to its ability to capture long and short term dependency in user behavioral patterns.
An exemplar work is the question & recommendation (Q&R) model proposed by Christakopoulou et al. [31], where the interaction between the system and a user is implemented as a selection system. In each turn, the system asks the user to choose one or more distinct topics (e.g., NBA, Comics, or Cooking) from the given list, and then recommends items in these topics to the user. It contains a trigger module to decide whether to ask a question about attributes or to make a recommendation. The triggering mechanism can be as simple as a random mechanism or can be more sophisticated, i.e., using criteria capturing the user's state, or even be user-initiated. At the -th time step, the next topic that user click can be predicted based on the user's watching history 1 , … , as: . After user clicking a topic , the model can recommend an item based on the conditional probability written as: | 1 , … , , . Both of the two conditional probabilities are implemented as the GRU architecture [29]. This algorithm is deployed on YouTube, for obtaining preferences from cold-start users.
Zhang et al. [208] propose a "System Ask User Response" (SAUR) paradigm. For each item, they utilize the rich re-view information and convert a sentence containing an aspectvalue pair to a latent vector via the GRU model. Then they adopt a memory module with attention mechanism [159,83,119] to perform both the next question generation task (determining which attribute to ask) and the next item recommendation task. Again, they also develop a heuristic trigger to decide whether it is the time to display the top-recommended items to users or to keep asking questions about attributes. One limitation of the work is that the authors assume all information in reviews can support the purchasing behavior, however it is not true as users may complain certain aspects of the purchased items, e.g., a user may write "64 Gigabytes is not enough". Using information without discrimination will mislead the model and deteriorate the performance.
The utterances produced by the system, i.e., the questions, are constructed with predefined language patterns or templates, meaning that what the system needs to pay attention to are only the aspect and the value. This is a common setting in state-of-the-art CRS studies because the core task here is recommendation instead of language generation [31,88,89].
Note that these kinds of methods have a common disadvantage: learning from historical user behaviors cannot aid understanding the logic behind the interaction. As interactive systems, these models do not consider how to react to feedback when users reject the recommendation, i.e., they just try to fit the preferences in historical interaction and do not consider an explicit strategy to deal with different feedback.

Reducing Uncertainty
Unlike sequential neural network-based methods that do not have an explicit strategy to handle all kinds of user feedback, some studies try to build a straightforward logic to narrow down item candidates.
Critiquing-based Methods. The aforementioned critiquing model is typically equipped with a heuristic tactic to elicit user preference on attributes [23,188,108,107]. In traditional critiquing models, where the critique on an attribute value (e.g., "not red" for color or "less expensive" for price) is used for reconstructing the candidate set by removing the items with unsatisfied attributes [23,117,155,172,12,154]. The neural vector-based methods take the criticism into the latent vector, which is responsible for generating both the recommended items and the explained attributes. For example, Wu et al. [188] propose an explainable neural collaborative filtering (CE-NCF) model for critiquing. They use the neural collaborative filtering model [60] to encode the preference of a user for an item as a latent vector̂ , , then̂ , is used for producing the rating scorê , as well as the explained attribute vector̂ , . The attributes are composed of a set of key-phrases such as "golden, copper, orange, black, yellow," and each dimension of̂ , corresponds to a certain attribute. When a user dislikes an attribute and critique it in real-time feedback, the system updates the explained attribute vector̂ , by setting the corresponding dimension to  Figure 1: An illustration of interactive path reasoning in CPR. As the convention of this paper, light orange, light blue, and light gold vertices represents the user, attribute and items respectively. For example, the artiest Michael Jackson is an item and and the attributes are rock, dance etc. and attributes as well as other relevant entities. An edge between two vertices represent their relation, for example, a user-item edge indicates that the user has interacted with the item, and a userattribute edge indicates that the user has armed an attribute in a conversation session. A conversation session in our CPR is expressed as a walking in the graph. It starts from the user vertex, and travels in the graph with the goal to reach one or multiple item vertices the user likes as the destination. Note that the walking is navigated by users through conversation. This means, at each step, a system needs to interact with the user to nd out which vertex to go and takes actions according to user's response.
We now go through an example in Figure 1 to better understand the process. A user TOM is seeking a recommendation of music artists. The walking starts from the user vertex ("TOM"), and the session is initialized by the user-specied attribute ("dance"). Accordingly, the system makes its rst step from "TOM" to "dance". Afterwards, the system identies an adjacent attribute (c.f. Sec 4.1) vertex on the graph to consult the user, or recommendation a list of items. If the user conrms his preference to the asked attribute, the system will transit to that attribute vertex. However, if the user rejects the attribute, or rejects a recommendation, the system will stay at the same vertex and consult the user for another attribute. The session will repeat such cycle multiple times until the recommended items are accepted by the user 1 .
The proposed CPR framework, as a new angle of conducting conversational recommendation, conceptually brings several merits to the development of CRS: 1. It is crystally explainable. It models conversational recommendation as an interactive path reasoning problem on the graph, with each step conrmed by the user. Thus, the resultant path is the correct reason for the recommendation. This makes better use of the ne-grained attribute preference than existing methods that only model attribute preference in latent space such as [13]. 2. It facilitates the exploitation of the abundant information by introducing the graph structure. By limiting the candidate attributes to ask as adjacent attributes of the current vertex, the candidate space is largely reduced, leading to a signicant advantage compared with existing CRS methods like [13,24] that treat almost all attributes as the candidates. 3. It is an aesthetically appealing framework which demonstrates the natural combination and mutual promotion of conversation system and recommendation system. On one hand, the path walking over the graph provides a natural dialogue state tracking for conversation system, and it is believed to be ecient to make the conversation more logically coherent [12,14]; on the other hand, being able to directly solicit attribute feedback from the user, the conversation provides a shortcut to prune o searching branches in the graph.
To validate the eectiveness of CPR, we provide a simple yet effective implementation called SCPR (Simple CPR), targeting at the multi-round conversational recommendation (MCR) scenario (c.f. Sec 3). We conduct experiments on the Yelp and LastFM datasets, comparing SCPR with state-of-the-art CRS methods [13,24] which also use the information of user, item and attribute but does not use graph. We analyze the properties of each method under different settings, including dierent types of questions (binary and enumerated) and dierent granularity of attributes. We nd that SCPR outperforms existing methods on recommendation success rate, especially in the settings where the attribute space is larger.
In summary, our contributions are two-folds: • We propose the CPR framework to model conversational recommendation as a path reasoning problem on a heterogeneous graph which provides a new angle of building CRS. To the best of our knowledge, it is the rst time to introduce graph-based reasoning to multi-round conversational recommendation.
• To demonstrate the eectiveness of CPR, we provide a simple instantiation SCPR, which outperforms existing methods in various settings. We nd that, the larger attribute space is, the more improvements our model can achieve.

RELATED WORK
The success of a recommendation system hinges on oering the relevant items of user interest accurately and timely. At beginning, recommendation systems are largely built on the collaborative ltering hypothesis to infer a distributed representation of the user prole. Representative models include matrix factorization [11] and factorization machines [9,20]. However, by nature, these approaches suer from two intrinsic problems. The rst one is the inability of capturing user dynamic preferences with the strict assumption that a user's interest is static over the long-term horizon [23]. The second problem is the weak explainability as the user preference representation is only a continuous vector. Later works try to introduce Markov models [21] and multi-arm bandit methods [28] to solve the dynamic problem but the explainability still remains to be unsatisfactory.
Recently, Graph-based recommendation methods attract increasing research attention. One line of research leverages on the better expressiveness of the graph. They either explore implicit properties like collaborative signals [25,35] from the global connectivities, or focus on yielding better representations of user/items by incorporating latent network embeddings [30]. Another line of zero. Then the updated vector̃ , is used to update the latent vector̂ , to bẽ , . Consequently, the recommendation score is updated to bẽ , . Following this setting, Luo et al. [108] change the base NCF model to be a variational autoencoder (VAE) model, and this generative model can help the critiquing system have better computational efficiency, improved stability, and faster convergence.

Reinforcement Learning-driven Methods.
Reinforcement learning is also used in CRSs to select the appropriate attributes to ask [161,88,89]. Empowered by a deep policy network, the system not only selects the attributes but also determine a controlling strategy on when to change the topic of the current conversation; we will elaborate this in Section 3.1 where we describe how reinforcement learning helps the system form a multi-turn conversational strategy.

Graph-constrained Candidates.
Graph is a prevalent structure to represent relationship of different entities. It is natural to utilize graphs to sift items given a set of attributes. For example, Lei et al. [89] propose an interactive path reasoning algorithm on a heterogeneous graph on which users, items, and attributes are represented as nodes and an edge connected two nodes represented a relationship between two nodes, e.g., a user purchased an item, or an item has a certain value for an attribute. With the help of the graph, a conversation can be converted to a path on the graph, as illustrated in Figure 4. The authors compare the uncertainty of preference for attributes and choose the attributes with the maximum uncertainty to ask. Here, the preference for a certain attribute is modeled by the average preference for items that have this attribute. Hence, the searching space and overhead of the algorithm can be significantly reduced by utilizing the graph information. There are other studies that apply graph neural networks (GNNs) to learn a powerful representation of both items and attributes, so the semantic information in the learned embedding vectors can help endto-end CRS models generate appropriate recommendations. For example, the graph neural network (GCN) model and its variants [81,149] are adopted on the knowledge graph in recent CRS models [25,223,192,98].
Other Methods. There are other attempts to make recom-mendations based on user feedback on attributes. For example, Zou et al. [230] proposed a question-driven recommender system based on an extended matrix factorization model, which merely considers the user rating data, to combine real-time feedback from users.
The basic assumption is that if a user likes an item, then he/she will like the attributes of this item. Thereby, in each turn, the system will select the attribute that carries the maximum amount of uncertainty to ask. In other words, if an attribute is known to be shared by most items that a user likes, then it does not need to ask about this attribute. Similarly, there is no need to ask about the attributes that users dislike. Only if it is not sure whether a user likes an attribute, then asking about this attribute can provide the most amount of information. The parameters in matrices can be updated after users providing feedback. Besides, using ideas similar to aforementioned models based on asking items, MAB-based models [207,95] and Bayesian approaches [113] are also developed in attribute-asking CRSs.

Section Summary
We list the common CRS models in Table 1, where the models are characterized by different dimensions, which are the asking entity (item or attribute), the asking mechanism, the type of user feedback, and the multi-turn strategy that we will describe in the next section.
In most interactive recommendations [232,176,205,40] and critiquing methods [23,188,108,107], the system keeps asking questions, and each question is followed by a recommendation. This process will only terminate when users quit with either being satisfied or impatient. The setting is unnatural and will likely hurt the user experience during the interaction process. Asking too many questions may let the interaction become an interrogation. Moreover, during the early stages of interaction, when the system has not confidently modeled the user preferences yet, recommendations with low confidence should not be exposed to the user [150].
In other words, there should be a multi-turn conversational strategy to control how to switch between asking and recommending, and this strategy should change dynamically in the interaction process.

Multi-turn Conversational Strategies for CRSs
Question-driven methods focus on the problem of "What to ask", and the multi-turn conversational strategies discussed in this section focus on "When to ask" or a broader perspective, "How to maintain the conversation". A good strategy cannot only make the recommendation at the proper time (with high confidence) and adapt flexibly to users' feedback, but also maintain the conversation topics and adapt to different scenarios to make users feel comfortable in the interaction. Table 1 Characteristics of common CRS models in different dimensions. The strategy indicates whether the work considers an explicit strategy to control multi-turn conversations, e.g., whether to ask or recommend in the current turn. Providing an utterance Yes [25,104] No [223,98] Providing preferred attribute values Yes [192] No [123]

Conversation Strategies for Determining When to Ask and Recommend
Most CRS models do not carefully consider a strategy to determine whether to continue interrogating users by asking questions or to make a recommendation. However, a good strategy is essential in the interaction process so as to improve the user experience. The strategy can be a rule-based policy, i.e., making recommendations every turns of asking questions [207], or a random policy [31], or a modelbased policy [31].
In the SAUR model [208], a trigger is set to activate the recommendation module when the confidence is high. The trigger is simply implemented as a sigmoid function on the score of the most probable item, i.e., if the score of the candidate item is high enough, then the recommendation step is triggered, else the system will keep asking questions.
Though straightforward and easy to control, these strategies cannot capture rich semantic information, e.g., what topics are talking about now or how deep the topics have been explored. This information can directly affect the conversation topic. Thereby, a sophisticated strategy is neces-sary. Recently, reinforcement learning (RL) has been adopted by many interactive recommendation models for its potential of modeling the complex environment [214,20,191,216,231,24,68,97,131,205,226]. Therefore, it is natural to incorporate RL into the CRS framework [161,88,89,167,141,76]. For instance, Sun and Zhang [161] propose a model called conversational recommender model (CRM) that uses the architecture of task-oriented dialogue system. In CRM, a belief tracker is used to track the users' input, and it outputs a latent vector representing the current state of the dialogue and the user preferences that have so far been captured. Afterward, the state vector of the belief tracker is input into a deep policy network to decide whether to recommend an item or to keep asking questions. Specifically, there are + 1 actions: actions for choosing one facet to ask and the last one is to yield a recommendation. The deep policy network uses the policy gradient method to make decisions. Finally, the model gets rewards from the environment, which includes user feedback towards the questions and the reward from the automatic evaluation of recommendation results.
• We propose a three-stage solution, EAR, integrating and revising several RC and CC techniques to construct a solution that works well for the conversational recommendation.
• We build two CRS datasets by simulating user conversations to make the task suitable for oine academic research. We show our method outperforms several state-of-the-art CRS methods and provide insight on the task. Datasets and source code will be released to promote further studies.

MULTI-ROUND CONVERSATIONAL RECOMMENDATION SCENARIO
Following [8], we denote one trial of recommendation as a round. This paper considers conversational recommendation as an inherently multi-round scenario, where a CRS interacts with the user by asking attributes and recommending items multiple times until the task succeeds or the user leaves. To distinguish the two, we term the setting single-round where the CRS only makes recommendations once, ending the session regardless of the outcome, as in [8,31]. We now introduce the notation used to formalize our setting. Let u 2 U denote a user u from the user set U and 2 V denote an item from the item set V. Each item is associated with a set of attributes P which describe its properties, such as music genre "classical" or "jazz" for songs in LastFM, or tags such as "nightlife", "serving burgers", or "serving wines" for businesses in Yelp. We denote the set of all attributes as P and use p to denote a specic attribute. Following [31,40], a CRS session is started with u's specication of a preferred attribute p 0 , then the CRS lters out candidate items that contain the preferred attribute p 0 . Then in each turn t (t = 1, 2, ...,T ; T denotes the last turn of the session), the CRS needs to choose an action: recommend or ask: • If the is recommend, we denote the recommended item list V t ⇢ V and the action as a r ec . Then the user examines whether V t contains his desired item. If the feedback is positive, this session succeeds and can be terminated. Otherwise, we mark V t as rejected and move to the next round. • If the is ask (where the asked attribute is denoted as p t 2 P and the action as a ask (p t )), the user states whether he prefers items that contain the attribute p t or not. If the feedback is positive, we add p t into P u to denote the preferred attributes 2 Figure 5: The estimation-action-reflection workflow. Credits: Lei et al. [88].
capturing the information of facet-values, which is hard to interpretable. In this respect, some studies explore better ways to construct the state of RL to make the multi-turn conversation strategy better adapt to an dynamic environment. For example, Lei et al. [88] propose an Estimation-Action-Reflection (EAR) framework, which assumes that the model should only ask questions at the right time. The right time, in their definition, is when (1) the item candidate space is small enough; (2) asking additional questions is determined to be less useful or helpful, from the perspective of either information gain or user patience; and (3) the recommendation engine is confident that the top recommendations will be accepted by the user. The workflow of the EAR framework is illustrated in Figure 5, where the system has to decide whether to continue to ask questions about attributes or to make a recommendation based on available information.
To determine when to ask a question, they construct the state of the RL model to take into account four factors: • Entropy information of each attribute among the attributes of the current candidate items. Asking attributes with a large entropy helps to reduce the candidate space, thus benefits finding desired items in fewer turns. • User preference on each attribute. The attribute with a high predicted preference is likely to receive positive feedback, which also helps to reduce the candidate space. • Historical user feedback. If the system has asked about a number of attributes for which the user gives approval, it may be a good time to recommend. • Number of rest candidates. If the candidate list is short enough, the system should turn to recommend to avoid wasting more turns. Building on these vectors capturing the current state, the RL model learns the proper timing to ask or recommend, which is more intelligent than a fixed heuristic strategy.
During the conversation, the recommendation module takes the items in the previous list of recommendations that are not chosen by users as the negative samples. However, Lei et al. [88] mention that this setting deteriorates the performance of the recommendation results. The reason, as they analyze it, is that rejecting the produced attribute does not mean that the user dislikes it: maybe the user does like it but overlooks it or just wants to try other new things.
Furthermore, Lei et al. [89] extend the EAR model by proposing the CPR model. By integrating the knowledge graph consisted of users, items, and attributes, they model conversational recommendation as an interactive path reasoning problem on the graph. A toy example of the generated conversation of the CPR model is shown in Figure 4. Unlike the EAR model where the attributes to be asked are selected irregular and unpredictable from all attribute candidates, CPR chooses attributes to be asked and items to be recommended strictly following the paths on the knowledge graph, which renders interpretable results.
In terms of the timing to ask or recommend, CRP makes an important improvement: the action space of the RL policy is only two -asking an attribute or making item recommendations. This largely reduces the difficulty of learning the RL policy. The CPR model is much more efficient than the EAR model due to the fact that the searching space of attributes in CPR is constrained by the graph. The integration of knowledge improves the multi-turn conversational reasoning ability.

Conversation Strategies from A Broader Perspective
Although learning from the query-answering interactions can enable the system to understand and respond to human query directly, the system still lacks intelligence. One reason is that most CRS models assume that users always bear in mind what they want, and the task is to obtain the preference through asking questions. However, users who resort to recommendation might not have a clear idea about what they really want. Just like a human asks a friend for suggestions on restaurants. Before that, he may not have a certain target in mind, and his decision can be affected by his friend's opinions. Therefore, CRSs should not only ask clarification questions and interrogate users, but also take responsibility for leading the topics and affecting users' mind. Towards this objective, some studies try to enrich CRSs certain personalities or endow CRSs the ability to lead the conversation, which can make the dialogues more attractive and more engaging. These efforts can also be found in the field of proactive conversation [122,189,4].

Multi-topic Learning in Conversations
Borrowing the idea from the proactive conversation, Liu et al. [104] present a new task which places conversational recommendation in the context of multi-type dialogues. In their model, the system can proactively and naturally lead a conversation from a non-recommendation dialog (e.g., question answering or chitchat) to a recommendation dialog, taking into account the user's interests and feedback. And during the interaction, the system can learn to flexibly switch between multiple goals. To address this task, they propose a multi-goal driven conversation generation (MGCG) framework, which consists of a goal planning module and a goalguided responding module. The goal-planning module can conduct dialog management to control the dialog flow, which Table 2 The commonly used multi-turn strategies in CRSs.

Main Mechanism Asking Method
When to ask and recommend Determining and Publication Asking questions Explicit Asking 1 turn; recommending 1 turn Fixed [31,203] Asking turn(s); recommending 1 turn Fixed [230] Adaptive [161] Asking turn(s); recommending turn(s) Adaptive [88,89,95] Implicit Contained in natural language Adaptive [94,25,223] Leading diverse topics or explore special abilities [104,225,143,90,185] takes recommendation as the main goal and complete the natural topic transitions as the short-term goals. Specifically, given a user's historical utterances as context and the last goal −1 , the module estimates the probability of changing goals as ( = −1 | , −1 ). In downstream tasks, the goal is changed when the probability > 0.5. Based on the current goal, the framework can produce responses from an end-to-end neural network.
Learning a multi-type conversational model requires a dataset that supports multi-type dialogues. Therefore, Liu et al. [104] create a dataset, denoted as DuRecDial, with various types of interaction. In DuRecDial, two human workers are asked to conduct the conversation based on a given profile, which contains the information of age, gender, occupation, preferred domains, and entities. The workers must produce utterances that are consistent with their given profiles, and they are encouraged to produce utterances with diverse goals, e.g., question answering, chitchat, or recommendation. Then these dialogue data are labeled with goals and goal descriptions by templates and human annotation.
Further, Zhou et al. [225] release a topic-guided conversational recommendation dataset. They collect the review data from Douban Movie 2 , a movie review website, to construct the recommended movies, topic threads, user profiles, and utterances. And they associate each movie with the concepts in ConceptNet [156], a commonsense knowledge graph, for providing rich topic candidates. Then they use rules to generate multi-turn conversations with diverse topics based on the user profile and topic candidates. Based on the proposed dataset, a new task of topic-guided conversational recommendation is defined as follows: given the user profile , user interaction sequence , historical utterances

Special Ability: Suggesting, Negotiating, and Persuading
There are miscellaneous tasks beyond the preference elicitation and recommendation for an intelligent interactive system, which require the CRS to possess different abilities to react in different scenarios. This is a high-level and abstract requirement. A lot of effort have put into helping the machine improve the topic's guiding ability. For instance, in conversational search, where traditional work has mainly attempted to better understand a user's information needs by resolving ambiguity, Rosset et al. [143] propose to lead the conversation with questions that a user may want to ask in the next step. For example, if a user queried "Nissan GTR Price," then the system can provide question suggestions include those that help the user complete a task ("How much does it cost to lease a Nissan GT-R?"), weigh options ("What are the pros and cons of the Nissan GT-R?"), explore an interesting related topic ("Is the Nissan GT-R the ultimate streetcar?"), or learn more details ("How much does 2020 Nissan GTR cost?"). These question suggestions can lead the user to an immersive search experience with diverse and fruitful future outcomes.
In addition, Lewis et al. [90] propose a system that is capable of engaging in the negotiations with users. They define the problem as an allocation problem: there are some items that need to be allocated to two people, where each item has a different value to a different person and people do not know the value of others. Hence, the two people have to converse and negotiate with each other to reach an agreement about the division of these items. Instead of optimizing relevancebased likelihood, the model should pursue a maximal profit for both parties. The authors use RL to tackle this problem. And they interleave RL updates with supervised updates to avoid that the models diverges from human language.
Wang et al. [185] develop a model that tries to persuade users to take certain actions, which is very promising for conversational recommendation. They train the model, according to conversational contexts, to learn and predict the 10 persuasion strategies (e.g., logical appeal or emotion appeal) used in the corpus. And they analyze which strategies are better conditioned on the background (personality, morality, value systems, willingness) of the user being persuaded.
Though some of these efforts are applied to specific application scenarios in dialogue systems, these techniques can be adopted in the multi-turn strategy in CRSs and thus push the development of CRSs.

Section Summary
The multi-turn conversation strategies of CRSs discussed in this section are summarized in Table 2. The main focus of the conversation strategy is to determine when to elicit user preference by asking questions and when to make recommendations. As a recommendation should only be made when the system is confident, an adaptive strategy can be more promising compared to a static one. Besides this core function, we introduce some strategies from a broader perspective. These strategies can extend the capability of CRSs by means of leading multi-topic conversations [104,225] or showing special ability such as suggesting [143], negotiating [90], and persuading [185].

Dialogue Understanding and Generation in CRSs
An important topic of CRSs is to converse with humans in natural languages, thus understanding human intentions and generating human-understandable responses are critical. However, most CRSs only extract key information from processed structural data and present the result via rule-based template responses [208,230,88,89]. This not only requires lots of labor to construct the rule or template but also make the result rely on the preprocessing. Recently, we have witnessed the development of end-to-end learning frameworks in dialogue systems, which have been studying for years to automatically handle the semantic information in raw natural language [50]. We will introduce these natural language processing (NLP) technologies in dialogue systems and describe how they help CRSs understand user intention and sentiment and generate meaningful responses.

Dialogue Understanding
Understanding users' intention is the key requirement for the user interface of a CRS, as downstream tasks, e.g., recommendation, rely heavily on this information. However, most CRSs pay attention to the core recommendation logic and the multi-turn strategy, while they circumvent extracting user intention from raw utterances and requires the preprocessed input such as rating scores [215,32,233,87], YES/NO answers [230,88,89], or another type of value or orientation [31,208] towards the queried items or attributes. This is unnatural in real-life human conversation and imposes constraints on user expression. Thereby, it is necessary to develop methods to extract semantic information in users' raw language input, either in an explicit or implicit way.
We introduce how dialogue systems use NLP technologies to address this problem and give the examples of CRSs that use these technology to understand user intention.

Slot Filling
A common way used in dialogue systems to extract useful information is to predefine some aspects of interest and use a model to fill out the values of these aspects from users' input, a.k.a, slot filling [36,37,198,118,197,130]. Sun and Zhang [161] first consider extracting the semantic information from the raw dialogue in CRSs. They propose a belief tracker to capture the facet-value pairs, e.g., (color, red), from user utterances. Specifically, given a user utterance at time step , the input to the belief tracker is the n-gram vector , which is written as = n-gram( ), where the dimension of is the corpus size. This means that only the positions corresponding to the words in utterance are set to 1, other positions will be set to 0. Suppose there are types of facet-value pairs, for a given facet ∈ {1, 2, … , }, the user's sequential utterances 1 , 2 , ⋯ , are encoded by a LSTM model [62] to learn the latent vector for this facet . The size of vector is set to the number of values, e.g., the number of available colors. The vector capturing the facet-value information will be used in the recommendation module and policy network later. Besides, Ren et al. [141], Tsumita and Takagi [167] also employ recurrent neural networks (RNN)-based methods to extract the facetvalue information as input for in downstream tasks in their CRSs.
However, explicitly modeling semantic information as aspect-value pairs can be a limitation in some scenarios where it is difficult and also unnecessary to do that. Besides, aspectvalue pairs cannot precisely express information such as user intent or sentiment. Therefore, some recent CRSs use endto-end neural frameworks to implicit learning the representation of users' intentions and sentiment.

Intentions and Sentiment Learning
Neural networks are famous for extracting features automatically, so it can be used to extract users' intentions and sentiment in CRSs. An classic example in CRSs is the endto-end framework that proposed by Li et al. [94], which takes the user's raw utterances as input and directly produces the responses in the interaction. They collect the REDIAL dataset 3 through the crowdsourcing platform Amazon Mechanical Turk (AMT) 4 . They pair up AMT workers and give each of them a role. The movie seeker has to explain what kind of movie he/she likes, and asks for movie suggestions. The recommender tries to understand the seeker's movie tastes and recommends movies. All exchanges of information and recommendations are made using natural language; every movie mention is tagged using the "@" symbol to let the machine know it is a named entity. In this way, the dialogues in the REDIAL data contain the required semantic information that can help the model learn to answer users with recommendations and reasonable explanations. In addition, three questions are asked to provide labels for supervised learning: (1) Whether the movie was mentioned by the seeker, or was a suggestion from the recommender ("suggested" label). (2) Whether the seeker has seen the movie ("seen" label): one of Have seen it, Haven't seen it, or Didn't say.
(3) Whether the seeker liked the movie or the suggestion ("liked" label): one of Liked, Didn't like, Didn't say. The three labels are collected from both the seeker and the rec-ommender.
In this way, although the facet-value constraints are removed, all kinds of information including mentioned items and attributes, user attitude, and user interest are preserved and labeled in the raw utterance. And the CRS model needs to directly learn users' sentiment (or preferences), and it will make recommendations and generate responses based on the learned sentiment. The deep neural network-based model consists of four parts: (1) A hierarchical recurrent encoder implemented as a bidirectional GRU [29] that transforms the raw utterances into a latent vector with the key semantic information remained. (2) At each time a movie entity is detected (with the "@" identifier convention), an RNN model is instantiated to classify the seeker's sentiment or opinion regarding that entity. (3) An autoencoder-based recommendation module that takes the sentiment prediction as input and produces an item recommendation. (4) A switching decoder generating the response and deciding whether the name of the recommended item is included in the response. The model generates a complete sentence that might contain a recommended item to answer each user's utterance.
Beside using the RNN-based neural networks, there are some CRSs that adopt the convolutional neural network (CNN) model [141,104], which has been proven to be very effective for modeling the semantics from raw natural language [80]. However, deep neural networks are often criticized to be non-transparent and hard to interpretable [11]. It is not clear how the deep language models can help CRSs in understanding user needs.
In order to answer this question, Penha and Hauff [132] investigate the bidirectional encoder representations from transformers (BERT) [38], a powerful technology for NLP pretraining developed by Google, to analyze whether its parameters can capture and store semantic information about items such as books, movies, and music for CRSs. The semantic information includes two kinds of knowledge needed for conducting conversational search and recommendation, namely content-based and collaborative-based knowledge. Content-based knowledge is knowledge that requires the model to match the titles of items with their content information, such as textual descriptions and genres. In contrast, collaborative-based knowledge requires the model to match items with similar ones, according to community interactions such as ratings. The authors use the three probes on the BERT model (i.e., tasks to examine a trained model regarding certain properties) to achieve the goal. And the result shows that both collaborative-based and content-based knowledge can be learned and remembered. Therefore, the end-to-end language model has potential as part of CRS models to interact with humans directly in real-world applications with complex contexts.

Response Generation
A natural language-based response of a CRS should at least meet two levels of standards. The lower level standard requires the generated language to be proper and correct; the higher level standard requires the response contains mean-ingful and useful information about recommended results.

Generating Proper Utterances in Natural Language
Many CRSs use template-based methods to generate responses in conversations [161,88,89]. However, templatebased methods suffer from producing repetitive and inflexible output, and it require intense manual work. Besides, template-based responses could make users uncomfortable and hurt user experience. Hence, it is important to automate the response generation in CRSs to produce proper and fluent responses. This is also the objective of dialogue systems, so we introduce two veins of technologies for producing responses in dialogue systems: Retrieval-based Methods. The basic idea is to retrieve the appropriate response from a large collection of response candidate. This problem can be formulated as a matching problem between an input user query and the candidate responses. The most straightforward method is to measure the innerproduct of the feature vectors representing a query and a response [190]. A key challenge is to learn a proper feature representation [190]. One strategy is to use neural networks to learn the representation vectors from user query and candidate response, respectively. Then, a matching function is used to combine the two representations and output a matching probability [63,164,139,45,175]. An alternative strategy, in contrast, is to combine the representation vectors of query and response first, and then a neural method is used on the combined representation pair to further learn the interaction [181,174,127,106]. These two strategies have their own advantages: the former is more efficient and suitable for online serving, while the latter is better at efficacy since the matching information is sufficiently preserved and mined [190].
Generation-based Methods. Unlike retrieval-based methods, which select existing responses from a database of template response, generation-based methods directly produce a complete sentence from the model. The basic generation model is a recurrent sequence-to-sequence model, which sequentially feeds in each word in the query as input, and then generates the output word one by one [162]. Compared to retrieval-based methods, generation-based methods have some challenges. First, the generated answer is not guaranteed to be a well-formed natural language utterance [194]. Second, even though the generated response may be grammatically correct, we can still distinguish a machine-generated utterance from a human-generated utterance, since the machine response lacks basic commonsense [201,221], personality [136,217] and emotion [220]. Even worse, generation models are prone to produce a safe answer, such as "OK," "I don't understand what you are talking about," which can fit in almost all conversational contexts but would only hurt the user experience [91,138]. Ke et al. [79] propose to explicitly control the function of the generated sentence, for example, for the same user query, the system can answer with different tones: The interrogative tone can be used to acquire further information; the imperative tone is used to make requests, directions, instructions or invitations to elicit further interactions; and the declarative tone is commonly used to make statements or explanations. Another problem is how to evaluate the generated response, since there is no standard answer; we will further discuss this in Section 6.
Researchers borrow the ideas from dialog systems and apply the technologies in the user inferface of CRSs. For instance, Li et al. [94] generate responses by a decoder where a GRU model [29] decodes the context from the previous component (i.e., predicted sentiment towards items) to predict the next utterance step by step. Liu et al. [104] adopt the responding model in the work of Wu et al. [189] and propose both a retrieval-based model and a generation-based model to produce responses in their CRS.
However, a correct sentence does not mean it can fulfill the task of recommendation; at least the name of the recommended entity should be mentioned in generated sentences. Hence, Li et al. [94] use a switch to decide whether the next predicted word is a movie name or an ordinal word; Liu et al. [104] introduce an external memory module for storing all related knowledge, making the models select appropriate knowledge to enable proactive conversations. Besides, there are other efforts to guarantee the generated responses should not only be proper and accurate but also be meaningful and useful.

Incorporating Recommendation-oriented Information
There is a major limitation CRSs that use the end-to-end frameworks as the user interface: only items mentioned in the training corpus have a chance of being recommended since items that have never been mentioned are not modeled by the end-to-end model. Therefore, the performance of this method is greatly limited by the quality of human recommendations in the training data. To overcome this problem, Chen et al. [25] propose to incorporate domain knowledge to assist the recommendation engine. The incorporation of a knowledge graph mutually benefits the dialogue interface and the recommendation engine in the CRS. (1) the dialogue interface can help the recommender engine by linking related entities in the knowledge graph; the recommendation model is based on the R-GCN model [149] to extract information from the knowledge graph; (2) the recommender system can also help the dialogue interface: by mining words with high probability, the dialogue can connect movies with some biased vocabularies, thus it can produce consistent and interpretable responses.
Following this line, Zhou et al. [223] point out the remaining problems in the dialogue interface in CRSs. Although Chen et al. [25] have introduced an item-oriented knowledge graph to enable the system to understand the movierelated concepts, the system still cannot comprehend some words in the raw utterances. For example, "thriller", "scary", "good plot". In essence, the problem originates from the fact that the dialog component and the recommender component correspond to two different semantic spaces, namely wordlevel and entity-level semantic spaces. Therefore, Zhou et al. [223] incorporate and fuse two special knowledge graphs, i.e., a word-oriented graph (ConceptNet [156]), and an itemoriented graph (DBpedia [8]), to enhance understanding semantics in both the components. The representations of the same concepts on the two knowledge graphs are forced to be aligned with each other via the mutual information maximization (MIM) technique [170,199]. Furthermore, a selfattention-based recommendation model is proposed to learn the user preference and adjust the representation of corresponding entities on the knowledge graph. Then, equipped with these representations containing both semantics and users' historical preferences, the authors use an encoder-decoder model to extract user intention from the raw utterances and directly generate the responses containing recommended items. Besides, some researchers try to improve the diversity or explainability of generated responses in CRSs. For example, Liu et al. [104] propose the multi-topic learning that can handle diverse dialogue types in CRSs. To enhance the interpretability of CRSs, Chen et al. [27] design an incremental multi-task learning framework to integrate review comments as side information. Hence, the CRS can simultaneously produce a recommendation as well as a sentence as an explanation, e.g., "I recommend Mission Impossible, because it is by far the best of the action series." Moreover, Luo et al. [108] use a VAE-based architecture to learn a latent representation for generating recommendations and fitting user critiquing. Therefore, their model can better understand users' intentions from users' raw comments, and thus can generate more interpretable responses. Gao et al. [51] consider attributes and review information and rewrite a coherent and meaningful answer from a selected prototype answer, which can address the safe answer problem in the response [91,138]

Section Summary
We classify CRSs in Table 3 in term of two dimensions: how the input comes and how the output is generated. Generally, interactive recommendations [232,176,205,40], critiquing methods [23,188,108,107], and CRSs focusing on the multi-turn conversation strategy [32,31,88,89,95] are prone to use the pre-annotated input and rule-based or templatebased output; dialogue systems [201,221,51] and CRSs caring about the dialogue ability [94,25,223] are more likely to use raw natural language as input and automatically generate responses. In the future, user understanding and response generation in CRSs will remain a critical research field, as they serve as the interface of CRSs and directly impact the user experience.

Exploration-Exploitation Trade-offs
One challenge of CRSs is to handle the cold-start users that have few historical interactions. A natural way to tackle this is through the idea of the Exploration-Exploitation (E&E) trade-off. With exploitation, the system takes advantage of the best option that is known; with exploration, the system takes some risks to collect information about unknown options. In order to achieve long-term optimization, one might make a short-term sacrifice. In the early stages of E&E, an exploration trial could be a failure, but it warns the model to not take that action too often in the future. Although the E&E trade-off is mainly used for the cold-start scenario in CRSs, it can also be used for improving the recommendation performance for any users (including cold users and warmup users) in recommendation systems.
MAB is a classic problem formulated to illustrate the E&E trade-off, and many algorithms have been proposed to solve the problem. In CRSs, the MAB-based algorithms are introduced to help the system improve its recommendation. Besides, there are also CRSs that use meta-learning to balance E&E. We first introduce MAB and common MABbased algorithms in recommender systems, then we present examples how CRSs balance E&E in their models.

Multi-Armed Bandits In Recommendation
We first introduce the general MAB problem and the classic methods, then we introduce how recommender systems use MAB-based methods to achieve the E&E balance.

Introduction to Multi-Armed Bandits
MAB is a classic reinforcement learning problem that well demonstrates the E&E dilemma [77,3]. The name comes from the story where a gambler at a row of slot machines (each of which is known as a "one-arm bandit"), wants to maximize his expected gain and has to decide which machines to play, how many times to play each machine, and in which order to play them, and whether to continue with the current machine or try a different machine. The problem is difficult because all of the slot machines are black boxes, whose properties, i.e., the probability of winning, can only be estimated by the rewards observed in previous experiments.
Formally, the problem is to maximize the cumulative reward ∑ =1 , after rounds of arm selection. Here, , is the reward with arm 0 ≤ ≤ selected at trial , is the total number of arms. Figure 6 illustrates an example in which a gambler decides which arm to choose. For a certain arm, a reward distribution is estimated based on experiments. The gambler can simply select the second arm which has the maximal mean reward ( ), or he can take into account the uncertainty Δ( ) and select the third arm, which has the maximal upper confidence bound ( ) + Δ( ). After each time he plays an arm, the new reward value is observed, and the reward distribution of this arm can be updated accordingly.
Equivalently, the problem can also be formulated as minimizing the regret function, which is the difference between the true optimal expected cumulative reward and the estimated expected cumulative reward: where * is the arm with the maximum expected payoff at all times. The commonly used bandit strategies includes the greedy strategy, i.e., the exploit-only strategy that always selects the arm with the current estimated highest reward; the random strategy, i.e., a trivial explore-only strategy; -greedy, which mixes the greedy and random strategies via a trigger with probability . And other classic models includes Upper Confidence Bound (UCB) [2,3] and Thompson Sampling (TS) [19] which are introduced next.

Recommendation via MAB-based Methods
Traditional bandit methods only consider treating items as independent arms and ignore the item features [92]. However, directly estimating each item's probability of being chosen based on the accumulated rewards is rather inefficient due to the large number of items. In recommendation, there is rich set of features on users and items, and whether a user would choose item can be predicted by the features of both and . Motivated by this, Li et al. [92] propose a linear upper confidence bound model called LinUCB, which is the first bandit model that considers the contextual information (i.e., user/item features) in recommendation systems. They assume the expected payoff of an arm (item) is linear in its -dimensional feature vector , with the unknown coefficient vector * ; namely, for all trial , where the feature vector , contains the features of both the user and the arm (item) . Using the UCB model men-tioned above, the arm picked is the recommendation to a user.
There are also studies pointing out that exploration in MABs should be diverse instead of recommending similar items [137,103,40]. For instance, Ding et al. [40] consider the fact that users may have different preference with regard to the diversity of items, e.g., a user with specific interest may prefer a relevant item set than a diverse item set, while another user without specific interest may prefer a diverse item set to explore his interests. Therefore, the authors propose a bandit learning framework to consider the user's preferences on both the item relevance features and the diversity features. It is a way to trade off the accuracy and diversity of recommendation results.
Besides, Yu et al. [203] use a cascading bandit in a visual dialog augmented interactive recommender system. In cascading bandits, each arm corresponds to a list of recommended items, and the user examines the recommended list from the first item to the last and selects the first attractive one [84,229]. This setting is practical to implement in online recommender systems or search engines. It has an excellent advantage as it can provide reliable negative samples, which are critical for recommendation, and the problem has drawn a lot of research attention [22,39,186,96,22]. Since the system can ensure that the items before the first selected one are not attractive, thus it can easily obtain reliable negative samples. Another contribution is the use of the item's visual appearance and user feedback to design more efficient exploration.
In addition, there are other efforts to enhance bandit methods in different recommendation scenarios. For instance, Chou et al. [30] indicate that a user would only choose one or a few arms in the candidates, leaving out the informative non-selected arms. They propose the concept of pseudorewards, which embeds estimates to the hidden rewards of non-selected actions under the bandit setting. Wang et al. [180] consider dependencies among items and explicitly formulate the item dependencies as clusters on arms, where arms within a single cluster share similar latent topics. They adopt a generative process based on a topic model to explicitly formulate the arm dependencies as the clusters on arms, where dependent arms are assumed to be generated from the same cluster. Yang et al. [196] consider the situations where there are exploration overheads, i.e., there are non-zero costs associated with executing a recommendation (arm) in the environment, and hence, the policy should be learned with a fixed exploration cost constraint. They propose a hierarchical learning structure to address the problem. Sakhi et al. [145] state that the online bandit signal is sparse and uneven, so they utilize the massive offline historical data. The difficulty is that most of offline data is irrelevant to the recommendation task, and the authors propose a probabilistic model to solve it.
The advantage of multi-armed bandit methods is their ability to conduct online learning, enabling the model to learn the preferences of cold users and adjust the strategy quickly after several trials to pursue a global optimum. estimate the preference of new users. Without any previous information, it is reasonable to assume that the preference of new users to be the average of exsing users'. Thus, we initialize -D as the average embedding of existing users while following the convention to initialize BD as identity matrix. Specically, if U >;3 denotes the collection of all embeddings of existing users, then, Correspondingly, the intermediate variable f D is also initialized with u8=8C since f D is updated by Eq 14 (i.e., -D = B 1 D f D , where BD is initialized by identity matrix). After initialization, ConTS starts a MCR session to interact with the user. At the beginning of each turn C = 1, 2, ...,) in a MCR session, ConTS samples from N (-D, ; 2 B 1 D ) to get user embedding u 7 . The sampling is the key step to achieve EE balance. On one hand, ConTS uses the mean -D to control the expectation of the sampling result to exploit user's currently known preference. On the other hand, ConTS uses covariance ; 2 B 1 D to model the uncertainty about the estimated user preference, which decides how ConTS explore user's latent unknown preference.

Arm Choosing
Once obtaining user embedding by posterior sampling, the agent needs to take an action, either by asking an attribute or recommending items. This is much more complex than existing contextual Thompson Sampling methods where the action is simply to choose an item to recommend. In the MCR scenario, a CRS needs to consider more questions: 1) what attributes to ask, 2) what items to recommend, and 3) whether to ask or recommend in a turn. To address those problems, ConTS adopts a simple but ecient strategy to model all the items and attributes as undierentiated arms 7 Note that both -D and

Multi-Armed Bandits in CRSs
As the classic algorithm for E&E trade-offs, MAB-based models can be seamlessly plugged into the online recommendation setting [204,216], interactive recommendation [215,177], and CRSs [32,207,95].
The ability to interact with users enables CRSs to directly use MAB-based methods to help the recommendation. A classic CRS model based on MAB is proposed by Christakopoulou et al. [32], which uses naive MAB-based methods to enhance the offline probabilistic matrix factorization (PMF) model [146]. They first initialize the model parameters using offline data, then leverage real-time user feedback to update parameters via several common multiarmed bandit models, including the aforementioned greedy strategy, random strategy, UCB [2,3], and TS [19]. On the one hand, the performance improves on the initialized model due to the online updating; on the other hand, the offline initialization helps bandit methods reduce the computational complexity.
As mentioned above, the original MAB methods ignore item features, which could be very helpful in recommendation. Hence, Zhang et al. [207] propose a conversational upper confidence bound (ConUCB) algorithm to apply the LinUCB model in the CRS context. Instead of asking items, ConUCB asks the user about one or more attributes (keyterms in their work). Specifically, they make the assumption that user preference on attributes can propagate to items, hence the system can analyze user feedback on queried attributes to quickly narrow down the item candidates. The strategies to select the attributes and arms depend on both the attribute-level and arm-level rewards, i.e., the feedback on attributes and items will be absorbed into the model parameters for future use. In addition, the authors employ a handcrafted function to determine the timing to ask attributes or make recommendation, e.g., making conversations in every rounds.
However, hand-crafted strategies are fragile and inflexible, as the system should make recommendation only when the confidence is high. Therefore, Li et al. [95] propose a Conversational Thompson Sampling method (ConTS) to au- Table 4 MAB-based methods adopted by interactive recommender systems (IRSs) and CRSs.

MAB in CRSs
Traditional bandit methods in CRSs [32] Conversational upper confidence bound [207] Conversational thompson sampling [95] Cascading bandits augmented by visual dialogues [203] Meta learning for CRSs Learning to learn the recommendation model [87,233,187] tomatically alternate asking questions about attributes with recommending items. They achieve this goal by unifying all attributes and items in the same arm pool, thus an arm selected from the arm pool can be either a recommendation about an item or a question about an attribute. The flowchart of ConTS is illustrated in Figure 7. ConTS assumes each user's preference vector̃ is sampled from a prior Gaussian distribution as̃ ∼  , 2 −1 , where the , , and are parameters.
For each new-coming user, the mean of prior Gaussian distribution, , is initialized by the average of existing users' preference vector  old as: The expected reward of arm (which can either be an item or an attribute) for user is also formulated as a Gaussian distribution since the Gaussian family is conjugate to itself. The expected reward is written as: where  denotes the user's currently known preferred attributes obtained in historical conversations. And represents the embedding vector of an arm. In the reward function, the term ⊤ models the general preference of user to arm , and the term ∑ ∈ models the affinity between arm and the user's preferred attributes  . Then ConTS select an arm with the maximal reward as: Note that if the ( ) is an attribute, the system will query the user about the preference on this attribute; if it is an item, the system will make a recommendation using this item. After obtaining users' feedback, parameters such as ,  , will be updated accordingly.

Meta Learning for CRSs
Beyond multi-armed bandits, there are work trying to balance between exploration and exploitation via meta learning. For instance, Zou et al. [233] formulate the interactive recommendation as a meta-learning problem, where the objective is to learn a learning algorithm that takes the user's historical interactions as the input and outputs a model (policy function) that can be applied to new users. The authors follow the idea of meta reinforcement learning [43] and use Q-Learning [121] to learn the recommendation policy. The exploration strategy is the aforementioned -greedy, where the model will select the items of maximum Q-value with probability 1 − , and choose random items with probability .
In addition, Lee et al. [87] address the cold-start problem in recommendation via a model based on the Model-Agnostic Meta-Learning (MAML) algorithm [46]. The learned recommendation model can quickly adapt to the cold user preference in the fine-tuning stage by asking the cold user a few questions about certain items (called the evidence candidates in the work). A drawback of this work is that the evidence candidates are only selected once, and the query process is conducted only at the beginning when cold users arrived. It could be better to extend this strategy to a CRS setting and develop a dynamic multi-round query strategy to further enhance the recommendation.

Section Summary
In this section, we introduce how a CRS can solve the cold-start problem and trade off the E&E balance via the interactive models such as MAB-based methods and meta learning methods. The solutions are summarized in Table 4. It still has a lot of room for CRSs to develop potential models to address the E&E problem, in order to improve the user experience.

Evaluation and User Simulation
In this section, we discuss how to evaluate CRSs, which is an underexplored problem. We group attempts to evaluate CRSs into two classes: (1) Turn-level evaluation, which evaluates a single turn of the system output. There are metrics designed to assess the recommendation task and response generation task. (2) Conversation-level evaluation, which globally evaluates the performance of the multi-turn conversation strategy. To achieve the goal, user simulation is important. We first introduce the commonly used datasets in CRSs, and then we introduce the metrics, methods, and problems in turn-level evaluation of CRSs. Finally, we discuss four strategies of user simulation for automatically evaluate the multi-turn conversation online.

Datasets and Tools
We list the statistics of the commonly used CRS datasets in Table 5. Some studies collect human-human conversation data from crowdsourcing sites such as Amazon Mechanical Turk (AMT) 5 [94,123,104,56]. Most studies, however, simulate the online user interaction from the historical records in traditional recommendation datasets, e.g., Movie-Lens, LastFM, Yelp 6 , and Amazon data.
There are many different settings in CRSs, making comparison between different models difficult. Recently, Zhou et al. [222] have implemented an open-source toolkit, called CRSLab 7 , for building and evaluating CRSs. They unify the tasks in existing CRSs into three sub-tasks: namely recommendation, conversation and policy, which correspond to our three components in Figure 3: recommendation engine, user interface, and conversation strategy module, respectively. Some models and metrics are implemented under the three tasks, and the toolkit contains an evaluation module that can not only conduct the automatic evaluation but also the human evaluation through an interaction interface, which makes the evaluation of CRSs more intuitive. However, up to now, the majority of implemented methods are based on end-to-end dialogue systems [94,25,223] or deep language models [225]; the CRSs that focus on the multiturn conversation strategies ( [89,88]) are absent.

Turn-level Evaluation
The fine-grain evaluation of CRSs is conducted on the output of each single turn, which contains two tasks: language generation and recommendation.

Evaluation of Language Generation
For CRS models that generate natural-language-based responses to interact with users, the quality of the generated responses is critical. Thus we can adopt the metrics used in dialogue response generation to evaluate the output of CRS. Two example metrics are BLEU [128] and Rouge [99]. BLEU measures the precision of generated words or n-grams compared to the ground-truth words, representing how much the words in the machine-generated utterance appeared in the ground-truth reference utterance. Rouge measures the recall of it, i.e., how many of the words or n-grams in the ground-truth reference utterance appear in the machine-generated utterance. Other metrics for measuring fluency [15,124,42], consistency [48,86], readability [85], and informativeness [67] can also be used in evaluating the natural language output of CRS models. For more metrics and evaluation methods on text generation, we refer the readers to the comprehensive survey by Celikyilmaz et al. [16].
However, the CRSs based on end-to-end dialogue frameworks or deep language models may have limitations regarding the usability in practice. Recently, Jannach and Manzoor [72] conducted an evaluation on the two state-of-theart end-to-end frameworks [94,25], and showed that both models face three critical issues: (1) For each system, about one-third of the system utterances are not meaningful in the given context and would probably lead to a breakdown of the conversation in a human evaluation. (2) Less than two-thirds of the recommendations were considered to be meaningful in a human evaluation. (3) Neither of the two systems "generated" utterances, as almost all system responses were already present in the training data. Jannach and Manzoor [72]'s analysis shows that human assessment and expert analysis are necessary for evaluating CRS models as there is no perfect metric to evaluate all aspects of a CRS. The CRS models and their evaluation still have a long way to go.

Evaluation of Recommendation
The performance of recommendation models is evaluated by comparing the predicted results with the records in the test set. There are two kinds of metrics in measuring the performance of recommender systems: • Rating-based Metrics. These metrics assume the user feedback is an explicit rating score, e.g., an integer in the range of one to five. Therefore, we can measure the divergence between the predicted scores of models and the ground-truth scores given by users in the test set. Conventional rating-based metrics include Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), where RMSE is the square root of the MSE. • Ranking-based Metrics. These metrics are more frequently used than rating-based metrics. Ranking-based metrics require that the relative order of predicted items should be consistent with the order of items in the test set. Thereby, there is no need for explicit rating scores from users, and the implicit interactions (e.g., clicks, plays) can be used to evaluate models. For example, a good evaluation result means that the model should only recommend the items in the test set, or it means that the items with higher scores in the test set should be recommended at higher ranks than the items with lower scores. Frequently used ranking-based metrics include hits, precision, recall, F1-score, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG) [74].
Recently, it has become common for researchers to speed up evaluation by sampling a small set of of irrelevant items and calculate the ranking-based metrics only on the small set [60,44,64,195]. However, Krichene and Rendle [82] point out and prove that some metrics, such as average precision, recall, and NDCG, are inconsistent with the exact metrics when they are calculated on the sampled set. This means that if a recommender A outperforms a recommender B on the sampled metric, it does not imply that A has a better metric than B when the metric is computed exactly. Therefore, the authors suggest that sampling during evaluation should be avoided; if it is necessary to sample, using the corrected metrics proposed by the authors is a better choice.
The biggest problem in these evaluation methods is that realworld user interactions are very sparse, and a large fraction of items never have a chance of being consumed by a user. However, this does not mean that the user does not like any of them. Perhaps the user has never seen them, or the user just does not have resources to consume them [100,21]. Hence, taking the consumed items in the test set as the users' groundtruth preferences can introduce evaluation biases [195,21]. Unlike static recommender systems, CRSs have the ability to ask real-time questions, so the system can make sure whether a user is satisfied with an item by collecting users' online feedback. This online user test can avoid biases and provide conversation-level assessments for the CRS model.

Conversation-level Evaluation
We first introduce metrics in the online user test, and then describe how the evaluation is conduct via user simulation.

Online User Test
For conducting the online user test, appropriate metrics should be designed to assess the recommendation performance in multi-turn conversations. For example, the average turn (AT) is a global metric to optimize in a CRS, as the model should capture user intention and make successful recommendations thus finish the conversation with as few turns as possible [88,89,95]. A similar metric is the recommendation success rate (SR@ ), which measure how many conversations have ended with successful recommendation by the -th turn. Besides, the ratio of failed attempts, e.g., how many of the questions asked by the system are rejected or ignored by users, can be a feasible way to measure whether a system makes decisions to the users' satisfaction. It still remains an open problem to define suitable metrics to assess user experience in the conversation.
Though effective, the online user evaluation has critical problems: (1) The online interaction between humans and CRSs is slow and usually takes weeks to collect sufficient data to make the assessment statistically significant [93,52,213]. (2) Collecting users' feedback is expensive in term of engineering and logistic overhead [71,70,193] and may hurt user experience as the recommendation may not satisfy them [150,93,52,26]. Therefore, a natural solution is to have simulated users, both at the model development stage and the evaluation stage.

User Simulation
There are generally four types of strategies in simulating users: (1) using the direct interaction history of users, (2) estimating user preferences on all items, (3) extracting from user reviews, and (4) imitating human conversational corpora.
• Using Direct Interaction History of Users. The basic idea is similar to the evaluation of traditional recommender systems, where a subset of human interaction data is set aside as the test set. If the items recommended by a CRS are in the users' test set, then this recommendation is deemed to be a successful one. As user-machine interactions are relatively rare, there is a need to generate/simulate interaction data for training and evaluation. Sun and Zhang [161] make a strong assumption that users visit restaurants after chatting with a virtual agent. Based on this assumption, they create a crowdsourcing task to use a schemabased method to collect dialogue utterances from the Yelp dataset. In total, they collect 385 dialogues, and simulate 875, 721 dialogues based on the collected dialogues by a process called delexicalization. For instance, "I'm looking for Mexican food in Glendale" is converted to the template: "I'm looking for <Category> in <City>", then they use these templates to generate dialogues by using the rating data and the rich information on the Yelp dataset. Lei et al. [88,89] use click data in the LastFM and Yelp datasets to simulate conversational user interactions. Given an observed user-item interaction, they treat the item as the ground truth item to seek for and its attributes as the oracle set of attributes preferred by the user in this session. First, the authors randomly choose an attribute from the oracle set as the user's initialization to the session. The session goes into a loop of a "model acts -simulator responses" process, in which the simulated user will respond with "Yes" if the query entity is contained in the oracle set and "No" otherwise. Most CRS studies adopt this simulation method because of its simplicity [230,31,20]. However, the sparsity problem in recommender systems still remains: only a few values in the user-item matrix are known, while most elements are missing, which forbids the simulation on these items. • Estimating User Preferences on All Items. Using direct user interactions to simulate conversations has the same drawbacks as we mentioned above, i.e., a large number of items that have not been seen by a user are treated as disliked items. To overcome this bias in the evaluation process, some research proposes to obtain the user preferences on all items in advance. Given an item and its auxiliary information, the key to simulating user interaction is to estimate or synthesize preferences for this item. For example, Christakopoulou et al. [32] ask 28 participants to rate 10 selected items, and then they can estimate the latent vectors of the 10 users' preferences based on their matrix factorization model. By adding noise to the latent vector, they simulate 50 new user profiles and calculate these new users' preferences on any items based on the same matrix factorization model. Zhang et al. [207] propose to use ridge regression to compute user preferences based on these known rewards on historical interaction and users' features; they synthesize the user's reaction (rewards) on each item according to the computed preferences. This kind of method can theoretically simulate a complete user preference without the exposure bias. However, because the user preferences are computed or synthesized, it could deviate from real user preferences. Huang et al. [66] analyze the phenomenon of popularity bias [157,134] and selection bias [114,61,158] in simulators built on logged interaction data and try to alleviate model performance degradation due to these biases; it remains to be seen to which degree generated interactions of the type described above are subject to similar bias phenomena. • Extracting Information from User Reviews. Besides user behavior history, many e-commerce platforms have textual review data. Unlike the consumption history, an item's review data usually explicitly mentions attributes, which can reflect the users' personalized opinions on this item. Zhang et al. [208] transform each textual review of part of the Amazon dataset into a question-answer sequence to simulate the conversation. For example, when a user mentioned that a blue Huawei phone with the Android system in a review of a mobile phone X, then the conversation sequence constructed from this review is (Category: mobile phone → System: Android → Color: blue → Recommendation: X). Zhou et al. [225] also construct simulated interactions by leveraging user reviews. Based on a given user profile and its historical watching records, the authors construct a topic thread that consists of topics (e.g., "family" or "job seeking") extracted from reviews of these watched movies. The topic thread is organized by a rule and eventually leads to the target movie. And the synthetic conversation is fleshed out by retrieving the most related reviews under corresponding topics. A noteworthy problem is that the aspects mentioned in reviews may contain some drawbacks of the products, which does not aid understanding why a user has chosen a product. For example, when a user complains about the capacity of a phone of 64 Gigabytes is not enough, and it should not be simply convert to (Storage capacity: 64 Gigabytes) for the CRS to learn. Thus, employing sentiment analysis on the review data is necessary, and only the attribute with positive sentiment should be considered as the reason in choosing the item [209,211]. • Imitating Humans' Conversational Corpora. In order to generate conversational data without biases, a feasible   otherwise push. For pull operations, the state transition probability simplies to: For the push operation, we need to nd a replacement action e a t , which remains to have the same goal as the original action a t . The state transition probabilities are then computed according to: The agenda updates, namely, the pull operation (P(A t +1 |A t , t +1 )) and nding the replacement action in case of a push operation (P(e a t |A t , t +1 )) are informed by the interaction model, and will be detailed in the next subsection.
To sum up, we switch between pull and push (replace) operations by checking if the user action is met with an appropriate agent response. The dialogue is terminated when the agenda is empty.

Interaction Model
The interaction model denes how the agenda should be initialized (A 0 ) and updated (A t ! A t +1 ) throughout the conversation. We consider two interaction models: (1) an existing general-purpose conversational interaction model, QRFA, which will serve as our baseline, and (2) our model, CIR6, which is developed specically for the conversational item recommendation task. Before we detail these models, we need to specify the space of possible user actions.

Action Space.
We base our user actions A on agent-human interactions for conversational search by Azzopardi et al. [3], which are listed below (with examples taken from [3]).
• Disclose: The user expresses the information need either actively, or in response to the agent's question ("I would to arrange a holiday in Italy."). • Reveal: It refers to the user revising, rening, or expanding constraints and requirements ("Actually, we need to go on the 3rd of May in the evening." or "Can you also check to see what kind of holidays are there available in Spain?"). • Inquire: Once the agents starts to show recommendations, the user may ask for related items ("Tell me about all the dierent things you can do in this place." ), or ask for similar options ("What other regions in Europe are like that?").
• Navigate: In our denition, navigation entails both actions around navigating a list of recommendations ("Which one is the cheapest option?") as well as questions about a certain recommended item on the list ("What's the price of that hotel?").
• Note: During the conversation, the user could mark or save specic items ("That hotel could be a possibility." or "Save that hotel for later."). • Complete: Finally, the user can mark the end of the conversation ("Thanks for the help, bye."). Note that we only use user actions to compose the agenda. That is, we generate the next action in the agenda directly based on the current user action, while treating the agent much like a black box. We assume, however, that the simulator can "understand" a set of agent actions. Specically, we consider the agent actions listed in Table 1 (for a detailed description of each, we refer the reader to [3]). The NLU is trained to recognize this set of agent actions. Then, at each turn, the agenda-based simulator can determine whether the agent responds to the user with an appropriate action (as captured by the indicator function ). For example, an Inquire user action can accept List or Elicit as an agent response; the full mapping is excluded due to space constraints and will be made available online.

QRFA
Model. QRFA (Query, Request, Feedback, and Accept) [35] is a general model for conversational information seeking processes. It uses a simple schema for annotating utterances, with four basic classes: two for user (Query and Feedback) and two for agent (Request and Answer); see Fig. 4. Vakulenko et al. [35] use this model to discover frequent sequence patterns in dialogs with the help of process mining techniques. QRFA provides good exibility and generalizability to a wide number of use cases. However, we need to make some adjustments before it can be applied in our scenario. First, for simulation purposes, where we are only interested in the user side, which has only two high-level classes (Query and Feedback). We subdivide these to provide a more ne-grained level of detail. Specically, we divide the action set we use in this paper 2020-05-16 08:40. Page 4 of 1-9. solution is to use real-world two-party human conversations as the training data [169]. By using this type of data, a CRS system can directly mimic human behavior by learning from a large number of real human-human conversations. For example, Li et al. [94] ask workers from AMT to converse in terms of the topics on the movie recommendation. Using these conversational corpora as training data, the model can learn how to respond properly based on the sentiment analysis result. Liu et al. [104] conduct a similar data collection process. Except for collecting the dialogues about the recommendation, they also collect and construct a knowledge graph and define an explicit profile for each worker who seeks recommendations. Therefore, the conversational topics can contain many non-recommendation scenarios, e.g., question answering or social chitchat, which are more common in real life. To evaluate this kind of model, besides considering whether the user likes the recommended item, we have to consider if the system responds properly and fluently. The BLEU score [129] is used to measure the fluency of these models mimicking human conversations [10,210]. There are also drawbacks for this kind of method. First, when collecting the human conversational corpus, two workers need to enter the task at the same time, which is a rigorous setting and thus limits the scale of the dataset. Second, designers usually have many requirements that restrict the direction of the conversation. Therefore, the generated conversation is constrained and cannot fully cover the real-world scenarios. By imitating a collected corpus, learning a conversation strategy is very sensitive to the quality of the collected data. Vakulenko et al. [169] analyze the characteristics of different human-human corpora, e.g., in terms of initative taking, and show that there are important differences between human-human and human-machine conversations.
Recently, Zhang and Balog [206] have investigated using user simulations in evaluating CRSs. They organize the action sequence of the simulated user as a stack-like structure, called the user agenda. A dynamic update of the agenda is regarded as a sequence of pull or push operations, where dialogue actions are removed from or added to the top. Figure 8 shows an example of a dialogue between the simulated user and a CRS. At each turn the simulated user updates its agenda by either a push or a pull operation based on the dialogue state and the CRS's action. The authors define a set of actions and the transition rule on these actions to let the simulated user imitate real users' intentions. For example, the Disclose action indicates that the user expresses its need either actively, or in response to the agent's question, e.g., "I would like to arrange a holiday in Italy". And after this action, the simulator can either transit to the Inquire action or the Reveal section based on how the CRS model acts.

Section Summary
In this section, we review the metrics, methods, and challenges in the turn-level evaluation and conversation-level evaluation of CRSs. The turn-level evaluation measures the performance of recommendation and language generation of the CRS in a single turn; the conversation-level evaluation measure how the CRS performs in the multi-turn conversation, e.g., how many turns does it need to fulfill the recommendation task. Since an online user test is expensive to conduct, researchers use user simulators to assist the model in training and testing. We summarize four strategies in simulating users.
The evaluation of CRSs still needs a lot of effort. It ranges from constructing large-scale dense conversational recommendation data, to proposing uniform evaluation methods to compare different CRS methods that integrate both recommendation and conversation aspects.

Future Directions and Opportunities
Having described key advances and challenges in the area CRSs, we now envision some promising future directions.

Jointly Optimizing Three Tasks
The recommendation task, language understanding and generation task, and conversation strategies in CRSs are usually studied separately in the three components in Figure 3, respectively. The three components share certain objectives and data with each other [25,111,88,223]. For example, the user interface feeds extracted aspect-value pairs to the recommendation engine, and then integrates the entities produced by the recommendation engine into the generated response. However, they have the exclusive data that does not benefit each other. For instance, the user interface may use the rich semantic information in reviews but not shares with a recommendation engine [94]. Besides, the two components may work in the end-to-end framework that lacks an explicit conversation strategy to coordinate them in the multi-turn conversation [94,25], thus the performance is not satisfied in human evaluation [72].
Thereby, the three tasks should be jointly learned and guided by an explicit conversation strategy for their mutual benefit, for instance, what if the conversation strategy module were able to plan future dialogue acts based on item-item relationships (such as complementarity and substitutability [115,173,101])?

Bias and Debiasing
It is inevitable that a recommender system could encounter various types of bias [21]. Some types of biases, e.g., popularity bias [1,157] and conformity bias [211,102], can be removed with introducing the interaction between the user and system. For example, a static recommender may not be sure whether a user will follow the crowd and like popular items. Therefore, the popularity bias is introduced in the recommender system since popular items can have higher probability of being recommended. This, however, could be avoided in CRSs because a CRS can query about the user's attitude towards popular items in real time and avoid recommending them if the user gives negative feedback.
Nevertheless, some types of bias persist. For example, even though a recommender system may provide access to a large number of items, a user can only interact with a small set of them. If these items are chosen by a model or a certain exposure mechanism, users have no choices but to keep consuming these items. That is the exposure bias [100]. Moreover, users often select or consume their liked items and ignore these disliked ones even these items have been exposed to users, which introduces the selection bias [114,61,158], also known as the positivity bias [66,134], i.e., rating data is often missing not at random and the missing ones are more likely to be disliked by the user [61]. These types of bias can be amplified in the feedback loop and may hurt the recommendation model [153,160]. For instance, a CRS model polluted by biased data might repeatedly generate the same items even through users suggested they would like other ones.
There are relatively few efforts to study the bias problem in CRSs. The exploration-exploitation methods introduced in Section 5 can alleviate some types of bias in CRSs. And Huang et al. [66] make an attempt to remove the positivity bias in the user simulation stage for the interactive recommendation. Moreover, Chen et al. [21] present a comprehensive survey of different types of bias and describe a number of debiasing methods for recommender systems (RSs); it provides some perspectives for debiasing CRSs.

Sophisticated Multi-turn Conversation Strategies
The multi-turn strategy considered in current studies of CRSs are relatively naive. For example, there is work using a hand-crafted function to determine the timing to ask attributes or make recommendation, e.g., making conversations in every rounds [207]. These studies based on endto-end dialogue systems or deep neural language models are worse: they do not even have an explicit strategy to control the multi-turn conversation [94,25]. Besides, some strategies can be problematic in regard to handling users' negative feedback. For instance, Lei et al. [88] consider updating the model parameters when the user dislikes a recommended item. However, simply taking rejected items as negative samples would influence the model's judgement on the queried attributes. For example, a user's rejection of a recreation video might be due to the fact that they watched it before, and it does not mean that they dislike recreation videos. To overcome this problem, the model should consider more sophisticated strategies such as recognizing reliable negative samples [22,39,186,96,22] as well as disentangling user preferences on items and attributes [110,184].
We have witnessed some studies using RL as the multiturn conversation strategy by determining model actions such as whether to ask or recommend [161,88,89]. However, there is a lot of room for improvement in designing the state, action, and reward in RL. For instance, more sophisticated actions can be taken into consideration such as answering open-domain questions raised by users [227] or chatting nontask-oriented topics for entertainment purposes [190,104]. Besides, more advanced and intuitive RL technologies can be considered to avoid the difficulties, e.g., hard to train and converge, in basic RL models [178]. For example, Inverse RL (IRL) [126] can be considered to learn a proper reward function from the observed examples in certain CRS scenarios, where there are too many user behavior patterns so the reward is hard to define. Meta-RL [43,179] can be adopted in CRSs, where the interaction is sparse and various, to speed up the training process and to improve the learning efficiency for novel subsequent tasks.

Knowledge Enrichment
A natural idea to improve CRSs is to introduce additional knowledge. In early stages of the development of CRSs, only the recommended items themselves were considered [32].
Later, the attribute information of items was introduced to assist in modeling user preferences [31]. Even more recent studies consider the rich semantic information in knowledge graphs [223,89,192,123]. For example, to better understand concepts in a sentence such as "I am looking for scary movies similar to Paranormal Activity (2007)", Zhou et al. [223] propose to incorporate two external knowledge graphs (KGs): one word-oriented KG providing relations (e.g., synonyms, antonyms, or co-occurrence) between words so as to comprehend the concept "scary" in the sentence; one itemoriented KG carrying structured facts regarding the attributes of items.
Besides knowledge graphs, multimodal data can also be integrated into the original text-based CRSs since it can enrich the interaction from new dimensions. There are some studies that exploit the visual modality, i.e., images, in dialogue systems [203,97,34,212]. For example, Yu et al. [203] propose a visual dialog augmented CRS model. The model will recommend a list of items in photos, and the user will give text-based comments as feedback. The image not only helps the model learn a more informative representation of entities, but also enable the system to better convey information to the user. Except for the visual modality, other modalities can benefit CRSs and could be integrated. For example, spoken natural language can convey users' emotions as well as sentiments towards certain entities [133].

Better Evaluation and User Simulation
The evaluation of CRSs still has a long way to go. As we introduced in Section 6.3, evaluating the CRS requires real-time feedback, which is expensive in real-world situations [71]. Thus, most CRSs adopt user simulation techniques to create an environment for training and evaluating [206]. However, simulated users cannot fully replace human beings. How to simulate users with maximum fidelity still needs further research. Feasible directions include designing systematic simulation agenda [206,148] and building dense user interactions for reliable simulation.
In addition, CRSs work on different datasets and they have various assumptions and settings. Therefore, developing comprehensive evaluation metrics and procedures to assess the performance of CRSs remains an open problem. Recently, Zhou et al. [222] have implemented an open-source CRS toolkit, enabling evaluation between different CRS models. However, their implemented models are mainly based on end-to-end dialogue systems [94,25,223] or deep language models [225], the models focusing on the explicit conversation strategy [89,88] are absent.

Conclusion
Recommender systems are playing increasingly important role in information seeking and retrieval. Despite having been studied for decades, traditional recommender systems estimate user preferences only in a static manner like through historical user behaviours and profiles. It offers no opportunities to communicate with users about their preferences. This inevitably suffers from a fundamental infor-mation asymmetry problem: a system will never know precisely what a user likes (especially when his/her preference drifts frequently) and what is the exact reason a user likes an item. The envision of Conversational recommender systems (CRSs) brings a promising solution to such problems. With the interactive ability as well as the natural languagebased user interface, CRSs can dynamically get explicit user feedback using natural languages, while increasing user engagement and improving user experience. This bold vision provides great potential for the future of recommender system, hence actively contributes to the development of the next generation of information seeking techniques.
Although the build of CRS is an emerging field, we have spotted great efforts from different perspectives. In this survey, we acknowledge those efforts, with the aim to summarize the existing studies and to provide insightful discussions. We tentatively gave a definition of the CRS and introduced a general framework of CRSs that consists of three components: a user interface, a conversation strategy module and a recommender engine. Based on this decomposition, we distilled five existing research directions, namely: (1) question-based user preference elicitation; (2) multi-turn conversational recommendation strategies; (3) dialogue understanding and generation; (4) exploitation-exploration tradeoffs for cold users; (5) evaluation and user simulation. For each direction, we reviewed the existing efforts and their limitation in one section, leading to the primary structure of this survey. Despite the progresses on the above five directions, more interesting problems remain to be explored in the field of CRSs, such as, (1) joint optimization of three components; (2) bias and debiasing methods in CRSs; (3) multi--turn conversational recommendation strategies; (4) multimodal knowledge enrichment; (5) evaluation and user simulation.
Our discussions above provide a comprehensive retrospect of current progress of CRSs which can serve as the basis for the further development of this field. By providing this survey, we call arm to this emerging and interesting field. We hope this survey can inspire the researchers and practitioners from both industry and academia to push the frontiers of CRSs, making the thoughts and techniques of CRSs more prevalent for the next generation of information seeking techniques.