Overview of the Ninth Dialog System Technology Challenge: DSTC9

This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with unstructured knowledge access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog, and 4. Situated interactive multi-modal dialog. This paper describes the task definition, provided datasets, baselines and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks.


I. INTRODUCTION
The Dialog System Technology Challenge (DSTC) is a one of the leading series of research competitions in the space of dialog systems.Since the inception in 2013, DSTC has been accelerating the development of dialog technologies, by bringing the leading researchers and engineers together to solve important problems in dialog systems.The challenge has been evolving every year to cater the demand and the interest of the dialog community to foster the development of technology.
The first Dialog System Technology Challenge [1] used human-to-bot dialogs in the bus timetable domain.Dialog State Tracking Challenges 2 [2] and 3 [3] used restaurant reservation application which introduced more complicated and dynamic dialog states.Dialog State Tracking Challenge 4 [4] and Dialog State Tracking Challenge 5 [5] moved to tracking human-to-human dialogs in mono and cross-language settings.From the sixth challenge [6], the DSTC has rebranded itself as "Dialog System Technology Challenge" and organized multiple tracks in parallel to address a wider variety of dialog related problems.The tracks in DSTC-6 were focused on endto-end conversation modeling and dialog breakdown detection.DSTC-7 [7] focused on developing end-to-end dialog technologies for noetic response selection [8], [9], grounded response generation [10], and audio visual scene aware dialog [11].More recently in DSTC-8 [12] the focus has been on diverse set of four tracks including, multi-domain task completion, predicting responses, audio visual scene-aware dialog and schema-guided dialog state tracking.

Every author has equal contribution
For the ninth edition of the DSTC, we received nine track proposals from the leading research organizations and top universities.The proposals went through a formal peer review process focusing on each task's potential for, (a) impact to the community, (b) novelty of the task, (c) feasibility of the proposal, and (d) potential participants.The DSTC-8 participants were also asked to provide their feedback on the presented track proposals through a survey, and the responses were also considered in the evaluation.Finally, we ended up with the four main tracks including three newly introduced tasks and one follow-up task from DSTC-8.
The track, Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access (Track 1), aims to support frictionless task-oriented scenarios, where the flow of the conversation does not break when users have requests that are out of the scope of APIs/DB but potentially are already available in external knowledge sources.Track 2, Multi-domain Task-oriented Dialog Challenge II, is a continuation of last year, and focuses on end-to-end multidomain task completion dialog and cross-lingual multi-domain dialog state tracking.The track 3 of this year, Interactive Evaluation of Dialog, aims to take the first step in expanding dialog research beyond datasets and challenges the participants to develop dialog systems that can converse effectively in interactive environments with real users.SIMMC: Situated Interactive Multi-Modal Conversational AI (track 4) is aimed at laying the foundations for the real-world assistant agents that can handle multi-modal inputs, and perform multi-modal actions.
The following sections describe the details of each track.

A. Track Overview
Most prior work on task-oriented dialog systems has been restricted to a limited coverage of domain APIs.However, users often have domain related requests that are not covered by the APIs.This challenge track aims to expand the coverage of task-oriented dialog systems by incorporating external unstructured knowledge sources.There are three main  tasks in this track as introduced in [13]: knowledge-seeking turn detection, knowledge selection, and knowledge-grounded response generation (Table I).

B. Data
This challenge track uses two different data sets (Table II).The first data is an augmented version of MultiWOZ 2.1 [14] that includes newly introduced knowledge-seeking turns in the MultiWOZ conversations.The data augmentation was incrementally done by the crowdsourcing tasks described in [13].A total of 22,834 utterance pairs were newly collected based on 2,900 knowledge candidates from the FAQ webpages about the domains and the entities in MultiWOZ databases.For the challenge track, we divided the whole data into three subsets: train, validation and test.The first two sets were released in the development phase along with the ground-truth annotations and human responses for participants to develop their models.
In the evaluation phase, we released the test split of the augmented MultiWOZ 2.1 and the other conversations collected from scratch about touristic information for San Francisco.To evaluate the generalizability of models, the new conversations cover knowledge, locale and domains that are unseen from the train and validation data sets.In addition, this test set includes not only written conversations, but also spoken dialogs to evaluate system performance across different modalities [15].All the backend resources for this data collection were also released, which includes 9,139 knowledge snippets and 855 database entries for San Francisco.

C. Evaluation Criteria
Each participating team submitted up to five system outputs each of which contains the results for all three tasks on the unlabeled test instances.We first evaluated each submission using the task-specific objective metrics (Table III) by comparing to the ground-truth labels and responses.Considering the dependencies between the tasks in the pipelined architecture, the final scores for knowledge selection and knowledgegrounded response generation are computed by considering the first step knowledge-seeking turn detection recall and precision performance, as follows: where s(x) is the knowledge selection or response generation score in a target metric for a single instance x ∈ X.Then, we aggregated a set of multiple scores across different tasks and metrics into a single overall score computed by the mean reciprocal rank, as follows: where rank i (e) is the ranking of the submitted entry e in the i-th metric against all the other submissions and M is the number of metrics we considered.
Based on the overall objective score, we selected the finalists to be manually evaluated by the following two crowd sourcing tasks: • Appropriateness: This task asks crowd workers to score how well a system output is naturally connected to a given conversation on a scale of 1-5.
• Accuracy: This task asks crowd workers to score the accuracy of a system output based on the provided reference knowledge on a scale of 1-5.In both tasks, we assigned each instance to three crowd workers and took their average as the final human evaluation score for the instance.Those scores were then aggregated TABLE IV: Objective evaluation results of the best entry from each team in the featured metrics for the Track 1 tasks.Team 0 is the baseline.Bold denotes the best result in each column and * indicates the finalists.

D. Results
We received 105 entries in total submitted from 24 participating teams.To preserve anonymity, the teams were identified by numbers from 1 to 24, while our baseline [13] was marked as team 0. Table IV shows the objective evaluation results of the best entry from each team in the featured metrics.The full scores with all the submitted entries and the other metrics are available on the track repository 1 .Most entries outperformed the baseline in all three tasks.In particular, the best entry from Team 3 achieved over 99% F-measure for knowledge-seeking turn detection, and also the highest scores in the BLEU and ROUGE variants for the response generation task.On the other hand, Team 19 was the best in the knowledge selection metrics and Team 15 was better than all the other teams in METEOR for generation.We calculated the overall score (Equation 2) of each entry and selected 12 finalists, corresponding to the best entry from each of the top 12 teams.
Table V shows the final ranking of the Track 1 participating teams based on the human evaluation scores of the finalist entries.The top three teams (Team 19, 3 and 10) commonly used ensemble of large-scale pre-trained language models in 1 https://github.com/alexa/alexa-with-dstc9-track1-datasettheir best entries.Team 19 won the challenge track with the highest scores for both Accuracy and Appropriateness, most likely because of their better performance in the knowledge selection task (as in the objective evaluation results).To compare the importance of each task towards end-to-end performance, we calculated the Spearman's rank correlation coefficient of the ranked lists of all the entries in every pair of objective and human evaluation metrics.As a result, Recall@1 for the knowledge selection task shows a strong correlation with the averaged human evaluation ranking at 0.8601, which is significantly higher than 0.7692 and 0.6503 with F-measure for the knowledge-seeking turn detection and BLEU-1 for the response generation, respectively.This implies that the knowledge-selection is a key task to improve end-toend performance.

III. TRACK 2 -MULTI-DOMAIN TASK-ORIENTED DIALOG CHALLENGE II
We provide two tasks in the multi-domain task-oriented dialog setting.One is the end-to-end task-oriented dialog task aiming to solve the complexity of building end-to-end dialog systems.The other is cross-lingual dialog state tracking (DST) to address the language adaption problem for the DST task.

A. End-to-end Task-oriented Dialog Task
This task is a continuation of last year at DSTC8 [16].Participants will develop an end-to-end task-oriented dialog system that takes natural language as input and generates natural language response as output in the travel planning setting.Both the evaluation result of last year's challenge [17] and empirical analysis of models in ConvLab [18] show that the best rule-based pipeline systems outperform systems assembled using state-of-the-art component-wise machine learning models.From the results, we've also observed a discrepancy between the performance of component-wise models using corpus-based evaluation and that of the entire system using the end-to-end evaluation.These findings are consistent with the landscape of dialog development technology stacks in the industry.However, interestingly, the winning team built their model based on GPT-2 [19], and achieved significant improvement over other teams with regards to success rate, understanding score, and response score at the human evaluation phase.Meanwhile, by using similar model training paradigms, SOLOIST [20] and SimpleTOD [21] shortly achieved top performance in the MultiWOZ leaderboard by leveraging GPT-2 [22].
This year, we continue with the end-to-end task-oriented dialog task, aiming to promote the technology of building end-to-end dialog systems one step further.Like last year, participants are encouraged to explore all possible approaches, and there is no restriction on dialog system architecture.
1) Data: Participants are expected to build dialog systems based on MultiWOZ 2.1 [14], a multi-domain dialog dataset spanning 7 distinct domains containing over 10,000 dialogs under the travel planning setting.Compared with MultiWOZ 2.0 [23], MultiWOZ 2.1 re-annotated states to fix the noisy annotation and incorporated user dialog act annotation.Although the dialog system is evaluated under MultiWOZ 2.1, participants can leverage any public datasets, pre-trained models, or other resources to build the dialog system.
2) Evaluation Criteria: ConvLab-2 [24] is employed as the platform for dialog development and evaluation.As the successor of ConvLab [25], ConvLab-2 provides a user simulator and evaluator for MultiWOZ 2.1 so that the participants can effectively run offline experiments and evaluations.Specifically, we offer two evaluation approaches: a) Automatic Evaluation: The dialog system is evaluated via conversing with an end-to-end user simulator.The simulator is constructed by assembling a BERT-based natural language understanding model [26] , an agenda-based user simulator [27] and a rule-based natural language generation module.A dialog is successful only if all requested slots are filled with grounded values in the database, and the booking is successful.We report metrics including success rate, book rate, number of turns for each dialog, and precision/recall/F1 score for slot prediction.
b) Human Evaluation: In human evaluation, Amazon Mechanic Turkers communicate with the dialog systems via natural language, judge whether the dialog is successful, and provide scores based on language understanding correctness and response appropriateness on a 5 point Likert-scale.Since MTurkers do not directly access the back-end database, we also report the success rate with grounding after verifying whether the requested slot values returned by the dialog systems match the database record.We take the average value of success rate with grounding and without grounding as the final ranking.
3) Results: As per our submission policy, each team is allowed to submit up to 5 models.We received 34 models in total from 10 teams.Table VI lists the automatic evaluation result for the best models of each team.We filtered out lowperformance models based on the automatic evaluation result while keeping the best model for each team and sent the remaining models for human evaluation.With this process, 21 models were evaluated in human evaluation, with the performance of each team's best model listed in Table VII.
Team 1 achieves the top 1 performance in both automatic and human evaluation by constructing an end-to-end dialog system with the pre-trained dialog generation model PLATO-2 [28].This model generates the dialog state, system action, and  system response simultaneously, given the dialog context.The dialog state is used as the constraint for database query, and the system action is then refreshed according to the queried results to re-generate the final system response.Team 2 achieves the same ranking as Team 1 in the human evaluation using a similar hybrid end-to-end neural model.It borrows idea from [20] and [29], uses GPT-2 as the backend for pre-training and fine-tuning and add various pre/post-processing modules to improve model generalization ability.An additional fault tolerance mechanism is also added to correct errors.4) Summary: Compared with the challenge results at DSTC8, there is a trend of shifting from building dialogs by assembling component-wise modules to end-to-end learning.In DSTC8, out of 11 teams with valid submissions, 1 team uses GPT-2 based models, 1 team uses word DST + word policy, with the rest 9 team uses component-wise models.This year, out of 10 teams, 8 teams used the end-to-end learning mechanism by leveraging transformer-based models.The top three systems in both automatic evaluation and human evaluation are all built using transformer-based end-to-end learning, and they have achieved much better performance in human evaluation than the systems at DSTC82 .

B. Cross-lingual Dialog State Tracking Task
We introduce the task of cross-lingual dialog state tracking, requiring the participants to build a dialog state tracker for the target language with a training set in the source resource language and a small development set in the target language.Based on newly proposed large scale multi-domain taskoriented dialog datasets, MultiWOZ 2.1 [14] and CrossWOZ [30], we offer two sub-tasks: 1) cross-lingual transfer from English to Chinese using MultiWOZ 2.1 dataset and 2) crosslingual transfer from Chinese to English using CrossWOZ dataset.
Following a similar scheme as in DSTC-5 [5], we provided machine translations of the original dataset.We collected 500 new dialogs in the target language as the test set.The performance of each dialog state tracker is evaluated on the test set and compared with reference annotation.
1) Data: Compared with previous datasets [5], [31], [32] for cross-lingual transfer learning in task-oriented dialog, Mul-tiWOZ 2.1 and CrossWOZ are much larger.MultiWOZ 2.1 contains over 10,000 dialogs, and CrossWOZ contains over 6,000 dialogs.They are also more challenging due to the multi-domain setting.For each sub-task, we prepared data in a similar way: a) collected 500 new dialogs in the source language, b) translated the ontology to the target language, and c) translated the original dialogs and the new dialogs.We released 250 new dialogs without any annotation as a public test set and reserved the other 250 dialogs as a private test set.
a) Test Data Collection: To collect new dialogs, we adapted the data collection website of CrossWOZ where paired workers can converse synchronously and make annotations.New user goals were generated by the goal generator from ConvLab-2.Following the Wizard-of-Oz setting, one worker acts as the user who needs to accomplish the allocated goal, and the other acts as the system that uses the database to provide information.During the conversation, both sides need to annotate the dialog acts of their utterances, and the system should also annotate the dialog states that are queries over the database.
b) Ontology Translation: We extracted the ontology from dialog act and dialog state annotations of both the original and test datasets.Then we used Google Translate to translate them to the target language.For some slots that may not be faithfully translated, such as "name" and "address", we employed human translators to correct the translations.This process is vital to ensure the translation consistency of the same values in different contexts.
c) Dialog Translation: To make sure that the translations of values in a dialog are faithful to the ontology dictionary, we first replaced the values that appeared in the dialog with their translations in the dictionary.Then we used Google Translate to translate the resulting code-switching sentences from the original dataset and test set.In this way, translated dialogs and corresponding annotations do not conflict.250 dialogs were sampled from the original dataset as the development set.Human translators were employed to proofread the translations of the development and test set.
2) Evaluation Criteria: We evaluate the performance of the dialog state tracker using the following metrics: a) Joint Goal Accuracy.This metric evaluates whether the predicted dialog state is exactly equal to the ground truth.b) Slot Accuracy.This metric evaluates whether each slot's predicted label is exactly equal to the ground truth, averaged over all slots.c) Slot Precision/Recall/F1.These metrics evaluate the overlap between the predicted labels and the ground truth for non-empty slots, micro-averaged over dialog turns.Each submission contains the predictions for the public test set and the model that is used to make predictions for the private test set.The results are averaged over the public and private test set.The final ranking is solely based on the joint goal accuracy.
3) Results: The results of MultiWOZ (en→zh) and Cross-WOZ (zh→en) sub-tasks are shown in Table VIII and IX respectively.During the evaluation, we found that the Cross-WOZ test data miss many "name" labels when the user accepts the attraction/hotel/restaurant recommended by the system.Therefore, we utilized the database search results and heuristic rules to correct empty "name" labels and provided an updated leaderboard for CrossWOZ in Table X.Both of the CrossWOZ leaderboards are valid, but the updated one is preferred.We adapted SUMBT [33] as the baseline model and used the translated training set of the original dataset to train for both sub-tasks.We have received 10 models for MultiWOZ (en→zh) and 8 models for CrossWOZ (zh→en) from the same 3 teams.We briefly introduce their best models here.Team 1 incorporated a four-class state operation prediction task into CHAN model [34].Team 2 modified SOM-DST [35] and used ontology and some handcraft rules to postprocess the generated values.Team 3 formulated the dialog state tracking as a sequence generation problem and used mBART to generate pairs of slot names and slot values.All of their best models were trained using the translated data in the target language.
4) Summary: To our surprise, all the best models are trained on monolingual machine translated data instead of both the original data and translations.Team 2 and 3 even got negative results when training XLM/mBART on the original data and the translations simultaneously.The performance of "Translate-Train" partially depends on the machine translator, which may be why team 1 and 2 augment the data by using another translator to translate the original dataset.Team 1 and 2 modified DST models that are state-of-the-art on English MultiWOZ 2.1 dataset and got strong performance on Chinese MultiWOZ 2.1, verifying these models' language portability.

A. Track overview
The aim of dialog research is to create systems that can be effectively used in interactive settings by real users [36].Despite this, the majority of research is performed on static datasets.For example, the task of response generation is typically done by producing a response for a static dialog context [37].This track is intended to move dialog research beyond datasets and evaluate models in interactive environments with real users.
This track consists of two sub-tasks: (1) static evaluation and (2) interactive evaluation.The first subtask challenges participants to build response generation models which are evaluated in a static manner, using the Topical-Chat corpus [38].The second subtask aims to extend dialog models beyond datasets and assess them in an interactive setting with real users, using DialPort [39].In in the first subtask, models must generate a response to a fixed dialog context.In contrast in the second subtask, they must have a back-and-forth interaction with a real user.Through the two subtasks, this track challenges participants to take strong response generation models and develop strategies of making them effective in interactive settings.

B. Data
Participants in this track were free to train on any publicly available data or use any pre-trained models.The static evaluation in the first subtask was carried out on the Topical-Chat corpus [38].Topical-Chat is a large collection of humanhuman knowledge-grounded open-domain conversations that consists of 11,319 dialogs and 248,014 utterances.For each conversational turn, several relevant facts are provided.Models must leverage these facts and generate a response.This dataset was chosen because it is the largest, knowledge-grounded open-domain dataset presently available, to our knowledge.Additionally, the choice of usable facts provides a mechanism for systems to tailor responses to a specific user's interests.
Since we continuously performed human evaluation over the duration of the challenge and used reference free evaluation metrics [40], it was not strictly necessary for models to be trained on the Topical-Chat corpus.A strong pre-trained dialog model may perform well on the first subtask, despite not training on the corpus.
The second subtask was not tied to a dataset.The interactive evaluation was carried out on DialPort 3 [39] with real users recruited through Facebook Advertising.

C. Evaluation Criteria
The first subtask was evaluated using ongoing (1) human evaluation and (2) three automatic metrics: METEOR [41], BERTscore [42] and USR [40].Human evaluation was carried out on Amazon Mechanical Turk with the annotation questionnaire used to collect the FED dataset [43].Over the duration of the challenge, we carried out evaluation on the Topical-Chat frequent validation set.For human evaluation, 30 context-response pairs were sampled and each one was labeled by 3 annotators.For the final evaluation, we carried out automatic evaluation on the frequent test set and perform human evaluation on 100 randomly sampled context-response pairs.For the final evaluation, the 100 dialog contexts used for evaluation were consistent across the different systems.
The evaluation for the second subtask consists of (1) collecting dialogs through conversations with real users on DialPort and (2) post-hoc assessment of the collected dialogs.Participants submitted dialog models (via an API) to DialPort.Real users were recruited through Facebook Advertising to interact with the submitted dialog systems.After gathering a sufficient number of conversations, we performed post-hoc assessment of the dialogs with the FED metric [43] and human evaluation on Amazon Mechanical Turk with the annotation questionnaire used to collect the FED dataset [43].
Throughout the challenge, we aimed to collect at least 100 conversations for each submitted system discounting any dialogs with offensive terms (e.g., curse words, racist phrases).For each system, 100 conversations were evaluated with the FED metric and on Amazon Mechanical Turk, with 3 annotators labeling each dialog.
For the final submission, we gather dialogs for all systems over the same time period.Ultimately, given a Facebook Advertising budget of $2500 and 11 systems (including two baselines), we obtained 4651 conversations (after removing offensive dialogs) with a total of 41,640 turns.We consider only the conversations that are at least four turns in length (total of 2960) for the final post-hoc assessment.For each system, we carry out human evaluation with 200 conversations of suitable length.Throughout the challenge, all individuals who interact with the system on DialPort do so for free, of their own volition, thereby avoiding common problems observed with paid users [44].

D. Results
The challenge received 33 submissions to the first subtask and 9 submissions to the second subtask.
Table XI shows the results of the static evaluation on the Topical-Chat corpus [38], for the 10 best performing systems according to the human evaluation.All of the top 10 systems used either pre-trained models or additional data, highlighting the importance of pre-training for open-domain response generation.This observation aligns with previous research, which has seen strong performance in open-domain response generation through the use of large-scale pre-training [45], [46].
In addition to performing ongoing human evaluation throughout the challenge, we assess systems in the first subtask  using three evaluation metrics.METEOR [41] and BERTscore [42], are referenced evaluation metrics that compare a generated output to a ground-truth response.In contrast, USR [40] is a reference free evaluation metric that uses pre-trained models and self-supervised training objectives to estimate the quality of a response.Though none of the evaluation metrics is a perfect predictor of the final ranking, we find that USR better correlates with the system-level human performance (Spearman: 0.35, p < 0.05) than either METEOR (Spearman: 0.23, p > 0.05) or BERTscore (Spearman: 0.22, p > 0.05).The relatively low system-level correlation highlights the importance of performing ongoing human evaluation throughout the challenge.The poor performance of automatic metrics, may in part be a consequence of the fact that several submissions did not finetune on the Topical-Chat corpus and instead relied on opendomain response generation capabilities learned through largescale pre-training.As such, while the responses were favored by human annotators -the automatic metrics penalized them for either not having high word-overlap with the ground truth (METEOR, BERTscore) or not resembling the utterances in the Topical-Chat corpus (USR).
The results for the second subtask are shown in Table XII.System 6 is our DialoGPT baseline [45], fine-tuned on the Topical-Chat corpus without knowledge grounding.System 11 is our Transformer baseline which was trained on the Topical-Chat corpus and uses tf-idf sentence similarity to retrieve relevant knowledge at inference time.The best performing model, System 1, leverages large-scale pre-training in addition to strategies for producing more diverse responses.This system achieved first place in both subtasks: System 1 in Table XII corresponds to System 2 in Table XI.
FED [43], which is an unsupervised evaluation metric for interactive dialog is shown to be a moderate predictor of the final ranking with a system-level Spearman correlation of 0.49 (p = 0.13), though it correctly predicts the top two systems.We also note that the average number of turns for a particular system is a strong indicator of its quality here (Spearman: 0.94, p < 0.01).Real users are more inclined to interact with a better system, making it an important metric for assessing systems in interactive settings [47].
While many of the submissions in the first subtask perform similarly, the scores in Table XII are much more varied.This signifies that interactive evaluation more exhaustively tests the capabilities of systems and is therefore a more indicative measure of a system's capabilities.This observation has been shown by prior work [43], when analyzing dialogs from Meena [46].
The Interactive Evaluation of Dialog track demonstrates both the feasibility and the importance of evaluating dialog systems in interactive settings with real users.We show that with an advertising budget of $2500, we collect more than 4000 dialogs on DialPort (2960 dialogs with at least 4 turns).The results of interactive evaluation are more varied (Table XII) suggesting that back-and-forth interactions with real users are challenging to dialog systems and that interactive evaluation is a better reflection of a system's capabilities.

A. Track overview
The SIMMC challenge aims to lay the foundations for the real-world assistant agents that can handle multimodal inputs, and perform multimodal actions.We thus focus on task-oriented dialogs that encompass a situated multimodal user context in the form of a co-observed image or virtual reality (VR) environment.The context is dynamically updated on each turn based on the user input and the assistant action.Moon et al. [48] provides more details on the datasets and the models we provide.

B. Data
SIMMC contains about 13k human-to-human dialogs (totaling about 169k utterances).We chose shopping experiencesspecifically furniture and fashion-as the domain for the SIMMC datasets because of the dynamic environment created by these domains, where rich multimodal interactions happen around visually grounded items.
SIMMC offers four key advantages over previous multimodal dialog datasets: TABLE XIII: Summary of each team's results on Test-Std split, average of Furniture and Fashion (*Team 5 submitted results only for Fashion).Best results from each team are shown.(1) API prediction via accuracy, perplexity and attribute accuracy, and, (2) Response prediction via BLEU, recall@k (k=1,5,10), mean rank, and mean reciprocal rank (MRR).(3) Dialog State Tracking (DST), via slot and intent prediction F1. ↑: higher is better, ↓: lower is better.1) SIMMC assumes a co-observed multimodal context between a user and an assistant and records the groundtruth item appearance logs of each item that appears.SIMMC tasks emphasize semantic processing of the input modalities, while work in this area has traditionally focused heavily on raw image processing.2) Compared with the conventional task-oriented conversational datasets, the agent actions in the SIMMC datasets span across a diverse action space (e.g."rotate", "search", and "add to cart").3) Agent actions can be enacted on both the object level (e.g.changing the view of a specific object within a scene) and the scene level (e.g.introducing a new scene or an image).4) SIMMC emphasizes semantic processing.The proposed SIMMC annotation schema allows for a more systematic and structural approach for visual grounding of conversations, which is essential for solving challenging problems in real-world scenarios.Datasets were collected through the SIMMC Platform [49], an extension to ParlAI [50] for multimodal conversational data collection and system evaluation that allows human annotators to each play the role of either the assistant or the user.

C. Evaluation Criteria
We present three subtasks primarily aimed at replicating human-assistant actions in order to enable rich and interactive shopping scenarios.Subtask 1: Structural API Call Prediction focuses on predicting the assistant action as an API call given the dialog and the multimodal contexts as inputs.Since accuracy does not account for the existence of multiple valid actions, we use perplexity (defined as the exponentiation of the Shannon entropy) alongside accuracy.To also measure the correctness of the predicted action (API) arguments, we use attribute accuracy compared to the collected datasets.Subtask 2: Response Prediction examines the relevance of the assistant response in the current turn.We evaluate in two ways; (a) as a conditional language modeling problem, where the closeness between the generated and ground-truth response is measured through using BLEU-4 score, and, (b) as a retrieval problem, where we measure the model performance when retrieving ground-truth responses from a pool of 100 candidates (randomly chosen and unique to each turn).

D. Results
The challenge saw a total of 11 model entries from 5 teams across the world, setting a new state-of-the-art in all three subtasks (Table XIII).
For each subtask, we listed metrics in a priority order and the entry with the most favorable performance on the highest priority metric was considered to be a candidate winner.Any entries within one standard error of this candidate's performance were also considered as candidates.Where there were more than one candidate, as in subtask 1, we used the next metric in the priority list and repeat this process until we had a single winner.
The winner of the structural API call prediction subtask (subtask 1) was a BART [51] model (BART-Large) from Team 4 that jointly predicted the dialog state (subtask 3), API call (subtask 1) and response (subtask 2a) as single target given the dialog history, multimodal context and user utterance.This model was one of two runners up on subtask 2a, and the runner up on subtask 3.
The winner of the response retrieval subtask (subtask 2b) was a BART-based Bi-encoder [52], [53], [54], also from Team 4, whose weights were initialized from the jointly trained BART model that won subtask 1.This model achieved a mean reciprocal rank (MRR) of 0.67, a lead of 0.29 points ahead of the runner up team on this subtask.
The winner of the response generation and DST subtasks (subtask 2a and subtask 3) was an ensemble of GPT-2 [22] models from Team 3 that were of differing sizes (large and small) and used differing portions of the training and development sets.Each GPT-2 model was independently trained on the joint tasks-subtask 2a and subtask 3-using a simple language model loss that optimized over the concatenated dialog history, multimodal context, user utterance, dialog state and response.Preprocessing over dialogue states was done before training, and an ensemble beam search over each model's prediction was used to generate the final prediction.

VI. CONCLUSIONS
This paper summarizes the four tracks of the ninth dialog system technology challenges (DSTC9).Beyond Domain APIs track expands the coverage of current task-oriented dialog systems by incorporating external unstructured knowledge sources.Multi-domain Task-oriented Dialog Challenge II, focuses on end-to-end multi-domain task completion dialog and cross-lingual multi-domain dialog state tracking.Interactive Evaluation of Dialog Track, expands dialog research beyond datasets encourages to develop dialog systems that can converse effectively in interactive environments.The Situated Interactive Multi-Modal Conversational AI track focuses on realworld assistant agents that can handle multi-modal inputs, and perform multi-modal actions.All the datasets and resources introduced for every track will be publicly available even after the challenge period to support future dialog system research.

Subtask 3 :
Dialog State Tracking (DST) aims to systematically track the dialog acts and the associated slot pairs across multiple turns, as represented in the flexible ontology developed to represent the SIMMC multimodal context.We use the intent and slot prediction metrics (F1), inline with prior work in DST.

TABLE I :
Summary of Track 1 tasks

TABLE II :
Statistics of the Track 1 data sets

TABLE III :
Objective evaluation metrics for the Track 1 tasks

TABLE V :
Human evaluation results for the Track 1 finalists

TABLE VI :
Automatic Evaluation Result (Best Submissions)

TABLE XI :
Results for subtask 1.For brevity, we only show the top 10 submissions (out of 33) according to the human evaluation.This table only reports the overall USR metric and the overall impression of the response from the human evaluation.The full evaluation results may be found here.

TABLE XII :
Results for subtask 2. This table reports for each system: the overall FED metric, the overall impression of the dialogs from the human evaluation, as well as the average number of dialog turns.The full results be found here.System 6 and 11 are our DialoGPT and Transformer baselines, respectively.