DxFormer: a decoupled automatic diagnostic system based on decoder–encoder transformer with dense symptom representations

Abstract Motivation Symptom-based automatic diagnostic system queries the patient’s potential symptoms through continuous interaction with the patient and makes predictions about possible diseases. A few studies use reinforcement learning (RL) to learn the optimal policy from the joint action space of symptoms and diseases. However, existing RL (or Non-RL) methods focus on disease diagnosis while ignoring the importance of symptom inquiry. Although these systems have achieved considerable diagnostic accuracy, they are still far below its performance upper bound due to few turns of interaction with patients and insufficient performance of symptom inquiry. To address this problem, we propose a new automatic diagnostic framework called DxFormer, which decouples symptom inquiry and disease diagnosis, so that these two modules can be independently optimized. The transition from symptom inquiry to disease diagnosis is parametrically determined by the stopping criteria. In DxFormer, we treat each symptom as a token, and formalize the symptom inquiry and disease diagnosis to a language generation model and a sequence classification model, respectively. We use the inverted version of Transformer, i.e. the decoder–encoder structure, to learn the representation of symptoms by jointly optimizing the reinforce reward and cross-entropy loss. Results We conduct experiments on three real-world medical dialogue datasets, and the experimental results verify the feasibility of increasing diagnostic accuracy by improving symptom recall. Our model overcomes the shortcomings of previous RL-based methods. By decoupling symptom query from the process of diagnosis, DxFormer greatly improves the symptom recall and achieves the state-of-the-art diagnostic accuracy. Availability and implementation Both code and data are available at https://github.com/lemuria-wchen/DxFormer. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
The combination of the internet and healthcare has excellent benefits and far-reaching positive effects in improving service efficiency and promoting social equity. Automated disease diagnosis is one of the rising needs in this new healthcare model, the goal of which is to simulate the actual diagnosis process of doctors.
The process of disease diagnosis can be considered as a sequence of queries and answers. Doctors choose relevant questions to ask the patient to have a better understanding of the patient's physical condition (Martinez et al., 2020;Janisch et al., 2020). In symptom-based automatic disease diagnosis, the agent has two types of actions: one type is to inquiry the patient about the possible symptoms, and another type is to predict the patient's disease (Peng et al., 2018). The process of disease diagnosis includes several turns of symptom inquiry and a final turn of disease diagnosis.
Recent years have witnessed an emerging trend of research on the task of automatic diagnosis. Considering its interactive nature, most researchers explore to model the problem by reinforcement learning (RL) (Wei et al., 2018;Yu et al., 2019;Xu et al., 2019;Liao et al., 2020). At each turn of interaction, the agent chooses an action from the action space of all symptoms and diseases. Correct symptom inquiries and disease diagnoses are positively rewarded, and the policy is learned by maximizing the expected cumulative reward.
Although RL-based approaches have made progresses for developing automatic diagnostic agents, they lack of exploration on symptom modeling and result in poor symptom recall. As shown in Figure 1, the symptom recall of most systems is within 30%. In these systems, the agent often only asks the patient for one or two symptoms and rushes to make a diagnosis. In this case, the system approximately degenerates into a pure disease classifier, and insufficient recalled symptoms also cause the diagnostic accuracy far below its upper bound.
We believe that traditional RL approaches do not perform well in symptom recall for two reasons: 1) In traditional RL framework, the action spaces of symptom inquiry and disease diagnosis are coupled (or joint), from where the agent selects an action at each step. Once the agent's chosen action is a disease, the entire session is terminated. The agent may be "afraid" of asking wrong symptoms that gets negative reward, and make a diagnosis eagerly, thus converging to a local optimal solution. Under the setting of traditional RL approaches, the number of interactive turns is determined by the reward and training settings, and it is not controllable; 2) In most past studies, symptoms are one-hot encoded as separate categories and simple MLP framework is utilized for policy learning. This simple setting is difficult to handle high-dimensional action space. Moreover, as the input features of the disease classifier, the increase of the number of symptoms will lead to a more sparse feature space, which may become intractable for the disease classifier.
To alleviate the above problems, we propose DxFormer, a decoupled automatic diagnostic framework, which consists of a Transformer-based decoder-encoder structure (not encoder-decoder). The decoder is for symptom inquiry, where the symptoms are treated as tokens or words in natural language, and the symptom inquiry is modeled as a conditional text generation task. The encoder is for disease diagnosis, where the symptoms obtained by the decoder are fed as the input sequence of the encoder, and the disease diagnosis is modeled as a sequence classification task. The decoder is encouraged to discover implicit symptoms, and the encoder is encouraged to make correct diagnosis, the two of which can work together in a decoupled manner and be trained simultaneously with little interference with each other. The dense representation of symptoms are learnt by optimizing the joint objectives. At runtime, the termination of symptom inquiry is determined by the stopping criterion. Specifically, the agent only switches from symptom inquiry to disease diagnosis if the confidence of the encoder on disease diagnosis reaches a certain threshold or the maximum number of turns is reached.
To evaluate DxFormer, we conduct experiments on three realworld structured medical dialogue datasets: Dxy, MZ-4 and MZ-10. Experimental results verify that DxFormer greatly improves symptom recall and diagnostic accuracy compared with the previous state-of-the-art methods. We also conduct additional ablation experiments to demonstrate the effectiveness of the components of DxFormer and further discuss the impact of the maximum number of turns and stopping criterion threshold on the model performance.
The main contributions of this paper can be summarized as follows: 1) We propose DxFormer, a decoupled system for automatic diagnosis based on inverted Transformer, where dense representations of symptoms can be learnt by optimizing the joint objectives of decoder and encoder; 2) We discuss the impact of the maximum number of turns and stopping criterion threshold on the model performance and suggest ways to utilize DxFormer in practice; 3) Extensive experiments show that the proposed model achieves the new state-of-the-art (SOTA) results in all the three public datasets.

Formalization
In this section, we will introduce several key concepts related to automatic diagnostic system to facilitate the reader's understanding of our motivations.
MCR In practice, annotated structured Medical Consultation Records (MCRs) are utilized to build automatic diagnostic system (Wei et al., 2018;Zeng et al., 2020). There are a large number of MCRs organized by disease categories available in online medical communities, such as Haodafu 1 . Generally, each MCR consists of the patient's self-provided report (i.e., self-report), multi-turn doctor-patient dialogue and the corresponding disease category. The self-report can be viewed as the first sentence in the dialogue.
Symptom Attribute Symptoms are widely present in actual doctor-patient conversations, they are the main topics discussed in medical dialogues and important basis for doctors to make diagnosis (Zeng et al., 2020;Chen et al., 2022). However, symptoms alone in the dialogue are less informative, additional annotations are needed to find the relationship between symptoms and patients. Generally, there are two kinds of relationships between a certain symptom and the patient: 1) Positive (POS): the patient is sure to have the symptom; 2) Negative (NEG): the patient is sure to not have the symptom. The annotator is required to find out all symptom entities mentioned in the dialogue, and identify their relationship with the patient. In this paper, we refer to this relationship as the Attribute of the symptom.
Structured MCR Let S denotes the set of all possible symptoms, D denotes the set of possible diseases, and A denotes the set of possible attributes. A structured MCR can then be denoted as: {(s 1 , a 1 ), (s 2 , a 2 ), ..., (sn, an), d}, where s i ∈ S is the i-th symptom that appears in the dialogue (with the self-report as the first utterance), a i ∈ A is the corresponding attribute of s i , and d ∈ D is the disease label.
Explicit & Implicit Symptoms Generally, symptoms appearing in the self-report are regarded as explicit symptoms while the others are implicit symptoms. As a notation, we take the first k symptoms as explicit symptoms, denoted as Sexp = {s 1 , ..., s k }, and the implicit symptoms are denoted as S imp = {s k+1 , ..., sn}. k is usually small because patients usually mention only 1 or 2 symptoms in their self-reports. Implicit symptoms are unknown during inference, thus the agent needs to find as many implicit symptoms as possible to obtain a more complete symptom profile about the patient before making a diagnosis.

Patient Simulator
We denote the patient simulator as P, and P can be viewed as a function whose input is any symptom and whose output is the attribute of that symptom of the patient. Note that if the symptom is not among the implicit symptoms, a Unknown (UNK) attribute is returned.
Agent Given a patient's explicit symptoms Sexp and their attributes, the task of the agent is to choose a symptom from S to interact with patient simulator P, and choose the next symptom after receiving the feedback, and so on for several turns. The dialogue will terminate when the agent finally makes a diagnosis, i.e. selects a disease from D. The goal of agent is to learn a policy, which can efficiently find implicit symptoms to obtain more complete information of the patient and finally make a correct diagnosis.

Accuracy Bound
To explain our motivation, we discuss how to evaluate automated diagnostic systems earlier in this section. In most related literature, there are two metrics concerned in automatic diagnostic system: symptom recall and diagnostic accuracy. For symptom recall (SX-Rec), it refers to the proportion of implicit symptoms that inquired by the agent. For more clarity, assuming that for a patient, the sequence of symptoms asked by the agent is Sagt, then SX-Rec measures the agent's ability to find implicit symptoms, and has a value between 0 and 1. For diagnostic accuracy (DX-Acc), since making a correct diagnosis is the most important step for the agent, DX-Acc is the final metric we aim to improve. Notably, for a disease classifier, the input is the symptom features of the patient. If Rec equals to 0, only explicit symptoms Sexp can be utilized as features, we call the accuracy of the system in this case the accuracy lower bound (Acc-LB); when Rec is 1, both explicit symptoms and all implicit symptoms can be utilized as features, in this case, the accuracy of the system is called the accuracy upper bound (Acc-UB).
We present the accuracy bounds for the three datasets (Dxy, MZ-4 and MZ-10) in Table 1 as a rough reference. We train support vector machine (Cortes and Vapnik, 1995) (SVM) classifiers using five-fold cross-validation on the training set and report the accuracy on the test set. Among them, Acc-LB means that only explicit symptoms are used as features, Acc-UB means that all symptoms are used, Acc-UB (P) means that explicit symptoms and positive implicit symptoms are used, and Acc-UB (N) means that explicit symptoms and negative implicit symptoms are used.
The results in Table 1 illustrate that the diagnostic accuracy of previous SOTA systems is far from the accuracy upper bound of SVM classifier, especially for MZ-10 ( Figure 1). It also suggests that both positive and negative implicit symptoms are useful for disease diagnosis. Besides, the results also give a non-rigorous reference for the agent, that is, the accuracy of a reasonable agent should be roughly between Acc-LB and Acc-UB.
The above discussion leads to the core motivation of this paper, which is to improve the diagnostic accuracy (DX-Acc) via improving the symptom recall (SX-Rec).

Method
In this section, we will introduce the two components of DxFormer, namely the decoder for symptom inquiry and the encoder for disease diagnosis.

Decoder for Symptom Inquiry
For symptom inquiry, assuming that there is no limit on the number of turns, the agent can find all implicit symptoms by simply traversing the symptoms in S. In this case, the recall equals to 1, and the accuracy can reach the upper bound. However, the size of S, i.e., the action space of symptom inquiry, can be potentially large, this policy is undoubtedly inefficient. Fortunately, due to the serialized nature of symptom inquiry and apparent co-occurrence between symptoms (Liao et al., 2020), training a more efficient agent is promising.

Architecture
In DxFormer, we analogize the process of symptom inquiry to a language model (Bengio et al., 2003). The symptoms are regarded as words, and attributes of symptoms are regarded as features of words, then symptom inquiry can be regarded as a text generation problem. We use a multilayer Transformer decoder as the recurrent model, which is a variant of the Transformer (Radford et al., 2018). The model applies a multiheaded self-attention operation over the historical symptom-attribute sequence followed by position-wise feedforward layers to produce an output distribution over target symptoms.

Input Representation
Dense input representation are designed in this work. For each symptom, its input embedding is the sum of corresponding symptom, attribute, and position embeddings. One visual example is shown in Figure 2(a). For symptom and attribute, we utilize Embedding layers to map any symptom in S and any attribute in A into dense vectors with same dimension. For position embeddings, we adopt sinusoidal position encoding used in (Devlin et al., 2018). The input sequence of symptoms is the concatenation of explicit symptoms and agent's historical asked symptoms.

RL formalization
As the decoder for symptom inquiry, our goal is to enhance the symptom recall within a certain number of turns, which is different from the objective function used in language models that maximizes the conditional likelihood. We can cast our generative model in the RL terminology as in (Ranzato et al., 2015) due to the non-derivable property of the objective function. Our Transformer decoder can be viewed as the agent that interacts with patient simulator P, whose parameters θ define a policy p θ , that results in an action that predict the next possible symptom based on known symptoms.

Reward Setting
Considering factors such as efficiency and rationality of symptom inquiry, we design the following reward mechanism.
Priori Reward A particular disease is often related to a certain group of symptoms rather than all symptoms (Liao et al., 2020). Therefore, agent is encouraged to ask about symptoms related to the specific disease. We achieve this through the disease-symptom co-occurrence frequency matrix in training set. For each symptom in Sagt, if the frequency of co-occurrence of the symptom and the disease corresponding to the ✐ ✐ "main" -2022/5/10 -1:41 -page 4 -#4 Wei Chen et al.   Table 2. Data statistics of Dxy, MZ-4 and MZ-10, the values in the form "a/b" in the last two columns represent the average and maximum values, respectively. Exp is short for Explicit and Imp is short for Implicit. case exceeds a certain threshold, a positive reward of +1 will be given, otherwise a negative reward of −1 will be given.

Ground Reward
The agent is encouraged to inquiry implicit symptoms to increase symptoms recall. For any symptom in Sagt, if the symptom is also in S imp , a reward of +2.5 will be given, otherwise −0.5 reward will be given. Priori Reward aims to prevent the agent from asking unrelated strange symptoms while ground reward facilitates the discovery of implicit symptoms.

Training Objective
The final reward of each action in Sagt is equal to the sum of these two rewards, and the training objective of the decoder for symptom inquiry is formalized to minimize the negative expected reward: where τ = {r 1 , r 2 , ..., rm} is the random reward sequence parameterized by p θ , and R(τ ) = m i=1 r i . To compute the gradient ∇ θ L(θ), we use the REINFORCE algorithm (Williams, 1992), which approximates the reward function using a single Monte-Carlo sample from p θ for each training example in the minibatch. Similar to traditional text generation, we initialize the parameter θ of the decoder by language model pre-training with maximum likelihood objectives.

Encoder for Disease Diagnosis
We decouple disease diagnosis from symptom inquiry for the following reasons: 1) The setting of reward is difficult in a coupled system, where the agent may struggle with whether to continue asking about symptoms or make a diagnosis; 2) There may be quite a few unknown (UNK) symptoms during the symptom inquiry, and these symptoms may interfere with the disease classifier.

Architecture
Upon the termination of symptom inquiry, we will extracted all positive and negative symptoms obtained by the agent. We adopt a multi-layer Transformer encoder to encode these symptoms, and then pass through an average pooling layer and a linear layer to produce an output distribution over target diseases. As shown in Figure 2(b), the Transformer encoder in our disease classifier differs from the decoder for symptom inquiry in the following three ways: 1) The encoder adopts a bidirectional transformer, while the decoder is unidirectional; 2) Position embeddings is removed from the input representation in the encoder, since intuitively the disease classifier should be insensitive with the order of symptoms; 3) Our encoder is shallower than the decoder (M < N ), the main reason is because the symptom inquiry is more complex and requires more parameters. It is worth noting that our encoder and decoder share the parameters of symptom embedding and attribute embedding.

Training
In DxFormer, the encoder is jointly trained with the decoder. Given the explicit symptoms of a patient, we obtain S sdec and S gdec by the decoder using sampling decoding and greedy decoding respectively, where S sdec is used to compute the REINFORCE reward, and S gdec is used to compute the cross-entropy loss between the predicted disease distribution and the real disease distribution. The final loss function equals to the sum of negative REINFORCE reward and the cross-entropy loss. The parameters are initialized by pre-training the decoder and encoder using the ground truth implicit symptoms simultaneously.

Stopping Criterion
The transition from decoder to encoder is controlled by the stopping criterion. During training, we specify a maximum turn Tmax for symptom inquiry, and the agent's goal is to find all symptoms as much as possible within Tmax turns and make a correct diagnosis. However, asking Tmax turns each time is inefficient, because it is possible that key symptom information may have already been found, and continuing to ask the patient to obtain some unknown or insignificant symptoms is unnecessary.
We take a simple but effective method to stop the symptom early. After each turn of symptom inquiry, the currently obtained symptom information is fed to the encoder to obtain the probability distribution over the possible diseases. The symptom inquiry will be terminated once the the probability of the chosen disease is beyond the certain threshold ǫ. We will analyze the effect of the threshold ǫ in § 4.7. Table 3. Experimental results of DxFormer on Dxy, MZ-4 and MZ-10 dataset.

Experimental Datasets
We evaluate DxFormer on three public real-world medical dialogue datasets: Dxy, MZ-4 and MZ-10, all of which consists of a number of annotated structured MCRs described in § 2.1. Details statistics of the datasets are listed in Table 2.
MZ-4 (Wei et al., 2018) The first human-labeled dataset collected from the pediatric department of Baidu Muzhi Doctor website 2 to evaluate automatic diagnostic system. MZ-4 includes 4 diagnosed diseases: children's bronchitis, children's functional dyspepsia, infantile diarrhea infection, and upper respiratory infection.
Dxy (Xu et al., 2019) An annotated medical dialog dataset collected from Dingxiang Doctor 3 , a prevalent Chinese online healthcare website. Dxy includes 5 diagnosed diseases:allergic rhinitis, upper respiratory infection, pneumonia, children hand-foot-mouth disease, and pediatric diarrhea.

MZ-10 (Chen et al., 2022)
A dataset with multi-level annotations expanded from MZ-4 to include 10 diseases, including typical diseases of digestive system, respiratory system and endocrine system. MZ-10 also contains more symptoms.
It is worth noting that MZ-4 and MZ-10 contain an average of 5.5 and 6.6 implicit symptoms respectively, and at most more than 20 implicit symptoms, significantly more than Dxy. This suggests that the difficulty of symptom inquiry is higher on the MZ-4 and MZ-10 datasets.

Baselines
We compare DxFormer with some state-of-the-art models for automatic disease diagnosis that use different techniques, including reinforcement learning (RL), generative adversarial network (GAN), and variational autoencoder (VAE).
DQN (Wei et al., 2018) An agent based on the Deep Q Network (DQN) algorithm that adopts the joint action space of symptoms and diseases, where positive reward is given to the agent at the end of a success diagnosis.
REFUEL (Peng et al., 2018) A policy based RL method with reward shaping and feature rebuilding, where a branch to reconstruct the symptom vector is utilized to guide the policy gradient. 2 https://muzhi.baidu.com/ 3 https://dxy.com/ (Xu et al., 2019) An improved RL method based on DQN that integrates relational refinement branches and knowledge-routed graphs to strengthen the relation between diseases and symptoms.

KR-DQN
GAMP (Xia et al., 2020) A GAN-based policy gradient network. GAMP uses the GAN network to avoid generating randomized trials of symptom, and add mutual information to encourage the model to select the most discriminative symptoms.
BSODA (He et al., 2022) An non-RL bipartite framework that uses an information-theoretic reward to collect symptoms, and a multimodal variational autoencoder (MVAE) model is used for disease prediction with a two-step sampling strategy.
We use the open source implementation 4 for DQN, REFUEL, KR-DQN and GAMP, since none of these papers provide any official repos or codes, and symptom recall is also not reported in most papers.

Model Configuration
DxFormer is composed of a 4-layer decoder and a 1-layer encoder. For Dxy, the embedding and hidden size is set to 128, and feed-forward size is set to 256; For MZ-4 and MZ-10, the embedding and hidden size is set to 512, and feed-forward size is set to 1024. We use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 3 × 10 −4 for maximumlikelihood pre-training, and learning rate of 1 × 10 −4 for RL training. All our experiments are performed on 4 Nvidia Tesla V100 32G GPUs. Following the conventional setting, all baseline models as well as our DxFormer specify the maximum number of turns for symptom inquiry to 10, the threshold ǫ is set to 0.99.

Overall Performance
In Table 3, we compare DxFormer with the baseline models. It can be seen that under the same setting, DxFormer greatly improves the symptom recall. Compared to the baseline model with the highest recall, SX-Rec has been improved by nearly 12∼27% absolute value on the three datasets. Besides, the diagnostic accuracy of DxFormer also surpasses all previous SOTAs. In particular, on MZ-10, DxFormer improves the accuracy by about 14 absolute percentages over the best baseline model.
Considering the performance of these baseline models on the MZ-10 dataset, the diagnostic accuracy is only on par with the lower bound of the SVM classifier, which suggests that these systems approximately  degenerate into disease classifiers that are very weak at finding implicit symptoms. This illustrates the advantages of DxFormer over other systems when faced with more diseases and symptoms. It is worth noting that in DxFormer, the average number of turns (# Turns) is higher than most systems, suggesting that the model performance may not be Pareto optimal. However, as we introduced in the introduction, the average number of turns is not controllable in traditional RL-based methods. It is difficult for us to compare the model performance with the same number of turns. Nevertheless, from a clinical point of view, one would expect that asking 6-9 symptoms on average to improve diagnostic accuracy is acceptable. A slightly larger number of turns is actually an advantage since finding enough implicit symptoms must come at the expense of increasing the number of turns.

Ablation Studies
To verify the effectiveness of each component in DxFormer, we conduct some ablation experiments. DxFormer-Sparse is an agent that exactly the same as DxFormer except for the input presentation. DxFormer-Sparse uses the one-hot representation of symptom and attribute and the concatenation of the one-hot vectors are fed as the input; DxFormer-SVM is an agent that uses the same decoder of DxFormer but replaces the encoder with a SVM classifier.
From the ablation analysis results in Table 3, both DxFormer-Sparse and DxFormer-SVM perform worse than DxFormer, which shows that the dense representation of symptoms is effective. Notably, the two variants of DxFormer can still beat SOTAs, illustrating the effectiveness of our decoupled framework. In fact, in our early attempts, models like RNN-MLP, LSTM-MLP can also work well, although not as well as DxFormer. The decoder-encoder framework and stopping criterion together contributes to the excellent performance of DxFormer.

Effect of Max Number of Turns
In figure 3(b), we analyze the effect of Tmax on model performance on the MZ-4 and MZ-10 datasets. We find a promising result that both SX-Rec and DX-Acc increase when Tmax increases. This confirms the core motivation of this paper, that it is feasible to improve the diagnostic accuracy by increasing the symptom recall. It can also be found that when SX-Rec reaches a certain value (about 70%), DX-Acc gradually converges to the upper bound, which is significantly higher than the upper bound of SVM classifier. On the MZ-4 and MZ-10 datasets, when the agent is allowed to interact with the patient more than 20 times, the diagnostic accuracy can be as high as about 78% and 70%, respectively, which is much higher than the current SOTA results.
We argue that the gap of the upper bound comes from two aspects: 1) the SVM classifier is based on sparse representation of symptoms rather than dense representation; 2) the SVM classifier does not exploit the sequential feature in the symptom inquiry. Besides, SX-Rec of DxFormer is also higher than that of a strong rule-based agent we create additionally for comparison. These results are encouraging and illustrate the advantage of DxFormer tremendously.

Effect of Stopping Criterion Threshold
Once the maximum number of turns Tmax is chosen and fixed, we can further control the balance of accuracy and efficiency through the threshold ǫ in the stopping criterion. We demonstrate in Figure 4 the effect of the threshold ǫ on diagnostic accuracy (DX-Acc) and the average number of turns (# Turns) on the MZ-4 and MZ-10 datasets, given Tmax = 10. It is found that both DX-Acc and # Turns tend to decrease as the threshold gradually decreased from 1.0 to 0.9, while the downward trend of DX-Acc is more gradual, especially in the beginning period. This suggests that it is possible to choose an appropriate threshold that reduces the average number of turns with little loss of diagnostic accuracy.
In DxFormer, Tmax and ǫ together determine the balance of accuracy and efficiency. In practice, we recommend to first choose the Tmax that allows DX-Acc to converge to the upper bound, and then select an appropriate threshold to achieve acceptable accuracy and efficiency.

Related Work
Automatic Disease Diagnosis Deep reinforcement learning (Mnih et al., 2013;Silver et al., 2016) has been applied for automatic diagnosis (Tang et al., 2016;Kao et al., 2018). (Peng et al., 2018) proposed reward shaping and feature rebuilding method for fast disease diagnosis. However, their data used is simulated that cannot reflect the situation of the real diagnosis. For the medical dialogue system for automatic diagnosis, (Wei et al., 2018) annotated the first medical dataset for dialogue system and use a Deep Q-network (DQN) to collect additional symptoms via conversation with patients. (Xu et al., 2019) released another medical dataset for the dialogue system and introduce prior knowledge to improve the diagnosis accuracy. (Liao et al., 2020) propose a hierarchical reinforcement learning framework based on master-worker structure for simulating real medical consultation. There are also some related studies based on non-RL framework (Xia et al., 2020;He et al., 2022).

RL based Text
Generation RL is also a popular alternative in text generation, especially when the training objective is not the traditional maximum likelihood. (Ranzato et al., 2015) Use REINFORCE algorithm to maximize the BLEU of the generated sequence to solve the exposure bias problem of the traditional seq2seq model. (Li et al., 2016) use policy gradient algorithm to maximize the mutual information of generated response in dialogue system. (Rennie et al., 2017) propose self-critical sequence training (SCST), which utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences in image captioning.

Conclusions and Future Work
In this work, we propose a decoupled system for automatic diagnosis called DxFormer, it can learns dense representations of symptoms by optimizing the joint objectives of decoder and encoder. We explore the balance between accuracy and efficiency of the system and suggest ways to utilize DxFormer in practice. Extensive experiments show that the proposed model achieves the new state-of-the-art (SOTA) results in all the three public datasets.
In future work, we hope to explore the difference and effect in diagnostic accuracy between the recall of positive and negative symptoms. Besides, we hope to further explore automatic disease diagnosis based on more features. Since symptoms are only one factor in making a diagnosis, we believe that building a more refined patient profile through an automated consultation system is an inevitable step to improve diagnostic accuracy.