Improving Detection and Categorization of Task-relevant Utterances through Integration of Discourse Structure and Ontological Knowledge

Information extraction from conversational data is particularly challenging because the task-centric nature of conversation allows for effective communication of implicit information by humans, but is challenging for ma-chines. The challenges may differ between utterances depending on the role of the speaker within the conversation, especially when relevant expertise is distributed asymmetrically across roles. Further, the challenges may also increase over the conversation as more shared context is built up through information com-municated implicitly earlier in the dialogue. In this paper, we propose the novel modeling approach M ED F ILTER , which addresses these insights in order to increase performance at identifying and categorizing task-relevant utterances, and in so doing, positively impacts performance at a downstream information extraction task. We evaluate this approach on a corpus of nearly 7,000 doctor-patient conversations where M ED F ILTER is used to identify medically relevant contributions to the discussion (achieving a 10% improvement over SOTA baselines in terms of area under the PR curve). Identifying task-relevant utterances beneﬁts downstream medical processing, achieving improvements of 15% , 105% , and 23% respectively for the extraction of symptoms, medications, and complaints.


Introduction
In this paper, we propose a novel modeling approach that embodies insights regarding the organization of task-oriented conversations in order to improve performance at utterance classification over SOTA baseline approaches.Task-oriented conversations involve sharing task-relevant information that may be useful as the task ensues (Liu et al., 2019a;Kazi and Kahanda, 2019).Unfortunately, human-to-human conversations are less well structured than expository text, which is more often the source material for information extraction and summarization.Expository text is typically structured top-down and organized around information flow.Task-oriented conversations, on the other hand, are typically organized around the task and knowledge of task structure provides an implicit scaffold for understanding.Thus speakers feel free to elide or imply important information rather than making it explicit.The challenges have been well documented (Waitzkin, 1989;Lacson et al., 2006).Prior work in utterance classification is a source of SOTA modeling approaches that perform relatively well despite these challenges while leaving much room for improvement.
Our evaluation in this paper specifically focuses on doctor-patient interactions.Doctor-patient interactions are task-oriented, expert-layperson interactions in which the concerns voiced by the layperson (e.g., symptoms), the underlying issue identified by the expert (e.g., complaint) and the prescribed solutions (e.g., medications) play a crucial part.Customer-service chats are another example of such dialogue.As in the general case, topic switching abounds: the doctor may jump from a question about a symptom to a statement providing initial assessment then back again, with or without waiting for a reply from the patient (which may, itself, be responsive or introduce a new concern).In addition, the participants make unequal contributions to different parts of the schema due to the inherent asymmetry between their roles in terms of knowledge and authority.Despite these challenges, humans are able to communicate very effectively in this way.Because of that, the issues increase as the conversation progresses and more shared context is built up, in part because of a certain amount of shared domain knowledge, despite differences in the extent and phrasing of it.In response to these insights, our proposed model, which we refer to as MEDFILTER, integrates elements of dis-Figure 1: Overview of MEDFILTER.MEDFILTER first encodes each utterance of the given conversation using a BERT-based encoder (A).The obtained utterance embedding is concatenated with contextual information like speaker role, position of utterance in the conversation, and ontological knowledge (B).This is then fed to a MS-BiLSTM (C1) for medical relevance identification.MS-BiLSTM leverages speaker role information to learn speaker-specific context for each utterance.This contextual representation is concatenated with the utterance embedding (C2) and passed through another MS-BiLSTM (C3) which focuses on fine-grained categorization.Both tasks are jointly learned.Refer to Section 3 for more details.
course structure and ontological knowledge to improve utterance classification, the impact of which is also observed in a downstream extraction task.We evaluate the approach on a corpus of nearly 7,000 doctor-patient interactions as a case study.
Our proposed method, MEDFILTER 1 , is illustrated in Figure 1 and described in detail in Section 3. Its architecture specifically reflects an awareness of the challenges above and begins to address them.In particular, the speaker's role (i.e., doctor, patient, and other) and position within the interaction are both introduced as structuring variables.Insights from ontological knowledge are also made available through a domain ontology: specifically, the Unified Medical Language System (UMLS) (Bodenreider, 2004).From a more technical perspective, the architecture introduces a novel Multi-Speaker BiLSTM to learn role-specific context representations.MEDFILTER also benefits from the incorporation of a hierarchical loss that jointly learns the coarse-grained task of predicting medical relevance to improve fine-grained topicbased utterance classification.The ability to extract medically relevant utterances from doctor-patient conversations and categorize them into the medical topics/categories has a substantial practical impact in medical practice (Finley et al., 2018;Quiroz et al., 2019

Related Work
Dialogue Summarization: In addition to the challenges noted earlier in the paper, other linguistic phenomena such as backchannels, false starts, and topic diffusion are prominent in human-tohuman conversations.They add noise, which challenges the capabilities of otherwise effective sumarization approaches such as pointer-generator networks (See et al., 2017;Liu et al., 2019b).
Some prior work has relied on an Information Extraction (IE) based approach to extract details about individual medical entities such as symptoms or medications (Du et al., 2019;Selvaraj and Konam, 2019).However, recently, multiple studies (Lacson et al., 2006;Kocaballi et al., 2019;Liu et al., 2019a,b;Park et al., 2019) have shown the benefits of using the topical structure in goal-oriented dialogues to improve summarization.Within that scope, Liu et al. (2019a) introduce keypoint sequences that describe the logical topic flow of the summary of customer-service chats.They propose a hierarchical transformer to predict these topics (key-points) for each utterance and use them as auxiliary labels to guide the summarization.This past work inspires our work in which we extend the approach and then apply it in the more challenging domain of doctor-patient interactions.We consider it more challenging both in terms of the number of utterances per conversation (avg.225 vs 20) and topic switches (Kocaballi et al., 2019).To improve the key-point sequence utterance-level topic classification approach (Liu et al., 2019a), we propose MEDFILTER that models speaker-specific context augmented with ontological knowledge and a hierarchical loss function.

Intent Classification:
The problem of classifying utterances into medical topics/categories has many similarities with the task of utterance-level intent classification (Zhang et al., 2019;Budzianowski et al., 2018b;Qu et al., 2019).In our case, medical categories act as coarse-grained intents that drive the content of the discussion.Much of the previous work in intent classification caters to creating better dialog agents that condition their responses on the intent of the previous utterance (Budzianowski et al., 2018a;Bocklisch et al., 2017).For instance, Chen et al. (2019); Kim et al. (2017) propose intent classification as a text classification task where each utterance is considered a complete, independent command.However, this is not true in our case as the discussion about a medical category might range over multiple utterances, each dependent on context.Hence, we tackle the classification problem as a sequence-labeling task.
Sequence Labeling in Dialogue: Most prior work that employs sequence labeling for utterance classification in dialogues (Raheja and Tetreault, 2019;Liu et al., 2017;Jiao et al., 2019a) evaluates their systems on dialogue-act classification (Shriberg et al., 2004(Shriberg et al., , 1998) ) or emotion recognition datasets (Poria et al., 2019).In this paper, we adopt stateof-the-art modeling approaches from the emotion recognition task (Jiao et al., 2019a,b) to serve as baselines in our evaluation since our task has not previously been benchmarked.

Proposed Method: MEDFILTER
The overall architecture of MEDFILTER is shown in Figure 1.The input to MEDFILTER is a transcribed clinical conversation C of form {u 1 , u 2 , ..., u n }, where each u i represents an utterance.Each utterance in the conversation is passed through a BERT-based encoder (Fig. 1A and Sec.3.1) to get a fixed-dimensional representation.Contextual information such as speaker role, the utterance's position in the conversation, and ontological knowledge (Fig. 1B and Sec.3.2) is then appended to the BERT representation.The encoding is input to the coarse Multi-Speaker BiLSTM (MS-BiLSTM) model (Fig. 1C1) followed by a fully-connected layer to classify the relevance of utterances for topical classification.The representation created by MS-BiSLTM (Coarse) is then concatenated with the utterance encoding (Fig. 1C2) and the resulting vector is fed to the fine-grained MS-BiLSTM (Fig. 1C3) to classify utterances into different medical categories (Sec.3.3).MEDFILTER is jointly optimized on both classification tasks.

BERT-based Encoder
Given the superior modeling capabilities of longrange dependencies in Transformer-based models (Vaswani et al., 2017), we use pre-trained BERT (Devlin et al., 2019) for encoding each utterance u i .We first encode each token in the utterance using BERT, i.e., [h BERT i1 , h BERT i2 , ..., h BERT im ], where h ij represents BERT-encoding of j th token of u i .Now, following Reimers and Gurevych (2019), we use MEAN pooling for obtaining a representation for the entire utterance (h Text i ).Since the original pretrained BERT model is trained on a general web corpus such as Wikipedia, it might not generalize well to our corpus.Therefore, we further fine tune the BERT model in a supervised manner for the task of predicting the utterance type.

Contextual Information
In addition to encoding the text of an utterance, we also make use of the following types of contextual information.
1. Speaker Role Info: In conversations in general, speaker identity helps ground co-references like I, You.In doctor-patient conversations, each of the speakers play a specific role in the goals of the interaction.For example, the doctor is more likely to discuss medications than the patient.To allow the representation to be sensitive to speaker information, we map the speaker roles, namely, doctor, patient, and other, to a d-dimensional embedding (h speaker i ) which is learned during training and given to the model along with the text-based representation.
2. Positional Info: Clinical conversations often follow a pattern where topics like symptoms and complaints are discussed earlier in the dialog and prescribed medications are narrated in the middle or toward the end.To include this signal in MED-FILTER, we partition all the utterances in a conversation into k equal parts based on their position.For instance, if the conversation has 40 utterances and k = 4 then the initial 10 belong to 1 st partition and the next 10 belong to 2 nd and so on.Similar to speaker role information, a trainable embedding is associated with each partition (h position i ).
3. Ontological Knowledge: UMLS (Unified Medical Language System) (Bodenreider, 2004) is a combination of a semantic network and a metathesaurus.The semantic network consists of a set of 127 broad subject categories, or semantic types, which provide a consistent categorization of all concepts represented in the meta-thesaurus.In MEDFILTER, we use Quick-UMLS (Soldaini and Goharian, 2016), which identifies clinical mentions in an utterance and retrieves the associated UMLS Concept Unique Identifers (CUIs) and semantic type, to inform our model about the type of medical phrases present in the input.We believe that types such as Pharmacologic Substance, Symptoms, and Diseases can be helpful in correctly classifying the utterances.We assign a trainable embedding to each semantic type.However, since each utterance can contain multiple clinical mentions of varied semantic types, we average the semantic-type embeddings for each mention present in the utterance and pass it to the model (h semantic i ).

Utterance Prediction
The classifier takes in the extended representation for each utterance u i in the conversation given as To explicitly model the separate roles performed by each speaker (as discussed in Section 1), we propose a novel module Multi-Speaker BiLSTM (MS-BiLSTM) that includes speaker-level BiL-STMs to learn the context for each speaker type separately.We note, for example, that when the doctor is prescribing medications to the patient, she is more likely to expand on her previous utterance in order to discuss different details about the medicine, whereas the patient is most likely to give simple acknowledgments or ask questions in her turn.Having separate speaker-level BiLSTMs allows MS-BiLSTM to model this difference in the use of context.
MS-BiLSTM takes in h i and s i (utterance's speaker) as input.h i is passed through a background BiLSTM (BiLSTM bg ) and different speakerlevel BiLSTMs (BiLSTM s ).Thus, if there are 3 speaker roles in the conversation, then the extended representation for each utterance (h i ) would be input to 4 BiLSTMs (1 background BiLSTM + 3 speaker BiLSTMs).The hidden representations from BiLSTM bg and BiLSTM si are combined using a sigmoid gate that is learned during training: Each speaker-level BiLSTM (BiLSTM si ) only receives gradients for that speaker's utterance (u i ) thus focusing on role-specific context.The gate between ĥsi i and ĥbg i controls the relative importance of the role-specific and general-context representation learned by speaker-level and background BiLSTMs respectively.
In this paper, we focus on classifying an utterance into one or more out of three categories, namely symptoms, complaints, and medications.However, these categories can be combined together to create a coarse-grained task of predicting if the utterances are medically relevant.We leverage this coarsegrained supervision to create a hierarchical model with a joint-learning loss.
Hierarchical Modeling: In this architecture, the extended representation (h i ) and the corresponding speaker role (s i ) are first passed through a coarse-grained MS-BiLSTM and a fully-connected layer followed by softmax to be classified into one of the two categories {Medically Relevant, Ir-relevant}.The representation h coarse i learned by this MS-BiLSTM would model the differences between medically relevant and irrelevant text which can also benefit fine-grained classification.Hence, h coarse i is concatenated with h i and sent to the fine-grained MS-BiLSTM which focuses on the multi-label classification into the three categories discussed earlier: Both tasks are jointly optimized and the hyperparameter β controls the relative strength of the medical-relevance classification loss (L coarse ): Such a loss function could also be used in other utterance classification tasks where classes follow a hierarchical structure.For instance, in emotion classification (Poria et al., 2019), the fine-grained categories (e,g, happiness, anger, etc.) can be combined to create an emotive class, and a coarsegrained classifier could be used to learn features that differentiate between emotive and neutral utterances.

Corpus Description
Our data set comprises 6,862 annotated transcripts of real and de-identified doctor-patient conversations with an average of 225 utterances per conversation, primarily from the doctor and patient but occasionally including contributions from nurses, caregivers, and other attendees as well.The annotation guidelines were developed by a team of professional medical scribes and NLP experts.Annotators were trained to identify the medically-relevant utterances in a given conversation and assign one or more (out of 15 possible) tags to each utterance.Each of these tags represents a medical category like symptom, previous medical history, diagnosis etc.Most conversations contain some informal, social interactions with utterances that are irrelevant to the downstream clinical tasks. 2  In this work, we leverage the labels to train MED-FILTER on the task of utterance classification and 2 An example dialogue is included in Appendix (Sec.A.2).
focus on three categories, namely, symptoms, complaints, and medications, where medications include past/current medications taken by the patient and prescriptions given by the doctor. 3We choose the above-mentioned categories as they are found in every office visit, and most closely generalize to other domains like customer-service chats.However, our approach can be easily generalized for capturing other aspects such as previous medical history, diagnosis, and assessments as well.We set aside a random sample of 627 and 592 conversations for validation and testing respectively.

Baselines
Since sequence-labelling models haven't been applied to utterance classification in doctor-patient conversations previously, we compare our proposed method, MEDFILTER, against baseline methods that give SOTA results on utterance-level emotion recognition data sets.HiGRU-sf (Jiao et al., 2019b) is a hierarchical gated recurrent unit (Hi-GRU) framework with an utterance-level GRU and a conversation-level GRU.BiF-AGRU (Jiao et al., 2019a) denotes a two-level BiGRU fusion model with uni-directional AGRU for attentive context representation.UniF-BiAGRU is similar to BiF-AGRU, but uses a uni-directional GRU for contextual utterance representation and a bi-directional AGRU for attentive context.For implementation, we use the official code provided by the authors. 45  Evaluation Metric: We use the mean area under the PR curve (AUC), a widely used metric in multi-label classification setting (Riedel et al., 2013;Mintz et al., 2009), as our evaluation metric.It is also used for early stopping and hyperparameter tuning.6

Utterance Classification Results
MEDFILTER performs better than any of the baseline approaches in assigning utterances in doctorpatient conversations to medically relevant categories.Table 1 presents the AUC scores for different utterance-labeling models on our test set.Each result is the mean of 5 independent runs with different seeds.
A BERT-based classifier that passes the mean of token-level embeddings through an FC layer gives a low score of 33.5 AUC.When the BERT encoder is fine-tuned along with the classification layer (BERT-FT), the performance jumps to 36.2 underlining the benefits of fine-tuning BERT (Devlin et al., 2019).We also find that using Clinical BioBERT-FT (fine-tuned) does not beat BERT-FT.This is partly because the former is further pretrained on MIMIC notes (Alsentzer et al., 2019) which are much more formal than medical conversations and thus the additional knowledge does not transfer well to our corpus.Adding context to BERT-based models , using, e.g.BiLSTM, gives substantial boosts.End-to-end fine-tuned BERT BiLSTM (BERT BiLSTM FT) performs worse than BERT-FT BiLSTM that passes fine-tuned BERT embeddings through a BiLSTM as non-learnable features.MEDFILTER, which further includes contextual information, uses MS-BiLSTM in place of BiLSTM, and optimizes a hierarchical loss, significantly outperforms all baselines and obtains 1.4 absolute AUC points over BERT-FT BiLSTM (2 nd best).It also surpasses emotion recognition SOTA methods like HiGRU-sf by 4.1 AUC points.
Ablation Results: To understand the importance of each module in MEDFILTER, we perform a cumulative ablation study (Figure 3).We find that removing individual modules results in notably reduced performance.The model that does not incorporate hierarchical modeling, shows a dip of 0.4 AUC points.This suggests that the information learned in the medical-relevance prediction layer aids the final classification task.Further, replacing MS-BiLSTM with a simple BiLSTM leads to a drop of an additional 0.6 AUC points, revealing the importance of modeling speaker-specific context.Without contextual information, we see a reduc-

Impact of Utterance Classification on Downstream Medical Extraction
The results in the previous section portray the effectiveness of MEDFILTER at sorting important utterances in clinical conversations into medically relevant categories.Such filtering, when included in the pipeline (for example, as a pre-processing step), can assist downstream medical processing methods to focus on utterances that contain information pertinent for their tasks (Figure 2), by improving the signal-to-noise ratio in the input.In this section, we evaluate whether the use of MEDFILTER to prune irrelevant utterances is advantageous for symptom, medication, and complaint extraction.

Task Setup
The extractor takes the conversation as input and outputs the discussed symptoms/medications/complaints within.Conversation-level labels for all three extraction tasks are taken from a predefined set provided by the corpus annotators.For symptoms, they include 14 coarse-grained classes to represent different body systems (e.g., cardiovascular) and 178 fine-grained ones for the corresponding issues (e.g., palpitations).Given the small size of the training data, we use the coarse-grained body-systems for symptom extraction.We then manually curate a list of different symptoms corresponding to each body-system using UMLS and use their UMLS CUIs as labels. 7 For medications, we manually link medication labels to their corresponding UMLS (Bodenreider, 2004)  from NCI Thesaurus (Sioutos et al., 2007). 8We pass each medication name through QuickUMLS to get a list of possible CUIs for the term in UMLS.We take the candidate CUI with a similarity of 1 and find its NCI hierarchy in the UMLS metatheasurus.The four topmost nodes in the hierarchy are extracted, which act as the pseudo-label for that CUI.In order to reduce the class-imbalance, some of these hierarchies are combined to form a coarser label.This reduces the number of labels to 31.Finally, Others label is added, which inhabits medicine names (in the test-set) that do not correspond to any of the previous 31 labels.This reduces the label count to 32 for medications. 9omplaints in our corpus range from follow-up visits to disease names to vaccine requests.Similar to medication extraction, we leverage SNOMED-CT hierarchies10 to constraint the tag list to 11, where the first 10 represent diseases of different body systems and Others encompasses complaints like follow-up, vaccine requests, medication refill requests, etc. (Table A11).
We use the same train/val/test split as defined for the utterance classification experiments in Section 4.1.The performance of the extraction pipeline is evaluated on Micro and Macro-F1 scores.

Extractor Details
All three extraction tasks are modeled as multilabel classification.We leverage a state-of-the-art medical entity-linking tool, QuickUMLS (Soldaini and Goharian, 2016) 11 , that takes in a conversation and outputs UMLS CUIs corresponding to all identified candidate concepts.Concepts with a similarity measure of 1 are chosen as predictions.For symptom extraction, the predictions are compared against a manually created list of CUIs (presented in Appendix Table A12) for symptoms associated with each of the 14 Body Systems.The presence of a symptom of body-system b is determined by the presence of the predicted CUIs in the target list for that body system.We compare the NCI and the SNOMED-CT hierarchies of the predicted concepts against the label hierarchies for medications and complaints, respectively.Concepts that do not fit into one of the specific categories are grouped under the label Others.In the next section, we report the results for the best performing filtering thresholds.12

Results
We find that the performance of the baseline medication and symptom extractor QuickUMLS (All Text) is substantially boosted by filtering out irrelevant utterances (Tables 2 and 3).Pruning medically irrelevant utterances using MEDFILTER (MR MEDFILTER) improves Micro F1 by 29.3 and 4.7 points for medication and symptom extraction, respectively.If only the medication/symptomatic utterances (MU/SU) are input to the extractors, the results improve further.
Results for complaint extraction are shown in Table 4.We find that the QuickUMLS extractor does not perform well on complaint extraction.However, consistent with the other two categories' trends, pruning irrelevant utterances before sending the conversation through the extractor improves performance.Micro-F1 score increases from 35.6 for All Text to 43.7 for CU MEDFILTER.
Pruning done using MEDFILTER seems to be more beneficial than BERT-FT BiLSTM (2 nd best utterance classifier in Table 1) for medication and complaint extraction, however they perform equally well for symptom extraction.This suggests that the benefits from the inclusion of discourse structure, domain knowledge, and a hierarchical loss function, do not transfer well to symptom ex-   4a).Furthermore, phrases with UMLS semantic types Pharmacologic Substance, Sign/Symptom, and Disease/Syndrome occur quite frequently in medical, symptom, and complaint utterances respectively (Figure 4b).traction.In Section 7, we investigate the kinds of utterance classification errors MEDFILTER makes, that need to be addressed to further improve the symptom extraction pipeline.

Discussion
Why does contextual information help?Ablation results (Figure 3) show that incorporating speaker role information and UMLS semantic-type information provides significant improvements in AUC scores for utterance classification.In Figure 4a, we plot the proportion of utterances from different medical categories against their speakers.While both parties contribute equally to symptom discussions, there is a clear asymmetry in the number of medication and complaint utterances spoken by the doctor and the patient, explaining the contribution of speaker role information in differentiating medication/complaint utterances from others.We also plot the distribution of the four most frequent UMLS semantic types present in the utterances of different medical categories (Figure 4b).For medications, we find that UMLS entities with semantic type Pharmacologic Substance are present in more than 55% of the medication utterances indicating that its detection is a knowledge-dependent task.Similarly, and supporting our hypothesis, Disease/Syndrome and Sign/Symptom are the most frequent semantic types in complaint and symptom utterances, respectively.
Error Analysis: In this section, we present a deeper analysis of some of the systematic knowledge-extraction errors made by MEDFILTER that limit its performance in recognizing medicallyrelevant utterances.
1. Informal Language: The model sometimes overlooks informal references to symptoms.For instance, utterances such as PT: I feel something unusual in my leg or PT: My heart beats funny!discuss musculoskeletal and cardiovascular symptoms but do not use medical terms to refer to them.These patterns seem to be more frequent in patient utterances, likely because they are less familiar with medical terminology.Off-the-shelf entity-linkers, like QuickUMLS (Soldaini and Goharian, 2016), do not transfer well to spoken medical conversations.They are unable to recognise the correct UMLS concepts (and semantic types) corresponding to the colloquial symptomatic phrases which reduces their effectiveness as features. 13For instance, for the utterance PT: My heart is racing., QuickUMLS outputs 1A rather than 1B:  8 Future Work For a system to correctly classify the samples of the above two categories, it needs both to generalize to patient-generated language, and to have a semantic understanding of whether the description strays from normal.Incorporating data from online self-disclosure sites like medical subreddits and discussion forums (Basaldella and Collier, 2019) during training might prove beneficial for learning better representations for such vocabulary.Concept normalization data sets (Miftahutdinov and Tutubalina, 2019;Lee et al., 2017) could also be leveraged in this regard.Our approach of training the BERT encoder separately from the context encoder would allow MEDFILTER to learn from such non-dialogue resources.
Extraction tasks (Section 6) mostly evaluate the ability of MEDFILTER to recognize utterances that contain the most information about the name or type of the medication, symptom, or complaint.However, to quantify the context-level benefits of MEDFILTER, especially the speaker-specific context modeling (MS-BiLSTM), on downstream processing, we need to evaluate the system on problems like regimen extraction (Selvaraj and Konam, 2019) or symptom summarization (Liu et al., 2019b).Such tasks require utterance classification models to correctly identify utterances that discuss fine-grained details about the topic and would therefore evaluate a model's ability to solve multiple challenges like coreference resolution, speaker-specific context detection, thread identification, etc.Such an evaluation is a part of future work.

Conclusion
In this paper, we have proposed a novel text classification approach that specifically leverages insights into the organization of task-oriented conversations in order to improve performance at topicbased utterance classification over SOTA baseline approaches.In particular, we have demonstrated that our utterance classification model, MEDFIL-TER, benefits from discourse information, domain knowledge, speaker-specific context modeling, and a hierarchical loss to reach a new state-of-theart performance on a doctor-patient interactions dataset.We find that using topic-based utterance classification in general, and MEDFILTER in particular, as a pre-processing step before medical extraction tasks, significantly improves the extraction scores.We believe that the contributions made in this work would also generalize to other kinds of expert-lay dialogue like customer-service chats.
Figures A2 and A3 show the precision-recall curves for each category separately.MEDFILTER improves utterance classification for all three categories.For symptom classification, the AUC scores improve from 52.2 to 53.5.However, symptom extraction results (Section 6 in the main paper) suggest that most of this improvement is on identifying utterances that discuss fine-grained details about symptom discussion and not on recognizing the utterance that contains the actual symptom name.

C.2 Performance on Validation Set
Table A5 shows the performance of different utterance classification models on the validation set.A4).This suggests that QuickUMLS medication extractor has low precision that is substantially improved when we prune irrelevant utterances.However, the graph for SE (Figure A5) shows that the extractor's performance is dominated by its recall.Pruning helps with improving the precision however does not help with the low recall.This explains the lower gains as compared to  3 in the main paper).

D.2 Oracle Results
Table A6 contains results for medication extraction when medically relevant (MR) or medication (MU) utterances are chosen using an oracle (MR/MU Oracle).Similarly, oracle results for symptom and complaint extraction are shown in Tables A7 and A8, respectively.We find that there is still a substantial room for in the symptom extraction pipeline.By just improving the topic-based utterance classifier, one can observe a potential jump of 5 Micro-F1 points in symptom extraction.However, we do not observe this trend for medication extraction where the topic-classification done by MEDFILTER performs much better than the Oracle.name gets extracted from within the conversation is irrelevant to the task.
Oracle picks utterances that would be sufficient for a human to identify the medications discussed in the dialogue.However, they might not be adequate for an automatic string-matching based extractor like QuickUMLS.Since QuickUMLS uses non-contextual surface-level features to identify medication names, it would look for phrases (in the input given to it) that match the surface requirements.So, it is possible for the Oracle utterances not to contain the proper surface-level forms that QuickUMLS could leverage for extracting medications.Furthermore, the utterances categorized as medication utterances by MedFilter on the other hand, even though incorrect, might contain the medication names in the form QuickUMLS expects, thus improving the score over the Oracle.One should note, however, that a perfect downstream extractor would not suffer from these side-effects.MR=Medically Relevant, CU=Complaint Utterances.

D.3 Supervised Extractor
For symptom extraction (SE), we also show the benefits of using topic-based utterance classification on a supervised-classification based SE approach that leverages a BiLSTM with attention (BiLSTM-Attn) for the problem of predicting the symptoms present in a conversation.

D.3.1 BiLSTM-Attn
Each utterance in the conversation is passed through the embedding layer and a BiLSTM layer to obtain a contextualized representation.
h i = BiLSTM(e(s i ), h i−1 ) where e(.) is the embedding function.The final state of the BiLSTM is re-weighted using attention calculated as shown in Equation A1.
This allows our model to pay attention to important utterances in the conversation to extract symptom information.We pass h f inal through a linear classifier and a sigmoid layer to get logits for each possible symptom label (Table A10).

D.3.3 Results
We present the results for symptom extraction (SE) using a BiLSTM-Attn model in Table A9.We find that using topic-based utterance classification to remove irrelevant utterances before passing the conversation through the BiLSTM-Attn improves

Figure 2 :
Figure 2: MEDFILTER as a part of extraction pipeline.
Proportion of utterances, in each medical category, spoken by different speaker roles.Most frequent semantic types in terms of the proportion of utterances they occur in.

Figure 4 :
Figure 4: Contextual Information: Different speaker roles contribute asymmetrically towards different medical topics/categories in the dialogue (Figure4a).Furthermore, phrases with UMLS semantic types Pharmacologic Substance, Sign/Symptom, and Disease/Syndrome occur quite frequently in medical, symptom, and complaint utterances respectively (Figure4b).
Manifestations of Symptoms: Internal symptoms often manifest themselves physically as a digression from the natural ability to perform typical activities.For instance, when the patient says I can't do anything after I'm back from office or I can only walk up one flight of stairs, she might be implicitly mentioning a cardiovascular symptom.A sizeable subset of such examples includes usage of duration or frequency to convey the implicit deviation, like Duration: PT: I only sleep for 4 hours, Frequency: PT: I go to the bathroom 10 times at night, Quantity: DR: I see you are up to 300 pounds now.

Figure
Figure A4: Medication Extraction: Micro-F1 vs Threshold Similar to the QuickUMLS based extractor, we use Micro and Macro F1 scores to evaluate the performance of the supervised extraction pipeline.BiLSTM-Attn (All Text) model takes in the entire conversation as input, whereas the other variants are given only a subset of utterances.MR Oracle/MEDFILTER models are trained on the medically relevant utterances as output by the oracle.Similarly, SU Oracle/MEDFILTER models are trained on the Oracle symptom utterances in each conversation in the training-set.Therefore, topicbased classification is used as a pre-processing step in the pipeline.

Table 1 :
Utterance classification results on the test-set (Avg.(std.dev.)).Results on valid-set are shown in the Appendix.The improvements are statistically significant (p < 0.01).
concepts and group them using hierarchies

Table A5 :
Results on val-set and the number of trainable parameters corresponding to each utterance classification model.The time taken by models that use BERT-FT is shown as a sum of two numbers as finetuning BERT is only done once, which is then used for both BERT-FT BiLSTM and MEDFILTER.