Does the Magic of BERT Apply to Medical Code Assignment? A Quantitative Study

Unsupervised pretraining is an integral part of many natural language processing systems, and transfer learning with language models has achieved remarkable results in many downstream tasks. In the clinical application of medical code assignment, diagnosis and procedure codes are inferred from lengthy clinical notes such as hospital discharge summaries. However, it is not clear if pretrained models are useful for medical code prediction without further architecture engineering. This paper conducts a comprehensive quantitative analysis of various contextualized language models' performance, pretrained in different domains, for medical code assignment from clinical notes. We propose a hierarchical fine-tuning architecture to capture interactions between distant words and adopt label-wise attention to exploit label information. Contrary to current trends, we demonstrate that a carefully trained classical CNN outperforms attention-based models on a MIMIC-III subset with frequent codes. Our empirical findings suggest directions for improving the medical code assignment application.


Introduction
Clinical notes generated by healthcare professionals are parts of electronic health records and provide an essential source for intelligent healthcare applications (Zhang et al., 2020). Medical information management aims to assign standard medical codes to each clinical document for categorization purposes, which requires professional medical knowledge and is usually costly and error-prone (Hsia et al., 1988;Stanfill et al., 2010). The International Classification of Diseases (ICD) system, as the most used coding system, provides a global standard for reporting diseases and health conditions. The rapid development of machine learning and natural language processing (NLP) can replace manual code assignment with automatic coding systems (Farkas and Szarvas, 2008;Crammer et al., 2007;Farzandipour et al., 2010). Practical medical code assignment requires to capture semantic concepts (Falis et al., 2019) and tackle the challenges of lengthy note encoding and large-dimensional code schemes.
Pretrained language models (PTM) such as BERT (Devlin et al., 2019) learn contextualized text representation and have started a new era in NLP. NLP applications benefit from large-scale pretraining on massive corpora, and universal language representations from PTMs have been successfully utilized in downstream tasks via transfer learning. In the field of clinical NLP, incorporating pretrained contextualized language models to encode lengthy clinical notes for large-scale medical code prediction has not been well-studied. Recently, Li and Yu (2020), ? and Dong et al. (2020) performed preliminary experiments with pretrained models; however, these three pilot studies failed to achieve satisfactory results or provide in-depth analysis.
This paper investigates language models pretrained in various domains. A language domain here defines a distribution over a topical field such as biomedical documents and clinical notes. Specifically, we investigate the following three research questions (RQ1-3). RQ1: What kind of BERT pretraining works best? We adopt domain-specific corpora and BERT variants pretrained with different domain adaptation illustrated in Fig. 1, and compare their performance on the MIMIC-III benchmark.

RQ2
: What kind of BERT fine-tuning formulation works best for long notes? We employ classical finetuning, develop a hierarchical architecture for long clinical notes, and consider label-aware feature representation.
RQ3: Are BERT models better than convolution-based approaches? We reproduce convolutional neural networks (CNNs) with classical pretrained word embeddings and conduct a comparison.
Understanding medical text is a long-lasting research problem. We study an essential task in medical information management and diagnosis support -medical code assignment, assigning medical codes to clinical notes. Language models can either be pretrained using a large corpus of medical text or trained from scratch and fine-tuned for a specific task. Several pretrained models have been published for medical NLP, and they differ in the collections of medical texts used for pretraining, e.g., biomedical, clinical, and medical-related social domains. We study the usefulness and relative merits of these different pretrained models and suggest improvements to the neural network architecture to improve performance with long notes: a hierarchical model for longer notes and label-wise attention to leverage relevant information about medical codes. Despite our careful attempts, we nevertheless find that fine-tuning pretrained models performs worse than carefully training conventional neural architectures from scratch. Hence, our results provide practical guidance for building medical information management systems: pretrained models offer a convenient plug-and-play solution; however, training robust existing standard models offers an appealing practical alternative with good performance in practice.
Our contributions are as follows.
• This paper conducts a comprehensive quantitative study to investigate the effect of knowledge transfer via mixed-domain and task-adaptive language model pretraining in different domains, and a thorough comparative study to answer the research questions. • We propose a hierarchical BERT architecture with a label attention mechanism to enhance contextualized representation with label awareness for long clinical notes. • We demonstrate that the classical CNN model with appropriate training can improve the predictive performance, achieving new state-of-the-art results on frequent medical codes (MIMIC-III top-50 dataset).

Related Work
Rule-based and machine learning-based methods have been studied for diagnosis code assignment from clinical notes (Medori and Fairon, 2010;Perotte et al., 2014). Perotte et al. (2014) proposed an SVM-based classification algorithm with a flat and hierarchy-based classifier. Recently, the research trend turns to deep neural networks. Convolutional neural networks are one popular category with many model architecture proposed, including CAML that applies CNNs and a label-wise attention mechanism (Mullenbach et al., 2018), MultiResCNN that uses residual connection (Li and Yu, 2020) and DCAN that utilizes dilated convolutions (Ji et al., 2020a). Recurrent neural networks are also extensively studied to capture sequential dependency in clinical notes. Such recurrent models include AttentiveLSTM (Shi et al., 2017), HA-GRU (Baumel et al., 2018) and tree-of-sequences LSTM network (Xie and Xing, 2018) Attention mechanism for matching important diagnosis snippets is widely integrated into CNN-and RNN-based models (Shi et al., 2017;Dong et al., 2020). CAML (Mullenbach et al., 2018) introduced a label-wise attention mechanism to learn label-aware document representations.
Understanding clinical notes require professional medical knowledge. Many methods incorporate external knowledge sources to enhance neural architectures and facilitate clinical text understanding. Prakash et al. (2017) proposed a condensed memory network model with iterative condensation of external memory for network updating and data retrieval from Wikipedia. Bai and Vucetic (2019) used Wikipedia articles of medical codes to learn knowledge-aware embeddings jointly, and Cao et al. (2020) utilized ICD code hierarchy with hyperbolic representation. Another direction to incorporate knowledge is through language pretraining and transfer learning. Li and Yu (2020) and ? reported preliminary results with semantic knowledge transferring. This paper conducts a comprehensive quantitative analysis.
Pretrained language models are trained on an auxiliary task, such as masked language modeling that predicts a word or sequence based on the surrounding context and gains improvement in many NLP tasks (Xipeng et al., 2020). Pretraining such auxiliary tasks benefit from large-scale training on unlabeled corpora that are readily available from the web or textbooks. Erhan et al. (2010) hypothesized that pretraining acts as a type of regularization and found that pretrained models exhibit lower generalization errors on average. Several pretraining models in specific domains have been released, such as BioBERT (Lee et al., 2020) and ClinicalBERT (Alsentzer et al., 2019). They have also been applied in many domain applications; for example, Mulyar et al. (2019) applied contextualized language models for phenotyping and Huang et al. (2019) used pretrained models to encode clinical notes to predict hospital readmission.

Method
we develop fine-tuning with different architectures, including a fully-connected classifier and a hierarchical classifier with an extra transformer atop ( Fig. 2) to address the long-document challenge. Label-wise attention to learn label-aware document representations with these two fine-tuning architectures is described in Section 3.3.

Pretraining Domains
We study three types of domains: 1) general domains such as book corpora and general Wikipedia articles; 2) domains that are closely related to the target clinical domain; 3) the target clinical domain. Assigning ICD codes from clinical notes is a task in the clinical domain. We consider biomedical and health-related social domains as candidate domains closely related to the clinical domain. Inspired by domain-and task-adaptive pretraining (Gururangan et al., 2020), we investigate different ways of pretraining models for medical code assignment: 1) pretrain only on general domains and immediately transfer to the target clinical domain; 2) continue pretraining on close domains and clinical domains such as the biomedical domain, and transfer to the target clinical domain 3) pretrain on close domains from scratch and transfer to the target clinical domain; 4) pretrain on mixed domains and further fine-tune on the target domain. The details of methods that fall into these classes can be summarized with three categories: Pretraining in General Domains does not involve specific topics or genres. We use BERT (Devlin et al., 2019) pretrained on two unsupervised prediction tasks, i.e., masked language model and next sentence prediction, using the BooksCorpus (Zhu et al., 2015) and English Wikipedia. Mixed-domain Pretraining contains a mixture of domains. We consider continued training and training from scratch. Domain-adaptive pretraining has been validated for the ability to improve the predictive performance (Gururangan et al., 2020) further. We use 1) BlueBERT (Peng et al., 2019) pretrained with PubMed text and MIMIC-III clinical notes; 2) BioBERT (Lee et al., 2020) continually pretrained on domain-specific data from PubMed abstracts and PMC full-text articles; 3) BioRedditBERT  initialized from BioBERT and continually pretrained on health-related posts from health-themed forums in Reddit; 4) PubMedBERT (Gu et al., 2020) domain-specific pretrained from scratch in biomedical domain using PubMed publications; and 5) SapBERT  that benefits from self-aligning biomedical entities to the Unified Medical Language System (UMLS) ontology and non-parametric metric learning, leading to a betterseparated embedding space.
In-domain Continued Pretraining continues pretraining in the target clinical domain, which is also called task-specific pretraining (Howard and Ruder, 2018;Gururangan et al., 2020). We use the ClinicalBERT (Alsentzer et al., 2019) with the second phase of pretraining in the clinical domain using clinical notes and discharge summaries.

Fine-tuning with Hierarchical Structure
Clinical notes, consisting of patient history and discharge summaries, are often long documents.
To address this, we develop two fine-tuning strategies, one based on the straightforward truncation and the other with hierarchical architecture. In the first approach, we truncate clinical notes to 512 tokens, take the final hidden state of the first token [CLS] as the pooled representation of the truncated note (denoted as C ∈ R d h ), and apply a fully connected network (FCN) as the classifier with sigmoid activation to predict output probabilities. This straightforward fine-tuning structure is denoted as BERT-trun, and it serves as the baseline. Besides, we propose a hierarchical fine-tuning structure to deal with long notes, shown in Fig. 2. The lengthy clinical notes (with more than 512 tokens) are first divided into several shorter subsequences to build the lower-level contextualized representation. An additional transformer network (Vaswani et al., 2017) is built atop to capture the second-level sequential dependencies between the note segments. The classifier follows the same setup of the truncated version, i.e., FCN with sigmoid. We call this hierarchical structure BERT-hier.
The learning objective function adopts the binary cross entropy loss denoted as: where y i ∈ {0, 1} is the ground-truth label,ŷ i is the sigmoid score for prediction, and m is the number of ICD codes. We use the AdamW optimizer (Loshchilov and Hutter, 2019) to fine-tune the model with backpropagation.

Label-wise Attention
To further connect the document representation with label information, we introduce label-wise attention (Mullenbach et al., 2018) in the fine-tuning procedures. The label attention network (LAN) prioritizes essential information in the hidden note representation relevant to the medical ICD codes. The LAN calculates the attention score A ∈ R n×m , which measures the importance between each pair of medical codes and words in the document. It is defined as a dot product: where H ∈ R n×h is the hidden encoding of the BERT encoder's last layer, U ∈ R h×m is the parameter matrix of the label attention layer (also known as the query), n is the number of tokens in the document, h is the hidden representation dimension, and m is the number of ICD codes. The attention layer's output is then calculated by multiplying attention A with the hidden representation from the last layer of BERT encoder. The attentive representation V ∈ R m×h formalized as V = A T H is further used for medical code classification, representing sequential dependency and label awareness.
The BERT-trun and BERT-hier fine-tuning architectures can both integrate the label-wise attention mechanism.

Experimental Setup
This paper focuses on assigning ICD codes to textual discharge summaries from a hospital stay.  Preprocessing. We use raw notes, ICD diagnoses, and procedures for patients from the public clinical MIMIC-III dataset for experiments. Discharge summaries labeled with a set of ICD-9 diagnosis and procedure codes include descriptions of procedures performed by a physician, diagnosis notes, patient's medical history, and discharge instructions. Addenda of admissions are concatenated to a single document. The NLTK package is utilized for tokenization, and all tokens are converted into lowercase. Non-alphabetic characters, such as numbers and punctuations, are removed. All documents are truncated at 512 tokens and 2500 tokens for a single BERT encoder (BERT-trun) and hierarchical BERT encoders (BERT-hier).
Training. For the prior publications and the recommendation of fine-tuning the BERT model, we choose some common settings. For example, the dropout probability is 0.1. The Adam optimizer (Kingma and Ba, 2014) is used to optimize CNN-based models, and the AdamW optimizer (Loshchilov and Hutter, 2019) to fine-tune BERT variants. We utilize a linear learning rate scheduler with warmup and layer-wise learning rates when fine-tuning BERT-based models. For retraining the CNN-based models, we use the CBOW of word2vec (Mikolov et al., 2013) and adopt static word embeddings. We set the batch size for MIMIC-III top-50 and full sets at 8, and the learning rate from 1e −6 to 1e −3 . All the models are run on a Linux cluster with Nvidia P100 or V100 GPUs. For the MIMIC-III top-50 data set, the fine-tuning architecture with truncated notes and a linear classifier has about 109M parameters. In comparison, the MIMIC-III full set takes about 115M parameters. For the hierarchical architecture, MIMIC-III top-50 and full codes data sets have 115M and 128M parameters, respectively. Generally speaking, fine-tuning pretrained models consumes the memory of a large GPU; however, it is arguably less expensive than training from scratch.

Results
We conduct a series of experiments with different pretrained models on various domains using two fine-tuning architectures and reproduce classical CNN-based models with word embeddings from scratch. We make our code publicly available at https://agit.ai/jsx/MCA_BERT. Different evaluation metrics are utilized for experimental evaluation, including micro and macro F1 scores and area under the receiver operating characteristic curve (AUC-ROC). We evaluate the metrics of precision at k, where k = 5 for MIMIC-III subset with top-50 frequent codes and k = 8, 15 for full sets of MIMIC-III, given the observation that most medical documents are assigned no more than 20 codes. For example, the macro precision is calculated as the overall average precision across all labels, given by: where TP and FP are the numbers of true and false positives of code . Micro scores give more weight to frequent labels by considering all labels jointly. For example, the micro precision is defined as:

Pretraining in Close Domains Improves Prediction (RQ1)
This section studies pretraining in different domains (Sec. 3.1) to evaluate which pretraining scheme works best for medical coding in the clinical domain. Results on MIMIC-III top-50 and full code set are shown in Table 2, where hierarchical fine-tuning architecture is used. Overall, pretraining in mixed domains improves predictive performance to some extent over the BERT-base pretrained in general domains. These results show the effectiveness of transfer learning to enhance the learning capacity on downstream tasks. Among all pretrained models from various domains, PubMedBERT pretrained from scratch on biomedical article corpora gains a comparatively better performance. Specifically, it leads to improvements of 3.9% and 3.3% for F1 macro and micro scores on the MIMIC-III top-50 code set. One possible explanation is that specific domain pretraining makes downstream classifier concentrate on specified semantic knowledge. While for ClinicalBERT with three types of domain, the model's attention may be distracted from relatively broad information. However, it still performs better than BERT-base with semantic knowledge only from general domains.

Effect of Pretraining with More Data (RQ1)
Can a larger pool of unsupervised pretraining data lead to performance gain in downstream clinical prediction? To answer this question, we conduct experiments on the effect of more pretraining data using three groups of pretrained model. In the biomedical domain, there are 200K abstracts from PubMed and 270K full-text articles from PubMedCentral. For the clinical domain, The CATE-GORY value of the MIMIC-III dataset includes 'Discharge Summary', 'ECG', 'Radiology', and 'Echo'. 'Discharge summary' indicates that the note is a discharge summary. Besides, there are also other free-text notes. For example, 'Report' indicates a full report, and 'Addendum' indicates an additional text added to the previous report.
The results with BERT-hier fine-tuning architecture on two sets of the MIMIC-III dataset are shown in Table 3. We also study the effect using truncated short notes with fully connected tuning architecture, with results reported in Table 4. These two tables suggest that pretraining with more data leads to better performance in most cases.

Hierarchical Fine-Tuning Improves Prediction (RQ2)
This section answers to the second research question (RQ2) by comparing the BERT-trun and BERThier fine-tuning architectures. Table 5 shows the results on the MIMIC-III top-50 code set using the two fine-tuning architectures when used either with a fully connected classifier or the label attention network. Our results are slightly better than the preliminary results reported by ? with appropriate training tricks. Mixed domain pretraining models, such as PubMedBERT and ClinicalBERT, gain increases in evaluation scores. However, most BERT-trun variants' predictive performance suffers due to the lack of information when long notes are truncated into short ones. These results show that the proposed hierarchical fine-tuning architecture effectively utilizes long sequences and boosts performance. The label-wise attention mechanism can further improve the prediction in most cases.

Reproducing CNN Outperforms Advanced Methods on Frequent Codes (RQ3)
Several recent advances for medical code assignment are CNN-based models. This section investigates whether improved training of CNN can improve prediction. Inspired by performance improvement owing to training tricks, we reproduce the CNN model to check if appropriate training strategies can lead to better scores.
Baselines. CNN (Kim, 2014) is built on pre-trained word vectors with 1D convolution and maxpooling for text classification. CAML (Mullenbach et al., 2018) integrates CNNs and a label-wise attention mechanism to learn rich representations. It has a variant called DR-CAML that uses ICD code descriptions to regularize the loss function. MultiResCNN (Li and Yu, 2020) combines residual learning (He et al., 2016) and multiple channels concatenation with different convolutional filters, achieving good performance in most settings. HyperCore (Cao et al., 2020) utilizes hyperbolic embedding and co-graph representation with code hierarchy. It gains slightly better performance than the MultiResCNN.
Reproducing CNNs. Table 6 shows the reproduced results compared with baselines. The hyperparameters are as follows: the learning rate is 0.003, the filter size is 4, the number of filters is 500, the dropout probability is 0.2, and word2vec embeddings are static. Our retrained vanilla CNN gains a better performance compared with two recent advanced MultiResCNN (Li and Yu, 2020) and HyperCore (Cao et al., 2020) on the MIMIC-III top-50 code dataset. Moreover, the P@15 score of the retrained CAML model leads to a significant performance increase in the MIMIC-III full code dataset. These results suggest that a simple model with appropriate training could achieve decent performance in NLP applications such as this clinical application. How did the CNN-and BERT-based models perform on the full code set? We bin ICD codes into different groups according to the frequency observed in the training set of the MIMIC-III full code dataset to study the effect of code frequency on the model's predictive performance. We take ClinicalBERT as a clinical note encoding representative and compare three fine-tuning strategies with CNN and CAML model. Fig. 3 shows models' predictive behavior on frequent and less frequent codes taking F1 scores as the evaluation metrics. These two figures show that all compared models' predictive performance decreases when the ICD code groups have fewer training samples. When dealing with few-shot codes, model performance drops sharply. These results again suggest that medical code prediction algorithms should focus on less frequent codes and enhance the robustness of less frequent codes.

Discussion
Pretrained models in both general and specific domains have shown effectiveness in capturing contextual information. However, they encounter limitations in this study. In the dataset with top-50 frequent codes, fine-tuning with pretrained models can achieve a good performance; however, the PTM fine-tuning does not work well for high-dimensional structured prediction with a full label set that has more than 8,000 labels. This study suggests focusing on less frequent codes.
Self attention-based models suffer from the complexity of O(n 2 d), where n is the sequence length, and d is the dimension of hidden representation, making it hard to encode extremely long documents. We investigated how to incorporate the pretrained BERT model and its variants with hierarchical fine-tuning architecture to tackle lengthy clinical document encoding. Nevertheless, CNN-based models (Mullenbach et al., 2018;Li and Yu, 2020;Ji et al., 2020a) and RNN-based (Shi et al., 2017) perform considerably well with the relatively small model scale and remain a meaningful direction. Recently, some improved transformer-based models such as Longformer , Linformer  and Big Bird (Zaheer et al., 2020) aim to solve the problem of encoding long document and mitigating the quadratic complexity. We leave these emerging models as future work.

Conclusion
This paper presented a comprehensive quantitive analysis of medical code assignment from clinical notes using various pretrained models with BERT. We compared the behavior of several different domain-specific BERT variants. To solve the problem of lengthy clinical note encoding, we developed two fine-tuning architectures: 1) fully connected network with simple truncation into short notes; 2) hierarchical fine-tuning architecture with long note segmentation and an additional Transformer on top. Moreover, we employed label attention to facilitate label-aware representation learning. Through intensive experiments, we found that the magic of BERT does not apply to the task of assigning ICD codes from clinical notes. In contrast, we found that a simple CNN trained from scratch can achieve superior predictive performance on frequent codes, achieving a new state of the art in the MIMIC-III top-50 dataset. This demonstrates how recent training strategies can improve old models. Our results furthermore suggest that medical code assignment algorithms should pay more attention to less frequent codes.