Hierarchical Label-wise Attention Transformer Model for Explainable ICD Coding

International Classification of Diseases (ICD) coding plays an important role in systematically classifying morbidity and mortality data. In this study, we propose a hierarchical label-wise attention Transformer model (HiLAT) for the explainable prediction of ICD codes from clinical documents. HiLAT firstly fine-tunes a pretrained Transformer model to represent the tokens of clinical documents. We subsequently employ a two-level hierarchical label-wise attention mechanism that creates label-specific document representations. These representations are in turn used by a feed-forward neural network to predict whether a specific ICD code is assigned to the input clinical document of interest. We evaluate HiLAT using hospital discharge summaries and their corresponding ICD-9 codes from the MIMIC-III database. To investigate the performance of different types of Transformer models, we develop ClinicalplusXLNet, which conducts continual pretraining from XLNet-Base using all the MIMIC-III clinical notes. The experiment results show that the F1 scores of the HiLAT+ClinicalplusXLNet outperform the previous state-of-the-art models for the top-50 most frequent ICD-9 codes from MIMIC-III. Visualisations of attention weights present a potential explainability tool for checking the face validity of ICD code predictions.


Introduction
The International Classification of Diseases (ICD) system is maintained by the World Health Organization (WHO) and is widely used to systematically code diseases for different purposes including health statistics, medical reimbursement and billing, decision support and medical and health services research [1,2]. Currently, trained clinical coders manually perform the ICD coding process. Manual coding is costly, laborious, and error-prone [3][4][5] and has traditionally suffered from a lack of workforce supply [5]. Thanks to the developments in the fields of natural language processing (NLP) and machine learning techniques, automated ICD coding has been an active research task for over two decades [6,7] but as yet has not been widely implemented at scale [8].
Automated ICD coding is formulated as a multi-label text classification task that assigns a subset of ICD codes to a given clinical document [9][10][11]. In recent years, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been widely used for automated ICD coding tasks [4,9,10,[12][13][14][15][16][17][18][19][20][21]. To improve the performance of ICD coding models, an attentive long shortterm memory architecture that added a label-wise attention layer upon a hierarchical LSTM network was proposed by Shi et al. [13]. Similarly, a hierarchical attention-bidirectional gated recurrent unit model [12] and a label-wise convolutional attention for multi-label classification [9] were introduced to employ the label-wise attention mechanisms to capture the important sentences or text snippets correlated with each code from the document. Motivated by the success of attention mechanisms with ICD coding, several studies [10,14,16,19] have used different label-wise attention mechanisms with various configurations of CNNs or RNNs and achieved the state-of-the-art (SOTA) results on the publicly accessible dataset Medical Information Mart for Intensive Care (MIMIC-III) [22]. In addition to better performance, attention explainability can be incorporated to increase users' trust in a posthoc manner by highlighting the text snippets that contribute more to the specific label predictions.
Here, label-wise attention mechanisms can be used in models to help explain the reasons why the models assign the subset of codes to the given document by giving different weight scores to different text snippets or words in the document.
The Transformer is an encoder-decoder architecture using stacked multi-head self-attention and position-wise fully connected feed-forward layers [23]. Some studies [24,25] have embedded the Transformer encoder/decoder in their model architectures to predict ICD codes and achieved better results than the previous CNN-or RNN-based models, but pretrained Transformer-based language models, especially Bidirectional Encoder Representations from Transformers (BERT) [26], have become popular and dominated the SOTA in many NLP tasks [27]. However, pretrained BERT models suffer from a discrepancy problem caused by symbols (such as [MASK]) that are manually added during the pretraining process, but which are not seen in fine-tuning datasets. To resolve this, Yang et al. [28] proposed another type of Transformer-based model, XLNet, a generalized autoregressive pretraining of language understanding. XLNet has been demonstrated to consistently outperform BERT in 20 NLP tasks [28].
Although many studies have attempted to fine-tune the pretrained Transformer-based language models e.g., BERT or its variants in ICD coding tasks [11,[29][30][31][32][33], only a few have investigated the pretrained XLNet and none of these outperformed the SOTA approaches. In addition, BERT imposes a maximum sequence length (512 tokens) to the input data due to the quadratic computational complexity. Therefore, it is a challenge to apply BERT to long documents which have thousands of words. Technically, any length of text sequence can be input into XLNet for finetuning, however the memory usage and computational complexity will quadratically increase as the sequence length grows.
Motivated by the success of label-wise attention mechanisms and Transformer-based models in ICD coding tasks and the robustness of XLNet in many NLP tasks, we propose a Hierarchical Label-Wise Attention Transformer model (HiLAT) for explainable ICD coding. The contributions of the study are:  Pretrained XLNet models 1 . In the biomedical NLP community, there are some domainspecific BERT models, which were pretrained on biomedical datasets, such as BioBERT [34], ClinicalBERT [35], BlueBERT [36] and PubMedBERT [37]. They have been proven to outperform general-domain pretrained BERT models on many biomedical NLP tasks [34][35][36][37].
However, there are only two biomedical domain-specific pretrained XLNet models (ClinicalXLNet [28]), which were pretrained on 1) nursing notes (n=1,077,792) and 2) discharge summaries (n=59,652) from MIMIC-III, respectively. We create two new XLNet models by continual pretraining XLNet from XLNet-Base 2 checkpoint on all the clinical notes of MIMIC-III combined (n=2,083,180) and all the clinical notes excluding the discharge summaries of MIMIC-III (n=2,023,528).  A HiLAT 3 architecture for explainable ICD coding from discharge summaries. HiLAT takes discharge summaries as input and assigns a subset of ICD codes to each of the discharge summaries. The attention weights produced by HiLAT make predictions explainable through highlighting the clinical text that influenced the assignment of ICD codes.

Attentional CNNs/RNNs
With the success and effectiveness of attention mechanisms and deep neural networks in NLP tasks, more and more studies have been undertaken to explore the fusion of these two technologies on the ICD coding task and have achieved SOTA results.
Mullenbach et al. [9] proposed an attentional convolutional network (CAML) to automatically assign ICD code sets to discharge summaries using MIMIC-II and MIMIC-III datasets. CAML utilized a label-wise attention layer above a single channel CNN layer to pay attention to the most relevant text within the document that were most relevant for each specific code. A model variant Description Regularized CAML (DR-CAML) was built to improve performance for the rarely observed codes by regularizing the model parameters. Sadoughi et al. [14] improved CAML by using a four channel CNN with maximum pooling across the channels and label-dependent attention layer (MVC-LDA).
They further proposed a model variant via regularizing the attention layer (MVC-RLDA) by using ICD code descriptions to enhance MVC-LDA. Li et al. [18] proposed a Multi-Filter Residual Convolutional Neural Network (MultiResCNN) that combined the multi-filter convolutional layer and residual convolution layer with a label-wise attention layer to capture various text patterns with different lengths and enlarge the receptive field for ICD coding.
Baumel et al. [12] presented a Hierarchical Attention bidirectional Gated Recurrent Unit (HA-GRU) network to effectively encode long clinical documents via two levels of GRU layers. A label-wise attention layer was added to the second GRU layer to focus on the relevant sentences for each label.
However, HA-GRU could only provide a sentence-level attention for each label. To enhance model explainability, Dong et al. [19] proposed a Hierarchical Label-wise Attention Network (HLAN), which applied a word-level label-wise attention to HA-GRU. Shi et al. [13] used a hierarchical labelwise attention LSTM architecture (AttentiveLSTM) to perform ICD coding. They explored two types of attention mechanism: hard-selection, which selected the maximum attention score for each code, and soft attention, which applied a softmax function to normalize the attention scores. Vu et al. [10] proposed a label attention model (LAAT) and a hierarchical joint learning model (JointLAAT) for ICD coding. The LAAT used a bidirectional LSTM network to produce input feature representations and a label-wise attention layer to learn label-specific vectors of clinical text for each label. The JointLAAT firstly employed a LAAT model to predict normalized codes (the ICD codes' first three characters) and secondly concatenated the projected normalization output with label-specific vectors of another LAAT model to generate the final prediction.

Transformer-based Architecture
Transformer-based architectures, in particular the pretrained Transformer language models, have become popular for a wide range of NLP tasks. Biswas et al. [24] proposed a Transformer-based code-wise attention model (TransICD) that used a Transformer encoder to capture contextual word representations. Similarly, Zhou et al. [25] presented an Interactive Shared representation network with self-Distillation mechanism (ISD) that employed a bidirectional multi-layer Transformer decoder to extract interactive shared representations that were captured from clinical notes via a CNN network. Transfer learning, to fine-tune the pretrained Transformer language models, is the prevalent approach for many NLP downstream tasks [36,40]. Feucht et al. [30] proposed a description-based label attention classifier (DLAC) to provide explainable ICD coding on discharge summaries. DLAC used pretrained Transformer models to represent the documents and Word2vec [41] to encode ICD code descriptions. A label attention classifier was employed to predict ICD codes for a given document. The pretrained model Longformer [42] combined with DLAC (Longformer-DLAC) achieved the best performance in the study. Pascual et al. [32] investigated the fine-tuning of another BERT variant (PubMedBERT) for the ICD coding task, called BERT-ICD. Due to the maximum input sequence length of BERT, they proposed five strategies to split the long text to: first 512 tokens, last 512 tokens, mixed first 256 and last 256 tokens, different fixed-length chunks, and meaningful paragraphs. MIMIC-III-50 dataset, consisting of the discharge summaries which are coded with at least one of the most 50 frequent diagnosis and procedure codes. The first two of these datasets are used for continual pretraining XLNet and the third is used to develop and evaluate proposed models for ICD coding task.

Datasets and Preprocessing
The details and descriptive statistics of the three datasets are shown in Table 1. summaries, we manually investigate some and find that the discharge summaries have structural sections such as "past medical history", "discharge medications", "allergies", "discharge diagnosis", "attending", and "discharge disposition" as shown in Supplementary Fig. S1. The discharge diagnosis section contains extensive diagnosis information and is located at the end of the whole discharge summary. The discharge summaries are split to 10 chunks sequentially with a fixed sequence length of 510 tokens. The padding tokens are inserted for shorter discharge summaries and longer discharge summaries are truncated. Therefore, there is a risk to miss this important information for ICD coding for the long discharge summaries. We move the sections of "discharge diagnosis", "discharge disposition", and "discharge conditions" to the beginning of the discharge summaries to avoid this risk. The same preprocessing for our pretraining task is applied to discharge summaries. We further remove the words not containing any alphabetic characters.

Pretraining XLNet
We utilize the original implementation 5 of XLNet to pretrain the models. First, the training data is converted to TensorFlow 6 records. Secondly, we further pretrain two XLNet models from the checkpoint of XLNet-Base using two generated datasets: MIMIC-III-CN and MIMIC-III-CN-DS, respectively. XLNet-Base was pretrained on BookCorpus, English Wikipedia, Giga5, ClueWeb 2012-B and Common Crawl by Yang et al. [28]. When the training is finished, the last checkpoint is selected as our final pretrained model. Finally, we use Hugging Face Transformers library 7 to convert the checkpoint to Pytorch model which can be loaded for ICD coding task.

HiLAT
The architecture of our proposed HiLAT model for ICD coding is shown in Fig. 1. Overall, the HiLAT model comprises four layers. Before feeding data into HiLAT, a discharge summary is split into multiple chunks and each chunk has a maximum of 510 tokens. The first layer is a pretrained Transformer language model that creates hidden representations for the tokens of each chunk. The second layer is a token-level label-wise attention layer in which a label-specific chunk representation is produced by applying attention weight scores on the tokens within a chunk. The representations from all the chunks for a specific label are stacked together and fed into the third layer, namely the chunk-level attention layer, to generate label-specific document representations. The last layer is the classifier layer which consists of multiple single feed-forward neural networks (FFNN). Each classifier is used to predict the probability of assigning the specific ICD code to the input discharge summary. for ICD code 2 from n different chunks are combined to a matrix C = [c12, c22, c32…cn2]. c, Given the label-specific chunk representation matrix, a chunk-level attention layer is employed to produce the document representation for the ICD code L. A binary classifier of FFNN is used to determine whether the ICD code L is related to the input discharge summary or not.
We implement and train HiLAT using Pytorch and Hugging Face Transformers library.

Transformer layer
We use the pretrained Transformer models as the Transformer layer and fine-tune all the parameters during the training. The input document D is tokenized by the relevant Transformer tokenizer as a sequence of tokens. We split D to 10 chunks sequentially with a fixed length of 510 tokens for each chunk. The padding and truncating operations are performed for the document shorter and longer than 5,100 tokens, respectively. Each chunk is appended by two special tokens of [CLS] and [SEP] as well.
The output of last layer of the pretrained Transformer models is used to represent the tokens { , , … , } in chunk n. All the token representation vectors are formulated as a matrix = [ , , … , ] ∈ ℝ × , where is the output size of the pretrained Transformer models we are using in HiLAT.

Token-level attention layer
The label attention mechanism proposed by Vu et al. [10] is employed in HiLAT. In token-level attention layer, we transform Hn into ℒ label-specific vectors representing chunk n. ℒ denotes the size of label set. First, the label-specific attention weights are computed as: W is a matrix ∈ ℝ × . The equation (1) results in a matrix ∈ ℝ × . ∈ ℝ ×ℒ is a matrix comprising the ℒ label representation vectors, which is multiplied with to compute the labelspecific token-level attention weights ∈ ℝ ℒ× . The ℓ row in is the attention weights for the 512 tokens in chunk n regarding to the ℓ label in ℒ. To make the summation of the token-level attention weights to equal to 1 for each label, a softmax activation function is applied to .
Subsequently, is multiplied with the transpose of the token-level attention matrix to generate a matrix ∈ ℝ ×ℒ for chunk n. The ℓ column in is the label-specific representation of chunk n for the ℓ label in ℒ. We use ℓ to denote the label-specific chunk representation for chunk n and label ℓ.
To generate the input of next layer, each ℓ column is extracted from the matrixes : to form a new matrix ℓ ∈ ℝ × (ℓ ∈ {1, … , }) as:

Chunk-level attention layer
This layer employs the same attention mechanism as the previous one. The matrix ℓ is inputted into the chunk-level attention layer to produce the document representation ℓ for label ℓ. The computing equations are as: Here, ∈ ℝ × is used to multiply with ℓ to produce ℓ ∈ ℝ × using the hyperbolic tangent activation function. The chunk-level attention weight vector ℓ ∈ ℝ for label ℓ is computed using a randomly initialized vector ∈ ℝ and the matrix ℓ with a softmax function. Finally, the document representation vector ℓ ∈ ℝ for label ℓ is produced using the equation (7).

Classifier layer
Given the document representation ℓ , we use a linear layer as the classifier. The probability ℓ for label ℓ is calculated by the classifier with a sigmoid transformation: where ℓ ∈ ℝ is a weight vector and ℓ is a bias parameter. We use a threshold of 0.5 to predict the binary output for label ℓ. The training procedure aims to minimize the binary cross entropy loss:

Attention explainability
There are two level attentions in our proposed HiLAT. The token-level attention weights measure the token contribution to the specific label prediction within a chunk. The chunk-level attention weights show the chunk contribution to the specific label prediction within a document. To get the global contribution for each token, we calculate the global token attention weights of chunk n for label ℓ as: where ℓ ∈ ℝ is the global token attention vector for chunk n and label ℓ. ℓ ∈ ℝ is the token-level attention weight vector for chunk n and label ℓ (the ℓ row of the matrix ); ℓ is the chunk-level attention weight (a scalar) for chunk n and label ℓ (the element of the chunk attention vector ℓ ).
In Transformer models, the token does not equal to the word because of using different tokenizers such as WordPiece, SentencePiece. The word is split to one or more than one tokens. The global attention vector ℓ is based on tokens. To calculate the attention weights of words for explainability visualization, the attention weights of the tokens that belong to the same word are summed together and then normalized as the attention weight for the word. Fig. 2 shows an example of word attention calculation.

Pretrained XLNet models
We train the XLNet models on eight TPU 8 v3 cores with 128GB of total TPU memory for one million  Supplementary Table S1. The optimal results are batch size at 16, learning rate at 5e-5, optimizer weight decay at 0.1, training steps at 2,500, and warmup steps at 500. The dropout rate is always set to 0.1. We train the models on eight TPU v3 cores. Table 2 shows the evaluation results for HiLAT using MIMIC-III-50. We experiment with six different pretrained Transformer language models in the Transformer layer. We also report the number of parameters and training times.
The language models pretrained on the datasets including MIMIC-III discharge summaries, ClinicalXLNet and ClinicalplusXLNet, perform better than ClinicalplusXLNet-DS pretrained on the dataset excluding discharge summaries. However, the training speeds of HiLAT with XLNet variants are almost four times slower than the ones with BERT variants because of using all possible permutations of the sequences to capture the bidirectional context. We compare our proposed HiLAT against 13 baselines as shown in Table 3. HiLAT beats all the baselines on most metrics with a notable improvement. When comparing to the current SOTA model (ISD), HiLAT performs better in three of five evaluation metrics, with an improvement of the micro-AUC, macro-F1 and micro-F1 by 0.1%, 1.1%, and 1.8%, respectively.

Ablation Study
To better understand the effectiveness of our proposed approach, we perform an ablation study using The results of the ablation experiments are shown in Table 4. All of the ablation experiments perform worse than HiLAT+ClinicalplusXLNet. This demonstrates that all of the five factors contribute improvements to our model. Especially, the fine-tuning all Transformer layers and hierarchical labelwise attention mechanism improve the performance by a large margin.

Model Explainability
The attention weights of the hierarchical label-wise attention mechanism of HiLAT can be used to tag the words in the chunks as relevant to each specific code. Fig. 3 shows examples of model explainability. For the ICD-9 diagnosis codes "276.2 Acidosis" and "428.0 Congestive heart failure, unspecified", our model successfully tags the keywords "metabolic acidosis … increased anion gap metabolic acidosis" and "acute on chronic systolic congestive heart failure", respectively. The keywords "cardiac catherization on and second cardiac catherization on with bms" and "cardiac catherization where two stents" are highlighted for the ICD-9 procedure code of "88.56 Coronary arteriography using two catheters". The patient with chronic lymphocytic leukemia was admitted to ICU due to fatigue and weight loss.
Our model assigns the ICD-9 diagnosis code "518.81 Acute respiratory failure" to the hospital admission according to the keyword "dyspnea". However, the discharge summary does not specifically mention respiratory failure, and this would not be coded on the basis of the symptom of dyspnea alone. The ICD-9 diagnosis code "995.92 Severe sepsis" is predicted according to the highlighted keywords of "possible septic physiology". The meaning of the keyword "possible" is not learnt by our model correctly. The next two examples are from a patient who had shortness of breath and was intubated for respiratory distress. Our model correctly identifies that the patient was intubated during the ICU stay. However, coding standards specify that the ICD-9 procedure code "96.04 Insertion of endotracheal tube" is recorded only when intubation occurs without ventilation, demonstrating challenges for learning when a code should not be recorded on the basis of somewhat arbitrary coding rules. The ICD-9 procedure code "96.71 Continuous invasive mechanical ventilation for less than 96 consecutive hours" is predicted by our model, while the clinical coder assigned "96.72 Continuous invasive mechanical ventilation for 96 consecutive hours or more" to the hospital stay as the patient was extubated after 7 days. Models that incorporate temporal understanding in free text are required to correctly learn such information. The procedure code "96.6 Enteral infusion of concentrated nutritional substances" in the last example of Fig. 4 is predicted correctly by our model, but the highlighted text snippets do not provide meaningful explainability.

Discussion
In this study, we propose a hierarchical label-wise attention Transformer model (HiLAT) to automatically predict ICD codes from discharge summaries. In addition, we pretrain two XLNet language models: ClinicalplusXLNet and ClinicalplusXLNet-DS using the clinical notes of MIMIC-III. The experiment results show that Our HiLAT+ClinicalplusXLNet model outperforms all the baseline models in macro-and micro-F1 scores by a large margin for predicting ICD-9 codes on the MIMIC-III-50 dataset, comprising the discharge summaries that are coded with at least one of the most 50 frequent diagnosis and procedure codes. The ablation study demonstrates that all the components of our proposed approach contribute improvements to the model performance.
Explainable ICD coding will build users' trust in automated coding models and help to remove obstacles to deploying automated coding systems in practice [46,47]. We demonstrate that We tried to train HiLAT on all discharge summaries in MIMIC-III, which contains an extremely large number of ICD-9 codes (n=8,929). We encountered a memory leaking problem after training for about 10 hours due to the model network graph in the loss backward step being too large for the physical memory storage. In an empirical comparison study [49], Yogarajan et al compared finetuning three pretrained BERT models with traditional neural networks such as CAML for the ICD coding task, using five different label sizes on two datasets. They found that the Transformer models could not outperform traditional neural networks when the label size was greater than 300.
Exploration of the utility for the ICD coding task of Transformer variants, which have been studied for extreme multi-label classification in particular X-Transformer [50] and XR-Transformer [51], could be a fruitful avenue for further research.
Although HiLAT provides label-wise explainability visualizations using the hierarchical attention weights, there is a need to evaluate the quality of explanations objectively. This will help contribute knowledge towards the debate about the usefulness of attention mechanisms for explainability [52,53]. This will be a crucial research direction of our future work. We plan to explore the use of the mutual information metric [54] to compare the keywords assigned high attention weights by HiLAT with the ground truth of ICD codes, including the ICD code descriptions and the other ICD code mapping terms such as Systemized Nomenclature of Medicine Clinical Terms (SNOMED CT) and Unified Medical Language System (UMLS).

Conclusion
We present an approach for automated ICD coding from discharge summaries: the HiLAT model with a hierarchical label-wise attention mechanism plus a pretrained Transformer language model, ClinicalplusXLNet. HiLAT extracts the label-wise text representations from discharge summaries to map the corresponding ICD codes. We also demonstrate use of the label-wise attention weights produced by HiLAT to highlight the relevant keywords contributing to its specific ICD code predictions. HiLAT can potentially be applied to different types of multi-label text classification tasks to achieve SOTA results, especially in the clinical health domain with the help of our pretrained language model ClinicalplusXLNet. HiLAT can be deployed to augment and streamline current manual processes for clinical coding, noting that our results relate to performance for predicting the 50 most frequent ICD codes only. Other applications with immediate promise include automated identification from clinical notes of patients with specific conditions who are eligible for recruitment into clinical trials, and of specific clinical endpoints (e.g., major cardiovascular events) for clinical trials and real-world evidence studies.
CRediT authorship contribution statement

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.