An Improved Double Channel Long Short-Term Memory Model for Medical Text Classification

There are a large number of symptom consultation texts in medical and healthcare Internet communities, and Chinese health segmentation is more complex, which leads to the low accuracy of the existing algorithms for medical text classification. The deep learning model has advantages in extracting abstract features of text effectively. However, for a large number of samples of complex text data, especially for words with ambiguous meanings in the field of Chinese medical diagnosis, the word-level neural network model is insufficient. Therefore, in order to solve the triage and precise treatment of patients, we present an improved Double Channel (DC) mechanism as a significant enhancement to Long Short-Term Memory (LSTM). In this DC mechanism, two channels are used to receive word-level and char-level embedding, respectively, at the same time. Hybrid attention is proposed to combine the current time output with the current time unit state and then using attention to calculate the weight. By calculating the probability distribution of each timestep input data weight, the weight score is obtained, and then weighted summation is performed. At last, the data input by each timestep is subjected to trade-off learning to improve the generalization ability of the model learning. Moreover, we conduct an extensive performance evaluation on two different datasets: cMedQA and Sentiment140. The experimental results show that the DC-LSTM model proposed in this paper has significantly superior accuracy and ROC compared with the basic CNN-LSTM model.


Introduction
People consult medical experts online in healthcare communities and ask for treatment plans through symptom description or seek the recommended hospital and department. Using a deep learning algorithm to classify disease symptom text can optimize the allocation of medical resources and improve the efficiency of medical treatment. Text categorization is a classic problem in the field of Natural Language Processing (NLP). For an effective medical diagnosis, we proposed the idea of using an improved LSTM model to implement medical consultation text classification. e commonly used classification methods include Naïve Bayes [1], Support Vector Machine (SVM) [2], and Decision Trees [3]. ese classic machine learning classification algorithms have achieved significant results in text classification tasks. With the research and development of neural networks and deep learning, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) have been found to exhibit excellent performance in text classification tasks. At the same time, some models of Chinese question and answer in the medical field have been proposed. Jain and Dodiya [4] have proposed a rule-based framework for the medical question-answering system. Yin et al. [5] designed an algorithm for clustering and similarity evaluation of similar questions and answers for the problem of low efficiency of online healthcare consultation. Feng et al. [6] used CNNs to learn the representation of question and answer combination and further used it to calculate different questions and candidates. Zhang et al. [7] proposed an endto-end word embedding multiscale CNN model for question and answer matching in the medical field.
Our primary contribution is a new Double Channel LSTM model, called DC-LSTM, and we add a hybrid attention mechanism to LSTM, which can selectively learn long sequences and make deep neural networks in each batch of training. is proposed model can learn different forms of features, enhance model learning and expression skills, and prevent overfitting. e experimental results show that the DC-LSTM model can significantly improve the accuracy compared with other CNN or RNN models. Particularly in the medical diagnosis classification, using this model can help people quickly choose the right outpatient department for medical treatment and improve the efficiency of outpatient service.

Related Work
CNN and RNN are two typical deep neural network models. CNN is typically applied in image processing [8] and speech recognition [9], while RNN is usually applied in machine translation and text sequence problem [10]. At the same time, some improved models have been derived. e LSTM proposed by Hochreiter and Schmidhuber in 1997 [11], which is based on the RNN-derived network model, is suitable for processing and predicting important tasks with relatively long intervals and delays in time series.

CNN Model.
CNN is a common deep learning network architecture. Hubel and Wiesel are inspired by the natural visual cognitive mechanism of biology. With the improvement of data volume and computing power, CNN has become a research hotspot in recent years. CNN consists of convolution, activation, and pooling layer. Many researchers have proposed some improved CNN models in text classification, such as fastText [12], textCNN [13], and Bi-LSTM [14,15], which are very effective, but they still have some problems such as generality and difficulty in extracting specific context features.

Convolutional Layer.
Convolution is the most basic operation in CNN, each convolutional layer is composed of several convolution units, and the parameters of each convolution unit are optimized by a backpropagation algorithm. e purpose of convolution operation is to extract different features of input. More layers of network can extract more complex features from low-level features iteratively. e convolution kernel can scan the input features according to a certain law and multiply the input features by matrix elements in the receptive field and then sum all the results and add the bias. e formula used is as shown in where b is the amount of deviation; z l and z l+1 represent the input and output of the l + 1th convolution layer; L l+1 is the size of Z l+1 ; Z (i, j) is the corresponding feature matrix; f is the size of convolution Kernel; s 0 is the step size of convolution; and p is the padding number.

Pooling Layer.
After the feature extraction of the convolutional layer, the output feature matrix is passed to the pooling layer for further feature extraction and information filtering. e pooling layer contains a preset pooling function, which can use the features of its neighboring regions in the feature matrix. e statistics are replaced, and the definition of pooling is as shown in where s 0 is the size of the pooling step and p is the parameter that has been customized specifically. When p � 1, this is called average pooling, and when k ⟶ ∞, this is called maximum pooling.

LSTM Model.
LSTM is a special RNN structure that can learn long-term dependencies. RNN can propagate historical information through chained neural network architecture. When processing sequential data, it looks at the current input x t and the previous output of the hidden state h t−1 for each time step. However, as the gap between the twotime steps becomes larger, the traditional RNN that can learn the long-term dependency characteristics becomes more difficult. e LSTM proposal addresses this long-term dependency problem and has achieved significant good performance in the statistical machine translation task of Chen et al. [15], making LSTM a successful model. e structure of the LSTM model is shown in Figure 1.
e LSTM architecture provides a series of repeating modules for each time step in a standard RNN. ese modules are called cells. At each time step, the output of the module is controlled by a set of gates in R d as a function of the old hidden state h t−1 and an input of the current time step x t described as follows: forget the gate f t , input the gate i t , and output the gate o t . ese gates together determine how to update the current memory unit c t and the current hidden state h t . We use d to represent the memory dimension in LSTM, and all vectors in this architecture share the same dimension. e LSTM conversion functions are defined as follows: where σ is a logical sigmoid function whose output is [0, 1], tanh represents a hyperbolic tangent function, and its output is [−1, 1]. LSTM is specifically designed to learn time-series data for long-term dependencies, so we chose LSTM on the convolutional layer to learn this dependency in higher-level feature sequences.

Word Embedding.
Glove is also known as Global Vectors for word representation [16]. It is a word representation tool that is count-based and uses overall statistics. A vector of real numbers is obtained that captures some semantic properties between words, such as similarity and analogy. We can calculate the semantic similarity between two words by computing the vector, such as Euclidean distance or cosine similarity.
Word2Vec is a three-layer neural network, which consists of the input layer, the hidden layer, and the softmax layer. e training process is to train the central words and context words by constructing a fake supervised task called Fake Task. e middle-hidden layer weight is used as a trained word vector. According to the size of the input and output data, it has two methods: Skip-Gram method and CBOW method. e former method uses the central word as input and uses the context word of the central word as the label to be predicted for training. e latter method is just the opposite, using the context word as input. As the input data, the context word is trained with the central word as the output to be predicted. After the training is completed, the output information is discarded, and the weight of the middle layer is used as the trained word vector. e Word2Vec model solves the computational bottleneck in the NNLM model. It can easily process tens of millions of text data and can use variable-length sequences as input. With this advantage, the neural network model can model more complex contexts, and Word vectors can contain richer semantic information.

Methodology
Traditional CNN and RNN networks often only use wordlevel embedding, and the semantic features are limited.
ese traditional models have a very limited capability, especially for words that need to use context to determine semantics. erefore, it is necessary to expand channels and use multilevel embedding to improve input characteristic diversity. At the same time, the relative importance of each word in the text is different for the modality expressed. Some words contribute more to the modality, some words have less contribution to the modality, and the emotions expressed by each word are also prioritized. erefore, in order to solve the problem of not being able to selectively learn the emotional characteristics of each word, we can add hybrid attention after LSTM and make trade-offs for different words with different emotions, improve the learning ability of LSTM model, and improve the special characteristics of neural network learning. is creates the ability to simultaneously improve generalization and prevent overfitting from occurring. e model is divided into three parts of CNN-LSTM, Double Channel, and hybrid attention, of which Double Channel is the most important structure.

CNN-LSTM.
In our work, we use both CNN and LSTM. e improved CNN structure in our model is similar to ConvNets proposed by Zhang et al. [17] in 2015.
e ConvNets consists of nine layers deep with six convolutional layers and three fully connected layers. e structure of the ConvNets is shown in Figure 2.
On the other hand, RNNs have been widely exploited to deal with variable-length sequence input. However, when the length of the input sequence becomes longer, CNNs may suffer from a gradient problem of disappearing or exploding, which will make it more difficult to learn information from a longer time context. LSTM is one of the popular variations of RNN which is proposed to solve this problem. Its network solves this problem by introducing a gate structure in each LSTM unit. e forgetting gate decides what information to Journal of Healthcare Engineering be discarded from a cell state and how many new inputs are determined by the input gate. e information is added to the cell state, and the output gate determines what value to be output based on the current state of the cell. We introduce the Double Channel mechanism in the CNN-LSTM model and input multiple levels of embeddings at the same time to acquire multiple levels of features, in order to solve the problem that the word-level and characterlevel features cannot be extracted at the same time.
In this model, according to the embedding granularity that is used, the structure is divided into Char-Channel and Word-Channel. e model structure is the same in each channel, which is divided into two parts: CNN and LSTM neural network. In the CNN part, the convolution result c is calculated for the input sequence X and the convolution kernel K, For the above LSTM calculation process, in order to simplify the formulation, it is unified as LSTM (x). For the convolution neural network and long-term and short-term memory neural network, there are two kinds of structures that can be used: series and parallel. At present, the most commonly used structure is series structure, but there is the information loss phenomenon with this, because information compression and loss will occur in the convolution process. e long-term and short-term memory neural network also receives compressed information and loses most of the time-series characteristics with the series structure. It is unable to give full play to the advantages of LSTM. To solve this problem, the parallel structure is chosen over the series structure and the results are mosaic. At the same time, the structure in each channel is recorded as From equation (5), we can get the basic description of Char-Channel and Word-Channel. eir input is corpus x, and the output is C out and W out , respectively, W out � channel emb�V g (x).
V e and V g in equations (6) and (7) are the word-level embedding vectors trained and the char-level embedding vector. We concatenate the results of the two channel outputs as a hidden layer output, en, the result of the hidden layer is sent to the fully connected layer, and then the softmax layer is used for classification output, y � softmax(dense(h)).
e Double Channel structure is as shown in Figure 3.

Hybrid Attention.
e weight score ω is an important component of the dynamic adaptive weight, and its calculation method is as shown in equations (10) and (11): where h t ′ is the LSTM output at time t, c t is the status in LSTM at time t, h is the hidden layer output, w a is the random initialization weight matrix, v a is the random initialization vector, and b is the random initialized bias. Next, the score ω is calculated as shown in where x is the length of the sequence. e output vector c i weighted by the dynamic adaptive weight, as shown in

DC-LSTM Overview.
In order to better obtain semantic representation and extract text features, we have designed DC-LSTM using the convolution layer through first embedding the input text sequence, then obtaining the vector representation of these sequences, and finally convolving the sequence using the convolution layer. is model can extract word-level semantic features and the pool, reduce the input data and the output size, and also reduce the risk of overfitting. e data processed by the convolution layer is sent to the LSTM layer, and the LSTM can analyze the timing characteristics in the data. is algorithm can extract some information of context semantics, ignore secondary information, and ensure the accuracy of classification tasks. e DC-LSTM model is shown in Figure 4.

Experiments
In order to verify the reliability of the model, a complete contrast experiment was designed. e two datasets of cMedQA and Sentiment140 were used to compare the DC-LSTM model and the basic CNN-LSTM model proposed in this paper was used.

Datasets.
To better validate the model's effects, the cMedQA medical diagnosis dataset and the Sentiment140 Twitter dataset are used for verification experiments. e cMedQA dataset is a Chinese text dataset with 792,099 medical consultations which include Andriatria, Internal Medicine, OAGD, Oncology, Pediatric, and Surgical department. e distribution of cMedQA is shown in Figure 5. e question-answering pairs have been preprocessed and classified into different categories. Each pair of QA is encoded as a series.
Meanwhile, to verify the model's good generality, we also select another dataset unrelated to medicine: Sentiment140. Sentiment140 dataset is a tweeter sentiment analysis dataset created and organized by three computer science students from Stanford University, Alec Go, Richa Bhayani, and Lei Huang, with 1.6 million training data and 498 test data. ese data are divided into negative, neutral, and positive categories according to emotional polarity. e detailed description of these two datasets is shown in Table 1.
Some samples in cMedQA dataset are as shown in Table 2.
As we can see from Table 2, the dataset includes three main features, namely, department, title, and ask. Title and ask indicate the consultant's symptoms, while department is the answer which indicates the department of treatment. Title points out the core demands of consultants, while ask further describes the content of demands, which puts forward higher requirements for the ability of the model to explain the context.

Evaluation.
e evaluation indicators use the accuracy rate, precision rate, recall rate, and F 1 -score to measure the performance of the model.
Define TP is True Positive, FP is False Positive, TN is True Negative, FN is False Negative, and then precision � TP TP + FP , Accuracy refers to the proportion of correctly predicted samples to all samples, precision refers to the proportion of samples that are positively positive, recall refers to the proportion of all positive samples that are correctly predicted, and F 1 -score refers to the harmonic average of precision and recall.

Hyperparameter.
It is well known that the quality of hyperparameters will directly affect the training effect of the model, so it is important to choose a series of optimal hyperparameters. e settings of the hyperparameters are shown in Table 3.  Journal of Healthcare Engineering

Experiment Results.
is section compares experiments and uses our proposed improved model, DC-LSTM, to compare it with CNN, LSTM, CNN-LSTM, and GRU models. e environment used in this paper is based on Tensorflow [18] as the background of Keras as the development verification environment, CUDA [19] as the GPU acceleration environment, and cuDNN [20] as the numerical computing environment of the deep neural network. e experimental results are shown in Table 4.
It can be seen from Table 4 that the DC-LSTM model proposed in this paper is superior to other models such as CNN, LSTM, CNN-LSTM, and GRU in terms of accuracy, precision, recall, and F 1 -score on cMedQA dataset and Sentiment140 dataset. e model has good generalization  Journal of Healthcare Engineering ability, and the model not only performs well in medical field but also performs better in other datasets. Figure 6 further shows that the AUC value of DC-LSTM model on various outpatient data of cMedQA is higher than 0.9, which is also significantly higher than other models. In general, the improved model DC-LSTM has been improved in many evaluation indexes. is is due to the introduction of a multichannel mechanism, which can make full use of the attention mechanism's ability to calculate text weight and also make use of the powerful temporal feature learning ability of LSTM. In the channel, learning the semantic information carried by word vector and character vector can learn more features and more fine-grained features.

Conclusions
In this paper, we find that the basic CNN-LSTM model cannot perform differential learning when dealing with complex long-sequence data. After analyzing the possible causes, we propose an improved method, called DC-LSTM, which incorporates multiplication by weights (w) according to each time step of the sequence. e upper bias calculates 烧心, 打隔, 咳嗽低烧, 以有4年多 Heartburn, interval, cough, and low fever, more than 4 years   the weight score, calculates the probability distribution of the weight score, and adds the hybrid attention according to the probability distribution. Experiments results have shown that DC-LSTM can effectively distinguish the emotional level of different words in sentences and assign different learning weights to different words, so that it can learn the sentiment features of each word in a differentiated way.

Data Availability
e cMedQA data used to support the findings of this study have been deposited in the GitHub repository (https:// github.com/liangsbin/Chinese-medical-dialogue-data).
e Sentiment140 data used to support the findings of this study have been deposited in the Kaggle website (https://www. kaggle.com/kazanova/sentiment140).

Conflicts of Interest
e authors declare that they have no conflicts of interest.