ICU false alarm identiﬁcation based on convolution neural network

Background: In intensive care unit(ICU), excessive false alarms burden medical staﬀ greatly, and cause medical resource waste as well. In order to alleviate false alarms in ICU, we constructed models for classiﬁcation using convolutional neural networks, which can deal directly with time series and avoid extracting features manually. Results: Combining with grouping strategy, we tried two basic network structures, i.e. DGCN and EDGCN. After that, based on EDGCN, which was proved better, ensembling networks were also constructed to elevate the performance further. Considering of the limited sample size, diﬀerent data expansions were also experimented. Finally, we tested our model in the online sandbox, and got a score of 78.14. Conclusions: Although the performance is slightly lower than the best scores that have been reported, our models are end-to-end, through which the original time series can be automatically mapped into a binary output, without manually feature extraction. In addition, our method innovatively uses grouped convolution to make full use of the information in multi-channel signals. In the end, we also discussed the potential solutions to further elevate performances.


Introduction
In the intensive care unit(ICU), the alarm issued by monitoring equipments can promptly notify the medical staff of patients' abnormal condition, and protect patients from danger. However, these automatic alarms are designed for high sensitivity rather than specificity, leading to excessive false alarms. According to previous research, nearly 88.8% of alarms in ICU are false [1] and false arrhythmia alarms are commonly due to single channel ECG artifacts and low voltage signals [2]. Such frequent false alarms will inevitably burden medical staff greatly, and the noise generated by the alarm will also stimulate the patient, resulting in patient delirium [3]. Therefore, it is quite necessary to examine the ICU alarms carefully and reduce the false alarm as far as possible.
In 2015, PhysioNet launched a competition of reducing false arrhythmia alarms in ICU, and after that, more researchers have devoted to this area. Researchers have tried many ways varing from dedicated decision rules to non-dedicated machine learning.
The rule-based decision-making system focuses on how to use multi-modal signals to estimate heart rate more accurately. In the decision-making process, according to the characteristics of the signal, many heart rate estimating algorithms and decision rules are fine-tuned. These improvements include invalid data or noise detection [4,5,6], signal filtering [7], signal quality indices(SQIs) or assessment(SQA) [7,8,9,10], pulse detection [7,11], QRS detection [4,6,9,10,12,13], spectral features [8], heartbeat estimation [4,6,9], and multi-rules fusion [4,6,10,12], etc. For example, in the challenge, Plesinger and colleagues [6] established a model based on a series of decision rules and obtained a final score of 81.39 on the hidden testing set of the competition and ranked 1st in the competition. Krasteva and colleagues [13] first used ADLib(Schiller AG) and Pulse Wave Analysis Module(PWAM) to analyze ECG, arterial blood pressure(ABP) and photoplethysmogram(PPG) signals(if available) respectively. Then, based on the extracted characteristics, they constructed a rulebased decision model to do the judgement. They got 79.49 score in the final hidden set test.
Machine learning has gained much attention these years. However, due to the nature of regular monitoring physiological signals including heterogeneity, low signal to noise ratio, high deficiency and high temporal resolution over long periods of time [14], applying machine-learning approaches to them is quite challenging, and much human intervention is required. Machine learning algorithms usually focus on extracting features that can distinguish different alarms, and use algorithms such as majority voting [15,16], SVM [17,18], decision trees [19], random forests [20,21,22] for classification. Most machine-learning algorithms use the same inputting parameters as the rule-based strategies, such as SQIs [19,20,22], heartrate [15,19,22], QRS detection [17], etc. In addition to basic statistical and spectral features, certain derived features designed according to the alarm type are also used [20,22]. Due to the large number of features, machine learning solutions often require feature dimensionality reduction, such as using PCA [15,23] and feature selections [22]. For example, considering that false alarms were usually caused by low QRS complex amplitude and high noise, Au-Yeung and colleagues [22] first processed the signal to ensure that the heartbeat can be correctly labeled and then extracted features from the signal. In particular, they extracted a series of SQIs to distinguish noise or artifacts from real physiological signals. After that, they trained a random forest model using all extracted features. Their model got a 83.08 score on the hidden set. In another way, Hooman and colleagues [24] have proposed a neural network training method based on genetic evolution algorithm. They used the features extracted by Liu [4] for network training. According to the cross-validation on the open dataset, their methods obtained the highest score of 86.81. Gajowniczek and colleagues [21] have proposed a weight-based random forest method. They reported that their model performed better than standard random forest, adaboost and other ensemble models on the open dataset.
Deep learning has also been applied to this field. Schwab and colleagues [14] has proposed Distantly Supervised Multitask Networks to reduce false alarms. Their innovative neural network includes multiple perceptual blocks, multiple auxiliary task models, combined with a final classification unit, and they trained auxiliary task and final classification unit alternately. They reported a better result than those of other methods on their own datasets. Neverthless, in their method, extracting and selecting features manually are still necessary. Lehman and colleagues [25] firstly performed FFT on ECG, and then used an auto-encoder to reconstruct the signal. The low-dimensional expression derived from the encoder was used as the input feature of the neural network for classification. Focusing on the ventricular tachycardia alarm(VTA), their method has a performance improvement of nearly 1% compared to Plesinger's [6] on their own validation set.
In general, above methods greatly relied on manually extracted features. Although, Lehman and colleagues [25] got rid of part of the feature extraction work through the auto-encoder, they still performed lots of signal preprocessing, such as FFT, peak detection, etc. Breaking through the limitation of manually extracting features, we attempted to accomplish features extraction and final classification all by automation. As former walkers, Schirrmeister and colleagues [26] successfully designed a deep convolutional neural network(CNN) with raw EEG signals inputting and classified labels outputting. Their method performs the same as or even better than traditional methods using pre-extracted spectrum features. Besides, the features used are not fixed a priori.
Building on their basic CNN ideas, we designed a deep grouped convolutional neural network(DGCN) for the ICU false alarm task. Main difference from Schirrmeister's network lies in that we took the grouping strategy instead of simple twodimension convolution to ensure that information from multi-channel heterogeneous signals, possibly of different intrinsic time scales, can be extacted and reserved respectively for final classification. That is, the multi-channel raw signals are directly inputted into the network and convolved seperatedly by different kernels according to their signal type. The features automatically extracted from different channels are all reserved until they are fused at the end of the network to complete the classification of true or false alarms. Grouping first combined with fusion last achieves an information complement from different channels. Based on the DGCN, the embedded deep group convolutional network(EDGCN) has also been developped. We compared above two networks and constructed ensemble models using EDGCN architecture. The score of our best performance model is 78.17 on the hidden set of 2015 physionet competition.

Data Descriptions
We used the database of 2015 physionet/CinC challenge, i.e. reducing false arrhythmia alarms in the ICU. It includes a total of 1,250 arrthymia alarm records, classified as one of five arrhythmia types of asystole(ASY), extreme bradycardia(EBR), extreme tachycardia(ETC), ventricular tachycardia(VTA) and ventricular fibrillation or flutter(VFB) [3]. Each alarm was labelled by 'True' or 'False' confirmed by clinical experts. However, only 750 records are open access. The remained 500 were embedded in a sandbox enviroment for testing services provided by the challenge organizers [3]. Therefore, we used the available 750 records for modeling as well as cross-validation, and tested our model in the sandbox enviroment.
The basic statistical information about the open-access 750 records is shown as Table 1.
Each record contains three or four synchornized physiological time series, including two leads of ECG and one or two supplementary signals as arterial blood pressure(ABP), photoplethysmogram(PPG) or respiration(RESP) [3]. The number of records that each series type occupies is shown as Table 2. As noted, ECG lead II is the series type that appears most frequently; the following are ECG lead V, PPG, ABP, and RESP in order. Considering that the rest series type appears too rare to give sufficient samples, we used only the top 5 frequent series type as our model inputs except for sembling models. According to the data description from the challenge organizor, all these time series have been resampled to 12 bit, 250Hz and band-as well as notch-filtered, therefore, we did not take extra filtering. As to the series length, 375 records are of 5 minutes length, and 375 are of 5.5 minutes. As told by the organizer, all the alarm occurs at 5:00 and the suspected arrthymia event triggering the alarm is believed occurring between 4:50 to 5:00 [27]. Considering that there might have been additional unannotated arrhythmia events in the 5 minutes preceding the alarm, we only use 15s duration from 4:45 to 5:00, in order to eliminate the uncertainty of unmarked series duration as well as retain information as much as possible.

Two basic Network Structures
We tried two networks. One is a deep CNN combined with grouping strategy(DGCN). The other employs an extra embedding layer to introduce arrhythmia type information, and we name it as embedded deep group convolutional network(EDGCN).

DGCN Net
The DGCN model adopts a five-layer structure as shown in Fig. 1. As shown, layer 1-4 are of basically the same structure, which sequentially comprises a convolution stage, a batch normalization [28] stage, a Leaky Relu activation [29,30] stage, and a maximum pooling [31] stage.
Considering of the different physiological mechanisms and temporal scales underlying different signals, i.e. ECG, blood pressure and respiration, we use a grouped convolution strategy in order to extract the specific characteristics of corresponding signal. That is, as shown in Fig. 1, the different input series are treated as different groups, which are marked by different colors, i.e. red, green or blue. Each group, represented by retaining its original color, will not mix up till the last stage. To be specific, in each layer from layer 1 to 4, each group is independently convolved by 15(layer 1 and 2) or 30 (layer 3 and 4) different kernels of stride 1, length 10, width K i , and no padding, in which K i represents the number of channels in each group of the i − th layer input. Thus for N input series or groups, there are N × 15 (or N × 30) different two-dimension kernels, generating N × 15 (or N × 30) different convolved series. Then, batch normalization [28] is conducted to alleviate the disappearance of the gradient and speed up the model training. The following is Leaky Relu activation stage. Finally, we use max pooling of width 3 and stride 3, which means downsampling ratio 3. In each layer, each neuron's receptive field, in length of original series, is illustrated as Fig. 2. In addition, we also depicted the length of output of each layer. Layer 5 is the decision layer. which comprises a convolution, an exponential linear unit(ELU) activation function [32], a full-connected(FC) layer(via convolution), and softmax in order. In order to integrate the information extracted from different groups, we first use N convolution kernels of stride 1, length 3, width K 5 × N to combine the different group results of the preceding forth layer, and then use the ELU activation function to perform nonlinear transformation. After that, we use 2 convolution kernels whose size are the same as the output of the previous step to perform convolution, as a FC layer, and get a result vector of dimension two. Dropout can effectively alleviate overfitting, therefore, we adopted it in FC layer and the dropout ratio is set 0.3. Finally, we transform our result into a decision probability through Softmax.

EDGCN
EDGCN is mostly the same as DGCN, except that in layer 5 we embed an external vector, which is the the one-hot coding of the arrythmia symptom for each alarm, into the model. To be specific, the N × 30 output series of length L 9 from layer 4 were first group-convolved with N corresponding kernels of stride 1, length 3 and width 30 × N , after which we get N series of length L 10 = L 9 − 2. Following ELU stage, we reshape the outputing matrix of N rows and L 10 columns into a vector. Then we concat the one-hot coding of arrhythmia type, of length 5, with the reshaped vector to obtain a vector of length N × L 10 + 5. After that, there is another convolution stage with 32 kernels of length N × L 10 + 5, working as a FC layer. And finally, we use another FC layer to compress the 32-dimension output vector from the previous stage into a 2-dimension vector and use a softmax classification to get the final results. Dropout is also adotpted in the FC layer, and the dropout ratio is set 0.3 The block chart of layer 5 as well as the data shape in each stage is illustrated in Fig. 3.

Data Preprocessing
In data preprocessing stage, we simply did necessary NAN filling. We adopted forward-filling. That is, for missing values, we first esxamine the preceding point. If it is a valid value, we fill the missing with the preceding value; whereas it is not, which happens irregularly, we simply retain the NAN.

Data expansion
Noting that the sample size of training set(750) might be insufficient, we made a data expansion by sliding window of length 12s and step 0.5s. Thus, for each 15s original series, we obtained 7 crops(or slices) and expanded the training set 7 times as the original. As to the application of the trained model, we also used the same cropping strategy, and merged all 7 crops results by voting. Then, considering that the expanded data has considerable redundancy, we randomly add artificial noises generated via formula (1) to each slice.
In formula (1), x noise represents the slice after adding noise, represents rand noise obeying standard normal distribution. A is the amplitude coefficient of the added noise and was set as 10% of the original signal peak-to-peak amplitude. Waveforms of a representative series before and after noise addition are shown as Fig. 5.

Performance evaluation
Besides the frequently used measures as accuracy(ACC), area under the receiver characteristic operator curve(AUC-ROC), true positive rate(TPR) and true negative rate (TNR), we also adopt the compound Score measure designed by Clifford and colleagues [3] as the evaluation index of the model. As shown by the definition formula (3), Score index punishes the false negative(FN) considering that it is quite dangerous to report a true alarm as false.

Experments
We experimented with the two network structures introduced in Section 2.2. In both networks, we selected the II, V, ABP, RESP, and PPG as the model input. We adopted the logistic loss, and used the Adam optimizer [33]. The hyperparameters, which were exactly the same in DGCN and EDGCN, are listed as Table 3. We also compared 4 different data expansions, i.e. pure 12s slicing signals and 12s slicing with either noise addition or 0-1 normalization, or both with complete 15s signal without expansions.
In machine learning, model ensembling usually achieves better performance. Therefore, we also tried combination models illustrated as Fig. 6. It means that we trained independently four sub-models (in DGCN or EDGCN structures) using different input signals, and then combined them by averaging their results and comparing it with a threshold to determine a binary output. The overall execution flow of our experiments is shown in Fig. 7. All experiments ran in a 5-fold cross-validation way. Fig. 8 gives the profiles of DGCN, EDGCN and combined models 5-fold cross validation performances, i.e. ACC, AUC-ROC, and Scroe, on the open dataset. Different colors represent different data expansions. From Fig. 8, it can be seen that  for almost all occasions, models with data expansions can stabilize much quicker in the early 50 epoches. It means that 12s slicing does help accelerate training via increasing the amount of training data. Meanwhile, models of 12s slicing without 0-1 normalization perform better than both complete 15s signals without expansion and 12s slicing with 0-1 normalization. However, after 12s slicing, noise addition itself does not bring about obvious elevation. The possible reason lies in that interruptions in real environment are usually not Gaussian stochastic.

Results
In order to make a clearer comparison between models, we list in Table 4 the best performance measures for each model. From Table 4, we can find that on the open dataset, combimed EDGCN models have the highest scores, and following are EDGCN models. DGCN modles perform the most poorly. It implies that the additional arrhythmia type embedding in EDGCN does supply information helpful for classification. What is more, ensembling strategy can improve performance further.
For the same network architecture, models using 12s slicing without 0-1 normalization perform the best, and then rank models using 12s slicing with 0-1 normalization. Models using complete 15s signals perform the worst. On the other side, noise addition only slightly elevate the performance compared with using pure 12s slicing. Finally we tested our combined models in the official sandbox environment. Performance measures are listed in Table 5. As we see, the combined EDGCN model using 12s slicing and normalization got the best score of 78.17. It is interesting that on the hidden set of sandbox environment, data expansion with 0-1 normalization achieves better performance, contrasting to the results on open dataset. One potential reason may be that the hidden dataset is different in amplitude from the open dataset, might resulting from different sampling system. Normalization in slicing can exactly erase this difference in abslulte amplitude while retaining the relative shape information of the waveform.
For comparison, in Table 5 we also list Plesinger's results, which ranked the first in the challenge. We can see that our best score is still slightly lower than Plesinger's score. Compared with Plesinger's model, the poor performances on the ASY alarm, no higher than 75 contrasting to 97, underlies the main deficiency. As definition, ASY means no beats in 4s. It is not difficult for a rule-based model to detect the real ASY. However, in our models, as our kernel length determined, the times scale reaches approximately 2.4s till the last convolution. And then the final FC stage just compress all features in time scale no longer than 2.4s together. It means that actually we cannot comprehensively examine 4s continuous series. That may be one of the main reasons that all our models perform much worse than Plesinger's on ASY.

Conclusion and Discussion
In this work, we developped an end-to-end model for ICU false alarm identification, through which the original time series can be automatically mapped into a binary output. One of the most principle advantage of this kind of methods is that manually extracting features is not pre-requisite. Convolution, which acts actually as filtering, conducts the feature extraction by itself. In addition, by grouping strategy, we make full use of various synchronized time series that represent different physiological aspects. However, as shown by results, the score on the official hidden set is not that good, although it is not too bad. We reflected it and believe that the reasons as well as the potential solutions might lie in the following aspects: Firsly, in this work, although we adopted group convolution strategy in order to carefully retain the characteristics of different series types, we did not truly adopted different receptive fields or time scales for them. As we know, ECG has much higher frequency than respiration, and PPG as well as ABP is great different from ECG in waveform, which implies the typical temporal response differences between electrophysiology and hemodynamics. However, we only adopted one set of kernel lengths for all different series types, especially in the first convolution stage that determines the basic coarse-grained scale. As Fig. 2 shows, in the first convolution stage, the receptive field is set 0.04s. It is a trade-off between 'fast' ECG and 'slow' RESP. The direct consequence is that it may be exactly appropriate for neither. Therefore, in order to promote the model performance, designing different kernel length for different series type is a potential solution.
Secondly, as arrhythmia type varies, the requisite or shortest continuous series length also varies. However, our methods did not make corresponding variation to satisfy the different arrhythmia type. For example, as discussed in the results, for ASY the model lacks a comprehensive scale of 4s. Therefore, in future work, we should design the kernel length more particularly for particular arrhythmia type.
Thirdly, in order to relieve the problem brought about by defficient samples, we adopted data expansion via sliding a 12s-window with stride 0.5s from 15 seconds previous alarming to exactly alarm happening. This conduction on the one hand does help elevate the model performance, which can be inferred from Fig. 8 and Table. 4; on the other hand, it may introduce slices with mistaken labels, e.g. slices labelling "True" alarm happens to embrace no arrhythmia events. This occasion happens especially when arrhythmia event can quickly trigger alarming with delay no longer than 5 seconds. Unfortunately, this problem can not be solved by simply increasing or decreasing the window length. Instead of data expansion, getting more reliable real samples is the final solution.