PS-DeVCEM: Pathology-sensitive deep learning model for video capsule endoscopy based on weakly labeled data

We propose a novel pathology-sensitive deep learning model (PS-DeVCEM) for frame-level anomaly detection and multi-label classification of different colon diseases in video capsule endoscopy (VCE) data. Our proposed model is capable of coping with the key challenge of colon apparent heterogeneity caused by several types of diseases. Our model is driven by attention-based deep multiple instance learning and is trained end-to-end on weakly labeled data using video labels instead of detailed frame-by-frame annotation. The spatial and temporal features are obtained through ResNet50 and residual Long short-term memory (residual LSTM) blocks, respectively. Additionally, the learned temporal attention module provides the importance of each frame to the final label prediction. Moreover, we developed a self-supervision method to maximize the distance between classes of pathologies. We demonstrate through qualitative and quantitative experiments that our proposed weakly supervised learning model gives superior precision and F1-score reaching, 61.6% and 55.1%, as compared to three state-of-the-art video analysis methods respectively. We also show our model's ability to temporally localize frames with pathologies, without frame annotation information during training. Furthermore, we collected and annotated the first and largest VCE dataset with only video labels. The dataset contains 455 short video segments with 28,304 frames and 14 classes of colorectal diseases and artifacts. Dataset and code supporting this publication will be made available on our home page.


Introduction
There are several colorectal diseases and abnormalities that interfere with the normal working of the colon. Colorectal diseases include colorectal cancer, polyps, ulcerative colitis, diverticulitis etc. Screening and detection of colorectal diseases at an early stage could improve disease management and diag-the large bowel that does not require sedation or gas insufflation (Shi et al. (2015)). A single VCE procedure produces approximately 50,000 images and takes 45-90 minutes to review. Therefore, a machine learning system can be used to complement gastroenterologists for fast and accurate diagnosis (Li et al. (2011)).
The detection and classification of colorectal diseases is a very challenging problem due to apparent colon heterogeneity.
In fact, colon data contains a high degree of apparent hetero-  Ronneberger et al. (2015)). These methods do not consider long-term temporal dependencies between frames to improve the performance of detection algorithms. Moreover, they rely on the assumption that pixel-level or frame-level annotation data is available and is trained in a fully supervised manner. However, this assumption is very limiting in a clinical setting as it is expert-intensive and time-consuming to obtain a precise annotation of the pathologies per image. In addition, from a clinical application perspective, a gastroenterologist is required to check for various types of pathologies during a single examination and computer-aided diagnostic techniques are expected to detect as many diseases as possible to circumvent miss-diagnosis. However, the number of pathologies that earlier methods handle is limited to classes of diseases such as polyps (Bernal et al. (2017)) and angiodysplasia (Shvets et al. (2018)). Moreover, the dataset used in training such models lacks class variety to be used in clinical application.
To address these challenges, we propose PS-DeVCEM, a new weakly supervised learning approach for learning framelevel multi-label classification from a given video label. Our model explores robust deep residual features that are invari-ant to apparent heterogeneity in colon data. We also explore residual LSTM units to take into account the long term photometric and appearance variability. Our approach is based on an objective function that minimizes within-video similarities between positive and negative feature frames, while at the same time learning video-level prediction and contribution from each frame. The proposed method requires only video labels which can easily be obtained from VCE data reader software tags such as RAPID reader (GivenImaging) as a normal working procedure. We formulate the aggregation of positive and negative frame labels using the Bernoulli distribution. The network is trained by optimizing the sum of log-likelihood and  (He et al. (2016)) pre-trained on ImageNet (Deng et al. (2009)). The feature embedding is computed by passing through a residual LSTM block. Finally, the embeddings are aggregated with learned weights. The output of the network is video-level class probabilities for each pathology. For details please refer to Sec. 3. input and detects key-frames with video-level prediction.
The features from each frame in the video are aggregated using learned weights for final video prediction. In addition, we assume temporal dependency between neighboring frames, which is modeled using residual bidirectional long short-term memory blocks.
• A new VCE dataset suited for weakly supervised learning problems. The dataset contains 455 short video segments extracted using RAPID reader software (GivenImaging).
There are 28,304 frames with a total of 14 classes of diseases from 40 patients.
• Detailed comparison of existing weakly supervised learning algorithms (Ray and Craven (2005); Andrews et al.  We organize the rest of the paper as follows. Section 2 briefly reviews previous works on video analysis and multiple instance learning (MIL). In Section 3 we present the PS-DeVCEM along with the self-supervising method. In Section 4 we present the dataset and comparison of different MIL methods and benchmarks will be discussed. In addition, we present experiments with a different configuration of the proposed method. Finally, in Section 5 we conclude the paper with future direction and discussion.

Related Work
In general, there are two approaches in modeling video data context: short and long context modeling. In these methods, the long and short-range dependencies can be well memorized by sequentially running the network over individual frames. However, designing an architecture for video analysis is a challenging task as it involves computationally expensive tasks such as temporal information fusion strategy, frame feature representation (as compared to end-to-end training) and spatio-temporal feature fusion. The basic building blocks for video analysis with deep learning includes spatial feature extraction unit such as ResNet (He et al. (2016)), VGG (Simonyan and Zisserman (2014)), etc. and temporal feature extraction unit such as optical flow and LSTM (Graves (2013)) units. LSTM (Graves (2013)) is combined with CNN for activity recognition in Long-term recurrent convolutional networks for visual recognition and description (Donahue et al. (2015)). Other alternative approaches extract spatio-temporal features together using 3D convolutions such as C3D (Tran et al. (2015)). Spatio-temporal features such as C3D are used in (Sultani et al. (2018)) for anomaly detection in natural videos. Most of the current state of the art methods use two-stream networks such as ActionVLAD (Girdhar et al. (2017)) at the expense of high computational complexity. This is usually done by fusing extracted spatial and optical flow features independently.
In many endoscopic pathology detection problems, labels are relatively scarce and expensive to obtain. One such case is in VCE, where annotating pathologies frame by frame is arduous and time consuming for medical doctors. Therefore, weakly supervised approaches such as MIL (Maron and Lozano-Pérez (1998); Møllersen et al. (2018)) or fully unsupervised methods of detection and segmentation are required to address the above issue. MIL is a type of weakly supervised learning problem where only group-level, also known as bag level annotation, is available. The instances within the bag are not labeled.
For example, the annotation could be a general statement about the category of the pathology in the video without information about the location within the video or frame labels. In the MIL problem formulation (Ilse et al. (2018), it is assumed that positive bag videos contain at least one instance of a given pathology while a negative bag video depicts none.
MIL algorithms can be divided into two categories, depend-ing on if the data is an independent samples (images) or temporal based (video).

Independent samples (images):
Assumes the data within a positive or negative bag is an independent sample. The simplest approach to MIL is single-instance learning (SIL) (Ray and Craven (2005)) which assigns each instance the label of its bag, creating a supervised learning problem, but mislabeling negative instances in positive bags (Doran and Ray (2014)).
In (Andrews et al. (2003)), the standard support vector machine (SVM) formulation (Suykens and Vandewalle (1999)) is modified so that the constraints on instance labels correspond to the fact that at least one instance in each bag is positive (Doran and Ray (2014)). Similarly, in (Bunescu and Mooney (2007) pooling (Pinheiro and Collobert (2015)) layer is used to aggregate instance features into one bag representation. Finally, a fully connected (FCN) layer with sigmoid is used to predict the bag labels. In Paul et al. (2018), spatial and temporal features are extracted using a two-stream network ( RGB streams and optical flow) and co-activity similarity loss is proposed to maximize the distance between multiple activities. In (Nguyen et al. (2018)) they consider the problem of untrimmed videos by extracting segment features and sparsity loss on the attention weights for aggregating the segment features.
In general, image-based MIL approaches do not provide temporal localization for the detected pathology and are not suit- We begin by formally defining MIL, and establishing the notation that will be used in the rest of this paper.
and N is the number of frames in the video. We assume individual labels are available for each video V and is given by G with unknown frame label y = {y 1 , y 2 , y 3 , . . . , y N }.
Earlier works in MIL assume binary classification where y n ∈ {0, 1} (Ilse et al. (2018)),(Wang et al. (2018)). But here we assume a general multi-label classification problem where y n can assume a set of all possible classes k, P = {p 1 , p 2 , p 3 , . . . , p k } in a multi-label learning problem. P k is defined as k th abnormality in the dataset and a given video could be labeled to contain multiple abnormalities such as and Y i ⊆ P. Using the above notation, the MIL constraints could be represented as: where Y is the predicted video label. Alternative MIL constraint formulation can be given as the maximum class probability over 6 the frames as: It is important to note that the frame-level labels, y n are not available during the training phase and only the video label G is provided. Therefore, our goal is to infer video label Y and frame label y n by propagating information from video-level to frame-level with a neural network. The motivation for using neural networks is that it is easier to train in an end-to-end fashion. Moreover, previous works (Ilse et al. (2018); Wang et al. Ray and Craven (2005)).

Residual LSTM
There are three different approaches to come up with a video-level feature representation. These are instance aggregation approach (Andrews et al. (2003)), group aggregation approach (Cheplygina et al. (2015)) and embedded space aggregation approach (Ilse et al. (2018)). The approaches differ in whether they estimate frame-level probabilities or aggregate the embeddings. Instance aggregation approach works by combining instance-level predictions while group-level aggregation approaches use group similarity for clustering positive and negative samples. Embedded space aggregation approaches merge instant features and learn group-level classifier (Wang et al. (2018)). In VCE or medical imaging applications in general, experts are interested to know frame-level pathology prediction of detection algorithm more than video label-level predictions. Hence, instance aggregation approaches are suitable for a medical application. This is because frame-level predictions are paramount as it gives interpretation and explanation for the video prediction. Our approach is based on an aggregation of embeddings with learned aggregation weight, i.e. attention, which gives frame-level inference to the final video prediction.
The framework (illustrated in Fig. 1) consists of N fully convolution encoder networks which extract features for each frame. The encoder network Φ θ is ResNet50 (He et al. (2016)) that is pre-trained on ImageNet (Deng et al. (2009)). However, it is possible to use other networks such as VGG (Simonyan and Zisserman (2014)), DenseNet (Huang et al. (2017)) or similar architectures. Temporal dependency between each frame is modeled using residual LSTM blocks as shown in Fig. 2. The residual LSTM blocks consist of bidirectional LSTMs composed of two LSTM units that leverage the residual connection (Graves and Schmidhuber (2005); Hochreiter and Schmidhuber (1997)). The main idea for using residual connection is to make training easier and avoid performance degradation in deeper networks (He et al. (2016)). The biggest advantage of bi-directional LSTM networks lies in their capability of preserving information over time by the recurrent method.

Temporal attention
Attention has been shown to improve performance of recurrent neural networks in language translation (Vaswani et al. (2017)) and activity recognition (Sharma et al. (2015)) tasks.
Attention is mainly used for easier modeling of long-term dependencies. However, the application of attention to MIL is mainly to model MIL pooling and has been limited (Ilse et al. (2018). Inspired by (Ilse et al. (2018)), the temporal attention is parameterized using a two layered neural network. The attention block is shown in Fig. (3). However, as shown in Fig. (1) the attention block is trained on residual temporal features rather than frame feature x i as in (Ilse et al. (2018); Raffel and Ellis (2015)).
The MIL pooling operator aggregates activations of feature frame representation of the residual block. Hence, the MIL pooling layer is given by (Ilse et al. (2018)): where where w ∈ R L×1 and V ∈ R L×M are parameters of two-layered neural network. Such formulation allows the gradient of cost
where B + and B − are the cardinalities of the set α > 1 N and α ≤ 1 N respectively. Finally, positive and negative bag feature embeddings, Z + bag , Z − bag , are used for training a two-layered neural network. The network is trained with the ground truth value of "1" if the bags are the same and "0" otherwise. In other words, the proposed self-supervision acts as a regularizer by maximizing the distance between positive and negative bags in embedding space.

Loss function
The inputs of our model consist of a sequence of video frames and their corresponding pathology label (ground truth).
Considering we would like to learn both temporal attention and video-level predictions, we formulated the loss function shown in Eq. (7). The purpose of the training process is to minimize such loss function L, where M is the size of the training set, g are the ground truth labels, and y are the predicted probabilities.
The first term part of Eq. (7) minimizes video-level prediction loss. Note that unlike (Sharma et al. (2015); Xu et al. (2015)), here the attention is learned implicitly without any constraint in the loss function. The second part of the equation represents self-supervision loss which is a negative log-likelihood based on the Bernoulli distribution. g bag m is the ground truth label, which is "1" if the bags are the same or "0" otherwise. y bag m is the predicted probabilities of the bag.

Experiment
In this section, we provide an analysis of our proposed PS-DeVCEM model and evaluate our temporal attention method. Then we compare our model with representative stateof-the-art methods (Ray and Craven (2005); Andrews et al.   Table 1.
Dataset splitting: For proper ablation study and benchmarking, we split the dataset into two groups, train/test. Data splitting can be formulated as a statistical sampling problem. There are various statistical sampling techniques that could be employed to split the data (May et al. (2010)). In our case, we used simple random sampling with 50% of the data for training and 50% for testing. In such a case, we try to make a train/test set to contain a more or less equal number of pathological findings and artifacts. The train/test set video are sampled randomly from all patients to insure the trained model learned to distinguish the diseases rather than different patients. Table 1 outlines information about the pathologies and the number of videos in the training and testing set.
Data augmentation: We randomly flipped the video segments horizontally or vertically and randomly zoom parts of the video segment to prevent the network from overfitting. We acknowledge that extensive data augmentation techniques (for instance, swapping temporal order, perspective distortion) will likely lead to improved performance. However, since the purpose of this evaluation is to benchmark different methods, we rely on simple data augmentation techniques. Implementation: Our model is implemented with a Pytorch library with a single NVIDIA TITAN X GPU. The images are resized into fixed dimensions with a spatial size of 224 × 224 before feeding the encoder networks. The encoder network follows the typical architecture of ResNet50 (He et al. (2016)), which has been widely used as the base network in many vision applications. The encoders are shared and initialized with a pre-trained weight trained on the ImageNet dataset. The last fully-connected layer of the network was truncated and the output of average pooling is used for frame representation. We set the sequence length to 30 frames per video segment with a bi-directional LSTM hidden-state dimension of 1024. Longer Table 1: Content of PS-DeVCEM dataset: Note that some of the video segments are labeled for multiple pathologies. Each video is labeled by one gastroenterologist and checked by a second gastroenterologist for quality control. On average, the training and test data has 1.74 and 1.85 labels per video respectively. In the training data number of videos having one, two, three, four, and five labels are 107, 88, 16, 14, and 2 videos respectively. In the test data videos having one, two, three, four, five, six labels are 98, 89, 21, 17, 2, and 1 respectively. Learning on frame feature (AttenConv): The temporal attention block is fed with the extracted feature from each frame x i . The frame representation and temporal attention are given in Eq. (8) and (9). Each convolution feature is weighted with computed value α n before feeding into the LSTM network. The final state of the LSTM (h N ) is used for training the neural network. This is equivalent to applying temporal attention to extracted features and model the temporal information with LSTM. Therefore, after temporal attention the extracted feature x n becomeŝ extended our earlier experiment, AttenLSTM by introducing the self-supervision network that is introduced in Section 3.3.
The self-supervision block is trained to minimize the distance between high and low attention weighted frame feature representation. The main purpose of this experiment is to examine the efficacy of self-supervision on the overall accuracy of the method and temporal attention.
Ablation study results: Table 2 lists the results of the four variants of our framework discussed along with the temporal attention weights shown in Fig. (6). From the experiments, we only used the RGB stream. As in Table 3, it is clear that the proposed PS-DeVCEM improves F1-score and precision when using residual LSTM blocks and self-supervision. However, in special cases where pathology exists throughout the video Fig. (8), our proposed method underestimates the frame attention weight (i.e. the video frames are visually similar but the attention values tend to be different). This is due to the dataset imbalance in the training examples for each pathology.
Self-supervision: In order to understand how selfsupervision loss affect the detection performance, we have included further experiment in Table 3. As shown in Table 3 Fig. (7a), the video feature is computed by using the weighted sum of features x i using the attention weights computed based on the convolutional feature while in Fig. (7b) and (7c) is computed using the output of the residual LSTM block. In Fig. (7a), we can see that normal frames are The ground truth label for the video is "Bleeding", in most of the frames. In (Ilse et al. (2018)), it is assumed that each instance is permutation-invariant and the attention modules are not able to localize the keyframes. Our approach considers neighboring instances to be similar and therefore gives a smooth and better localization of the keyframes.    (Nguyen et al. (2018)) , W-TALC (Paul et al. (2018)) and Attention based deep MIL (Ilse et al. (2018)) are deep neural network based methods while SIL (Ray and Craven (2005)) and MissSVM (Zhou and Xu (2007)) are based on SVM classifier.

Method
Precision Recall F1-score Specificity SIL (Ray and Craven (2005) Discussion: By using self-supervision and residual LSTM blocks, we effectively optimized the performance of the proposed approach. By first classifying group of frames into positive and negative classes and further classifying the frames as a whole into separate categories progressively, the selfsupervision mechanism improves feature discrimination between similar classes. Compared to metric learning techniques such as Paul et al. (2018) as shown in Table 3, the proposed method relies on a weak supervision to improve positive and negative class feature representations. Alternatively approaches to the self-supervision mechanism including metric learning and siemese network Bromley et al. (1994) variants can be used with the caveat that each video could contain multiple pathologies and some pathologies are more-likely to occur together than others. Furthermore, since such approach give a high selfsupervision, dataset imbalance and representation learning need to be taken into consideration. From the confusion matrix plot in Fig. (7), we can observe that similar classes like debris and erosion, erosion and ulceration are challenging to visually separate. With the proposed weak self-supervision, we are able to improve discriminative feature representation without directly addressing class imbalance problem. However, we note that the frame-level inference could be influenced depending on the following points. Firstly, the dataset is collected with a central part of the video tagged for pathologies. This could influence the learning process in practical settings since it can bias the learning algorithm to memorize the location of the tagged pathology. Secondly, the residual LSTM blocks aggregate information temporally which could miss-align the attention to an incorrect segment of the video. Higher attention weight could be given to frame location where the highest temporal information available. One approach to address the above issues is to collect additional datasets and longer sequences. However, de-spite being trained in a purely weakly-supervised manner, our approach gives the state-of-the-art result for pathology detection.
As shown in Fig. (9), temporal information aggregation using attention units improves the over all performance of any of methods. However, the method and input to the attention units affect the performance video classification task as well as frame localization. Experimental results as shown in the ablation study indicate that residual LSTM blocks as input to a two layered neural network attention units give a better performance compared to alternative approaches.

Conclusion
In this work, we proposed PS-DeVCEM: a pathologysensitive end-to-end deep model based on weakly labeled capsule endoscopy data. We introduced a self-supervision method and residual LSTM blocks for video and frame-level prediction, further improving the interpretability of the proposed framework. Furthermore, we developed the first VCE dataset with video labels aiming at MIL formulation with a total of 455 short-segment videos. Moreover, experimental results on the PS-DeVCEM database show that the proposed method achieves the best performance on precision and F1-score metrics. Finally, we believe that the PS-DeVCEM dataset and the proposed approach will inspire similar works as the dataset and code will be available with this publication.
As future work, we plan to improve the video frame localization through domain knowledge of the pathologies. Moreover, some pathologies such as inflammations have longer temporal dependencies and handling longer temporal dependencies can further improve the performance. Furthermore, we are planning to diversify our dataset and collect more videos to improve the frame-level localization of pathologies.
Appendix A. Training Details:  (2005) different frames. In this case, our method is able to localize the disease from the video labels. Fig. (A.10b) depicts Erosion and Erythema. Although the prediction for the video is correct, the network attention did not span all the frames with the diseases. Fig. (A.10c), the segment shows Tumor and Bleeding. As the video shows the Bleeding is not detected with our method. However, some part of the segment containing tumor were localized.