Extraction of Key-frames From Endoscopic Videos by using Depth Information

Early detection of colorectal cancer (CRC) can reduce the risk of death. Polyps are the precursor to such cancer. Analyzing the polyps from the most significant frames out of thousands of endoscopy frames is vital for diagnosing and understanding disease. In this article, a deep learning-based monocular depth estimation (MDE) technique is proposed to select the most informative frames (key-frames) of an endoscopic video. In most cases, ground truth depth maps of polyps are not readily available, and that is why the transfer learning approach is adopted in our method. An endoscopic modality generally captures thousands of frames. In this scenario, it is quite essential to discard low-quality and clinically irrelevant frames of an endoscopic video while the most informative frames should be retained for clinical diagnosis. In this view, a key-frame selection strategy is proposed by utilizing the depth information of polyps. In our method, image moment, edge magnitude, and key points are considered for adaptively selecting the key-frames. One important application of our proposed method could be the 3D reconstruction of polyps with the help of extracted key-frames. It gives a surgeon a real-time 3D view of the polyp surface for resection which involves detaching the polyp from its mucosa layer. Also, polyps are localized with the help of extracted depth maps.


I. INTRODUCTION
Endoscopy is a minimally invasive state-of-the-art medical modality to investigate the gastrointestinal (GI) tract. During, endoscopy an endoscopist looks to find a tumor in the mucosa. The tumor-like growth is called polyps and, if not treated early, may lead to cancer [1]. These polyps are generally found in the colon region and turn into cancerous cells at their advanced stage. Colonoscopy is a medical procedure adopted to detect such anomalies in the colon regions. Colorectal cancer (CRC) is the most occurring cancer, and a significant reason of deaths worldwide [2]. Wireless Capsule Endoscopy (WCE) is a non-invasive modality to monitor the conditions of the internal viscera of a human body. WCE moves along the gastrointestinal (GI) tract to capture images. It is extensively used to detect polyps in colon regions, which become cancerous if left untreated. Colorectal cancer is the third most prevalent cancer today [3]. During the colonoscopy, doctors comprehensively analyze the detected polyp regions to find the dysplasia in them. Depending on the condition of the polyp nature, they may opt for laparoscopic surgery. However, the number of frames captured during the entire colonoscopy process is so humongous that it challenges the surgeon to infer useful clinical information. Therefore, video summarization techniques are adopted which only retain the clinically informative frames.
The capsule moves under the peristalsis movement, and it is challenging to control the motion and orientation of the camera. Thus, redundant and clinically non-significant frames are generally obtained in a video sequence. WCE takes nearly 8 hours, capturing close to 50000 frames. A large part of the data is clinically not significant and needs to be removed [4].
Several methods have been proposed for detection, localization and classification of polyps in endoscopy frame [5] [6][7] [8]. A recent work focusing on video summarization instead of anomalies detection like bleeding or ulceration is proposed by Li et al. [9]. Iakovidis et al. [10] used clusteringbased methods for video summarization. Similar work based on clustering technique was proposed by Avila et al. [11]. However, clustering-based methods are not suitable in noise environments. Endoscopy frames are generally susceptible to noise. Also, redundant frames are captured during the endoscopy, which makes clustering methods perform poorly. Researchers are working on visual attention models, like saliency maps for finding key-frames of videos [12]. Another visual saliency-based attention model was proposed by Ezaj et al. [13]. They used motion, color, and texture features for hysteroscopy video summarization. A color histogram comparison-based method was adopted by Mendi et al. [14]. They compared the color histogram of successive frames in a video sequence, and key-frames were selected using kmeans and PCA whenever a significant change in content was observed. However, this model does not fit into endoscopic videos as most of the frames have similar color information. Recently, dictionary learning-based approaches have been proposed for video summarization [15]. In [16], a gastroscopic video summarization technique based on a dictionary learning approach is proposed. Key-frames are very important and help in better prognosis and clinical management of the disease. Therefore, colonoscopy frames that need immediate medical attention are considered for this study. Malignant polyps usually have a convex shape and are more textured compared to benign polyps. Seitz et al., [17] proposed that polyp size is correlated to the degree of dysplasia. A large and convex type polyp is associated with more severity of dysplasia. Getting a 3D view of the polyp surface can significantly help in resection [18]. A good 3D reconstruction of an object in an image entails dense depth estimation. The 3D view gives shape and size information of a polyp. Depth estimation of endoscopic images is a challenging task as the endoscopic images are monocular.
Attempts have been made to solve it as a per-pixel regression problem, however, supervised learning methods require a lot of training data. It isn't easy to acquire depth data without using stereo cameras or expensive depth sensors, as with endoscopy videos. Thus unsupervised methods are being given more importance. Depth estimation in endoscopic video frames imparts clinical relevance to a physician. 3D reconstruction of the monocular images helps in diagnosis and surgical planning. Recently, depth estimation, especially monocular depth estimation (MDE) has gained high research interest. This is due to its application in scene understanding, robotics, autonomous driving, and Augmented Reality (AR). Finding depth from a single image is an unconstrained problem since many real-world scenes can give the same 2D image, resulting in the same depth maps. Humans perceive depth from cues such as perspective, prior knowledge of sizes of objects, or occlusion. In the literature, both supervised and unsupervised-based methods have been employed for estimating depth.
Eigen et al., [19] introduced a multi-scale information approach that takes care of both global scene structure and local neighboring pixel information. A scale-invariant loss is used for MDE. Similarly, Xu et al. [20] formulated MDE as a continuous random field problem (CRF). They fused the multi-scale estimation computed from the inner semantic layers of a CNN with a CRF framework. Instead of finding continuous depth maps, Fu et al. [21] estimated depth using an ordinal regression approach. A space-increasing discretization method is introduced by allowing objects at larger depths to have a lesser influence on the depth maps than the objects nearer to the observer.
Depth is generally obtained using sensors like LIDAR, Kinect, or by using stereo cameras. Sensors are expensive, and stereo cameras are not generally used in endoscopy due to several restrictions. Obtaining ground-truth training data for depth estimation is very difficult in endoscopic imaging, so supervised methods are not feasible for endoscopic image classification. Finding correspondence between two images for 3D reconstruction is also difficult in endoscopy videos. It isn't easy to find corresponding features across the frames.
Hence, unsupervised and semi-supervised methods are employed for MDE. Garg et al. [22] used binocular stereo image pairs for the training of CNNs and then minimized a loss function formed by the wrapping of the left view image into its right of the stereo pair. Godard et al. [23] improved this method by using the left-right consistency criterion. They trained CNNs on stereo images but used a single image for inference. They introduced a new CNN architecture that computes end-to-end MDE. The network was trained with an efficient reconstruction loss function. The state-of-the-art unsupervised MDE method, i.e., Monodepth [23] model has limited application in in-vivo images like endoscopic images. This is because most models leverage outdoor scenes [24] and a few indoor scenes [25] for training, and they use highend sensors or stereo cameras, while the WCE method only captures monocular images. Hence, it is important to devise a strategy to perform MDE in medical imaging datasets that generally do not have ground truth depth information. That is why a transfer learning approach is adopted in our method for estimating depth. Transfer learning refers to a learning method where what has been learned in one setting is exploited to improve generalization in another setting [26]. Zero-shot learning is the extreme case of transfer learning where no labeled examples are present. In our method, a zeroshot learning approach for MDE [27] is employed.
The proposed method consists of two main steps. The first step focuses on depth estimation, and the second step extracts key-frames. As mentioned above, a zero-shot learning approach is adopted for depth estimation in endoscopic videos. We propose a framework to select the most informative frames of an endoscopic video sequence. Our method employs a three-criteria approach to identify the key-frames. Subsequently, these key-frames can be used for 3D reconstruction. Our method is unique in the sense that it considers depth information to find key-frames. Finally, any of the selected key-frames can then be used for 3D reconstruction using a GUI. Experimental results clearly demonstrate the  effectiveness of our method in choosing the key-frames and subsequent polyp visualization. The proposed method is elucidated in section II. Experimental results and conclusions are discussed in section III and section IV, respectively.

II. PROPOSED METHOD A. DEPTH ESTIMATION
Due to the unavailability of ground truth depth data in endoscopy video datasets, a transfer learning approach is adopted for MDE in our proposed method. Lasinger et al. [27] proposed a zero-shot learning for depth estimation. The work of Lasinger et al. inspires our proposed work for depth estimation as a zero-shot approach.
This section explains how we use monocular images to learn relative depth. As demonstrated in Figure 2, we model monocular relative depth perception as a regression problem. In an end-to-end method to regress pixel-wise relative depth given a batch of input images I, we create a non-linear function y = f (I, δ) parameterized by δ. The network is built on a feedforward ResNet architecture that generates multi-scale feature mappings [28]. To improve predictions, a progressive refinement technique is used to combine multiscale variables.
The model was trained for depth maps obtained in three different ways. First, the dataset contains depth maps ob-tained using LIDAR sensors. This method gives depth maps of high quality. Second, the Structure from Motion (SfM) approach is employed to estimate the depth. The third method of getting depth information from stereo images of the 3D movies dataset. It uses optical flow to find motion vectors from each of the stereo images. Then, the left-right image disparity is used to find a depth map. The dataset contains images that have varying aspect ratios. Sometimes, black bars on frame borders appear in estimated depth maps. So, all the images are cropped to extract only the center portion of the frame. This ensures the framework can handle images of varying aspect ratios. Moreover, the method focuses more on the central part of the image frame. Using the distance of an object from the camera to predict depth leads to sparse 3D reconstructions. This is because depth is estimated by tracking the corresponding features over a series of frames. Then, the induced parallax is used for triangulation and depth estimation. However, the resultant parallax will be small for distant features (like the sky) and won't allow proper reconstruction. Thus, distant objects like the sky are not considered while estimating depth. This addresses the issue of finding correspondences for distant objects.
The disparity map is found by using stereo matching using optical flow. Optical flow successfully handles moderate displacements. The horizontal component of the flow vectors is used as a reference for finding a disparity map. Optical flow is estimated taking either the left or right image as a reference and finding flow from the other. Next, the consistency between both left and right is calculated to discard the pixels with more than one-pixel disparity.
The datasets on which the model is trained are unique because they contain both positive and negative disparities. However, training on ground truth data from different sources has some constraints: 1) The dataset contains images that have only depth (from LIDAR sensors) or disparity images; 2) Data obtained from the SfM technique gives depth images for which scale is not known; 3) The 3D movies dataset gives a ground truth depth which has an unknown shift.
Loss function. A shift and scale invariant loss function is chosen to address the problems pertaining to training on three different datasets. Let d ∈ R N be the computed inverse depth and d ′ ∈ R N be ground truth inverse depth, where N is the number of pixels in a frame. Here s and t represent scale and shift, respectively and they are positive real numbers. This can be represented in a vector form by taking d i =(d i , 1) ⊺ and p=(s, t) ⊺ and thus the loss function becomes: The closed-form solution is given as: Substituting p opt into (2) we get: Regularization term. A multi-scale scale-invariant regularization term is used, which does gradient matching to the depth inverse space. This biases discontinuities to be sharp and coincide with ground truth discontinuities. The regularization term can be defined as, where, Here Q k gives the difference of inverse depth maps at a scale k. We use k = 4 scale levels, halving the image resolution at each level. Also, the scale is applied before finding x and y gradients.
Modified loss function. The final loss function for a training set of size M , taking into consideration of the regularization term, becomes: Here α is taken as 0.5.

B. SELECTION OF KEY-FRAMES
During the colonoscopy, not all the captured frames are clinically significant. Most of the frames may have redundant information, or may not be useful from a diagnostic perspective. Such frames need to be discarded and the clinically informative frames need to be retained. It is also strenuous and computationally intensive for a physician to investigate each frame of a video sequence. Thus, we propose a keyframe selection technique. Subsequently, 3D reconstruction is done to perform further analysis of the polyps. The keyframe selection method is given in Fig. 1. Colour space conversion. Our dataset contains images which are in RGB color space. Taking clues from the human visual system which works on saliency, we changed the color space from RGB to COC which gives a better perception in the medical imaging [29].
The image is subsequently used to find key-frames. A frame should satisfy three criteria before being selected as a key-frame: 1) It should be significantly different from neighboring frames. 2) The key-frame should give significant depth information of a polyp. 3) The polyp should not be occluded in the key-frame. We ensured that the above requirements were met, and they are formulated as follows: Image moment: Image moments give the information of the shape of a region along with its boundaries and texture. Hu moments [30] are considered as they are invariant to affine transformation, and moment distances of consecutive frames are used to identify the redundant frames of a video. Subsequently, the moment difference between consecutive frames are calculated. The frames with a higher moment distance will be considered as a key frame. The moment distance d between two images is calculated as: where, i represents each of a total of 7 moments. Edge density: In our proposed method, the key-frames which have significant depth information are only considered for the 3D reconstruction of a polyp. It is observed that the polyp images having more edges have more depth information. The edge information can be obtained with the help of the gradient magnitude of an image. Before finding the gradients, images were smoothed using a Gaussian kernel.
Horizontal and vertical gradients are obtained using Sobel operators S x and S y and then the gradient magnitude ∆S is calculated as follows: Key-point detection: The proposed moment-based keyframe detection method may capture some occluded frames. So, the objective is to select non-occluded key-frames from a group of key-frames that were extracted by our proposed image moment and edge density-based criteria. For this, a key-point detection-based technique is used.
For key-point detection and extraction, we used ORB (Oriented FAST and Rotated BRIEF). ORB is computationally faster and robust to noises in endoscopic images. The frames containing a lesser number of ORB points correspond to occluded polyps.
Adaptive key-frame selection. After finding the moment distance (d), edge magnitude (s), and the number of ORB points (p), we normalize these scores using min-max normalization. This is done so that each of the three scores is reduced to the range of 0 to 1 with both values inclusive. Instead of adding the three scores directly, we use dynamic weights to capture the changes in a video. The variable having more significant variance is given more weight-age. Here, w i is the weight of the normalized score. To consider intra-variable changes, we used the sum of the magnitude of difference between consecutive frame scores as a measure to find weights. We then normalized this score to be used as weights for finding a fused score. The weights are given by: Here, d 1 , s 1 , p 1 are the sum of magnitudes of difference between consecutive frame scores and f is the fused score obtained by adaptively weighting the three frame scores. The frames with the highest fused scores are selected according to a threshold value which was set as 0.5. The variance of each criterion with frame number is shown in Fig. 3.

III. EXPERIMENTAL RESULTS
The proposed method is evaluated on the publicly available dataset. This dataset contains colonoscopic video sequences from three classes, namely adenoma, serrated and hyperplasic.
The adenoma class contains 40 se-

Convex polyps
Patchy polyps quences, serrated contains 15, while hyperplasic contains 21 sequences [32]. In this work, we consider only the frames from the adenoma (malignant) class because this For this work, we considered only narrowband images (NBI) as they require less preprocessing. The adenoma class contains 40 video sequences of different patients. It contains both patchy and convex polyp sequences. In this work, the frames which have convex polyps are taken for estimating the depth. A few convex and patchy polyp images of the dataset are shown in Fig. 4. We used a pre-trained model trained on diverse datasets by Lasinger et al. [27] in our work. A ResNet-based multiscale architecture as proposed by Xian et al. [33] is used for depth estimation. Adam optimizer is used with a learning rate of 10 −4 for layers that VOLUME 4, 2016 FIGURE 5. Key-frames obtained by our method and their corresponding depth maps. The polyp is visible from different viewing angles in these selected frames.

Input image
Monodepth [23] Zero-shot [27] FIGURE 6. Comparison of MDE on two input images, one outdoor and the other one is an endoscopy image. The depth map by Monodepth [23] performs well for outdoor environment while giving unsatisfactory results for the endoscopy image. However, the zero-shot learning method [27] clearly performs well for medical images but cannot accurately estimate the depth in outdoor scenes.
are randomly initiated and 10 −5 for layers initialized with pre-trained weights. Decay rates for the optimizer are set at β 1 = .9 and β 2 = .999, training uses a batch size of 8. Due to different image aspect ratios, images are cropped and augmented for training. The input size of the frames is taken as 384 × 384.
Our method performs better than the state-of-the-art MDE methods. The depth estimation results are shown in Fig. 6, where the first column represents the input images, while the second and the third column show the comparative results between monodepth model [23] and zero-shot crossdataset transfer pre-trained model [27]. This clearly shows that monodepth performs well in outdoor environments than our method. However, the Zero-shot learning method is more accurate in predicting depth in endoscopic images.
Our method is the first-of-its-kind in which key-frames are extracted from an endoscopic video using depth maps. Also, it is robust to occlusions. As redundant frames are discarded in our method, it is more convenient for physicians to analyze essential frames of a video sequence. As explained earlier, the moment distance criterion between consecutive frames is used to ensure that redundant frames are identified and then discarded. The edge magnitude criterion leverages the depth images data to select the best frames. Frames with fewer ORB points have occluded polyps, and these frames are redundant. Adaptive thresholding is used to apply three criteria to obtain essential frames for 3D reconstruction.
The selected key-frames are finally used to reconstruct the 3D surface of the polyp. We have used Facebook's 3D image GUI to view the reconstructed polyp surface; the link to the video is shown here: https : //youtu.be/P JKf k0M qu2I.   [31], the last two rows of images are frames taken from a video sequence of the publicly available dataset [32].
3D visualization of a polyp helps in surgeries involving the removal of the polyp from its root. This gives better visualization of polyps for diagnosis. Fig. 5 shows some of the results of key-frame extraction and the corresponding depth maps. No publicly available datasets or methods using them that predict depth maps from endoscopic frames exist. Thus, a comparison between different methods for predicting depth from endoscopic images couldn't be performed.
Another application of our proposed method could be automatic segmentation of polyps in endoscopic images. The depth maps generated by our proposed method can further be used for polyp localization. The canny edge detector is used over the depth maps, and subsequently, polyp boundary is determined by using connected component analysis. Fig.  7 shows localized polyps in some of the endoscopic image samples. The segmentation performance on some of the sequences of the CVC-Clinic Database [31] is shown in Table  1. This dataset contains 25 colonoscopy video sequences. Each sequence contains an average of 25 frames. We defined mIoU as the mean intersection over the union of the segmented polyp masks to the ground truth masks. In polyp segmentation, an IoU score of ≥ 0.5 is generally considered good [34].

IV. CONCLUSION
Our proposed method can determine depth maps using a zero-shot learning approach. The zero-shot learning method performs well on previously unseen classes like endoscopic images. Through this, we extended MDE to in-vivo images, which would be helpful to analyze medical images. The essential frames are picked out from WCE videos with the help of depth information and the proposed three criteria selection strategy. The selection of a threshold value for the final fused score must be empirically set to extract the key-frames. Experimental results show the efficacy of the proposed method in selecting key-frames from endoscopic videos and subsequent segmentation of detected polyps in the key-frames with the help of extracted depth maps. Also, the 3D model could be used in clinical diagnosis and surgeries. One possible extension of this work could be the visualization of polyps in detected key-frames in an augmented reality framework.