QI-BRiCE: Quality Index for Bleeding Regions in Capsule Endoscopy Videos

: With the advent in services such as telemedicine and telesurgery, provision of continuous quality monitoring for these services has become a challenge for the network operators. Quality standards for provision of such services are application specific as medical imagery is quite different than general purpose images and videos. This paper presents a novel full reference objective video quality metric that focuses on estimating the quality of wireless capsule endoscopy (WCE) videos containing bleeding regions. Bleeding regions in gastrointestinal tract have been focused in this research, as bleeding is one of the major reasons behind several diseases within the tract. The method jointly estimates the diagnostic as well as perceptual quality of WCE videos, and accurately predicts the quality, which is in high correlation with the subjective differential mean opinion scores (DMOS). The proposed combines motion quality estimates, bleeding regions’ quality estimates based on support vector machine (SVM) and perceptual quality estimates using the pristine and impaired WCE videos. Our method Quality Index for Bleeding Regions in Capsule Endoscopy (QI-BRiCE) videos is one of its kind and the results show high correlation in terms of Pearson’s linear correlation coefficient (PLCC) and Spearman’s rank order correlation coefficient (SROCC). An F-test is also provided in the results section to prove the statistical significance of our proposed method.

need to employ stringent criteria for maintaining high standards in the multimedia content provision towards end users. Such high standards can be maintained by continuously monitoring the quality of the multimedia content being transmitted through communication systems. Quality of service (QoS) and QoE are vital aspects in assessing the validity and reliability of multimedia telemedicine applications. Unlike telemedicine, the entertainment domain has seen intense research in the eld of quality estimation and modelling for multimedia services and applications. An inef cient and costly way of quality estimation is by employing volunteers who can provide subjective measurements for the quality of the videos in question. The participants of these tests can be from expert (physicians, doctors etc.) and non-expert categories. Another way is to develop objective metrics, whose outputs are highly correlated with subjective measurements. The need of the hour is to develop ef cient and accurate video quality metrics (VQM) for quality estimation of speci c medical videos such as various ultrasound videos [4], endoscopy videos [3] and laparoscopic videos etc. With the provision of such metrics, the doctors and physicians will have enough con dence to use the medical multimedia content for various purposes, such as diagnosis, even after processing and transmission of data over wireless channels.
Wireless transmission offers two major challenges i.e., limitation in resources such as available bandwidth and the error prone nature of channels through which the data is transmitted. Bandwidth limitations force the network operators to adopt certain lossless [5][6][7] and lossy [8,9] compression algorithms in order to make sure that there is no interruption (stalling, frame freezing etc.) in services to the end users. Medical videos are considered highly sensitive/vital content as they contain vital information such as disease traces which help the doctors and physicians to perform diagnosis [3,[10][11][12]. But with realization of modern video compression standards such as H. 265, the network operators can compress medical videos with minimum loss of perceptual and diagnostic quality. High ef ciency video coding (HEVC) or H. 265 offers up to 50% bandwidth savings as compared to its predecessors.
For medical videos, perceptual quality holds lesser importance as compared to diagnostic quality in the context of video quality assessment (VQA) [3]. In telemedicine and telesurgery, the end users are mostly physicians and doctors who are more interested in the diagnostic quality of the videos. The diagnostic quality of medical videos mainly depends on the clarity of the sensitive content [3]. Objective VQMs for medical videos should be able to estimate the quality based on diagnostic as well as perceptual quality of the videos.
Wireless capsule Endoscopy (WCE) is a process in which a pill-shaped swallowable electronic device, as shown in Fig. 1, is swallowed by a patient and the device captures and transmits video of the gastro-intestinal (GI) tract to a post processing workstation. A Typical WCE video usually contains roughly 60,000 frames [13] and they require a lot of bandwidth if these frames are wirelessly transmitted. So, an ef cient compression method is needed to avoid wasting network resources and also losing clarity in any diagnostic data. HEVC allows such compression with minimum amount of degradation in the diagnostic quality [3,4]. But, such compression requires continuous quality monitoring and the provision of ef cient objective VQMs speci cally designed for quality estimation of WCE videos can overcome this. The most common type of anomaly that occurs in GI tract of humans is the GI bleeding which leads to various kinds of fatal diseases. There have been several works highlighting the importance of GI bleeding in human beings [14].
This paper presents a novel objective full reference (FR) VQM: Quality Index for Bleeding Regions in Capsule Endoscopy (QI-BRiCE). The metric mainly focuses on joint estimation of perceptual and diagnostic quality of impaired WCE videos that contain bleeding regions. The method jointly estimates the diagnostic and perceptual quality of impaired WCE videos that contain bleeding traces. To the best of authors' knowledge, no work has been done to provide an objective VQM speci cally designed for WCE videos. The main contributions of this paper are highlighted at the end of the next section. In the following section, we have provided a survey of the state-of-the-art and the principal contributions of this work.

Background and Related Work
Though limited, but there have been efforts in designing, standardizing and modelling video quality metrics specially designed for estimating the quality of medical videos. This section rstly discusses the published works in video quality assessment (VQA) for medical videos and then discusses the state-of-the-art FR objective VQMs.

VQA in the Context of Medical Videos & Images
The authors in [15][16][17] have conducted a VQA study for various types of medical resonance (MR) images. All the three works have considered medical experts in their subjective tests and have studied different types of distortions in MR images. In [15], the authors have carried out subjective tests to assess the quality of MR images of the human brain, spine, knee and abdomen distorted with 6 types of distortion (Rician & White Gaussian noise, Gaussian blur, discrete cosine transform (DCT), JPEG and JPEG 2000 compression) at 5 various levels. In another study [16], MR images of brain, liver, breast, foetus, hip, knee, and spine were studied by considering the impact of a set of common distortions (Ghosting, edge ghosting, white and coloured noise) on the perceived quality. A similar study in [17] considers the perceptual impact of different types of distortions and noise in MR images. The study in [18] investigates the effects of blurring, colour, gamma parameters, noise, and image compression on animal digital pathology images. In this study, the test subjects belonged to both expert and non-expert category. In [19], a subjective study comprising of both expert and non-expert subjects is presented for studying the effects of angular resolution and light eld reconstruction of 3D heart images. The authors in [20] conducted subjective tests with several medical experts and concluded that the highly compressed endoscopic videos presented to the experts did not modify their perception and opinion. Another study in [21], on H. 264 encoded laparoscopic videos was conducted to evaluate the impact of resolution and the constant rate factor (CRF) changes on overall image and semantic quality. In [3], the authors have presented a detailed objective and subjective study for HEVC compressed WCE videos. The study included both expert and non-expert participants and concluded maximum compression levels for WCE videos from the view point of diagnostic and perceptual quality. A similar study was conducted in [4] for HEVC compressed Ultrasound videos. In [22] subjective tests were conducted on 4 videos representing different stages of a laparoscopic surgery. A quality threshold in terms of bitrate was concluded from the viewpoint of experts' opinion about MPEG2 compressed laparoscopic surgery videos. The authors in [23] studied the impact of delay, jitter, and packet loss ratio (PLR) on ophthalmology videos from the view point of telemedicine. In [24] the authors have conducted Subjective tests to investigate the impact of H. 264 and HEVC compression on hepatic ultrasound videos. A detailed and comprehensive survey related to medical VQA is available in a recent publication [14]. Finally, an FR VQM speci cally designed for cardiac ultrasound videos is presented in [25].

Objective Quality Metrics
Objective VQA is an economical as well as the least complex method of assessing the quality of videos for the purpose of network optimization. Network Operators employ objective FR VQA models for the purpose of network optimization because the results of an objective VQA model function as feedback to the network. Based on the results, the network operator optimizes the network in order to overcome the encoder and transmission errors. For medical purposes, this is very important as preserving the diagnostic information in medical videos is required. Objective VQA models can be classi ed into three major categories. First one is Full Reference (FR) VQA model, in which the source or original video is present at the reception side and the quality of the video is based on the comparison between the original video and the received video. Second one is Reduced Reference (RR) in which instead of the whole original video, some of its features are present at the reception side in this VQA model and the quality of the video is assessed based on the comparison of features of the original and the received video and nally third one is No Reference (NR) method in which there is no information of the original video available at the reception side.
A detailed review of FR metrics can be found in [26][27][28]. A brief description of FR VQMs used in this work is given as follows. Peak signal-to-noise-ratio (PSNR) is based on statistical measurements. Mean square error (MSE) is calculated for each pixel of a frame of a video sequence, which serves as noise in order to calculate the ratio of signal over noise. Structural similarity index metric (SSIM) [29] measures the quality of the video based on the structural similarity between the original video and the impaired video. The similarity is measured based on luminance, contrast and structural comparison. SSIM's better version Multi-scale SSIM index metric (MSSSIM) [30] measures the quality of the image on multiple scales, with one as the lowest scale and M as the highest scale. The contrast and the structural comparison are calculated on a scale J but the luminance is measured on a scale M. The overall evaluation of the video is obtained by combining these measurements on different scales. Visual signal-to-noise ratio (VSNR) [31] uses contrast thresholds to identify the impairments in the video sequences. All the impairments above these thresholds are mapped to represent the quality of the video sequences. Information delity criterion (IFC) [32] is based on natural scene statistics (NSS); the reference video is transformed to the wavelet domain and then information based on NSS is extracted from it. The same information is extracted from the impaired video. Both extracted quantities are combined to form a model for estimating the visual quality of the video sequence. In Visual information delity (VIF) [33] metric the reference video is quanti ed and certain information is extracted from it by transforming each frame of the video into wavelet domain. This reference information is based on HVS i.e., the information that can easily be extracted by human brain from a video sequence. This same reference information is then extracted from the impaired video sequence. The two quantities are then combined in order to measure the visual quality of the distorted image. Pixel-based VIF (VIFP) [33] is a lower complexity version of the VIF metric. The information extracted from the reference and distorted videos are based on the pixels of each frame of the video sequences. Universal quality index (UQI) [34] measures the structural impairments in a video sequence and then maps these measurements to a model that can predict the visual quality based on these degradations. Noise quality measure (NQM) [35] considers the variation in contrast sensitivity, local luminance mean and contrast measures of the video sequence. This metric is a weighted signal to noise ratio measure between the reference and the processed video sequence. Weighted signal-to-noise ratio (WSNR) [35] metric uses a contrast sensitivity function (CSF) and de nes WSNR as the ratio of the average weighted signal power to the average weighted noise power. It is measured on the dB scale.
The FR metrics explained in this section are freely available online for research and academic purposes. A full understanding of the mathematical models of these metrics can be found in their corresponding publications. These FR metrics are simulated in this paper for comparison purposes with the recommended simulation parameters taken from the corresponding publications.
Inferring from the presented survey of related works and with the authors' best of knowledge, there has been so far no VQM that is speci cally designed for WCE videos. As emphasized in earlier sections, GI bleeding is the most common type of abnormality that occurs in the GI tract of human beings. An ef cient bleeding detection algorithm is needed that can highlight the bleeding regions or pixels in WCE videos. The diagnostic quality of a WCE video containing bleeding regions mainly depends on the detection of such regions. We combined a number of observations from the detailed subjective and objective VQA study presented in [3] and combined these observations with the bleeding detection algorithm presented in [14]. Further in this section, a novel quality estimation method QI-BRiCE is presented which takes into account the following estimates to build a quality index. The basic owchart of the QI-BRiCE model can be seen in Fig. 2  • Motion estimates between the pristine video and the impaired video, which are used to model a motion-quality model. Motion estimation models are available in [36,37].
• Bleeding pixels' estimates of the pristine and impaired videos, which are used to build a quality model for detected bleeding regions using the method in [14].
• Quality estimates of non-bleeding video frames using the VIFP FR-VQM [33], which is the best performing metric for WCE videos based on the results from [3].
The next section contains detailed explanation and step by step implementation of our proposed method QI-BRiCE.
The rest of the paper is organized as follows: Section 3 contains all the necessary theoretical and mathematical details about the proposed NR-VQM, including the frame freeze detection method. Section 4 encompasses the details about the preparation of video datasets that are used for model's evaluation and validation in Section 5. Also, a comparison with other contemporary methods is provided in Section 5. Statistical signi cance tests, for further validating the proposed model as compared to other VQMs, are provided in Section 6. A brief discussion on the results, the conclusion, which is followed by the future work, is provided in Sections 7-9, respectively.

Proposed VQM: QI-BRICE
The contemporary FR-VQMs presented in Section 2 are designed to evaluate the visual or perceptual quality of a video. As these methods are not application speci c, so they are considered general purpose FR-VQMs. For quality estimation of medical data, application speci c VQMs are needed and so far, there has been limited work in this eld of research [25]. In this section we have presented an FR method that jointly estimates the visual, as well as diagnostic quality of WCE videos that contain bleeding regions.

Motion Quality Estimates
In Fig. 3, it can be observed that there is a signi cant difference between a compressed WCE video and an original one. The compressed WCE video was compressed at QP 41 using HEVC. This shows that the compression clearly effects the temporal information between frames of a video. In order to measure the degradation due to compression in WCE videos, we have used frame difference information between consecutive frames of WCE videos. The frame difference is calculated using (1) and (2), and it gives an estimate of how much motion degradation has occurred between consecutive frames of the compressed WCE video in consideration. For decreasing the computation complexity of this process, we rstly convert the pristine and impaired videos into binary color space where only one bit represents each pixel i.e., 0 s or 1 s. Compared to the natural RGB color space, where each pixel is represented by 24 bits, the binary color space offers signi cantly less computational time and provides the same level of accuracy as shown in [2]. So, the motion degradation is estimated by rstly calculating the frame difference between consecutive frames for the original and impaired WCE videos as follows.
where, M R (n) and M D (n) are the frame difference calculations for the original/reference and impaired/distorted WCE videos respectively. B R and B D represent the nth frame of the reference and distorted video in the binary color space, i, j are the coordinates for each pixel and N (n = 1, 2, 3, . . . , N) is the total number of frames in the clips. The frame difference measurements for reference and distorted videos shown in Fig. 4 were plotted using (1) and (2). Further, in (3) and (4) we have taken mean of the frame difference measurements from previous equations.
avgM R and avgM D are the means of motion estimates for both the reference and distorted videos respectively. Finally, subtracting the mean motion estimate avgM D of the distorted video from the mean motion estimate avgM D of the reference video, we get the average motion degradation for the distorted video as follows: In order to have a model for motion quality Q M , we simply subtract the average motion degradation from (1) as shown in (6).
Next, we calculate the diagnostic quality estimates using the detected bleeding regions from the WCE videos.

Quality Estimates for Bleeding-Pixels
In this section, a model for quality estimation of bleeding pixels is presented, which is mainly dependent on the bleeding detection process. We have used the method used in [13] for the detection of bleeding regions in WCE videos and an overview of this method is given in the next subsection.

Bleeding Detection in WCE Videos
The bleeding detection method presented in [CMIG] uses color threshold analysis along with an optimal support vector machine (SVM) classi er to identify bleeding regions in WCE videos. Threshold analysis is used in HSV color space to build up features for the training of support vector machine classi er. The trained SVM classi er accurately classi es between bleeding and non-bleeding regions in WCE videos. A owchart of this method is given in Fig. 5, where the training of the SVM based model is also shown. The details of this method can be found in its relevant publication [13].
The next subsection presents the quality estimation model for bleeding regions.

Quality Estimation for Bleeding Pixels
From the bleeding detection method explained brie y in the previous subsection, we have used the information of detected bleeding pixels. As the bleeding regions in the WCE video frames are the ones that are used for diagnosis of different GI tract diseases, so the quality estimation for such frames is more important as compared to non-bleeding frames. From (7) and (8), we calculate the number of bleeding pixels from all the detected frames that contain bleeding traces. Fig. 6 contains examples of WCE video frames that contain bleeding regions and their corresponding results for bleeding detection using the method in [13]. (a) (b) Figure 5: Flowchart of the bleeding detection method [13]. (a) Training model for the bleeding detection method (b) Bleeding detection method Figure 6: An example-result of the bleeding detection method in [13] R BP and D BP are the average number of bleeding pixels, from all the WCE video frames that contain bleeding regions, for the reference and distorted WCE video.
Next step in quality estimation for bleeding regions is by calculating the ratio between D BP and R BP as shown in (9). The maximum value for this ratio is 1, and the minimum is 0. A value of 1 represents that R BP = D BP , which shows that the bleeding detection method [13], shows same results for the reference and distorted WCE video.
Using (6) and (9), we can model the diagnostic quality estimation model in the WCE videos as follows.
Now, we have a diagnostic quality estimation model in (10) which serves as the measure of diagnostic quality for the WCE videos containing bleeding traces. Next, we calculate the visual quality of the WCE videos.

Quality Estimates for Non-Bleeding-Pixels
From the observations in Section 4.1, we found that the contemporary FR-VQM visual information delity for pixels (VIFP) is the best performing metric in terms of correctly estimating the quality of WCE videos. Though VIF [33] and IFC [32] were among the best performing metrics as well but they have high computational time as compared to VIFP [33].
So, for the non-bleeding regions in WCE videos i.e., for calculating the visual quality Q Vis , we have used VIFP for all the frames that do not contain bleeding regions.
where, F NBP represents the total number of frames N NBP that do not contain bleeding regions. Using (11), we estimate the visual quality of the WCE videos and now we can move to the nal quality estimation for the WCE videos.

Quality Metric: QI-BRiCE
The joint estimation of diagnostic and visual quality is performed by taking a product of Q Diag , from (10), and Q Vis , from (11). As in medical videos, the diagnostic information is of high importance, so we assign different weights to both Q Diag and Q Vis , where Q Diag contains the diagnostic quality estimates.
Using (12), we can estimate the overall quality of a WCE video that contains bleeding regions. We calculated the best values for the weighting factors w 1 and w 2 , based on the highest degree of overlap for subjective measurements and our proposed model's predictions. For optimal performance of our model, we assigned 70% weightage to Q Diag and 30% to Q Vis . In this way w 1 = 0.7 and w 2 = 0.3. The weightage is assigned by keeping in view that in medical videos, diagnostic quality is more important than visual or perceptual quality. In the next section, we have performed performance evaluation of our presented FR-VQM with other contemporary FR-VQMs.

Subjective Tests
The subjective tests and their corresponding results used to evaluate our presented model in this paper are thoroughly presented in [3]. In this section, we have brie y explained the WCE video dataset, subjective tests, scores and the corresponding results.
For the evaluation of our proposed method, we have used 2 original WCE videos containing bleeding regions. These two pristine videos correspond to the diseases Angiodysplasia and Phlebectasia, which are most common GI bleeding diseases, as shown in Fig. 7. These videos were compressed using HEVC video encoder (HM 8.0 software) [3] at eight different compression levels. The compression level was maintained using the quantization parameter (QP) in HEVC. We compressed the videos at QP values of 27, 29, 31, 33, 35, 37, 39 and 41. So, in this way the total number of processed video clips became 16 and 18 videos in total, including the two pristine videos. The details about these videos are given in Tab. 1.  The selection of observers in our subjective tests consisted of 6 experts (clinicians) and 19 non-experts, the description and process of selection of these participants is provided in [3]. As explained in [3], after screening of the observers, one non-expert's measurements were discarded. The method DSCQS type-II was used to evaluate the quality of the clips on 5-point continuous rating scale ranging from 1 to 5. In the DSCQS type-II method, a participant is shown two videos, an original and processed video, at the same time, but the participant is unaware which one is the original. As the participants view both clips, they are asked to rate them on the scale separately.
The recorded opinion scores (OS) on the ve point rating scale are converted to a normalized scale that ranges between 0 and 100. The details about the scoring method are provided in [3].

Performance Evaluation and Results
To evaluate the performance of our method QI-BRiCE, we have used the correlation analysis i.e., correlation between the expert and non-expert subjective measurements and the model's predictions. A high correlation means the performance of the proposed method is good and vice versa.
Furthermore, we have used 3rd order polynomial curve tting model for improving the performance of our proposed method. This is done by tting the output of our proposed method to the curve tting model, which results in better quality prediction. In the presented work, the tting is done using robust least square regression and the method used is bi-square weights. The curve tting is performed for both the expert and non-expert measurements and the results are shown in Fig. 8.  Further, Tab. 2 shows the results in terms of Pearson linear correlation coef cient (PLCC) and Spearman rank order correlation coef cient (SROCC). It can be observed that our method exhibits highest correlation in terms of both PLCC and SROCC. To further emphasize the performance of our method, we have performed statistical signi cance tests to see which objective metric is statistically superior to others. We have used the F-Test which is based on the errors between the average DMOS and objective metrics' predictions. For a particular objective metric, this test results in three conclusions i.e., whether the metric is statistically superior, inferior or equal to other metrics. Similar tests have been conducted in [1][2][3].
Note: A symbol value of "1" indicates that the statistical performance of the VQA model in the row is superior to that of the model in the column. A symbol value of "0" indicates that the statistical performance of the model in the row is inferior to that of the model in the column and "=" indicates that the statistical performance of the model in the row is equivalent to that of the model in the column.
In an F-test, the ratio of the variance of the residual error from one objective metric to that of another metric is calculated. Using (13) [3], as follows, the residual errors between the objective metric predictions and the DMOS are calculated.
where, S tted J represents the tted score of objective VQA model for the jth WCE clip, DMOS J represents the DMOS for the same clip and N is the total number of WCE clips.
where, var represents variance and the F-Test is applied on this ratio at 95% signi cance level. In an F-test, the null hypothesis states that the variances of error residuals of two objective metrics are equal. If the null hypothesis is rejected, then this concludes that either of the metrics' is superior to the other. The ratio which is calculated using (14) is compared to an F-Critical value.
The metric with higher variance in error residuals is kept in the numerator while calculating the F-ratio in (14). If the F-ratio is greater than the F-critical value then it is concluded that the metric in denominator is superior to the metric in numerator, hence the null hypothesis is rejected.
The results for the F-test are given in Tab. 3. The F-critical value can be calculated using the signi cance level and the number of video clips. From Tabs. 3a and 3b, it can be summarized that the performance of QI-BRiCE, for expert DMOS, is statistically superior to that of PSNR, SSIM, MSSSIM, VSNR, UQI, NQM and WSNR but it is statistically equivalent to the performance of VIF, VIFP and IFC.

Conclusion
In this paper, we have presented a novel FR VQM QI-BRiCE that estimates the diagnostic and perceptual quality of impaired WCE videos containing bleeding regions. The diagnostic quality is measured by considering motion quality estimates and bleeding regions' quality estimates, whereas perceptual quality is measured using the contemporary VQM VIFP. Both diagnostic and perceptual quality are then combined together using a weighted sum approach. The method outperforms other contemporary FR VQMs in terms of PLCC and SROCC. Also, the statistical signi cance of QI-BRiCE is superior to most of the FR-VQMS.

Future Work
The potential future extensions of the presented work are as follows but are not limited to. The method can be enhanced to include other anomalies in WCE videos such as various types of GI tumors. Other anomaly detection approaches can be combined with the proposed method in order to estimate the quality of WCE videos containing different types of anomalies other than GI bleeding.

Con icts of Interest:
The authors declare that they have no con icts of interest to report regarding the present study.