Reduced-reference Video Quality Metric Using Spatial Information in Salient Regions

In multimedia transmission, it is important to rely on an objective quality metric which accurately represents the subjective quality of processed images and video sequences. Maintaining acceptable Quality of Experience in video transmission requires the ability to measure the quality of the video seen at the receiver end. Reduced-reference metrics make use of side-information that is transmitted to the receiver for estimating the quality of the received sequence with low complexity. This attribute enables real-time assessment and visual degradation detection caused by transmission and compression errors. A novel reduced-reference video quality known as the Spatial Information in Salient Regions Reduced Reference Metric is proposed. The approach proposed makes use of spatial activity to estimate the received sequence distortion after concealment. The statistical elements analysed in this work are based on extracted edges and their luminance distributions. Results highlight that the proposed edge dissimilarity measure has a good correlation with DMOS scores from the LIVE Video Database.


Introduction
In the recent emerging multimedia systems and applications, user requirements are going beyond requirements on connectivity, and users now expect the services to meet their requirements on quality.So, it is important to evaluate the quality of the received video sequence with minimal reference to the transmitted one.In this paper, a novel reducedreference video quality metric using edge-based feature on salient regions is developed and analysed.The metric relies on saliency as well as spatial information (SI) differences within frames of transmitted videos and received videos.A novel video quality measure is developed and known as Spatial Information in Salience Region Reduced Reference Metric (SISRR).The aim of this work is to discover an improved reduced-reference (RR) video quality metric (VQM) that makes use of the spatial information values incorporated with saliency maps of a transmitted and received sequences to estimate the quality of the received video in the presence of errors.
Regions in a video frame are considered salient if they attract the visual attention of the viewer.Visual attention has been investigated in numerous research fields such as cognitive psychology, neuroscience and computer vision [1].Saliency models can generally be di vided into two categories based on the applications.Some saliency detection models are developed to predict human fixations [2]- [9] and some models are developed to identify salient regions or objects [10]- [14].Fixation prediction models generally generate sparse separated salient regions whereas salient object detection produces smooth linked saliency regions in video quality assessment the salient object detections method is mostly employed.The aim of this paper is to design a reduced reference quality metric that makes use of salient region detection.Each pixel in a saliency map represents the importance of the object that contains the pixel in the scene.A saliency map allows high resolution analysis on the most relevant parts of the visual field.Therefore, image or video processing would be very efficient in processing complex scenes and its fixated region by the fovea [15].

Previous Work
The visual quality assessment task may seem simple, but it involves a set of complex mechanisms that are not completely understood.Koch and Ullman [16] proposed a model generating saliency maps and it was one of the significant and influential models in the field.The model has been since extended and advanced maps have been proposed over time [3,17].These models fundamentally, find the dissimilarity of intensity, colour, orientation and other factors to determine the saliency of the image or frame.Based on each property, the dissimilarities values are summed over several scales to form conspicuity maps.The conspicuity maps are then normalised and merged to construct the saliency map.From the saliency map, maxima values are identified based on the thresholds designated for the models.Other approaches define saliency as local complexity where scale localised features with high entropy are measured [18].Another approach to measure saliency is by using several variations of local symmetry operator as proposed in [19].A better outcome with relation to human perception was observed compared to contrast saliency work of [3].
The work in [20] considered prior high-level knowledge instead of typical bottom-up process when computing saliency maps.The high-level knowledge consists of combination of trained decision trees, where different sized windows are grouped into different object classifications.Pixels grouped into non-background are then considered to be in the saliency map.
There are two general approaches in determining low-level saliency, using biological models or computational models [21].These approaches can be further divided into the number of scales involved in the model algorithm, either a single scale or a number of scales.The salient region detection is based on the contrast determination filter over a few scales to produce several saliency maps.These maps are then pooled to create the final saliency map.This method has previously been used in [21] and has produced a good result in detecting saliency regions.The segmentation would be based on the hill-climbing algorithm.The novelty of this work is to find saliency maps that incorporated with HVS that is the edges as well spatial energy of the whole frame.In general, all saliency detection methods are based on finding the local contrast of the image or frame by comparing regions using different features.These features consist of the colour, intensity and orientation.Typically, each feature produced its own saliency map, and the combinations of all the features' maps generate the final saliency map [22]- [25].An approach to find the saliency map using center-surround differences using several feature maps of colour, intensity and orientation is proposed in [23] .This method reduced the computational time by using integral images in Visual Object Detection with a Computational Attention System (VOCUS).However, due to reducing the feature saliency maps size to a lower scale, the final saliency map has lower resolution and losing data.However, the work done in this paper managed to maintain the resolution by resizing the feature saliency maps at each scale.In [24], Hu et al. presented a thresholding analysis approach instead of a scale-space one.The approach uses histogram entropy thresholding of colour, intensity and orientation.In this work, the measures used are based on the spatial compactness measure as well as saliency density.The spatial compactness measure is performed by rounding up the exterio r body of the salient regions and the saliency density is a function of weighing each individual magnitude of saliency features before combining them.A spatial attention saliency -based biological-driven computational model is proposed in [25].The features that are taken into account to generate the saliency maps are luminance, colour and orientation at different scales.Then these magnitudes are aggregated and combined for each location in the image, which in turn using a bottom-up approach, are combined to generate the concluding saliency map.Work in [22] proposed a local contrast-based method to generate saliency map by using single scale operation and does not considered any biological model.The input to the operation consists of image that has been resized, and then colour quantized using CIELab space and also then subdivided into pixel blocks.The proposed operation consists of summing up differences between image pixels and their surrounding pixels within a small neighbourhood, which then produces the output of the saliency map.

Proposed Method
The proposed RR metric generates saliency maps using low level features as well as using spatial information of the reference as well as distorted sequences' frames.The algorithm of finding salient regions consists of computing the saliency map at different scales of the input image or frame, adding each of the saliency maps based on the different scales at each pixels and then the added values are averaged and normalised.This process generates the comprehensive visual saliency map. Figure 1 shows the flow diagram of the proposed SISRR.For the purpose of this work, no segmentation is performed due to the objective of this work.The aim is to produce a RR video quality metric using the information from the saliency regions combined with the SI.Therefore, there is no need for any segmentation to be performed.In this work, the local contrast of a region within a video frame with respect to its neighbourhood is used to determine the saliency.The local contrast is found by calculating the distance between average vectors of the pixels within the region to its res pective average vectors pixels in the neighbourhood.as suggested by Achanta et al. [21] which used two low level features, colour and luminance.The method is easy to implement, noise tolerant and fast to compute compared to other complex saliency models.The saliency detection managed to capture the salient regions successfully and the region scope is not too limited and also not too wide [26].The flow diagram of the visual computation is shown in Figure 2.
For each scale, the contrast value, c ( ) is determined as the distance D between the average vectors of the region R 1 and region R 2. The coordinate ( ) is the pixel position within the frame as shown in Figure 3 and can be calculated as follows: where N 1 and N 2 are the number of pixels in regions R 1 and R 2 respectively, and is the vector of feature elements corresponding to a pixel.In order to generate feature vectors for colour and luminance, the CIELab colour space [28] is used.The average feature vector values of R 1 and R 2 are computed by using the integral image approach as applied in [29].In this method, filter region scaling is performed instead of image scaling.This allows generating saliency maps of the same size and resolution as the input.Filtering is performed at three different scales for each frame as seen in Figure 3 and the final saliency map is determined as the summation of saliency values across all three scales S: where ( ) is an element of the summation saliency map M. The saliency map itself is obtained by pixel-wise summation of saliency values across the scales.Another feature to be extracted is the spatial information (SI) itself.SI measurement evaluates the spatial information details.It is closely related to the perception of the human viewer, where the human viewer notices that changes or distortions occur spatially.SI measurement is standardised in ITU-T Recommendation P.910.The measurement has low computational complexity, as it is easy to calculate using well-known technique that is the Sobel filter.The Sobel filter is a simple high-pass, edge enhancement digital filter which is widely used in image processing.In short, SI is an indicator of edge energy.In order to calculate the value of SI for one video frame, a Sobel filter is first applied on the luminance values.The SI value of frame Fn at time n is then equal to the standard deviation of the image resulting from convolving frame Fn with the Sobel kernel:

Simulation Setup
The test sequences were obtained from the LIVE Video Quality database [30].The proposed metric is tested against the subjective quality score, DMOS, provided from the .The DMOS values range from 0 to 100, where the smaller value expresses the greater quality and the larger value states the worse quality, are collected using the subjective test model specified in ITU-R BT 500.11.The subjective study was conducted using a single stimulus procedure with hidden reference removal and the subjects indicated the quality of the video on a continuous scale.Subjects also viewed each of the reference videos to facilitate computation of difference scores using hidden reference removal.From the database, 80 distorted video sequences were obtained from 10 different high-quality videos with a wide variety of content as reference videos.
A set of 80 distorted videos are tested using two different distortion types: H.264 compression and simulated transmission of H.264 compressed bitstreams through error-prone wireless networks, as these types of distortions relate the most to the work performed in this chapter.The diversity of distortion types is to test the ability of the proposed objective model to predict visual quality consistently across distortions.The H.264 compression system produces fairly uniform spatial and temporal distortions in the video.Network losses, however, cause transient distortions in the video, both spatially and temporally.The H.264 compressed videos exhibit a visual appearance of typical compression artifacts such as blur, blocking, ringing and motion compensation mismatches around the edges of the main body in the frame.Videos obtained from the wireless transmission error exhibit errors that are restricted to small regions of a frame.Errors sustained by an H.264 compressed video stream in a wireless environment are also spatio-temporally localised distortions, due to the small packet sizes or temporally transient and appear as glitches in the video.A packet transmitted over a wireless channel is susceptible to transmission errors due to various factors such as shadowing, attenuation, fading and multiuser interference in wireless channels.
All of the ten uncompressed high-quality YUV sequences used have the resolution of 768 x 432 pixels.Each sequence was assessed by 29 valid human subjects in a single stimulus study where the scores are based on a continuous quality scale.The DMOS from the subjective evaluations are used to compare with the differences between transmitted and received visual comprehensive saliency maps.The work scope deals mainly with streaming video over a multicast network that required the transported bitstream to be able to be decoded and displayed in real-time.In this work, all of the frames are used to determine the most suitable similarity measure.However, only the last reference frames in each GOP from both the reference and received sequences perform as inputs in the propos ed quality assessment system.This is due to the fact that it is crucial in keeping the overhead bit rate as low as possible, as well its practicality and realistic in keeping with real-time wireless transmission over a multicast network scenario.The system outputs a value to quantify the quality of the distorted sequence.The Live Video Database has been evaluated by many researchers and has been verified with various objective performance metrics [32]- [41].

Result Analysis
The first analysis performed is the correlation between the DMOS and the quality measure resulting from the differences between the reference and the distorted.The correlation coefficients are acquired between the two parameters in order to compare and justify the performances relatively, in terms of prediction accuracy, monotonicity and consistency performances.Table 1 shows the performance of all objective models using LCC for each sequence and its average for every objective quality metric.The averaged performance across all sequences shows that SISRR, with LCC equals to 0.851, outperforms SSIM and VIFP, whereas PSNR has the highest correlation at 0.901.This shows that with additional information, that is the saliency features, the quality measure has increased its correlation with the subjective quality scores if compared to only using edge feature.Table 2 shows the performance of all objective models using LCC for each distortion types.All sequences' LCCs are averaged for each distortion types and the overall average across both distortions shows that SISRR, with LCC equals to 0.941, outperforms PSNR and VIFP, whereas SSIM has the highest correlation at 0.943.This is again another improvement for SISRR when compared to EDIRR, where the LCC average across both distortions is 0.938.Table 3 compares the performance of all objective models using LCC, SROCC and KRCC for the entire LIVE Video Quality Database.These correlation coefficients are computed on each sequence and then  The results reported for the different distortion types for all of the databases presented in Table 2 also shows that SISRR outperforms PSNR and VIFP on wireless distortion induced images and it also performs on a par with PSNR, SSIM and VIFP for H.264 com pressions distortions.It can be observed from Table 3 that these results are comparable with the verified results in [33] where extensive work has been done to compare subjective scores.The performance of all objective quality assessments used in this paper were validated using metrics relating to prediction accuracy, monotonicity and consistency as recommended in [42].In addition, the proposed method exhibits very little complexity relative to all other methods (except PSNR) as shown in Table 4. Complexity was measured as the average execution time on an Intel i7-2600 CPU @ 3.40GHz PC and was normalised relative to the execution time of PSNR.All test metrics were realised in Matlab except MOVIE, which is realised in C. The results in Figure 4 shows the screenshots, salient pixels, the SI pixels as well as the comprehensive visual saliency maps using the proposed method on Bluesk y sequence from LIVE database.From the results, it can be observed that the proposed method highlights the edges within the salient regions only.This observation reflects on the correlations achieved with the DMOS as edge differences have a higher perceptual significance to quality, in terms of structural distortion.

Conclusion
In this paper, the possibility of extracting video quality information using the RR video quality metric by analysing the SI and salient regions using low level features of luminance and colour is examined.A novel method named Spatial Information in Salience Region Reduced Reference Metric (SISRR) is proposed in this paper.The metric is performed by comparing the combination of spatial and salient information of the original and distorted sequences based on the idea that edge differences have a higher perceptual significance to quality, in terms of structural distortion.The method is easy to implement, low complexity and fast enough to be used in real-time applications.The saliency maps have high resolutions which reflect the same resolutions as the input frames.
Even though the results show that PSNR has the highest performance in correlation with DMOS, the rank is followed by the proposed metric, SISRR, which outperforms SSIM and VIFP.The results obtained from SISRR shows some moderate correlations wit h quality values estimated by a number of full reference objective quality metrics which shows its suitability for simple albeit less accurate video quality assessment.It was also shown to outperform some full reference metrics when tested on the wireless distortion part of the LIVE video database.


ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 3, June 2018 : 965 -973 970 averaged together to get the results.The results once again show that PSNR has the highest correlation, followed by the proposed metric, SISRR, which outperforms SSIM and VIFP.

Figure 4 .
Figure 4. Bluesk y screenshot of the (a) reference (ref.)frame (b) tested frame.Saliency map of the (c) ref. frame (d) tested frame.SI filtered of the (e) ref. frame (f) tested frame.The comprehensive visual saliency map of the (g) ref. frame and (h) tested frame.

Table 1 .
The LCC between DMOS and SISRR, PSNR, SSIM and VIFP for each test sequence

Table 2 .
The LCC between DMOS and EDIRR, PSNR, SSIM and VIFP for different distortion sources

Table 3 .
Comparing the quality scores between SISRR and DMOS using LCC, SROCC and KRCC

Table 4 .
Comparison of the performance of video quality assessment (VQA) algorithms for wireless distortion (LIVE DATABASE) Reduced-Reference Video Quality Metric Using Spatial .... (Farah Diyana Abdul Rahman) 971