multi-metric approach for block-level video quality assessment

Developing an objective video quality metric that accurately estimates perceived video quality is challenging. Developing a metric that can additionally be embedded in the rate distortion optimization process of a video codec can be even harder given that decisions have to be made locally. In this paper, we present a method for combining a number of existing state of the art objective video quality metrics at the coding block level by employing a fusion of local content features for deciding how to best utilize the chosen metrics. Our results indicate promising performance in terms of the correlation of the developed locally-acting quality metric with the overall perceived quality of the video.


Introduction
Although video quality has been traditionally evaluated using Mean Squared Error (MSE), it is already known that it does not linearly correlate with the perceived quality due to the human visual system properties that are not captured by it [1]. The most reliable method to assess the quality of the compressed videos is through the subjective assessment of the perceived quality. This, however, for a real-time system is impractical due to the time constraints imposed. As a solution, many different objective quality metrics that purport to correlate well with perceived quality have been proposed. However, the performance of these metrics varies widely on different video content [2].
The literature is rich in quality metrics which claim better correlation to perceptual quality than MSE. These metrics were either initially designed for images, such as the Structural Similarity Index (SSIM) [1], Peak Signal to Noise Ratio based on HVS (PSNRHVSM) [3], Multi-Scale SSIM (MS-SSIM) [4], Visual Information Fidelity (VIF) [5], Feature Similarity Index (FSIM) [6]; or for video, such as Perception-based Video Metric (PVM) [7], Motion-based Video Integrity Evaluation (MOVIE) index [8], or Video Quality Metric (VQM) [9]. Although most of the aforementioned metrics correlate better with perceived quality than PSNR [10] for compressed video, they lack the capability of operating as an integral part of the RDO process, either because they are highly complex (e.g. MOVIE) or because they do not offer the additive property; the measured quality of a region is not equal to the sum of measured quality of its parts. RDO addresses this problem by utilizing the SAD and SATD metrics that offer such a property up to the CTU level; RDO optimizations are performed on each level of block segmentation. However, this is limited to the size of the CTU. Several ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. CTUs are never assessed together and therefore their collective score is never calculated for the purposes of RDO. Our work is a method for assessing the overall quality at a different segmentation level, as our method segments the CTUs based on their content characteristics. Some of the metrics above have been tested within an RDO framework. SSIM is typically an example of such an attempt (e.g. [11][12][13]) which has been applied to RDO [14] and quantization [15]. SSIM is also an example of a metric that does not offer the additive property, rendering it difficult for use by the RDO process. It is also important to note that improving PSNR [16] by adapting it to subjective quality evaluation scores has received extensive research. Choosing amongst all these metrics is a challenge by itself as they each offer different levels of performance for different content. One way to address metric selection is through fusion of several metrics using machine learning techniques. VMAF [17] is a good example of a practical quality metric that fuses VIF [5], DLM [18] and motion information (i.e. frame differencing). Being trained on a large varied dataset, VMAF shows higher correlation to subjective quality compared to other objective quality metrics. However, it evaluates the overall frame quality, which is not ideal in an RDO environment where block-level quality estimation is required.
Motivated by the above, this work introduces a block-level fusion of objective metrics for video quality assessment (BVQA). BVQA is a result of fusing state of the art objective metrics based on their spatio-temporal content at a block level. A diagrammatic outline of the proposed method to develop BVQA models is depicted in Fig. 1. First a small scale study is performed on a set of best-performing objective metrics. Then, based on this, a content analysis takes place. In   particular, content-based clustering of the video blocks is performed to group blocks with similar content features and quality. Considering this grouping of content and quality, different block-level quality prediction models are developed. The aim here is to identify the fusion of metrics that performs best in specific scenarios, as some metrics might perform better at relatively static scenes compared to others. BVQA does not aim to provide the equation that accurately describes subjective quality based on simple objective metrics but rather attempts to estimate it. Moreover, three different categories of models of different levels of complexity are examined. All three categories consider the content features in the fusion of metrics into models and take advantage of the fact that the correlation of the objective metrics to the perceptual quality depends on the content features. To the best of our knowledge, this is the first time a content-driven fusion of objective metrics at a block level has been proposed in the literature. The rest of the paper is arranged as follows: in Section 2 we do a small study on the performance of state-of-the-art (SOA) metrics at a block level. In Section 3, we perform a content analysis of the different blocks of the considered video sequences with the aim of identifying groups of content that have similar quality performance. Based on this, we introduce a content-driven multi-metric fusion approach at a block level in Section 4. Finally, in Section 5 conclusions are drawn.

Quality evaluation at the block level
In recent years, the assumption of optimizing on short video clips (at a ''chunk'' or ''shot'' level) has been adopted either with respect to the trade-off between streaming performance and coding efficiency [19][20][21][22] or because of the trade-off of the presentation duration and the scoring for subjective quality purposes [23]. If we assume that for short-duration videos (up to 5 s) the spatial and temporal characteristics are consistent (within one shot), then the perceived quality after compression is expected to be effectively the same across all frames. Moreover, if we consider sequences with no apparent viewing pattern, the foveation effects are omitted and the perceived video quality is not expected to change dramatically within a frame. To this end, the dataset employed here is one with sequences of one shot and without an obvious focal point. This is the BVI_Texture dataset [24] that contains 20 full high definition (HD) video sequences at 60fps and is annotated with differential mean opinion scores (DMOS). This specific dataset has been selected for two important reasons: firstly, it satisfies the criterion for spatial and temporal homogeneity that allows the extrapolation of the content evaluation scores to the block level. Secondly, the subjective tests were performed in our lab and the raw subjective scores were available. There do not exist many datasets at HD resolution at 60 fps with no apparent viewing pattern that are also providing subjective assessment scores.
The sequences were encoded using the HEVC HM 16.2 (CTC Low Delay mode) at four different compression levels (different quantization levels) and then we computed the value of seven objective quality assessment metrics for each block as reported in Table 1. We would like to note that we did not use metrics that have shown better correlation to perceptual quality like PVM, MOVIE, and VQM due to their high complexity and their design to operate at a frame level. The sequences were divided into 64 × 64 pixel blocks, so that the size and positioning coincides with the block partitioning of HEVC HM (i.e. CTUs). This created a total of 11.52 × 10 6 paired data points (i.e. 20 sequences of four different compression levels, 300 frames per sequence, 480 blocks per frame) of subjective quality score and objective quality metric value pairs at a block level.
Looking into the raw data pairs prior to any processing, we report the absolute values of the linear (Lin) and rank (Rnk) correlation coefficients in Table 1 when the metrics are calculated at a per block (Blk), per frame (Frm) and per sequence (Seq) level. The metrics have been calculated at a block-level and the fusion occurs at a block level. The correlation is computed at this level. Then based on the segmentation of the frame (depending on where each block belongs to), a weighted average is computed per frame (weighted average because of the different number of blocks per content class). At this level the frame correlation to subjective scores is computed. Finally, all frame scores are averaged over the length of frames and the sequence level correlation over all sequences is computed. In Table 1, we observe that in most cases the best performing metrics are FSIM and MS-SSIM. Another important observation is that the correlation coefficients increase as we move from the block level to the frame and then to the sequence level. This is expected because of the different distributions of the metric values at the different spatial levels. Furthermore, in order to give an idea of the complexity in terms of execution times, in the bottom row of Table 1, we report the relative average complexity of the metrics as ratios of the average execution time over the minimum average execution time. As can bee seen, PSNR requires the lowest execution time on average. On the other hand, FSIM, which is one the most well performing metrics in this table, is concurrently the most expensive in terms of execution time.

Content analysis
In this section, we study the quality performance of video blocks with similar content features. Therefore, we propose the clustering of blocks into groups according to their content. As a first step, we calculate three spatio-temporal features for all blocks of the considered sequences. These help identify content characteristics. The selected features are edge entropy (EDGE_ENT), spatial information (SI) and temporal information (TI) that are also used in the ITU-T P.910 recommendation [25]. SI is based on the Sobel filter and expresses the temporal maximum of the standard deviation of luminance over the filtered frame. TI represents the temporal maximum of the standard deviation of spatial differences of adjacent frames. To determine the edge entropy of a block, we first search for regular and homo-directional edges in the scene using the directional edge entropy approach [26,27]. First, a Sobel filter is applied to determine the horizontal and vertical gradients and after determining the direction of edges in every block, we calculate the 73 bins histogram for the values −180 • to 180 • , equivalent to a resolution of 5 • per bin. The edge entropy is given by: where is the number of observations for the bin . The data collected during feature extraction, are then randomized and 1∕10-th of them are selected to be used for -means clustering (due to software and memory limitations). To avoid cluster biasing, especially in the case of the TI, all features were normalized in the range [0, 1]. Then, in order to select the optimal number of clusters, we employed the Expectation Maximization (EM) algorithm [28] and the elbow method [29]. According to the latter method, we check the ratio of the within class to across classes distortion: where is the data point with coordinates (EDGE_ENT, SI, TI), is the centroid of the ℎ cluster and max is the maximum number of clusters to be considered. During the elbow method application, -means clustering was applied following a five-fold centroid initialization. Fig. 2(a) depicts for the different number of clusters tested.
By inspecting this figure, we observe that the distortion ratio converges at seven clusters. The same number of clusters is suggested by EM clustering when applied on the three content attributes and the seven SOA metrics. Therefore, we conclude that seven clusters is the optimal number of clusters for our data. Fig. 2(b) shows the clustered blocks in the content feature space and

Dataset partition for training and testing
Prior to the multi-metric fused model development and to avoid over-fitting, we divide our dataset into two parts: 70% of the data are used for fitting our models and 30% for testing purposes. In addition to this, the partitioning is performed on a sequence level ensuring that only blocks from sequences of the training set are extracted for training purposes (and the same for testing). Due to the small number of sequences, a selection method based on the content characteristics is suggested. Particularly, a scoring system is formulated to indicate which sequences can best represent the total population. This scoring is based on uniformity and coverage [30] of the block data in each sequence against the total population of data points available. Table 4 lists the uniformity and coverage of the three chosen content attributes. The uniformity is calculated using the entropy of the histogram bins that evenly span the whole set of sequences, whereas the coverage is calculated for the normalized dimensions of the three content features, as explained in [30]. Finally, the product of the mean value of the uniformity for each dimension and of the coverage generates a score (i.e. Score U mean ⋅ T) that indicates how well each sequence represents the population. Out of this score we select the six sequences that are located within the 35th and 65th percentile. This decision derives from the motivation to partition the dataset in two representative sets suitable for training and testing. Choosing sequences from the same percentile for training (i.e. featuring great coverage and uniformity) would result in poor performance in testing.
In Table 4, the chosen sequences are highlighted in light grey. As can be observed, the choice of the middle six sequences based on the score allowed the division of the population into training and testing sets that adequately represent the whole population and are referred to in Table 4 as ''Training Seqs'' and ''Testing Seqs''. Indeed, while the overall population scores a total of .457, the training and testing subsets follow closely with a score .436 and .332 respectively. Next, we inspect the distribution of the blocks across the 7 clusters in Table 5. It can be seen that although the training set adequately represents the total population, the testing set includes a higher percentage of blocks in some clusters (e.g. K5) against others (e.g. K3). This is expected to impact on the performance of the prediction models between the testing and the training set.

Model fitting
Our hypothesis here is that a better performing quality assessment metric can result by combining several other state of the art metrics in a content-dependent manner. The fusion of the multiple metrics is achieved by applying multivariate fitting of the objective metrics and the content features. We have designed different families of predictors that use a different combination and number of inputs, that as a consequence also result in different computational complexity. The first two families, LL and LH, are a result of a linear combination of input metrics. Particularly, LL models are a result of the linear combination of up to three state-of-the-art quality metrics and LH models are a result of all considered quality metrics. The third family of models, NL, are non-linear combinations of quality metrics and content features. The software used for the model fitting purpose is Eureqa Pro software [31,32]. We would like to note that we used a justified hold-out method instead of a random -fold cross-validation (see Section 4.1).
The predicted DMOS, DMOS , is continuous and limited within the range [0, 5] according to the reported range for the collected DMOS values. The fitted models in all three families of predictors are reported in Table 6.
In order to assess the goodness of fit of the models, the following metrics are reported in Table 7: R 2 , Lin, MSE and mean absolute error (MAE). We observe that the models fit reasonably well and provide good DMOS prediction. LL predictors consider only positive weight coefficients, resulting in solutions that are simple linear combinations of a few metrics. This limitation is removed for the LH family of predictors, where all metrics are considered for the first order linear fitting. This introduces a clear computational overhead as all metrics have to be calculated within the RDO. Finally, for the NL predictors non-linear formulas are examined that can potentially combine all Table 6 BVQA Models.   seven metrics as well as the three primary content features. In this case, the goodness of fit metrics improve for the NL predictors. Although the NL predictors could result in an arithmetically more complex solution since they include floating point multiplication and division of the individual features and metrics, during the fitting only a subset of the features was selected resulting in an overall execution time that is lower than that of the LH models. To provide an indication of the computational complexity of the proposed models, we followed the same approach as earlier in Table 1. Thus, in the last row of Table 7 we are reporting the relative average complexity of the three models with reference to the minimum objective metric execution time, aka PSNR. As expected by considering the execution times of the different SOA quality metrics, BVQA-LL and BVQA-NL models are the fastest to compute due to the smaller number of input metrics. Fig. 3 illustrates the process of using the proposed block-level quality assessment to predict the expected perceived quality per block. After extracting the content features at a block level from the original video blocks, the blocks are classified in one of the seven clusters identified above. Then, the objective quality metric values are computed using the encoded video blocks. The content feature values, the assigned class and the quality metric values are fed into the BVQA models and the perceived video quality per block is estimated.  For the BVQA model validation, we evaluate the performance of the method against the best performing metrics from Table 3, namely FSIM, MS-SSIM, and PSNR-HVSM. In Table 8, we list the linear and rank correlation coefficients between the original [24] and the predicted DMOS p using BVQA models for both the training set. For the training set, we identify that the model has been fitted correctly for each cluster by observing the first couple of columns. Each fitting solution (LL, LH and NL models) shows a linear and rank correlation between .78-.83 and .73-.75 at the block level, respectively. As can be observed, the correlation values increase at the frame (.86-.91 and .83-.85) and at the sequence level (.88-.93 and .84-.86). This shows the effectiveness of the method as high correlation with the DMOS scores is achieved at the sequence level overall.

Model validation
To further verify the BVQA models, we use two other datasets. The first is the testing set of BVI_Texture and the other is the VQEG-HD3 dataset. We have selected this dataset as it complies with the assumption we made for sequences without an apparent viewing task and it is annotated with subjective scores. It is expected that the model performance will deviate for these two datasets compared to the training set mainly because the available number of sequences annotated with subjective scores is not high and diverse enough to cover the feature and objective quality metrics space.
The results of deploying the BVQA models for the testing sequences of BVI_Texture are reported in Table 9. As anticipated, the correlation values drop in the testing set. However, BVQA outperforms the state of the art quality metrics in terms of rank correlation. The drop of performance in the testing set is a natural effect of the variability and randomness of the selected blocks from the video sequences, as well as of the small number of sequences available for the training. Finally, we present the results on another dataset from Video Quality Expert Group (VQEG) with HD videos, the VQEG-HD3 dataset [33]. For this dataset, as reported in Table 10, all tested metrics achieve lower linear and rank correlation values compared to those from the testing sequences in BVI_Texture dataset (see Table 9). This is expected due to the different content characteristics of this dataset. Nevertheless, in most cases, BVQA outperforms the state of the art objective quality metrics in this dataset.

Conclusion
We presented a multi-metric fusion approach, which delivers a video quality assessment method at a block level that correlates better with perceptual quality compared to the state-of-the-art objective metrics. This approach is a step towards combining several well-performing metrics into one, exploiting the advantages of using objectives metrics that are embeddable in the RDO process in a content-dependent manner. At the same time, the advantage of developing a block-level quality metric is that of using it within the RDO environment. The first results of BVQA are promising in terms of the correlation of the developed locally-acting quality metric with the overall perceived quality of the video. This allows us to argue that, within this group of content, this combination of metrics produces a quality estimate closer to the average experience. Consequently, the RDO is expected to be more efficient as it will be using a model that is more affected by a higher level of content awareness (what is around it) and not just by the content of the block.

Limitations and challenges for future work
Recently, with the aim to optimize the trade-off of the encoding pipeline and the streaming performance, the videos are split in ''chunks'' (often at a shot level) of a few seconds as proposed for example in [19,20]. The presented multi-metric fusion method is built on the assumption that for short videos that could represent one shot, we have homogeneity in terms of the scene content across all tested frames. We have also assumed for this work no apparent viewing patterns. It is however important to take into account the perceptual significance of specific parts of a frame either because of visual salience or/and the semantic importance. Thus, the challenge is to extend our method to take into account the perceptual importance of specific areas that might be points of interest for most viewers.
Furthermore, the results presented in this paper were based only on a limited number of sequences coming from two datasets in order to conform with the method assumptions. Then, we followed a holdout validation method using a justified splitting of the sequences that was based on the relative coverage and uniformity of the low-level features of the dataset at a sequence level. The challenge arising form this is to further test the method against new datasets and perform a cross-validation with randomized splits.
Finally, the biggest challenge once BVQA is the natural step of integrating the proposed method in the RDO of a video encoder, and computing the effectiveness (gains both in quality and bit rate) and the efficiency (complexity overhead) of the BVQA-based optimized encodings.