30 Years of Video Coding Evolution what Can We Learn from It in Terms of QoE?

. From the beginnings of ITU-T H.261 to H.265 (HEVC), each new video coding standard has aimed at halving the bitrate at the same perceptual quality by redundancy and irrelevancy reduction. Each improvement has been explained by comparably small changes in the video coding toolset. This contribu-tion aims at starting the Quality of Experience (QoE) analysis of the accumulated improvements over the last thirty years. Based on an overview of the changes in the coding tools, we analyze the changes in the quantized residual information. Visual comparison and statistical measures are performed and some interpreta-tions are provided towards explaining how irrelevancy reduction may have led to such a huge reduction in bitrate. The interpretation of the results in terms of QoE paves the way towards an understanding of the coding tools in terms of visual quality. It may help in understanding how the irrelevancy reduction has been improved over the decades. the differences of the residuals relate to known or yet un-known properties of the human visual system, may en-able a closer collaboration between perception research and video compression research.

Abstract. From the beginnings of ITU-T H.261 to H.265 (HEVC), each new video coding standard has aimed at halving the bitrate at the same perceptual quality by redundancy and irrelevancy reduction. Each improvement has been explained by comparably small changes in the video coding toolset. This contribution aims at starting the Quality of Experience (QoE) analysis of the accumulated improvements over the last thirty years. Based on an overview of the changes in the coding tools, we analyze the changes in the quantized residual information. Visual comparison and statistical measures are performed and some interpretations are provided towards explaining how irrelevancy reduction may have led to such a huge reduction in bitrate. The interpretation of the results in terms of QoE paves the way towards an understanding of the coding tools in terms of visual quality. It may help in understanding how the irrelevancy reduction has been improved over the decades. Understanding how the differences of the residuals relate to known or yet unknown properties of the human visual system, may enable a closer collaboration between perception research and video compression research.

Introduction
The efficient reduction of bitrate in multimedia coding is one of the core applications of research on perception. The encoding of audio signals has been largely improved by audio perception research in both speech and music transmission. For video encoding the gain due to the knowledge transfer seems less significant. No major difference in the coding structure itself can be noticed from the ITU-T H.261 standard in the early nineties to the recent High Efficiency Video Coding standard. The encoding loop is still very similar containing a block by block processing with Discrete Cosine Transformation, quantization, entropy coding step as well as translational motion compensation between successive frames. However, there are many differences that have been accumulated over the years of continuous development. While it may be impossible to judge the perceptual impact of each of the thousands of contributions that have been made to the video encoding efficiency improvement and it may still be hard to interpret the difference between one standard and its successor, the coding performance enhancement over the last thirty years seems worth investigating from a perceptual viewpoint.
While a large part of the aforementioned coding gain may stem from a redundancy reduction without any link to the human visual system, we observe a bitrate reduction of a factor around ten which seems hard to attribute solely to the general progress in a redundancy reduction (i.e. switching to an arithmetic coding) or to an improved representation of the video signal. In this contribution, we aim at identifying and charac-terizing the irrelevancy reduction part. Instead of taking an algorithmic perspective and assessing the perceptual impact of each of the coding steps, we take a signal oriented perspective by comparing the intermediate images in the H.261 and HEVC encoders such as the predicted image. This allows for a more holistic approach with the focus on the perceptual differences in terms of QoE. The results of the complex encoder operations (especially after several frames encoded with consecutive predictions) can thus be perceptually evaluated on well-chosen video content. In this contribution, we present the results of the analysis of the predicted image using both signal based statistics as well as visual inspection of the video frames. To the best of our knowledge, this is the first work which reverses the viewpoint of video coding improvement and human vision research by taking this holistic perspective.
The remainder of the paper is organized as follows. Section 2.
provides a brief overview of the evolution of the video coding tools in the last decades. In Sec. 3. , the analysis of signal based and QoE interpreted differences between the investigated video coding standards is described. Section 4. presents the results of the statistical and perceptual analysis. Section 5.
provides the final conclusions and suggests future work.

Evolution of Video Coding
This section sets the context for our research. It firstly introduces the concept of QoE and its measurement, second, it compares both video coding approaches to be later investigated in the paper, i.e. H.261 and H.265. It then briefly provides an evidence on progress, which was made in this context over the decades.  [3] for algorithms that take both the bitstream as well as the decoded video and potentially the reference video as input for their quality prediction.

Measuring Quality of Experience
The number of research papers proposing methods for video quality assessment is very large and continuously growing. Good reviews may be found in [4] and [5]. The research focus in these algorithms is on the prediction of image and video quality for a particular application scope, which includes the content and the usage scenario and, in the case of algorithms that consider the bitstream information, on one specific video coding algorithm. In this paper, however, the focus is on the difference between video codecs and its impact on the perceived quality, which may be considered as an inverse view. When it comes to the quality assessment, databases containing annotated image and video samples are necessary. Some popular and extensively used ones can be found in [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] and [16].
In terms of QoE, it is worth mentioning that the expectation of human observers on video quality in the era of H.261 (early 1990's) was different from today's expectations, thus the analysis should be extended to Human Influence Factors as well.
It may be difficult to quantitatively name a bitrate improvement factor for all contents. From the goal of the standardization group, their core test experiments and independent tests, at least H.264 and H.265 have likely halved the bitrate, leading to a total factor of four at the same perceptual quality. Conversely, four times the bitrate has been shown in various subjective experiments to change the perceptual quality significantly, and in most cases dramatically. Thus, there seems to be sufficient evidence that the coding performance improvements of the video coding standards has led to a significantly higher perceptual quality at the same bitrate.

3.
Approach to Analyzing QoE Differences The basic idea of hybrid multimedia coding is to combine redundancy and irrelevancy reduction. In redundancy reduction, the multimedia signal is transformed into a representation that minimizes the signal energy.
In video coding, this is done by a Differential Pulse Code Modulation (DPCM) loop that includes a predictor that uses previous image contents that have already been transmitted. Those may stem from previously transmitted data of the same image (intra-prediction) or of image patterns that stem from temporally preceding (or following) images and that may be motion compensated (inter-prediction). The resulting residual image contains less energy than the original image in most cases. The energy is further compacted by using a frequency transformation (DCT for H.261, Integer-DCT for H.265). None of the redundancy reduction steps is lossy. The loss occurs in the quantization of the DCT coefficients. This step is supposed to remove irrelevancy, i.e. image information that is not mandatory to perceive important content. The frequency transform helps in identifying the significance of the information for human, in particular the DC-component. Later on, the signal is entropy coded, thus minimizing the required bitrate, which is another redundancy reduction step.
QoE analysis needs to focus on the irrelevancy reduction as this removes irretrievable information from the signal. Taking a different perspective, the reference video and the decoded video differ to a certain amount that can be interpreted as added noise.
The shape of the noise should be optimized such that the QoE impact is minimized.
Two approaches may be used. The first one consists of analyzing the signal difference of the reference video and the decoded video. This is the well-known classical approach of video quality estimation by Full-Reference methods. It is not the goal of this paper. The second one consists of the analysis of the information that is transmitted from the sender to the receiver. In video coding this information is identical to the transmitted bitstream and the closest visual representation is the quantized residual signal. This signal is calculated in the encoder just before entropy coding and is available in the decoder immediately after entropy decoding. From a perceptual point of view, interpreting and understanding the characteristics of this signal gives an approximation of what difference is important for the Human Visual System (HVS) to perceive within a frame's time (e.g. 20 ms at 50 fps). This interpretation and understanding is limited by many factors, notably the assumption that the motion estimation and compensation sufficiently approximates the motion prediction of the HVS and that the transmitted information resembles what the HVS requires in addition to the previous image and the prediction of object movements. The first limitation may be relativized by the fact that block-based object motion estimation algorithms perform reasonably, for the second limitation we propose the hypothesis that a significant improvement towards the requirements of the HVS relates to the efficiency gain in video coding explained earlier, i.e. the difference seen in between the quantized residual information of H.261 and H.265 should hint on the required difference information for the HVS.

Analysis
In this section, the results of the statistical and perceptual analysis are presented respectively.

Statistical Results
We assume that an improvement of quality may by captured by residuals of successive frames. It is worth noting first that the residuals have changed signifi-cantly over the years. Today, the residuals of successive frames carry less data/information than in the past but they still seem capable of covering all important information for the HVS. Therefore, in the following, we compare the residuals of the older H.261 video coding standard to those resisuals of the current H.265 video coding standard.
We decided to analyze video sequences in CIF (352×288) resolution for an investigation of the residual's behavior of the H.261 and H.265 video codecs. This choice has been made because such low resolution can be easily depicted in a publication for visual inspection and because smaller images are faster to process. The processing time not only concerns the encoding process but also the extraction of residuals and the calculation of signal based statistics presented in the following. Sequences were downloaded from the Video trace library [42] in YUV_420 format.
We selected four different contents namely Akiyo, Bus, Football and Hall on the basis of the spatial-temporal indicator (SI/TI) diagram, see Fig. 1 for more detail, in order to cover a broad range of content in terms of spatial and temporal information. Only the first five frames were used for the creation of the SI/TI diagram and for the analysis. It is worth noting here that the first five frames represent the behavior of the whole sequence in terms of the SI/TI activity. Each sequence was encoded by a simple MATLAB implementation of the H.261 encoder and also by the HM reference software (version 16.20) of the H.265 standard. The encoding process was restricted to involve only the Intra (I) and Predictive (P) frames to minimise the difference between the standards. The remaining parameters, except for the Quantization Parameter (QP), were kept as default. As the deployed standards utilize different representation of Quantization parameter. In order to specify the severeness of compression, we adjusted the parameters of the quantization setting such that a similar quantization step size for both standards H.261 and H.265 was used. To cover a wide range of quality degradation, five different quantization steps were selected. More specifically, we chose quantization steps of 2, 16 The first analysis consists of the visual comparison of the quantized residuals of both standards. We show the residuals of the first two frames of each sequence, which appeared very similar to the residuals of the other frames. As the residuals represent differences, they may contain positive and negative values. In order to visualize the data as an image, we added the mid-gray value 128 to each residual pixel value. When we compare the residuals visually, see Fig. 4, it can be seen that more blocks are distinguishable with increasing quantization step size. This is more pronounced in the case of the H.261 codec than in the case of the H.265 codec. Thus, the higher bitrate of H.261 seems to be partly due to a higher energy and thus requires more bitrate for the residual information. Furthermore, the coded (non-zero) blocks are often positioned in different locations when it comes to the H.261 and H.265 codecs. This is caused by the fact that the DC component of the H.261 codec is always quantized by a division by eight independently of the selected quantization step. That results in more blocks being visibly different from the surrounding as the DC value is less often quantized to zero. For the observer, this may finally lead to a distortion recognized as a blockiness effect and to a quality degradation. These coded blocks cause also increasing residuals energy as can be seen in Tab. 1. It is evident from the table that H.265 residuals have a lower energy in comparison to the H.261 resid-uals and the difference increases with higher quantization steps. The energy was computed by the following Eq. (1): where A x,y is the value of residual at the coordinates x, y and m, n are width and height of the image respectively. There is also a difference in sizes of the blocks. The flexibility of block size selection in H.265 compared to H.261 enables the H.265 to use smaller block sizes, in particular at higher quantization steps. When we take a look at frequencies stored in the blocks of the residuals, we can see that the H.265 residuals encoded with the higher quantization steps contain lower frequencies in comparison to the H.261 residuals. It is also evident that the residuals contain more information (higher energy) when the scene contains faster motion.
In the next step, an analysis of the correlation of the residual signal is presented. Concerning the correlation between the horizontally adjacent pixels, see Fig. 2, it seems that the H.265 residuals show a higher correlation compared to those of the H.261 codec. In Fig. 2, a cross-like structure appears in the middle of the scatter plot for the H.265 residues with the higher quantization steps. This indicates that pixels requiring a large residual value are co-located to zero pixels. This behavior is not observed in the case of the H.261 residuals. Values of the correlation to adjacent pixels for each quantization step are presented in Tab (2) where A x,y is the value of residual in position x, y, and A is the average values of the residual.
The next analysis concerns the inter-standard pixelby-pixel correlation. We compare the residual values of H.261 to those at the same identical pixel position in H.265. In Fig. 3 of scatter plot. Please note that H.265 discerns two types of residual values. The first type is the residual of intra-frame predicted blocks, which therefore contains differences within frames. The second type is the residual information of inter-predicted blocks is also present in H.261, which therefore covers differences between the frames. The H.265 encoder decides whether a particular block is represented by the intra and inter prediction. Fig. 3

Perceptual Interpretation
The above mentioned differences in residuals may also lead to a different impact on observers when asked for QoE evaluation. This may lead to a lower difference between the adjacent blocks and ultimately to a less pronounced blockiness effect. A preview of distortions caused by the investigated video compression standards is depicted in Fig. 5. In our analysis this known fact was evidenced by the intra-standard correlation analysis. It was seen that the effect gets more pronounced when the quantization step size increases. The numerical characterization (correlation analysis) of the visual artifact (blockiness) executed by comparing the same video content in between an older standard (H.261) and a new one (H.265) may be used to further improve the coding tools for the residual information in upcoming standards.
From the visual inspection of the above quantized residuals, it also seems that the H.265 codec produces distortions that resemble more to a blurring effect than to a blockiness effect. This may be caused by the above mentioned quantization difference as well as the variation in block sizes. Blurriness may be perceived as a more commonplace effect in daily life and may thus be more acceptable when viewed by humans [43]. However, the same authors found in [44] that blockiness and blurriness is detected at a similar level and perceived with similar annoyance when the resulting signal distortion is similar. As this change appears in the residual information of the inter-predicted blocks, a temporal effect of these distortions needs to be considered. In H.261, the stronger amount of blockiness may lead to a flickering of blocks or a structured noise in consecutive frames. The blurriness in the residual may also lead to degradation but this noise may appear more as a random noise, such as camera noise, rather than a structured, localized signal distortion. Further evaluating this effect on the residual may help in a better understanding of the QoE gain that H.265 has achieved over H.261. In the long term, as mentioned before, the intentional shaping of the residual towards a desired signal may further increase the perceptual quality.

Conclusions
In this paper, we focused on introducing a possible path towards the QoE analysis of differences in between video coding standards. The quantized residual in the video encoders (or decoders) was analyzed as the visual representative of the introduced degradation. The signal based analysis helped in quantifying the perceived difference in the residuals of H.261 and H.265 codecs. Interpretation of the visualization of the residual and the signal based characteristics towards an understanding of QoE differences was provided. Although only very basic signal processing and visual analysis has been performed, some interesting results were documented that hint to the potential of the approach, both for understanding the gain in irrelevancy reduction in video coding as well as the possibility to design video coding tools towards an improved QoE.
Future work should focus on a subjective assessment of the human perception of quantized residuals. This would not only lead to an improved understanding of the same perceptual quality can be reached at a lower bitrate today but would also pave the way towards further improvements of video coding by shaping the information contained in the quantized residual by optimized video coding tools.