Influence of Bit Depth on Subjective Video Quality Assessment for High Resolutions

This paper deals with the influence of bit depth on the subjective video quality assessment. To achieve this goal, eight video sequences, each representing a different content prototype, were analysed. Subjective evaluation was performed using the ACR method. The analysed video sequences were encoded to 8 and 10-bit bit depth. Two most used compression standards H.264 and H.265 were evaluated with 1, 3, 5, 10 and 15 Mbps bitrate in Full HD and UHD resolution. Finally, the perceived quality of both compression standards using the subjective tests with emphasis on bit-depth was compared. From the results we can state, that the practical application of 10-bit bit depth is not appropriate for Full HD resolution in the range of bitrate from 1 to 15 Mbps, for Ultra HD resolution, it is appropriate only for videos encoded by H.265/HEVC compression standard.


Introduction
In recent years, the level of video multimedia services has increased rapidly. This evolution was allowed by increase of bandwidth of communication networks. Despite the fact that the capacity of certain network access technologies is up to hundreds or thousands of megabits per second (depending on the type of technology), the video compression is still hot and a current topic.
The rest of the paper is divided as follows. In the first part, the short characteristic of H.264 and H.265 compression standards is written. The second part describes briefly objective metrics used in our experiments. In the last part, the measurements and experimental results are described.

State of the art
Even if in papers [1][2][3] the coding efficiency comparison of well-known and most used compression standards as H.264/AVC, H.265/HEVC and VP9 using objective metrics has been

Objective video quality assessment methods
The video quality assessment is commonly divided into the two groups-objective and subjective assessments.
The subjective evaluation is based on assessments by the observers-assessors score the video quality in appropriate scale. An advantage of this way is the result accuracy (determine exactly end recipient of video information); drawbacks of this method are that it is very timeconsuming and for evaluation, many people are needed (in accordance with ITU-R BT.500-13, minimum of 15 observers for each test are needed).
Vice versa, the objective quality assessment is executed by computers that allow quick evaluation in all the time, and it is not limited by the assessment duration and times of repetition. From the mentioned reason, the objective assessment is mainly used. It consists of the use of computational methods which produce values that score the video quality. The big advantage of this type of assessment is the repeatability. Nowadays many objective metrics exist. Mostly used are peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM).
The PSNR is the oldest objective metric and considering its simplicity and computing speed still one of the most used. It is defined in [14]. The PSNR belongs to pixel-based metrics. The value of PSNR is derived from mean square error (MSE) and defined as: where x 0 and x r are two consecutive frames of sequence, X.Y is the size of frame in pixels, T is the count of frames in the sequence and I is maximum value that a pixel can take. The value I is defined by bit depth as follows: where b is the bit depth.
The SSIM metric uses the structural distortion measurement instead of the error one. It measures three components-the luminance, the contrast and the structural similarityand combines them into one final value which determines the quality of the test sequence (Figure 1).
SSIM metric reaches a very good correlation with subjective perception [15]. The results are given in interval [0-1] where 0 represents the worst and 1 the best quality.

Measurement procedure
In this experiment, eight video test sequences are used. The sequences are part of database [16]. The next paragraphs contain short description of used sequences.
• Bund nightscape-city night shot. The scene is time lapsed; the dynamic segments of scene are moving cars and walkers on the curb; static segments are represented by urban buildings. The camera captures scene from static position (Figure 2).  • Campfire party-night scene close the fire. In the front of the image, there is flaming bonfire (the fast change of temporal and luminance information). In the background of the image, there is a group of slightly static people. At the end of the sequence, the camera zooms on the group of people (Figure 3).
• Construction field-shot on the construction site, where the static background is represented by buildings under construction, dynamic objects are represented by construction vehicles (excavator) and walking workers. The slow-motion scene is captured statically (Figure 4).
• Fountains-the daily shot on the city fountain. The foreground consists of squirting water (a lot of edges in the picture); the background is static formed by trees and the buildings. The capturing is static, scene with low dynamic of motion ( Figure 5).
• Marathon-marathon competition. The runners are multiple moving objects with moderate dynamic; the background is a static road. The camera capturing is static from high point of view (Figure 6).
• Runners-the running challenge, but in contrast to "marathon scene", there are fewer runners. The camera is static, located in the front of the runners slightly angled to the side (higher spatial information). Scene is relatively dynamic (Figure 7).
• Tall buildings-the shot on the modern city. The static objects are skyscrapers, river and the urban infrastructure; the slow-motion objects are represented by city traffic. The camera is moving slowly form the left to the right side. The scene is characteristic with the change of spatial and temporal information (Figure 8).
• Wood-the forest scenery. The shot on the trees in the forest (captured objects are static). The motion of the camera is from the left to the right side, and the motion is accelerating in the sequence. Relatively high value of the spatial and temporal information (Figure 9). Generally, the compression difficulty is directly related to the spatial and temporal information of a sequence. Regarding [17], the spatial information (SI) and temporal information (TI) using the Mitsu tool [18] was calculated. According to results the spatial-temporal information plane was drawn (Figure 10).     According to [17], eight test sequences were used in a test.
All sequences were uncompressed in *.yuv format, in UHD resolution (3840 × 2160 px). The aspect ratio of all sequence was 16:9, framerate was 30fps (frames per second) and used chroma subsampling was 4:4:4. The length of these sequences was 300 frames, that is, 10 s.

3.
Then, the sequences were decoded using the same tool back to the format *.yuv.

4.
Finally, the quality between these sequences and the reference (uncompressed) one was compared and evaluated. This was done using the MSU Measuring Tool Pro version 3.0 [22]. PSNR and SSIM objective metrics for the measurements were used.  The whole procedure of measurement and evaluation is represented in Figure 11.

Experimental results
All experiments using eight abovementioned video test sequences with different codecs, video resolutions, bitrates and bit depths were performed. The list of all parameters is in Table 1.
For every combination of test parameters, the value of PSNR and SSIM was computed. The obtained dataset consists of 320 values of PSNR and SSIM. Because the obtained number of results is massive, only the presented parameters are published: • The PSNR difference between 10-and 8-bit bit depth: PSNR 10bit − PSNR 8bit • The relative SSIM difference between 10-and 8-bit bit depth: In the first part, the full HD video sequences were analysed. Figure 13 show the difference between codecs. The last column of tables contains average values computed for specific bitrate. Figure 12 shows coding efficiency comparison of H.264 and H.265 in full HD with 8-bit bit depth (left) and with 10-bit bit depth (right).

Tables 2-5 and
The relative SSIM difference between H.264 and H.265 compression standard is more significant for low bitrates in full HD resolution (Figure 13). For 8-bit bit depth, the value is very close for bitrates 10 Mbps and higher. If the 10-bit bit depth is used, the value of relative SSIM is different for scenes Fountains and Campfire Party for bitrate 5 Mbps and higher. The trend of relative SSIM is similar for 8-and 10-bit bit depth, but the values for 8-bit bit depth are slightly higher for lower bitrates.  From Table 6, we can state that the best result of PSNR difference of FHD indicates sequences Construction field and Runners (scenes with slow dynamic); the worst result is in the scene Wood, which contains the most spatial and temporal information. Average enhancement with 10-bit bit depth is only 0.05 dB. Commonly, quality enhancement coefficient strongly depends on content of test sequence.     Table 7 shows influence of 10-bit bit depth for H.265 compression standard. If we compare Table 7 with Table 6, we can state that efficiency of H.265 with 10-bit bit depth is rapidly higher than (even the 34 positive cases). In the rest of cases, negative difference is very small and not significant. It leads to the conclusion that encoded videos are suitable for practical implementation. Table 8 indicates slightly better results in comparison with Table 6; probably this result will better correlate with evaluations from subjective video quality assessment. This table indicates that the relative quality enhancement with 10-bit depth for H.264 and FHD resolution is in 23 cases. It leads to conclusion that efficiency of mentioned compression standard with 10-bit bit depth is low and not appropriate for practical applications. Table 9 shows relative SSIM differences between 8-bit and 10-bit bit depths for H.265 in full HD resolution. The reference chosen was 8-bit bit depth. In all sequences 10-bit bit depth outperformed 8-bit; exemptions are only sequences Runners and Wood, which indicate high level of spatial information, but nevertheless the negative differences are so small-close to zero.   From Tables 6-9 and Figure 11, we can state that 10-bit bit depth for H.264 in full HD resolution is not significant (only nearly half cases that quality increased); for H.265 the increase of quality is higher and should be a good choice for practical application.
In the next part, the ultra HD video sequences were analysed.

Tables 10-14
show the difference between codecs. The last column of tables contains average values computed for specific bitrate.
The relative SSIM difference between H.264 and H.265 compression standards is more significant for low bitrates (Figure 15). For 8-bit bit depth, the value is very close for bitrates 10 Mbps and higher. If we compare results from Figures 13 and 15, we can state that better results in percentage of SSIM should be achieved with UHD resolution.