An Efficient Hierarchical Video Coding Scheme Combining Visual Perception Characteristics

Different visual perception characteristic saliencies are the key to constitute the low-complexity video coding framework. A hierarchical video coding scheme based on human visual systems (HVS) is proposed in this paper. The proposed scheme uses a joint video coding framework consisting of visual perception analysis layer (VPAL) and video coding layer (VCL). In VPAL, effective visual perception characteristics detection algorithm is proposed to achieve visual region of interest (VROI) based on the correlation between coding information (such as motion vector, prediction mode, etc.) and visual attention. Then, the interest priority setting for VROI according to visual perception characteristics is completed. In VCL, the optional encoding method is developed utilizing the visual interested priority setting results from VPAL. As a result, the proposed scheme achieves information reuse and complementary between visual perception analysis and video coding. Experimental results show that the proposed hierarchical video coding scheme effectively alleviates the contradiction between complexity and accuracy. Compared with H.264/AVC (JM17.0), the proposed scheme reduces 80% video coding time approximately and maintains a good video image quality as well. It improves video coding performance significantly.


Introduction
Due to the rapid growth of the multimedia service, the video compression becomes essential for reducing the required bandwidth for transmission and storage in many applications. The prospects of video coding technology are broad ranging from national defense, scientific research, education, and medicine to aerospace engineering. However, in the case of limited bandwidth and storage resources, new requirements have been raised for the existing video coding standard, such as higher resolution, higher image quality, and higher frame rate.
In order to achieve low complexity, high quality, and high compression-ratio, the International Telecommunication Union (ITU-T) and the International Organization for Standardization (ISO/IEC) set up a Collaborative Team on Video Coding (JCT-VC) and released the next generation of video coding technology proposal High Efficiency Video Coding (HEVC) [1,2] in January 2010. HEVC still inherits the hybrid coding framework of H.264/AVC which is launched by ITU-T and ISO/IEC in 2003. HEVC focuses on the study of new video coding techniques to resolve the contradiction between the compression-ratio and coding complexity. More than that HEVC aims at adapting many different types of network transmission and carrying more information processing business [3]. It has become one of the hottest research areas in signal and information processing in the technologies and applications of "real time, " "high compression-ratio, " and "high resolution" [4,5].
Up to now, many scholars carried out a lot of work on fast video coding algorithm or visual perception analysis, but few of them combine the two kinds of coding technique in a video coding framework to jointly optimize the performance of video coding [6,7].
Tsapatsoulis et al. [8] detected the region of interest by color, brightness, direction, and complexion, but they ignored the motion visual characteristics [9]. Wang et al. [10] built a model of visual attention to extract region of interest by motion, brightness, face, text, and other visual characteristics. Tang et al. [11,12] and Lin and Zheng [13] obtained the region of interest by motion and texture. Fang et al. [14,15] proposed that the region of interest obtains method based on wavelet 2 The Scientific World Journal transform or in the compressed domain. Because the global motion estimation algorithm is too complicated, it is difficult to extract the visual region of interest. The video coding algorithms based on human visual systems (HVS) technology mentioned above focused on the bit resource allocation optimization under limited bit resources. Considering the region of interest, the above video coding methods based on HVS lack computing resource allocation optimization, and the additional computational complexity which was caused by visual perception analysis is neglected also.
On the other hand, Kim et al. [16] reduced the loss of rate-distortion performance under limited computing resource by controlling the motion estimation search points. Saponara et al. [17] adjusted the numbers of reference frames, the prediction mode, and the motion estimation search range according to the sum of difference Sum of Absolute Differences (SAD). Su et al. [18] set the parameters of motion estimation and mode decision to achieve a self-adaptive computational complexity controller. The above computing resource optimizations do not distinguish the various regions according to the saliency of the visual perception. This kind of algorithm ignores the differences of the perception in various video scenes that use the same coding algorithm for all encoding contents in video.
Therefore, there is important theoretical significance in using visual perception principle to optimize the computing resource allocation. The optimization further improves the computational efficiency of the video coding standard. In this paper H.264/AVC (JM17.0) is taken as the experimental platform, where we combine the visual perception analysis and the fast video coding algorithm to make the two respective advantages complementary to each other. The proposed method optimizes computing resource allocation more effectively by using visual perception principle and then proposes an efficient hierarchical video coding algorithm based on visual perception characteristics.

Visual Perception Characteristics Analysis for VPAL
Rapid and effective visual analysis which can effectively detect the visual region of interest is the key to optimize coding resource. We propose an efficient hierarchical video coding algorithm based on visual perception characteristics.

Temporal Visual Characteristics Analysis and Detection.
On the ideal condition, foreground movement brings out a nonzero motion vector which is highly focused by HVS. Because background does not have relative movement, so it brings out a zero motion vector which is lowly focused by HVS. So the motion vector can be regarded as the temporal characteristics of visual perception analysis. While, on the real condition, due to external light change and inherent parameters change such as quantization parameter (QP), motion search strategy, and rate-distortion optimization, nonzero motion vector random noise in background will appear. In addition, the horizontal displacement of camera will bring out global motion vectors. Therefore, it is necessary to develop appropriate motion vector detection to filter motion vector random noise interference and translational motion vector error.

Motion Vector Random Noise Filtering.
Motion vector noise filter is put forward based on the following principle. According to motion continuity and integrity, there is strong correlation of movement characteristics between the current coding block and the corresponding position blocks in the previous frame. We define the motion reference region consisting of the encoded macroblocks having position correlation with the current coding block in the previous frame, signed . If there exists nonzero motion vector → V in the current coding block, but there is no motion vector in reference region , then considering → V a motion vector random noise should be filtered.
Therefore, how to define the reference region becomes one of the key factors of motion vector noise filtering results.
In this paper, taking QCIF format encoded video sequence, for example, is defined as shown in Figure 1. In Figure 1, take macroblock as the initial search point, which has the same coordinates ( , ) as the current coding block, move macroblocks horizontally opposite to → V , and get macroblock . Then, take macroblock as the initial point again, move macroblocks vertically opposite to → V , and get macroblock . After that, take macroblocks and as the starting points, make the extension of vertical directions and horizontal directions. respectively, and get macroblock . As a result, obtain a rectangular region surrounded by four macroblocks , , , , namely, motion reference region . Here, ( , ) represents the position coordinates of the current coding block. → V and → V represent the motion vector components in the horizontal directions and vertical directions of → V , respectively. and are defined as In formula (1), | → V |, | → V | represents the amplitude of → V and → V , and , ℎ represent the width and height of the current coding block, respectively. If any one of the three macro-blocks , , is not in the encoding frame, which means it is beyond the border of the encoding frame, choose the macroblocks on the boundary as the coordinate points of motion reference region C rr .  Figure 1: Schematic diagram of motion reference region .
The method of detect motion vector random noise is defined as In formula (2), ( , ) represents the coordinates of the current coding macroblock.
represents the mean motion vector in . If | | = 0, means in , there is no motion vector, V is set to 0. 1 ( , ) = 3, means V is caused by motion vector random noise.
If |V | ≥ | |, 1 ( , ) = 2, means that the current macroblock has more saliency motion characteristics compared with neighbored macroblocks and it belongs to foreground dynamic region.
Otherwise, 1 ( , ) is set to 2 ( , ) and then the motion vector is going to be detected whether it is translational or not. The translational motion vector detection can distinguish the macroblock belonging to background region or foreground translational region which has the similar motion characteristics in neighbored macroblocks.

Translational Motion Vector Detection. Consider
In formula (3), ( , ) represents the coordinates of the current macro-block, ( , ) represents the pixel of the current macro-block, ( , ) represents the pixel of the corresponding macroblock in previous frame, and and represent the pixels number in the horizontal or vertical direction of the current macroblock, respectively.
If the value of SAD( , ) is larger, the difference of the corresponding macroblocks in neighbored frames is bigger. In this case the current macroblock belongs to foreground translational region in translational background, and then  is the mean SAD of all macroblocks which are considered in background at previous frame: In formula (4), represents the background region in previous frame, ∑ , ∈ SAD ( , ) represents the sum of the SAD in , and Num represents the summation times.
In this paper, the temporal visual characteristics analysis and detection are realized by preprocessing in two layers; the proposed algorithm flowchart is given in Figure 2.
The current encoding frame is divided into different temporal visual perception characteristic regions according to → V and motion vector correlation between neighbored macroblocks. Figure 3 shows part of the experiment results schematic; taking typical video monitoring sequence (Hall), indoor activity sequence (Salesman), and outdoor sequences (Coastguard, Foreman) including camera panning, for example, it can be found that the proposed method can disperse foreground and background effectively.

Spatial Visual Characteristics Analysis and Detection.
Existing research results have proved that mode decision is accordant well to visual attention. The macroblocks choose subblock prediction modes in intraframe or interframe coding with high probability and attended highly by human eyes when spatial visual characteristic varies intensely or abundant image contents include more moving details. The macroblocks have been chosen by macroblock prediction mode in intraframe or interframe coding with high probability and attended lowly by human eyes when spatial visual characteristic varies slowly or abundant image contents include smooth movements [19,20]. In this paper, prediction mode decision is regarded In formula (6), mod represents the predicted mode of the current macroblock in frame . mod represents the predicted mode of the current macroblock in frame .
If mod chooses the intramode, ( , ) = 2 means the spatial visual characteristic saliency is the highest and belongs to sensitive region.

Hierarchical Coding Scheme for VCL
H.264/AVC has higher compression, but the video coding complexity is increased continually, so it is a huge challenge to obtain the real-time performance. Some researches have shown that prediction mode decision and motion estimation (ME) occupy approximately 80% calculation in encoder [21]. Depending on the previous researches for fast mode decision algorithm and fast motion estimation algorithm, the computing resource will be optimized by intraprediction mode decision, interprediction mode decision, motion estimation search range, and numbers of references. The hierarchical video coding scheme proposed here is developed based on the visual perception characteristic analysis results according to the foregoing paragraphs.

Priority Setting for Visual
In formula (7), ROI( , ) represents the priority setting for visual region of interest, ( , ) represents the salient degree of temporal visual characteristic, ( , ) represents the salient degree of spatial visual characteristic, and ( , ) represents the coordinates of the current macroblock.

Settings for the Resource Allocation Optimization.
In order to improve the real-time performance while maintaining the video image quality and the compression bit rate, the macroblock with region of interest should be optimized firstly. With the limited computing resource and limited bits resource, the hierarchical video coding algorithm based on visual perception characteristics is proposed as shown in Table 1.
Fast intraprediction mode decision algorithm in Table 1 uses the macroblock histogram to define macroblock smoothness characteristics [19]. If the macroblock is flat, only Intra 16 × 16 mode is chosen. If the macro-block is rough, only Intra 4 × 4 mode is chosen. If the macroblock has nonsaliency texture, then Intra 16 × 16 mode and Intra 4 × 4 mode are ergodic.
Fast interprediction mode decision algorithm in Table 1 uses the early termination for some specified modes which are chosen by the probability of intermode decision [20].
Fast motion estimation search algorithm in Table 1 uses the dynamic search strategy which is proposed according to the correlation of motion vectors and the coding block motion degree [22].
PSNR Y (dB) Figure 5: Comparison of the rate-distortion performance.

8
The Scientific World Journal   Description: The symbol "+" means enhancement or increase; symbol "−" means decrement or decrease. PSNR-Y means the peak signal-to-noise ratio of luminance, and it also represents the quality of the reconstructed video image. ΔPSNR-Y means the difference of the PSNR-Y. ΔROI-PSNR-Y means the nonzero region of the ΔPSNR-Y in visual perception characteristics mark.

Experimental Results and Analysis
In order to verify the rationality and the performance of the proposed hierarchical video coding algorithm, the experiment has been performed.
The video coding framework diagram proposed in this paper is shown in Figure 4.

Environment and Configuration.
(i) The standard video sequence: see Table 3.
(  Table 2 show the performance of the hierarchical video coding scheme compared with the H.264/AVC (JM17.0) standard algorithm by ten typical sequences.

Experimental Results and Performance Analysis. The statistic results in
Compared with the H.264/AVC standard algorithm, under various QP (28, 32, and 36), the hierarchical video coding algorithm reduces 78.55%, 78.88%, and 79.22% coding time on average, the bit rate increases by 1.93%, 1.74%, and 1.27% on average (less than 3%), the PSNR-Y reduces 0.188 dB on average (the maximal reduce is less than 0.3 dB), in nonzero region with visual perception characteristics which is the human visual attention region, and the PSNR-Y reduces 0.153 dB on average (the maximal reduction is less than 0.25 dB). Compared with the human visual nonregion of interest, the hierarchical video coding scheme gives the priority to ensure the quality of the visual perception characteristics saliency region.
In terms of bit rate control, the two rate-distortion curves are very close as shown in Figure 5. It means that the proposed method inherits the advantages of low bit rate and high quality in H.264/AVC.
In terms of video image reconstruction quality, the proposed method ensures the average PSNR reduction to be less than 0.2 dB which is less than the perceived minimum human eyes sensitivity 0.5 dB. It maintains a good reconstructed video image quality.
In terms of improving the coding computational speed, according to the statistical result as shown in Figure 6, the computational complexity of the proposed method is lower compared with the coding algorithm in reference [14] and H.264/AVC (JM17.0). It reduces about 85% coding time on average compared with the standard algorithm in H.264/AVC and fits for the sequences with gentle movements, such as Akiyo and News.
A large number of experimental results show that the proposed hierarchical video coding scheme based on visual perceptual analysis can accelerate the coding speed under the condition of maintaining good subjective video image quality. The experimental results also proved the feasibility of the low complexity visual perception analysis method based on the coding information. The consistency between visual perception characteristic saliency degree and HVS indicates the rationality of the hierarchical video coding algorithm based on visual perception characteristics.

Conclusion
This paper presents an efficient hierarchical video coding scheme based on visual perception characteristics. In order to achieve high coding performance, the scheme proposed video coding framework consisting of the video coding layer and the visual perception analysis layer. On one hand, the two layers can reduce the computation time greatly. The visual perception analysis layer uses the video stream information in coding layer to extract visual region of interest. On the other hand, the two layers can allocate the coding resource reasonably. The video coding layer uses the visual perceptive characteristics in perception analysis layer. The above technologies achieve a hierarchical video coding method and improve coding performance effectively. Experimental results show that the proposed algorithm can maintain good video image quality and coding efficiency; moreover, it can improve the H.264/AVC computational resource allocation. The proposed algorithm keeps the balance on good subjective video quality, high compression bit rate, and fast coding speed; also it lays the foundation for following the study of fast video coding algorithm in HEVC.