A HARDWARE-EFFICIENT BLOCK MATCHING UNIT FOR H.265/HEVC MOTION ESTIMATION ENGINE USING BIT-SHRINKING

The main objective of this work is to enhance the processing performance of the recently introduced video codec H. 265/HEVC. Since most of the computations of H. 265/HEVC still occur in the motion estimation engine which is inherited from its predecessor H.264/AVC, we propose a bit-shrinking approach with a modified logic functionality to design an efficient and simplified block matching unit that replaces the already used Sum of Absolute Differences (SAD) unit. The hardware complexity of the proposed unit itself is reduced and the number of its generated output bits is reduced as well which in turn simplifies all the subsequent units of motion estimation. The hardware complexity, the consumed power and the processing delay of the motion estimation engine are therefore reduced significantly with only marginal deterioration in both the bit-rate and the peak-signal-to-noise-ratios (PSNR) of the tested High Definition (HD) and Ultra-High Definition (UHD) H.265/HEVC compressed videos. We simulate our design using HM16.6 and perform system logic synthesis using Synopsys’s Design Compiler, targeting ASIC, for evaluation purposes.


INTRODUCTION
For almost a decade, H.264/AVC [1] has been the de facto standard for video coding. Its impressive performance opened wide doors for watching and exchanging videos almost everywhere. In spite of the fact that H.264 can handle High Definition (HD) videos, the sizes of these videos using H.264/AVC are still a concern, especially when using smart handheld devices with limited storage and power constraints. Recently, H.265/HEVC [2] has been introduced in the literature to provide better video coding performance than the legacy H.264/AVC. The former can provide up to twice the compression ratio of the latter, while maintaining the same video quality. Rearticulating, the former can provide much better video quality than the latter for the same video compression ratio. This huge increase in performance is due to the many enhanced techniques and methodologies that have been introduced in H.265/HEVC. Some of these enhancements have tackled the block matching criterion of motion estimation, the powerful engine of video coding. Although, block matching still relies on the use of the Sum of Absolute Differences (SAD) in H.265/HEVC which is inherited from H.264/AVC, the maximum block size has been enlarged in the former to 64x64 pixels instead of 16x16 pixels. Thus, the number of computations and consequently the number of the used SAD components and its constructing logic gates vastly increased in H.265/HEVC. Hence, the hardware complexity, the processing delay and the power consumption of the logic gates increased vastly as well. Taking the processing delay as an example, Grois et al. conducted experiments in a recent work [3] to compare the performance of H.264/AVC, VP9 [4] and H.265/HEVC. They reported that the typical total encoding time of VP9 is around 130 times higher than the total encoding time of H.264/AVC and 7.35 times lower than the total encoding time of H.265/HEVC for the same PSNR value. This means that the total encoding time of H.265/HEVC is 955.5 times higher than that of H.264/AVC for the same PSNR value, which is actually a huge increase in the encoding time. The motion estimation engine by itself consumes around 60% to 80% of the total encoding time of H.264/AVC [5], while it consumes around 80% of the total encoding time of H.265/HEVC [6]. Thus, a pragmatic strategy to elevate the performance of H.265/HEVC relies on enhancing the performance of the core component of the motion estimation engine which is the block matching unit.
The main focus of this work is the block matching unit of the motion estimation engine. We propose in this paper a simplified and hardware-efficient block matching unit that replaces SAD. The framework of the proposed work is the motion estimation engine of the state-of-theart video codec, H.265/HEVC. The main contributions and the differences from others in the literature are explained thoroughly in the following section.
The rest of the paper is organized as follows. We survey some of the related work in the literature and clarify our contribution in section 2. We then discuss the proposed designs of the block matching unit in section 3. We provide numerical analysis and evaluations with discussion in section 4 and finally conclude the paper in section 5.

LITERATURE REVIEW AND CONTRIBUTION
Many works in the literature have considered simplifying either the software and/or the hardware design of motion estimation in order to enhance the performance of video processing. Fast matching algorithms such as three steps search [7], four steps search [8], hexagon-based search [9], diamond search [10] and adaptive rood pattern search [11] tend to shrink the number of matched macro-blocks immensely. Most of these algorithms are only suitable for software implementations which are not as efficient as hardware implementations for real-time applications. Online arithmetic [12] and saturation arithmetic [13] are used to reduce the computational complexity of SAD. Nevertheless, the tree-adder implementation, where the SAD computations are performed, is still either time consuming or hardware costly. Vanne et al. performed in [14] arithmetic operations accompanied with several early termination mechanisms and sophisticated SAD computation control. Another approach based on SAD [15] takes the difference pixel count (DPC) as the selection criterion. Yeo et al. [16] use an XOR function instead of adders to simplify the matching criterion. One of the most recent promising approaches is to reduce the pixel resolution from eight bits to fewer bits. The works introduced in [17]- [20] use a one-bit transform by converting video frames into a single-bit-plane. On the other hand, the works introduced in [21]- [22] follow a bit-truncation approach where least significant bits are eliminated to simplify the hardware. One-bit transform and bit-truncation approaches deteriorate both the compression ratio and the video quality. The amount of deterioration in the latter depends on the number of truncated bits. Recently, Manjunatha and Sainarayanan recommended in [23] the use of a 1-bit full adder which consists of XOR, AND and OR gates instead of the commonly used 1-bit full adder which consists of XOR and NAND gates for performing SAD. In their work, they showed enhancements in the consumed power, latency and area when using the former rather than the latter. Following a different approach, we introduced in [24] a modified XOR function that replaces conventional SAD of the motion estimation engine of H.264/AVC. We showed enhancements over many proposed designs in the literature. All of the aforementioned authors who evaluated their works based on video coding, adopted the motion estimation engine of H.264/AVC with a maximum block size of 16x16 and low quality videos, CIF and QCIF in their evaluations. Some recent works, which consider the enhancements of the motion estimation engine of H.265/HEVC, have also been proposed in the literature. We discuss some of the efforts which consider hardware implementations as follows. Sanchez et al. evaluated in [25] the use of the Multi-Point Diamond Search (MPDS) algorithm [26] in H.265/HEVC. They found that it is more hardware-friendly than the Enhanced Predictive Zonal Search (EPZS), which is implemented in the standard, on the expense of small amounts of deterioration in the video quality and compression ratio. Sinangil and Sze proposed in [27] a new hardware-aware search algorithm for HEVC motion estimation. They reported enhancements in area and bandwidth when using their algorithm. Jaja et al. proposed in [28] two fast motion estimation algorithms based on the structure of the triangle and the pentagon for H.265/HEVC. In their experimental evaluations, they found that the proposed algorithms can offer up to 63% and 61.9% speed-up in run-time when compared with the original algorithms of the standard. Medhat [31] a high-throughput motion estimation system to process Ultra-High Definition (UHD) videos in H.265/HEVC. The system embeds two parallel processing paths for the integer-pel and the fractional-pel motion estimation. Their synthesis showed that the system is able to encode UHD videos at 30fps with only small deteriorations in PSNR and compression ratio. Ye et al. proposed in [32] a parallel clustering tree search (PCTS) algorithm for integer-pel motion estimation that processes the prediction units (PU) simultaneously with a parallel scheme. The hardware implementation of PCTS can support quad-full HD (QFHD) videos at 30fps in real-time. All of the aforementioned efforts still rely on the use of the block matching unit, conventional SAD in designing the algorithms to enhance the motion estimation of H.265/HEVC.
In this paper, we introduce several enhancements on our previous work [24]. They are mentioned in the following. Our new modified block matching unit is implemented in the motion estimation engine of H.265/HEVC which has a different hardware-efficiency with the increased block size than the one used by H.264/AVC. We also introduce in this paper a new bit-shrinking approach to reduce the number of generated output bits of the matching unit which is reflected on all subsequent stages of the motion estimation engine. We perform system logic synthesis using Synopsys's Design Compiler [33], targeting ASIC, to evaluate our design and compare it with the conventional SAD and other works in the literature. We consider the number of gates, the consumed power and the processing delay as performance metrics for evaluation purposes. The obtained results show the superiority of our design in all performance metrics. We also run extensive simulations using HM16.6 [34] to measure the video quality and the compression ratio. We apply our simulations on both HD and UHD videos for evaluation purposes and consider a block size of 64x64.
It is worthy to mention that the proposed block matching unit can be utilized by many motion estimation algorithms in the literature, such as the ones proposed in [25]- [32], to replace conventional SAD and hence enhance performance.

PROPOSED BLOCK MATCHING UNIT
Let us first give a brief description of the sum of absolute differences operation. SAD is a measure of similarity between a block of pixels of the present frame and a block of pixels of a previous reference frame of a video. It estimates motion between the considered frames to remove redundant information and consequently reduces the sizes of the videos. Considering NxN as the macro-block size, ( , ) as the current pixel of a macro-block, −1 ( + , + ) as the candidate pixel of a macro-block of a reference frame and [− , − 1] as the search range, SAD is defined by the following equation.
( , ) = ∑ ∑| ( , ) − −1 ( + , + )|; (1) The absolute difference function in Equation (1) can be described as: The sum of absolute difference function involves the comparison of two pixels. These pixels have unsigned representations of luminous values. The unsigned nature of these pixels shall burden the system. To avoid this, the sum of absolute difference function is designed in several ways in hardware. One way is by subtracting two unsigned numbers and then making a decision about the obtained result by converting the negative sum into positive as shown in Figure 1 with the aid of an XOR gate. Another design is achieved by subtracting the first number from the second one and also the second number from the first one, simultaneously, and then selecting the positive result via a multiplexer. The former implementation encounters longer delay than the latter, since the critical path goes through two adders in the former while it goes only through one adder and one multiplexer in the latter. Note that the two adders in the second design operate in parallel.
There are several hardware implementations for the full adder. The basic, simplest and mostly used implementation is the Ripple Carry Adder (RCA). As the name indicates, the carry ripples from the n th bit to the next one and so on until it reaches the most significant bit. Thus, the most significant bit of the output cannot be calculated until all the preceding bits are calculated one after another. It can be deduced that the more the number of calculated bits is, the more is the delay in generating the final result. Another implementation of a full adder is the Carry Look Ahead Adder (CLA). The latter is faster than the RCA, but more complex, which leads to a larger implementation area and more power consumption. Considering the circuit shown in Figure 1. (b), the sum of absolute difference function for one pixel matching is implemented using two 8-bit full adders and one multiplexer. These units are built from many logic gates as shown in Figure 2. Each macro-block matching consists of many SAD operations. Since very large amounts of macro-blocks' matching occur for motion estimation during the processing of a video, gigantic amounts of SAD operations are performed. For example, it requires almost 66,846,720 SAD operations or 534,773,760 SAD operations to process only one single frame of an HD video with the resolution of 1920x1080 or an UHD video with the resolution of 3840x2160, respectively, using a 64x64 macro-block size and a [−64, 63] search range. Furthermore, performing SAD operations on different block sizes shall increase these numbers by multiples. This is hardware-costly and also makes the video processing encounter considerable delay and power consumption. Hardware implementation with parallel SAD operations is the ultimate solution to meet the real-time constraints when using conventional SAD in full search motion estimation.
Bit-truncation has been proposed in [21]- [22] as a promising approach to reduce the cost of the conventional SAD unit. The main concept of bit-truncation is to perform conventional SAD The number of generated output bits is 8.
The more the number of truncated bits is, the simpler is the SAD operation. As can be deduced from Equation (3), the bit-truncation approach does not really simplify the full adder circuit itself which makes the propagation delay only shortened by the number of truncated bits. Bittruncation approach also requires a special memory design to support such variable number of bits which are fed into the SAD unit [21].
Considering the hardware-costly SAD and its long processing time and large power consumption, we introduced in [24] a block matching unit with modified XOR functionality that produces almost the same outputs of the SAD unit for the many different combinations of the inputs and is built from a much smaller number of logic gates. It is described by: where, where is the ℎ bit of a pixel in the ℎ frame. We denote the matching unit, described by Equation (5) and shown in Figure 3, by MXOR.
The proposed unit outperforms SAD in terms of hardware-complexity, processing delay and power consumption as explained briefly below.
In a regular Ripple Carry Adder, which is used in the computation of SAD as mentioned before, the highest significant bit of a result depends on the carry which is generated after summing all lower significant bits. This encounters a long delay, since the carry keeps propagating through the adder as explained earlier. In the proposed matching unit, the computation of each output bit depends only on the neighbouring lower significant bit. Thus, the propagation of bits is limited to only one bit position. This saves a considerable amount of time in producing the result of the matching process. Furthermore, the circuit of the proposed matching unit which evaluates each bit is composed of only an XOR gate, an inverter and an AND gate. Hence, the number of constructing logic gates of the proposed unit is enormously less than that of SAD. This is clearly illustrated when observing and comparing the units shown in Figure 2 and Figure 3. The simplification in designing the unit leads to a shorter processing delay along with a smaller unit area and less power consumption.
Since most -and not all -of the output bits of SAD and the modified XOR match, a deterioration in the performance of the video codec shall occur. By implementing the modified XOR in the motion estimation engine of H.264/AVC video codec and taking sample CIF and QCIF videos, we showed in [24] that the amounts of deterioration in both video quality and compression ratio are marginal. In this work, we introduce a new approach, bit-shrinking, to further reduce the computational burden of motion estimation based on the previously introduced MXORU. It is explained as follows.
Bit-shrinking reduces the number of the generated output bits of the matching unit. This is directly reflected on the following stages; namely the parallel adder tree and the compare and select unit shown in Figure 4. The purpose of the adder tree is to accumulate the generated output bits of the matching unit for each sub-block. Therefore, reducing the number of the output bits will obviously reduce the complexity of the processing elements and the used registers in the following stages. Hence, the hardware architecture is simplified and all the related performance metrics in terms of processing delay, power consumption and hardware cost are reduced.
Considering Equation (4) as the general form representation of the block matching unit, the ̂ represents several circuit designs according to the level of bit-shrinking. To simplify the Boolean representation of the ̂ function of the matching unit, let us first define Ω and Ψ as:  where is the ℎ bit of a pixel in the ℎ frame.
First, by combining the lowest three output bits of the unit shown in Figure 3 through an XOR gate, one bit is generated. It represents the least significant bit of the output and is described by 2 0 (Ω 0 , Ω 1 , Ω 2 ). The remaining bits are shifted by 2 −2 . Thus, the number of the generated output bits is reduced from eight to six. The matching unit is denoted by MXOR2 and shown in Figure 5. It is also described by: Combining the 2 nd and 3 rd output bits of the unit shown in Figure 5 through another XOR gate reduces the number of the generated output bits to five rather than six.
The least significant bit is described by 2 0 (Ω 0 , Ω 1 , Ω 2 ). The next higher significant bit, which is generated based on the new combination, is described by 2 1 (Ψ 3 , Ψ 4 ) and the remaining output bits are shifted by 2 −3 . The matching unit is denoted by MXOR3 and shown in Figure 6. It is also described by:  In a different manipulation, combining the 2 nd , 3 rd and 4 th output bits of the unit shown in Figure  5 through an XOR gate reduces the number of the generated output bits to four.
The least significant bit is described by 2 0 (Ω 0 , Ω 1 , Ω 2 ). The next higher significant bit, which is generated based on the new combination of the three bits, is described by 2 1 (Ψ 3 , Ψ 4 , Ψ 5 ) and the remaining output bits are shifted by 2 −4 . The matching unit is denoted by MXOR4 and shown in Figure 7. It is also described by: Furthermore, combining the two most significant output bits of the unit shown in Figure 7 through an XOR gate reduces the number of the generated output bits to only three bits.
The generated output bits of the proposed matching units deviate from the original generated output bits of SAD. The more the shrinking is, the more is the deviation. This leads to a degradation in the coding performance of the video codec which is illustrated by deteriorations in both the bit-rate and PSNR. In the following section, we show that the amounts of deterioration are marginal and that our proposed approach actually pays off.

NUMERICAL ANALYSIS AND EVALUATIONS
First, we perform ASIC system logic synthesis using Synopsys's Design Compiler [ver I-2013.12-SP5-10 using TSMC 90nm general-purpose nominal-threshold-voltage library] for the proposed matching units, the conventional SAD, the modified implementation of SAD [23] (we denote it by MISAD) and finally the matching units of the bit-truncation approach. We consider the latter, since this approach is the closest to our proposed bit-shrinking approach. Taking the number of generated output bits as a reference, our units described by Equations (8), (9), (10) and (11) are analogous to the units which generate six output bits (NTB2), five output bits (NTB3), four output bits (NTB4) and three output bits (NTB5), respectively, of the bittruncation approach.
We consider the following performance metrics for evaluation purposes: the hardware complexity in terms of the number of two-input NAND gates, the processing delay in terms of the critical path in nano-second and the consumed power in micro-watt. The power estimation is under a global operating voltage of 1.1V, a load capacitance of 0.2 pF, an operating frequency of 100MHz and with a medium confidence level. Table 1 shows the obtained results. From the table, there is a huge increase in performance when using all the proposed matching units when compared with the conventional SAD. The number of two-input NAND gates, consumed power and processing delay of MXOR, which generate the same number of output bits of the conventional SAD, are only 18%, 19% and 14% of the number of two-input NAND gates, consumed power and processing delay, respectively, of the conventional SAD, while their values increase to 24%, 24% and 31% of the conventional SAD, respectively, when considering MXOR5 which generates only 3 output bits. Table 1 also shows that the results of MISAD are very close to the results of the conventional SAD. The number of two-input NAND gates, consumed power and processing delay of MISAD are 96.5%, 98.7% and 90.9% of the number of two-input NAND gates, consumed power and processing delay, respectively, of the conventional SAD, which indicates a very modest improvement of this contemporary work [23]. Table 1 also shows the superiority of our matching units when compared with the matching units of the bit-truncation approach with the same number of generated output bits. Thus, in terms of hardware efficiency illustrated by the adopted three metrics, our proposed matching units outperform the conventional SAD, MISAD and all the matching units of the bit truncation-approach.
The benefits gained by bit-shrinking are not limited to the matching unit only, as we mentioned before, but it also affects later stages of motion estimation engine (shown in Figure 4). Bitshrinking shall reduce the complexity of the adder tree in terms of a lower-width adder and smaller accumulating registers for each sub-block in 64x64 macro-block. Furthermore, the compare and select unit shall also be reduced in terms of the sizes of the compare units as well as the used registers. For example, the size of each accumulating register reduces from 12 bits, 14 bits, 16 bits, 18 bits and 20 bits for 8-bit pixel matching in a conventional SAD for sub-block sizes of 4x4, 8x8, 16x16, 32x32 and 64x64, respectively, down to only 7 bits, 9 bits, 11 bits, 13 bits and 15 bits, respectively, when using MXOR5.
The significant performance elevation introduced by our units may boost the widespread of H.265/HEVC, especially in modest devices such as smartphones which have storage and power constraints. Nevertheless, we have to show that the compression ratio and video quality are not much affected by the proposed units to support our statement. Therefore, we perform intensive We perform the simulations using HEVC reference software HM16.6 to compare the video quality illustrated by PSNR and the compression ratio illustrated by bit-rate. The PSNR is the simulations taking into consideration both the compression ratio and video quality for different kinds of frame sequences of the standard. most frequently used indicator by the research community to measure picture quality. It is defined by: = 20 log 255 √ ; (12) where 255 is the largest pixel value for 8-bit representation and MSE is the Mean Square Error between a noise-free M×N monochrome frame I and its noisy approximation K. It is defined by: The simulations are conducted first on six HD videos: Tennis, Beauty, Bosphorus, Honey Bee, Jockey and Ready Steady Go. The resolution of these videos is 1080p. We consider main profile level 6.2 with Random-Access (IBBB frame sequences) and Low-Delay-P (IPPP frame sequences) configurations. The block search range is [−64, 63] with maximum CU size 64 and maximum CU partition depth 4 with full search motion estimation. The number of frames taken in each sequence is 120. The simulations are carried on Windows 8os platform with Intel i7 extreme @ 2.93GHz CPU and 8GB RAM. Table 2 shows the obtained results of PSNR and the bit-rate of a conventional SAD as well as the deviations in these two metrics when using all matching units compared with SAD. As expected and discussed in the previous section, due to the partial mismatch of the generated output bits between the proposed units and the conventional SAD, a deterioration occurs in both PSNR and bit-rate. Still, the amount of decrease in PSNR and the amount of increase in bit-rate are very small as frankly illustrated by the results obtained from the simulations performed on all the tested videos. Among our proposed matching units, the maximum deviation of PSNR occurs when processing Tennis video with a value of -0.04dB using MXOR5, while the maximum increase in bit-rate also occurs when processing Tennis video with a value of 1.5%. These values are considered very small and hence the deteriorations are actually marginal even at the peaks. Table 3 summarizes the deviations in PSNR and bit-rate by showing averages. Note that the amount of degradation in the performance is very small when using our approach compared with the conventional SAD. The amount of degradation in PSNR ranges from 0.001dB and 0.002dB when considering MXOR to 0.013dB and 0.018dB when considering MXOR5 for IPPP and IBBB frame sequences, respectively. On the other hand, the amount of increase in the bit-rate ranges from 0.09% and 0.07% when considering MXOR to 0.543% and 0.684% when considering MXOR5 for IPPP and IBBB frame sequences, respectively. It can be deduced from the table that the amounts of deterioration in both PSNR and bit-rate are small when using either bit-truncation or bit-shrinking approaches. Thus, we can confidently state that the video quality and the compression ratio are not practically affected when following these approaches. Note that MISAD performs the SAD operation with a different implementation of the building 1-bit full adder unit as explained earlier in Section 2. Therefore, the generated output bits of MISAD and the conventional SAD are exactly the same. Hence, there are no deteriorations in both bitrate and PSNR when using MISAD instead of the conventional SAD.
To evaluate the proposed matching units in processing very high resolution videos, we also conduct simulations on 4K UHD videos with the resolution of 3840x2160.   The obtained results of the extensive simulations that are performed on both HD and UHD videos using HM16.6 validate the correctness of the functionalities of the proposed matching units in replacing SAD. The videos have been processed and encoded successfully using H.265/HEVC with marginal deteriorations in the values of both PSNR and bit rate as illustrated in Table 3 and Table 4 Figure 9. (a) The code of conventional SAD operation of HM16.6 (b) The code of MXOR implemented in HM16.6.

CONCLUSION
We proposed in this paper a hardware-efficient block matching unit for H.265/HEVC motion estimation engine. We followed a bit-shrinking approach with a modified logic functionality to reduce the number of the building two-input NAND gates, consumed power and processing delay of the matching units. We performed extensive simulations to evaluate our design. We considered HD and UHD videos in our evaluations, since their usage have been spreading tremendously in recent years. The results obtained show the superiority of our approach over SAD, MISAD and the bit-truncation approach taking the three adopted performance metrics into consideration. The results show only small amounts of deterioration in video quality and compression ratio when shrinking the number of output bits. The amounts of increase in the number of two-input NAND gates, consumed power and processing delay when using MXOR units with bit-shrinking instead of the pure MXOR unit are small. Nevertheless, the MXOR units with bit-shrinking actually outperform the pure MXOR, since they significantly reduce the hardware complexity of all subsequent stages including the parallel adder tree and the compare and select units of the motion estimation engine.
It is important to clarify that we propose the matching units only for hardware implementations and not for software implementations. The design considers manipulating the pixels at the bitlevel which makes the proposed units run efficiently on hardware. As mentioned above, considerable enhancements have been shown when we evaluated the hardware design of the proposed matching units and compared them with the hardware designs of others. Software implementations of the proposed matching units are not efficient, since extra computations are performed by the motion estimation engine when compared with the conventional SAD. The code shown in Figure 9 (a) is the code extracted from HM16.6 that performs one pixel matching using SAD, while the code shown in Figure 9 (b) is the code implemented in HM16.6 to perform the same operation using MXOR which is taken here as an example. The codes to compute MXOR2 through MXOR5 are even longer than the code of MXOR. Hence, extra time is needed by the software to perform the coding process when using our proposed units instead of the conventional SAD. We emphasize the fact that we only performed the software simulations using HM16.6 to validate the functionalities of the matching units and also to measure the amounts of deterioration in both PSNR and bit-rate when using the proposed units instead of SAD.