Adaptive search range adjustment and multiframe selection algorithm for motion estimation in H.264/AVC

Abstract. The H.264/AVC video standard adopts a fixed search range (SR) and fixed reference frame (RF) for motion estimation. These fixed settings result in a heavy computational load in the video encoder. We propose a dynamic SR and multiframe selection algorithm to improve the computational efficiency of motion estimation. By exploiting the relationship between the predicted motion vector and the SR size, we develop an adaptive SR adjustment algorithm. We also design a RF selection scheme based on the correlation between the different block sizes of the macroblock. Experimental results show that our algorithm can significantly reduce the computational complexity of motion estimation compared with the JM15.1 reference software, with a negligible decrease in peak signal-to-noise ratio and a slight increase in bit rate. Our algorithm also outperforms existing methods in terms of its low complexity and high coding quality.


Introduction
Motion estimation (ME) is used in many video coding standards, such as MPEG-1, MPEG-4, and H.264/AVC, to remove interframe redundancies.There are two main factors that determine the performance of ME.One is the size of the search range (SR), and the other is the number of reference frames (RFs).The values of these two factors are set manually in the configuration file of the encoder and are used as fixed parameters in the whole ME process.Although larger SR and RF values enable better coding performance, they also impose a heavy encoding complexity.In the overall coding system, about 50% of the computation time and 90% of the memory access are dedicated to ME. 1,2 Indeed, the memory access bandwidth required to read reference pixels from external memory has gradually become the bottleneck of real-time applications.
Because the coding video differs considerably, fixed SR and RF settings are not suitable for all situations.For example, if the video sequence is motionless, a small SR and single RF are sufficient to determine the best motion vector (MV).On the other hand, if the sequence contains a high degree of movement, we can adaptively adjust the above settings to achieve better video quality.In such cases, dynamic SR adjustment and multiframe selection algorithms have become important in improving coding efficiency and satisfying the needs of real-time applications.
Many algorithms that adaptively adjust the SR size and select the RFs have been proposed.For instance, the influence of the SR on the compression rate and coding quality has been examined in an extensive series of experiments. 3he maximum-likelihood estimator has been used to model the probability density function of the motion vector difference (MVD), with the optimal SR then calculated for a given hitting probability. 4,5However, the integral operation and likelihood estimator make the algorithm somewhat complicated.In Ref. 6, images are divided into motionless and moving regions, and different SR selection methods are given for each.Using the spatial correlation, the SR can be set as a function of the maximum of the neighboring matching errors. 7Similarly, the block can be classified as motionless, slow, or violent based on the magnitude of the predicted motion vector (PMV), allowing a 3 × 3, 7 × 7, or 9 × 9 search area to be selected accordingly. 8,9In Ref. 10, three different SRs are selected based on certain thresholds.However, with only a few fixed SRs and thresholds, the adaptability of these algorithms [8][9][10] is insufficient to cope with the diversity of video sequences.In Ref. 11, the correlation between different block partitions is used to predict the SR size.The SR of the current block has been calculated based on the relationship between the probability density function of MVD and the MV distribution of the neighboring blocks. 12An analysis of the relationship between the compression ratio and the size of the SR has been shown to allow a smaller SR for macroblocks (MBs) with smaller compression ratios. 13In Ref. 14, a linear SR model is defined based on the 1-bit image plane.
Existing multiframe selection algorithms are generally classified into three categories.In the first, the MV composition technique is used to reduce the calculation load of multiframe ME, which uses the MVs from the previous RFs to generate those for the current RF.The activity dominant vector selection criterion can be used to obtain the composited MV. 15 The MVs of two successive frames can synthesize the MV of the current RF, 16 with only a small range around the composited MV needed for ME.In Ref. 17, the MVs with the same characteristics are first merged and then the two most relevant MVs are selected to generate the MVof the current block.To apply the MV composition algorithm, the MVs of different block types must be stored for further calculation.Therefore, this approach inevitably involves a large additional memory space.
In the second category, the candidate reference frames (CRFs) are predicted from the temporal and spatial neighbors.The priorities of all RFs can be computed from the spatial and temporal correlation of the RF index and MVs, 18 or the coding modes and RFs from some reference region may be used to determine the CRFs for the current block. 19In Ref. 20, the initial RF is dynamically selected using the mean of the RF indexes of spatially neighboring MBs.
The third category speeds up the multiframe selection procedure by employing an early termination strategy.For instance, based on the spatial and temporal consistencies of the motion field, multiframe ME may only be selected for a small region of the MBs, 21 or the experimentally obtained sum of absolute difference (SAD) curve could be used to skip the third and fourth RFs. 22In Ref. 23, the relationship between the SAD and the all-zero quantized transform coefficients was deduced.If the SAD is less than some threshold, the multiframe ME procedure should be terminated immediately.Alternatively, the MVs and SADs from previous RFs can be used to automatically confirm whether to search the last two RFs. 24ifferent from the above-mentioned methods, we propose a different dynamic SR and multiframe selection algorithm for the H.264/AVC ME process.For the SR adjustment scheme, we first analyze the relationship between the PMV and the SR and then use this relationship to adjust the SR size.For the multiframe selection scheme, we focus on the correlation between different block sizes of the MB to adaptively select RFs.Simulation results show that the proposed algorithm achieves a higher speedup while maintaining almost the same video quality and total bit rate.
The rest of this paper is organized as follows.Sections 2 and 3 describe the algorithms for dynamic SR adjustment and multiframe selection, respectively.Our simulation results are presented in Sec. 4, and we give our conclusions in Sec. 5.
2 Proposed SR Adjustment Algorithm

Motivation and Analysis
The SR size can be successfully reduced using the algorithms mentioned in Sec. 1, but the efficiency can be further improved.In the H.264/AVC coding standard, the PMV decides the center of a search area.Therefore, the SR size depends on the location of the PMV more than any other factors, such as the maximum matching error 7 or the motion activity. 8The relationship between the SR and the PMV is illustrated in Fig. 1.The vector mv 0 denotes the search center and is calculated as the median operator over the neighboring MVs, and mv n is the final MV of block N obtained by ME.The vector mvd n is the difference between mv 0 and mv n .Intuitively, when PMV (mv 0 ) is accurate (i.e., when mv 0 is close to mv n ), fewer regions need to be searched before the final motion vector (mv n ) is obtained.Hence, we can consider an adaptive SR adjustment algorithm which depends on the PMV.
To determine an MV within the search window, SAD has been widely used to measure the prediction distortion between the current and the candidate blocks.For a block of size W × H SAD is defined as where currði; jÞ denotes the pixel value of the current block and ref is the pixel value of the candidate block for a given motion vector mvðmvx; mvyÞ.In our scheme, we define the Dif metric to evaluate the accuracy of the PMV.In other words, we use Dif to measure the distance between the MV and its predictor.It is defined as follows: where SAD I ½c; rðmv 0 Þ and SAD I ½c; rðmv i Þ are the SADs computed at the PMV (mv 0 ) and the MV (mv i ) of block I, respectively.A small Dif value implies an accurate predictor.Thus, we may expect a high probability of a small SR.
To reveal the relationship between Dif and the SR size, the experimental results from a variety of video sequences are shown in Fig. 2. In this figure, jmvd n j represents the magnitude of the difference between MV and PMV, and  average Dif is the average Dif value of blocks with the same jmvd n j.A clear positive correlation can be seen, especially for jmvd n j values up to 11.Although average Dif does not increase monotonously over the whole jmvd n j range, there is a high probability that average Dif will become larger as jmvd n j increases.Hence, we can develop an adaptive SR selection algorithm, which uses Dif as the criterion for determining the SR.That is, the small Dif means the accurate PMV and the small SR value, whereas the large Dif means the large SR value.
For the current block N in Fig. 1, mv n is not available until the ME process has been performed across the whole search area, so we must set a predictor for SAD N ½c; rðmv n Þ before it can be used.The influence of this predictor on coding quality and efficiency is briefly analyzed as follows.In Eq. ( 2), if a larger predictor is selected for SAD N ½c; rðmv a Þ, a smaller Dif value will be obtained.According to Fig. 2, this should result in a smaller SR size.Although a higher coding speed can be achieved with a smaller SR, the coding quality may be lost if the true MV falls outside this smaller SR.In contrast, a smaller predictor gives a larger SR and a higher coding quality.Hence, our purpose is not to get an accurate estimate for SAD N ½c; rðmv n Þ but to achieve a tradeoff between the coding quality and the coding efficiency.Therefore, we replace SAD N ½c; rðmv n Þ with a smaller predictor: where SAD A ½c; rðmv a Þ, SAD B ½c; rðmv b Þ, SAD C ½c; rðmv c Þ, and SAD E ½c; rðmv e Þ are the SADs computed from the MVs of the spatially neighboring blocks (A is the left block, B is the upper block, C is the upper-right block) and the colocated block (E).
To estimate the SR size of block N, let Dif N , Dif A , Dif B , Dif C , and Dif E denote the Dif values of the current block N and its four neighbors (left, upper, upper-right, and colocated block), respectively.Dif A , Dif B , Dif C , and Dif E are subtracted from Dif B , Dif C , and Dif N , and the four differences are collected in the set ΔDif: The SR is then adjusted according to the following conditions.
Condition 1: If any element in ΔDif is smaller than 0, Dif N is smaller than the Dif value of (at least) one of its four neighbors.Thus, it can be determined that the PMV of block N is closer to its MV than its neighbor, and therefore the following smaller SR should be set for block N: In Eq. ( 5), SR A , SR B , SR C , and SR E are the SRs of blocks A, B, C, and E, respectively.
Condition 2: If all elements in ΔDif are greater than 0 and Dif N > α × maxðDif A ; Dif B ; Dif C ; Dif E Þ, where α is an adjustment parameter, the PMV of the current block N is far from its MV, which is likely to be due to complex motion.In such cases, in order to guarantee the coding quality, a larger SR should be set as follows: where R is a fixed SR given in the configuration file of the encoder.Based on extensive experiments, we set the value of α to 2. Condition 3: If conditions 1 and 2 are not satisfied, the two minimum elements in ΔDif (corresponding to the blocks that have the closest Dif values to the current block N) are selected.These blocks are marked b 1 and b 2 , and the SR of block N is set according to After determining the new SR according to Eqs. ( 5)-( 7), the ME is performed within this new SR.

Proposed Algorithm
For clarity and completeness, we summarize the detailed description of the proposed algorithm as follows: Dif N is calculated for current block N according to Eqs. ( 2) and ( 3), and ΔDif is obtained according to Eq. ( 4).The following conditions are then used to adjust the size of SR.
After finding the MV within the new SR defined by our algorithm, the Dif and SR values of the block N are saved for the next coding block.
In order to verify the correctness of the proposed algorithm, we analyzed the hit ratio to see whether our new SR included the result of ME.The hit ratios for different quantization parameters (QPs) are shown in Table 1.It can be seen that our algorithm gives a high accuracy of 93.28% to 99.61%.The Bus, Mobile, and Stefan sequences contain a high level of motion and thus obtain relatively low hit ratios.For the Akiyo, Paris, and Flower sequences, which contain slower motion, the proposed algorithm yields a hit ratio of around 99%.
3 Proposed Multiframe Selection Algorithm

Motivation and Analysis
Figure 3 describes the relationship between multiframe and variable block size ME.As shown in Fig. 3, the multiframe ME process is performed on each block size of the MB.These blocks are different partitions of the same MB, so their best RFs are highly correlated.According to this feature, the best RFs of the upper divisions can be used as CRFs for the lower divisions.However, using this up-to-down method to predict the CRFs of the lower divisions has two shortcomings.First, if the MB contains the object boundary or the complex content, then the probability that the lower and upper divisions will share the same best RF will be small.Second, the lower partitions have more subblocks than the upper partitions.Therefore, this up-todown prediction method is not very efficient.For example, the P8 × 8 division contains 36 sub-blocks in total.If the number of CRFs is 2, then the 36 sub-blocks will require 72 ME procedures, whereas for the 16 × 16 division, only two ME procedures are required.Therefore, reducing the number of CRFs in the lower divisions is critical for improving the efficiency of ME.
In summary, the 8 × 8 block is used as the basic unit in this study to determine the best RFs for the other coding modes.We use the 8 × 8 block as it better reflects the motion and texture characteristics of the MB and improves the efficiency as much as possible.For convenience, the RFs are hereafter marked as ref 0 ; ref 1 ; ref 2 ; ref 3 : : : in accordance with their distance from the current frame.
Due to the temporal consistency of a video sequence, it is highly probable that blocks in static or slow-moving regions will select ref 0 as the optimal RF.In this work, the movement complexity (MC) of the MB is defined as MC ¼ maxðjMV 0 j; jMV 1 j; jMV 2 j; jMV 3 jÞ; (8)   where MV 0 , MV 1 , MV 2 , and MV 3 denote the MVs of four 8 × 8 blocks at ref 0 .
To inspect the relationship between movement complexity and ref 0 , various video sequences with different motion activities are used as the test material.Table 2 documents the average experimental results for different QPs.T B is the number of blocks with the same movement complexity, and RF_ratio represents the percentage of blocks whose ultimate RF is ref 0 .For the static blocks (which MC is equal to 0), an average of 97.87% of them select ref 0 as the best RF.For the News, Container, and Akiyo sequences, the percentage of blocks choosing ref 0 among T B is greater than 99%.Table 2 also shows that the RF ratio decreases as the value of MC increases.For the Stefan sequence, in particular, when the MC is 3, only 68.32% of T B select ref 0 as their ultimate RF.Therefore, we can conclude that when the value of MC is less than some threshold Th (set to 2 in this paper), there is a very high probability that only ref 0 will be used as the best RF.From this result, we conclude that the search across other RFs can be skipped.
If MC is larger than Th, it indicates that the motion is complex.Thus, we perform ME on the remaining RFs for each 8 × 8 block.The reference maps are then generated according to the correlation between the 8 × 8 block and the other blocks.Figure 4

Proposed Algorithm
Based on the above description, the pseudocode for our multiframe selection algorithm can be written as shown in Fig. 5.The initial number of CRFs is usually set to 5. As shown in Fig. 5, ME is first performed on ref 0 for the four 8 × 8 blocks, and the MVs of the four blocks are used to calculate the motion complexity according to Eq. ( 8).If MC is less than the threshold Th (set as 2), then ref 0 is set as the CRF.Otherwise, ME is conducted on from ref 1 to ref 4 for the four 8 × 8 blocks, respectively.The next step is to determine a candidate reference frames set (CRFS) for other partitions based on the reference maps in Fig. 4. ME is then performed on the RFs in the CRFS for the 8 × 8 sub-block, 16 × 16 block, 16 × 8 block, and the 8 × 16 block.In the ME process for H.264/AVC, the coding order is P16 × 16, P16 × 8, P8 × 16, and then P8 × 8.As the proposed algorithm changes the coding order, i.e., the 8 × 8 block is processed first, the MVs of the second, third, and fourth 8 × 8 blocks must be repredicted after finishing the whole ME process, otherwise there will be some inconsistency between the decoder and encoder.
Using the pseudocode above, we investigate the number of RFs under the best-and worst-case scenarios.In the best case, only one RF (ref 0 ) is required for all block divisions.In the worst case, all five RFs are required for the 8 × 8 blocks; each of the 16 × 16 block, 16 × 8 block, and 8 × 16 block have two CRFs.However, subpartitions such as 8 × 4, 4 × 8, and 4 × 4 will always have one RF.Therefore, the average number of RFs required for the entire coding of the MB is ½ð4 To verify the effectiveness of the above method, the average hit ratio for different QPs at each block size was compared to the full RF search method.The results are shown in Table 3.As can be seen, our algorithm achieves a high average accuracy of 91% to 98% for all block partitions.For the 16 × 16 block size, the hit ratio averages above 95%.Even the 8 × 8 sub-block averages over 91%, with a worst-case performance of 84.75% with the Mobile sequence.

Simulation Results and Analysis
To evaluate the performance of the proposed approach, it was implemented on the H.264/AVC reference software JM15.1.Various standard sequences were used, including CIF, 4CIF, and HD sizes, at 30 fps for 100 frames.The simulations were performed with rate distortion (RD) optimization and Hadamard transform enabled, QP ¼ 28, 32, 36, and 40, IPPP sequence types, CAVLC entropy coding, and one or five RFs.The default SR was set to 16 for CIF, 32 for 4CIF, and 64 for HD sequences.For a numerical comparison, the Bjontegaard delta peak signal-to-noise ratio (BDPSNR) and Bjontegaard delta bit rate (BDBR) were used to evaluate the coding quality.These quantities are recommended for comparing the difference between two RD curves. 25A negative BDPSNR (dB) or positive BDBR (%) indicates a coding loss over the fullsearch algorithm in JM15.1.For an objective comparison, the average used reference frame (URF) and average computational complexity (CPX) are used to measure the complexity reduction.CPX (%) 4 is the speed-up ratio of the SR adjustment algorithm, which is defined as the ratio of the number of search points in the proposed method to the number of search points in the full-search motion estimation (FME) algorithm.For example, a CPX value of 21% implies that the number of search points in the proposed method is 21% of the number in the FME algorithm.The average URF represents the average number of RFs used by each MB, with a smaller URF signifying fewer RFs.
For an objective comparison of coding performance, the proposed approach was compared with corresponding methods.Our SR adjustment method was compared to two recent schemes. 4,11It can be seen from Table 4 that the proposed algorithm reduced the number of search points by an average of 2.56%, albeit at the cost of a 0.14% increase in bit rate and a 0.06 dB reduction in peak signal-to-noise ratio (PSNR).
When the hitting probability of MVDs was set to 0.7, the method of Ko et al. 4 exhibited a good performance, with an average CPX of 4.94%.However, the implementation cost of this method is much higher than that of the proposed scheme due to the overheads inherent in the integral operation and likelihood estimation.For several sequences, such as Stefan, Station2 Hd, and Soccer 4Cif, the method of Ref. 11 exhibited better coding quality than our method in terms of BDPSNR and BDBR, but our approach can further deduct an average of 11% search points with only a slight bit rate increase and small PSNR decrease.
For further comparison, Table 5 depicts the ME time of the proposed method compared to the UMHexagonS algorithm in JM15.1.This shows that the proposed method reduces the ME time by 9.15% on average, with a marginal increase in BDBR (þ0.07%) and a small loss of BDPSNR (−0.01 dB).For sequences with static and slow motion activity, such as Akiyo Cif and Hall Cif, UMHexagonS achieves a high speed-up factor, because the ME process is terminated after finishing the small and middle diamond pattern search.In contrast, our algorithm works well for CIF sequences with high motion levels and 4CIF and HD sequences with larger SR values.
We compared our multiframe selection algorithm with two well-known efficient algorithms. 19,23The comparison results are listed in Table 6.For most sequences, the method of Lee et al. 19 yielded a high bit rate increment and PSNR loss (average of 0.09 dB BDPSNR decrease and 0.61% BDBR increase), with 1.88 frames searched compared to the JM15.1 reference encoding.The method of Ref. 23 achieved good PSNR and bit rate performance (average 0.06 dB BDPSNR decrease and 0.45% BDBR increase) with a higher speed-up factor (1.52 RFs used).The proposed algorithm gave superior results to both of these techniques in terms of coding complexity reduction.It outperformed the method of Lee et al. 19 in all aspects, with a 0.63-frame reduction in the number of searched RFs, which implies a reduction of 12% in the number of ME operations.At the same time, the proposed method improved the URF value by 0.31 frames compared to the method of Huang et al., 23 with a better bit rate and the same PSNR.
It should be mentioned that the URF and CPX parameters are virtually independent of the programming quality or hardware platform.Therefore, it is clear that the performance of the proposed methods is superior to the corresponding methods evaluated above.It must also be noted that the proposed methods can be combined with any fast ME method, such as Fast Full search, EPZS, and UMHexagonS, in the JM reference encoder.

Conclusions
In this paper, we presented an efficient adaptive SR adjustment and multiframe selection algorithm for ME.The proposed SR adjustment algorithm adjusts the SR for ME based on the accuracy of PMV.We also designed an RF selection scheme with the aim of reducing the complexity of multiframe ME based on the relationship between the different block sizes of the MB.The described technique significantly reduces the CPX of ME in H.264/AVC.Through a comparative analysis, it was found that the proposed approach gave an average RF saving of 3.75 frames (1.25 URFs) with a setting of five RFs and a reduction of 97% in the number of search points with a setting of one RF, with a negligible bit rate increment and a minimal loss of image quality.These features will be important in the development of real-time coding applications for video conferencing and telephony.In future work, we will combine our two algorithms to further speed up the ME process.

Fig. 1
Fig. 1 Relationship between the search range (SR) and PMV.

Fig. 2
Fig. 2 Relationship between Dif and jmvd n j. (Test conditions: encoded frame ¼ 100; number of reference framesðRFsÞ ¼ 1; quantization parameter ðQPÞ ¼ 28; SR ¼ 16; group structure is IPP.) shows the reference maps for other blocks, where ref a , ref b , ref c , and ref d denote the optimal RFs of the first, second, third, and fourth 8 × 8 block, respectively.The CRFs for each partition are described as follows: a.The P16 × 16 mode is used to encode regions containing slow motion.Thus, if more than two RFs are contained in Fig. 4(a), the motion of the current block is complex and the P16 × 16 mode cannot be selected.In this case, only ref 0 is selected as the best RF; otherwise all RFs contained in Fig. 4(a) are selected as CRFs for the 16 × 16 block.b.For the 16 × 8 block, ref a and ref b are regarded as CRFs for the upper 16 × 8 division, whereas ref c and ref d are CRFs for the lower 16 × 8 division in Fig. 4(b).c.For the 8 × 16 block, ref a and ref c are regarded as CRFs for the left 16 × 8 division, whereas ref b and ref d are CRFs for the right 16 × 8 division in Fig. 4(c).d.For the three 8 × 8 subdivisions, the best RF for the 8 × 8 block is directly used as the best RF for its subdivisions in Fig. 4(d).
Fig. 3 Multiple RFs motion estimation (ME) with various block size.

Table 3
Average hit ratios of the proposed algorithm for each block size (test conditions are the same as Table2).

Table 4
Performance comparison of the proposed algorithm on JM 15.1 (with one RF).

Table 5
Performance comparison with UMHexagonS algorithm (test conditions are the same as Table4).