H IGH -P ERFORMANCE B LOCK M ATCHING A LGORITHM FOR H IGH B IT -R ATE R EAL -T IME V IDEO C OMMUNICATION

Although the advancements in hardware solutions are growing exponentially along with the communication channels capacity, high quality video encoders for real-time applications are still considered an open area of research. The majority of researchers interested in video encoders target their investigations towards motion estimation and block matching algorithms. Many algorithms that aim to reduce the total number of required mathematical operations when compared to Full Search have been proposed. However, the results often converge to local minima and a significant amount of computations is still required. Therefore, in this research, a hierarchy-based block matching method that facilitates the transmission of high bit-rate videos over standard communication methods is proposed. The proposed algorithm is based on the frequency domain, where the algorithm examines the similarities between a chosen frequency subset, which significantly reduces the total number of comparisons and the total mathematical computations required per block.


INTRODUCTION
Digital videos consist of successive frames sampled over a period of time. Those successive frames carry high data redundancy. Therefore, eliminating bits of redundant data can be extremely helpful in reducing the size of digital video and compressing the video. Several types of compression techniques have been proposed in the past few decades. Those compression techniques are classified as being either lossless or lossy. The former type is achieved by eliminating redundant bits and reproduces exact original dataset. The latter is achieved by eliminating least important bits and reproduces a similar copy that might be indistinguishable by the human visual system from the original. Lossy compression techniques achieve better compression and are more applicable to digital videos; while, lossless techniques are in their nature more applicable for digital images. Lossless image compression allows the use of human visual system limits, by producing data that is sufficient to be classified as "good enough". The latest compression standards have set the architectures for video codecs as consisting of the following basic blocks: prediction, transform and entropy coding. Prediction includes estimates for the position of a current block inside a video frame. Transform process converts a block of pixels into frequency domain. Entropy coding involves encoding video data into a compressed bit stream.
Usually, consecutive frames have the same still or moving objects, creating a high correlation between consecutive frames. Therefore, researchers have investigated the use of methods that examine the object movements in a video sequence in order to produce motion vectors that represent the estimated motion. On the other hand, those estimated vectors are forwarded to the proper motion compensation methods that use those vectors to simulate the object movement, achieving data compression. Motion estimation and compensation methods are considered the most important techniques that eliminate temporal redundancy in successive video frames. However, those techniques are more applicable to translational motion and still have their limits when applied to rotational motion which is difficult to estimate and requires other techniques for processing. Therefore, motion estimation algorithms usually assume the following: objects movement is translational, illumination is uniform across spatial and temporal domains, occlusions of objects by others are neglected, and finally uncovered background is not to be considered.
Various methods for coding have been proposed for video compression. Those coding techniques include intra-frame and inter-frame coding which are used to minimize the total number of bits required to transmit or store videos. In intra-frame coding, each frame is separately coded and this type of coding includes: transformation quantization and frame encoding. Inter-frame coding investigates the temporal redundancy and is usually applied in video coding in order to achieve the actual compression. In this type of coding, motion estimation and compensation algorithms are normally applied to eliminate the temporal redundancy that exists between successive frames. Various motion estimation approaches were proposed in literature; amongst those approaches, block matching algorithms were proven to be more suitable because of their reliability and simplicity. Block matching algorithms are used to estimate the object's motion in successive frames on the basis of rectangular blocks. These algorithms assume that all the pixels within a block have the same motion behaviour [8].
In block matching algorithms, frames are divided into N N  blocks; where all blocks in the current frame are matched with candidate blocks within a search area (window) on the reference frame (considering that candidate blocks have a translation movement in other frames) and the displacement motion vector is recorded for the best matched candidate. In inter-frame coding, the motion vector and the residual frame (resulting from subtracting input frame from the prediction of the reference frame) are usually transmitted. At the receiver side, the decoder builds the frame difference and adds it to the reconstructed reference frame. Therefore, data compression is achieved by eliminating inter-frame redundancy. This demonstrates the fact that better prediction methods give smaller error signals and a reduced transmission bit-rate [19].
In this paper, in addition to the introduction section, section 2 provides an up-to-date literature review of motion estimation algorithms. Section 3 introduces the transformation process. In section 4, the proposed hierarchical search algorithm is described along with the proposed matching criterion. Section 5 provides the experimental results and analysis. Finally, section 6 concludes this research.

LITERATURE REVIEW
A large number of block matching algorithms have been proposed over the last decades, such as the traditional methods found in [2]- [5], [7], [11], [15]- [16] and [24]. Amongst the available block matching algorithms, full search leads to the best possible match of the block in the reference frame with a block in another frame by calculating the cost function at each possible location in the search window. The resulting motion compensated frame has the highest peak signal-to-noise ratio when compared to any other block matching algorithm. However, this is the most computationally extensive block matching algorithm [8].
Optimized block matching algorithms speed up the exhaustive search required by full search algorithms based on fixed search patterns. Researchers in this domain have investigated the use of many algorithms in order to enhance the traditional search algorithms. In [33], Diaz Cortes et al. proposed a block matching algorithm that combines harmony search with a fitness approximation model. The authors considered the motion vectors in search window as potential matches. The authors applied a fitness function in order to evaluate the matching quality of each motion vector in addition to a strategy to decide which motion vectors can be estimated amongst the rest of the motion vectors. In [34], the authors proposed a hierarchy-based motion estimation algorithm using Gaussian image pyramid and unidirectional estimates of motion vectors at the top level. In their work, the authors proposed the use of five candidates for each motion vector. At the bottom level of the hierarchy, the motion vectors are corrected based on the sum of absolute difference values of the blocks. Moreover, in their work, the unidirectional motion vectors are assigned to bidirectional motion vectors.
In [35], Abdelazim et al. proposed the use of cross search algorithm in the H.265 standard that deals with high-efficiency video coding. In their work, the authors proposed a speed optimization technique in the frequency domain phase-correlation that enables compressing the videos rapidly while maintaining the video quality. In [36], Jia and Ding proposed a fast sub-pixel motion estimation algorithm. In their work, the authors proposed a scheme to skip sub-pixel search process in smooth prediction units. Moreover, the authors proposed a fast sub-pixel search algorithm based on texture direction analysis in order to reduce the computational complexity. In [37], the authors presented a low computational complexity systolic hardware architecture for full search block matching algorithm. In their work, the proposed architecture is based on one-bit transform-based full search algorithm. The proposed motion estimation hardware architecture performs full search for four macro-blocks in parallel, where the proposed architecture was implemented in VHDL. In [38], the authors presented a three-step searching method in order to estimate the motion vectors of high-resolution image sequences using low number of computations. The searching strategy of this algorithm is carried in three steps, where the first search is performed in the large areas, the second is performed in the adaptive directional search and the last is performed in the small area search.
In [39], Arora et al. proposed a dynamic zero motion pre-judgment technique along with an adaptive diamond pattern search-based algorithm in order to enhance the search efficiency and accuracy of motion estimation. The dynamic zero motion pre-judgment is used for early identification of the stationary blocks. However, for the rest of the stationary blocks, an initial search center is used which has a high probability to be near actual motion vector. The variable size diamond pattern is used to obtain the global minima. In [40], Kovacevic et al. presented a motion estimation technique that combines recursive block-matching and customized phase plane correlation. In [41], Kamble et al. developed an approach for video coding using a modified three-step search block matching algorithm and weighted finite automata coding. In their work, the proposed block matching algorithm is based on the combination of rectangular and hexagonal search patterns and is used to compute motion vectors. The proposed weighted finite automata are used for the coding with a focus on reducing the encoding time. In order to reduce the encoding time, the authors in [47] proposed another approach for fractal coding using the weighted finite automata. The authors of [42] proposed a motion estimation method for image stabilization, integrating the speeded up robust features algorithm, modified random sample consensus and the Kalman filter. The authors achieved video stabilization with filtered motion parameters using the modified adjacent frame compensation.
In [43], the authors presented an enhanced version of the dynamic pattern search algorithm by means of reducing the search point computation. In their work, the algorithm starts by identifying the stationary blocks; then, the search points within the search area were evaluated for minimum distortion. The proposed work has been compared with other techniques like full search, diamond search and hexagon search. In [44], the authors proposed a two-step approach for enhancing the accuracy of initial search center prediction that is applied in the H.264 standard, in order to improve the motion estimation speed in video encoding. In their work, candidate blocks are identified in the first step for initial search center prediction. In the next step, the search is refined to obtain best possible initial search center.
In [45], the authors presented a hybrid approach for motion estimation. The hybrid method combines the dynamic zero motion pre-judgment technique with the initial search centers technique. In their work, calculating the initial search centers shifts according to the process of zero motion prejudgment. In [46], the authors analyzed various tools involved in fast motion estimation algorithms. Moreover, the authors proposed a number of improvements in order to achieve a fast hybrid algorithm.
However, fewer researchers have investigated applying motion estimation algorithms in the frequency domain, such as the work of Young and Kingsbur [22] who proposed an alternate block matching method by applying a motion estimation technique based on overlapped transforms. Argyriou and Vlachos [1] in their work, proposed the use of gradient correlation in the frequency domain. Edrem et al. [6] estimated the motion parameters using a harmonic retrieval approach. Tzimiropoulos et al. [21] proposed a method for detecting symmetries in real images in the frequency domain. In their approach, the authors used motion estimation techniques to sequentially determine associated parameters. Pingault and Pellerin [17] tested motion transparency phenomena in video sequences based on the frequency domain. Their method contains an algorithm that introduces a new statistical model.
Hierarchical motion estimation algorithms are widely used for their accuracy; where in such algorithms several searching methods at different levels are applied. These types of algorithms are widely used due to their accuracy. However, applying those algorithms in the frequency domain has not yet been investigated properly. In hierarchical block matching techniques, the reliability of motion vectors is related to the block size, where large blocks are more likely to converge on local minima. Moreover, in such algorithms, the advantages of selecting large blocks with small blocks at different levels are combined. Various research topics on the hierarchical search algorithms have been tackled in literature [25]- [29], [31].
In this work, a motion estimation algorithm based on a two-level hierarchy is proposed with a new block matching criterion to be applied at both levels of the hierarchy, as can be seen in Figure 1. The next section introduces the transformation method applied in this research.

Motion Compensation
The

TRANSFORM DOMAIN
Video compression reduces the spatio-temoral redundancy that exists in the frame data and between consecutive frames using intra-frame and inter-frame coding methods. Intra-frame coding involves spatial to frequency transformation of the video frame and quantizing the frame frequencies by means of removing high frequencies that represent insignificant visual details in a given frame. Regardless of the transformation method that has been applied, it should be computationally acceptable and revertible [18]. In inter-frame coding, compression is achieved by utilizing the temporal redundancy using proper motion estimation and compensation algorithms. Various spatio-temporal transformation methods have been proposed in literature and are either image-based or block-based methods [9].
Block-based transformation methods are most applicable for use in video coding, since motion estimation algorithms are based on block matching methods. In this work, the Discrete Fourier Transform (DFT) is chosen, as it allows working in the frequency domain comparable to other transformation methods available in literature.

The Discrete Fourier Transform
Based on the Fourier theory, a complex signal can be decomposed into infinite series of cosine and sine terms and a group of coefficients that can be determined. The original function ) (t f can be decomposed into a series of basis states, based on (1).
The relationship above can be simplified as shown in (2) In order to use Fourier transform with discrete input data such as the data available in digital videos and images, integrals are replaced by sums, T is replaced by N, changes to n x and n c is replaced by n X , which represents the Digital Fourier Transform shown in (3) and its inverse shown in (4). The DFT reveals periodicities in input data as well as the relative strengths of any periodic components [32].
Using (3), the N input samples (pixels) in a given block are converted into N frequency samples. The DFT is a coefficient matrix multiplication as shown in (5).
The above calculation is of order 2 N . In order to reduce the DFT complexity, a number of researchers investigated the use of different patterns in nk W . Amongst those approaches, the Fast Fourier Transform or FFT is proposed as a computational method in the order of N N log . In this research, the Cooley-Tukey algorithm is used, as it is the most common FFT algorithm available in this domain [32] to transform the video frames with different block sizes at different levels of the two-level hierarchy. Moreover, only part of the frequencies is considered in the block matching criterion to get the best match as will be demonstrated in the next section.

THE PROPOSED HIERARCHICAL SEARCH ALGORITHM
Hierarchical block matching algorithms normally start the search process with small blocks and use their motion vectors as starting points to search for larger blocks in next hierarchies (the selected block sizes at each hierarchy affect the reliability of the produced motion vectors, where large blocks result in local minima.). Generally, in the spatial domain, three level hierarchical searches are used, as the data in its original form (pixel domain) is highly correlated. However, given the fact that data in the frequency domain is decorrelated, this facilitates the reduction of hierarchical levels needed to perform the matching process. Therefore, in this research, a two-level hierarchy in the frequency domain is used and proven to be sufficient. The proposed algorithm contains well-known algorithms in each level of the hierarchy with a new matching criterion (described in section 4.1.1) to be used at each level. The steps of the proposed algorithm are summarized in Algorithm-1 and visually represented in Figure 2.

The Proposed Matching Criterion
In order to compare algorithms in this domain, the standard Sum of Absolute Differences (SAD) shown in (6)

Algorithm-1
Step 1: sub-sampling level-1 (the lowest level that consists of the video frame at its full resolution) by a factor of 2 in vertical and horizontal directions to produce level-2.
Step 2: transforming the frames at level-1 and level-2 into the frequency domain using the FFT with 4 4  block size at level-2 and 8 8 at level-1.
Step 3: the search process starts from level-2 with 4 4  block sizes, with the TSS search algorithm (described in section 4.1.2) to get a coarse motion vector that will be passed to level-1, based on the proposed matching criterion described in section 4.1.
Step 4: the two-dimensional logarithmic search algorithm (described in section 4.1.3) with 8 8 block sizes is applied, based on the proposed matching criterion described in section 4.1. in order to get the final motion vector.
Step 5: the resulting motion vectors from step-4 are added to the previous image in order to obtain the next predicted image frame.  . The performance of the algorithm is highly dependent on the matching criterion. However, when applying the matching criterion in the spatial domain, the number of required computations cannot be reduced, as this will directly affect the matching results, since frame pixels are highly correlated. Therefore, in the frequency domain (where frame data is highly decorrelated), reducing the number of required computations is more appropriate. The FFT produces frequency coefficients arranged in a pattern where the corners of coefficients block contain the lowest frequencies that describe the general vertical and horizontal information in the pixel block. However, the rest of the coefficients in the block include high frequencies that describe vertical and horizontal details in the pixel block. In this research, the coefficients at the four corners of the transformed block are only considered in the SAD matching criterion. Therefore, the total number of computations is reduced to a constant of 4 subtractions and 4 additions for each candidate block at each search position, instead of 2 N operations required by other algorithms in the spatial domain.
Information in these parts of the block is adequate to distinguish the desired block from the rest of the surrounding blocks as can be seen later in the experimental results section.

The TSS Algorithm
The steps of the TSS algorithm are applied at level-1 of the hierarchy in both of the current and previous frames as shown in Algorithm-2.

Algorithm-2
Step 1: Set the window size (W) to Step 2: Set the step size (S) to N 2 size, where 2  N .
Step 3: Start with search location at the center and apply the following: a) The eight locations at +/-S around location (0, 0) are to be searched and the one with the minimum SAD is selected based on the matching criterion described in section 4.1.

b) The search origin is set to the above selected location and the step size is reduced by a factor of 2 c) The search repeats until S = 1 and the location with minimum SAD is considered as the best match in level-1.
Step 4: Pass the obtained coarse motion vector to level-2.
In the TSS algorithm, the total number of computations is reduced when compared to the full search algorithm by a factor of 9. Instead of evaluating 225 blocks, the TSS only evaluates 25 blocks.

The Two-Dimensional Logarithmic Search Algorithm
The two-dimensional logarithmic search algorithm is closely related to the three-step search algorithm. This algorithm requires more steps than the three-step search; however, it has a better accuracy. The two-dimensional logarithmic search algorithm is described in Algorithm-3:

Algorithm-3
Step 1: Set the window size to The resulting motion vectors from this step will be added to the previous frame in order to obtain the next predicted image frame. The TDLS algorithm is related to the TSS algorithm; however, it is used for a large search window size.

EXPERIMENTAL RESULTS AND DISCUSSION
In order to test the efficiency of the proposed methods, two sets of standard testing videos are used in this research (shown in Table 1 and Table 2). The first set comprises a total of six standard videos of type CIF with an aspect ratio of 4:3. The CIF (Common Interchange Format) is a video format initially proposed in the H.261 standard. This video format is used to standardize the horizontal and vertical resolutions of YCbCr sequences in video signals. This type of video is used in standard video teleconferencing systems. CIF defines a video sequence with a resolution of 352 × 288 and a frame rate of 30 frames per second with color encoding using the YCbCr 4:2:0 standard, where the selected video sequences consist of 300 frames in each sequence. The second set of videos comprises a total of 3 High Definition (HD) videos (1080p) with an aspect ratio of 16:9 and color encoding using the YCbCr 4:4:4 standard, where the selected video sequences consist of 500 frames in each sequence.
The selected videos from both sets are well-known standard videos that are used to test the efficiency and compare the work with other benchmark algorithms. These video sequences from both sets are selected with increasing motion complexity ranging from slow to high motion complexity. More than 3300 video frames from the different sequences were used in the experiments and are listed in Table 1 and Table 2. The well-known PSNR is used to evaluate the proposed motion estimation algorithm performance. The represent the pixel value inside the original and predicted frames, respectively. High PSNR values indicate better quality. A PSNR result above 30dB means that changes caused by compression algorithms cannot be visually recognized.
Applying the PSNR between the original and reconstructed frames measures the efficiency of the proposed work. Therefore, Table 1 compares the obtained PSNR values of the proposed algorithm with those from other state-of-the-art algorithms in this domain using videos from the first set. Figure  3 and Figure 4 visually represent the results in Table 1. Table 2 shows and compares the PSNR results of the proposed work with those of state-of the-art algorithms using HD videos (1080p) in the second set. Using the first set of test videos (CIF), and as can be seen in Table 1, the proposed work outperforms the standard three-step search [13], two-dimensional logarithmic search [10] and the diamond search algorithm [23] with 22%, 28% and 23% average enhancement, respectively. Moreover, when compared to the well-know KSHS algorithm [20], ETSS [12], CDMHS [14] and FHFS [30], the proposed algorithm outperforms those algorithms with 12%, 16%, 7% and 2%, respectively.
Using the HD (1080p) videos set, the proposed work outperforms the standard three-step search [13], two-dimensional logarithmic search [10] and the diamond search algorithm [23] with 12%, 26% and 24% average enhancement, respectively. Moreover, when compared to the well-know KSHS algorithm [20], ETSS [12], CDMHS [14] and FHFS [30], the proposed algorithm outperforms those algorithms with 11%, 10%, 9% and 5%, respectively. Figure 5 provides a visual representation of the average PSNR values shown in Table 1 and Table 2 and represents the results of the proposed algorithm compared to the rest of the state-of-the-art algorithms when applied to HD (1080p) highresolution videos and normal CIF standard videos. Figure 6 shows samples of the reconstructed frames from the HD (1080p) videos listed in set 2. Figure 7, Figure 8 and Figure 9 provide samples of the reconstructed frames listed in set 1 that contains the standard CIF well-known videos.
The complexity of the proposed work is compared against the benchmark block-based motion estimation algorithms. Let w be the search window size, N represents the block size (hierarchy-based algorithms use various block sizes at each level of the hierarchy), the full search algorithm requires ) comparisons at both levels. Using an appropriately sized search window ( w ( and an appropriate block size ( N ) for each algorithm, the proposed algorithm requires less than 1% of the total number of additions and the total number of absolute differences compared to the full search algorithm. When compared to the rest of the algorithms, the algorithm requires less than 5% of the total complexity required by the cross-diamond modified hierarchical search algorithm and less than 15% of the total complexity required by the Kalman simplified hierarchical search. This complexity reduction can be attributed to the substantial reduction in the total number of operations required in the proposed matching criterion.
Generally, in order to evaluate the use of motion estimation algorithms in the frequency domain, a performance comparison is conducted that evaluates the standard full search algorithm when applied in both pixel and frequency domains. Table 3 presents the resulted PSNR values of the full search algorithm implemented in both domains based on the standard set of HD (1080p) test videos (first 50 frames of each video are included in the test). As shown in this table (Table 3), the resulting average PSNR in the frequency domain is slightly better than that in the pixel domain. This small enhancement does not cover the cost of the extra complexity caused by the transformation process. However, it can achieve far better results (in terms of complexity reduction) when accompanied with proper search and matching techniques.
The direct implementation of Discrete Fourier Transform (DFT) requires ) operations. However, when using Fast Fourier Transform (FFT), this can be reduced to )) log (  (  N  N  O , resulting in a substantial difference in the tractability of the DFT. The fact that the transition between the domains can be computed efficiently allows for more efficient implementations of the DFT.     Table 1 and Table 2, that represents the results of the proposed algorithm compared to the rest of the state-of-the-art algorithms when applied to HD (1080p) high-resolution videos and normal CIF standard videos. Figure 6. From top to bottom and from left to right, the reconstructed frames from "Park_Joy", "In-To-Tree", "Station" and "Blue_Sky" video sequences. Figure 7. Sample of standard videos consisting of low motion activities, from left to right, reconstructed "Akiyo" and reconstructed "Mother and Daughter" video frames. Figure 8. Sample of standard videos consisting of moderate motion activities, from left to right, reconstructed "News" and reconstructed "Hall" video frames. Figure 9. Sample of standard videos consisting of high complex motion activities, from left to right, reconstructed "Flower Garden" and the reconstructed "Football" video, frames.

CONCLUSIONS
Digital videos consist of successive frames sampled over a period of time and carry high data redundancy. Digital video sizes can be massively reduced by eliminating redundant bits which can be achieved by proper compression methods. Various types of compression methods have been proposed in literature and during the last few years, many algorithms have been proposed to compress the massive amount of data available in digital videos while maintaining as much of the visual quality as possible. Motion estimation techniques based on block matching algorithms have been widely used for this purpose. In block matching techniques, each video frame is divided into blocks of similar sizes that contain frame pixels. Object movements successive video frames are searched and investigated on block basis. In this work, block matching is applied in the frequency domain, where a group of carefully chosen frequencies that correctly identify each block distinctively is tested. The algorithm proposed in this research has reduced the total number of required operations, significantly reducing the algorithm's complexity. The proposed algorithm has been tested using standard test videos and has proven to outperform other state-of-the-art algorithms. Two sets of standard test videos were used in this work, the first set is comprised of the standard CIF videos and the other set is comprised of the standard HD (1080p) videos. Using the standard set of CIF videos, the proposed work outperforms the standard three-step search, two-dimensional logarithmic search and the diamond search algorithms with 22%, 28% and 23% average enhancement, respectively. Moreover, when compared to the wellknow Kalman simplified hierarchical search algorithm, the enhanced three-step search algorithm, the cross diamond modified hierarchical search and the frequency-based Hierarchical fast search, the proposed algorithm outperforms those algorithms with 12%, 16%, 7% and 2%, respectively. Moreover, using the standard HD (1080p) videos set, the proposed work outperforms the standard three-step search, two-dimensional logarithmic search and the diamond search algorithms with 12%, 26% and 24% average enhancement, respectively. When compared to the well-know Kalman simplified hierarchical search algorithm, the enhanced three-step search algorithm, the cross diamond modified hierarchical search and the frequency-based hierarchical fast search, the proposed algorithm outperforms those algorithms with 11%, 10%, 9% and 5%, respectively. The complexity of the proposed work is compared against the benchmark block-based motion estimation algorithms. Results show that the proposed algorithm requires less than 1% of the total number of additions and the total number of absolute differences when compared to the full search algorithm. Moreover, when compared to the rest of the algorithms, the proposed work requires less than 5% of the total complexity required by the cross-diamond modified hierarchical search algorithm and less than 15% of the total complexity required by the Kalman simplified hierarchical search.