An R-D optimized transcoding resilient motion vector selection

Selection of motion vector (MV) has a significant impact on the quality of an encoded, and particularly a transcoded video, in terms of rate-distortion (R-D) performance. The conventional motion estimation process, in most existing video encoders, ignores the rate of residuals by utilizing rate and distortion of motion compensation step. This approach implies that the selected MV depends on the quantization parameter. Hence, the same MV that has been selected for high bit rate compression may not be suitable for low bit rate ones when transcoding the video with motion information reuse technique, resulting in R-D performance degradation. In this paper, we propose an R-D optimized motion selection criterion that takes into account the effect of residual rate in MV selection process. Based on the proposed criterion, a new two-piece Lagrange multiplier selection is introduced for motion estimation process. Analytical evaluations indicate that our proposed scheme results in MVs that are less sensitive to changes in bit rate or quantization parameter. As a result, MVs in the encoded bitstream may be used even after the encoded sequence has been transcoded to a lower bit rate one using re-quantization. Simulation results indicate that the proposed technique improves the quality performance of coding and transcoding without any computational overhead.


Introduction
In ubiquitous multimedia access, the same content is transferred between different users with various devices and applications; each has different capabilities in their resources such as network bandwidth, display resolution, frame rate, and decoding bit rate. Therefore, there is an urgent need to use adaptation techniques to support moving from one specification to another. One of such technique is SNR adaptation or bit rate reduction which targets network bandwidth and/or decoding bit rate limitations of different users.
Scalable video coding (SVC) is a viable solution to address the content adaptation. However, as further discussed in Subsection 2.3, deployment of SVC decoders has been limited in practical applications. On the other hand, transcoding is a suitable alternative to accommodate users' requirements by modifying the compressed bitstream. The most straightforward realization of a transcoding architecture is to cascade a decoder and an encoder. But cascading scheme imposes a considerable computational complexity. Transcoding is normally performed at the intermediate network nodes where there is usually a limited computational power; hence, some methods were proposed in the literature to meet the computational constraint. One of such method for bit rate reduction is typically implemented by simple re-quantization [1]. Among existing methods for re-quantization [1,2], the cascaded pixel-domain transcoding (CPDT) that cascades a decoder and a simplified encoder is computationally effective and has no drift problem.
In CPDT architecture, considering the significant computational complexity of motion estimation (ME) and mode decision (MD) processes [3], the cascaded encoder is simplified to further reduce the transcoding complexity. This simplification is realized by either performing a fast ME and mode refinement [4,5], or skipping ME and MD altogether by reusing the original motion vectors (MV) and modes. Hence, as shown in Figure 1, MVs and modes are extracted from the high quality bitstream and reused in the transcoded bitstream. However, new residuals are calculated by motion compensation, re-quantized by new quantization parameter (QP 2 ), and then re-encoded into the transcoded bitstream. Hence, in this CPDT architecture, the rate-distortion (R-D) performance of the transcoded bitstream is significantly affected by MVs in the original bitstream.
Motion vectors in most practical encoders are selected by conventional motion estimation methods, using simplified Lagrangian cost function [6]. This in turn may have a negative effect on the performance of video coding. Furthermore, the selected MV is dependent of QP; hence, the selected MV for high bit rate compression is not suitable for low bit rate ones. This feature has a negative impact on the quality performance when a video bitstream is transcoded using CPDT with MV and mode reuse technique. The quality degradation is more critical when simplified mode decision methods are used in the encoder side to achieve low complexity encoding.
The joint multi-rate optimization method [7] is the most recent R-D optimized encoder that is designed to be used with CPDT transcoding system. In this method, the video is coded with three different QP values at the same time, and the best modes are selected based on a joint optimization. Although this method improves the quality performance of the transcoded bitstream, it suffers from high computational overhead at the encoder side.
In this paper, we propose a new R-D optimized criterion to select MVs that are suitable for both high and low bit rate compressions. We firstly propose a two-piece model for the R-D curve of a block. Then, based on two-piece R-D model, a two-piece Lagrange multiplier selection is introduced for motion estimation process. We show that our proposed scheme results in motion vectors that are less sensitive to changes in bit rate or QP. As a result, MVs which are selected for high bit rate bitstream may be reused even after the bitstream has been transcoded. Compared to the conventional motion estimation method, the proposed motion estimation scheme not only improves the quality of the encoder but also results in a far better R-D performance when the bitstream is transcoded using CPDT with MV reuse technique.
The rest of the paper is organized as follows: Section 2 reviews the background on rate-distortion optimization, the existing R-D models, and the SVC deployment problems. The proposed two-piece R-D model is presented in Section 3. Section 4 describes the proposed motion estimation method and two-piece Lagrange multiplier selection. The application of the proposed motion estimation in a transcoding system is presented in Section 5. The simulation results for R-D performance and computational complexity are reported in Section 6, followed by concluding remarks in Section 7.

Background and literature review
In this section, we first summarize the R-D optimization tool in a video encoder and exiting R-D models. Then, we highlight the limitations of the SVC deployment in real applications.

Rate-distortion optimization
A video encoder should optimally determine the coding parameters for each block in order to generate an R-D optimized bitstream. These coding parameters (P B ) mainly include QP, prediction mode, and MV, which are optimally selected for each block or macroblock by minimizing the Lagrangian cost function [8]. In a basic method, the optimization is performed at the block level by checking all the combinations of all the coding parameters using (1).
where R B , D B , and P B are the rate, distortion, and coding parameters of a block, respectively [9]. The parameter of λ is Lagrange multiplier. But there is a large number of combinations of parameters (mode and MV) that should be considered. Although the R-D models, presented in Subsection 2.2 below, can be used for fast optimization, in a typical encoder such as the H.264/AVC [10,11], the selection of QP, MV, and mode is performed separately in three consecutive stages of rate control (RC), motion estimation, and mode decision, respectively. The residual coding process and coding a block for all MV candidates is still computationally intensive [9]; hence, practical motion estimation methods ignore the residual coding and determine the best MV using the available data during motion estimation process [12,13]. The early motion estimation techniques only minimized the motion compensation distortion using a block matching algorithm [12]. The more recent motion estimation algorithms, such  as the one adopted in the joint model (JM) of the H.264/ AVC video encoder, are based on the methods proposed in [6] and [13]. In this method, MV is selected by minimizing the motion compensation cost, as defined in (2), where R MC and D MC are the number of bits required to code the motion vector and the distortion of motion compensation, respectively, λ is the Lagrange multiplier which is determined based on QP, and SA is the search area. It should be noted that R MC includes the number of bits to code the motion vector difference, the difference between the actual motion vector (mv) and motion vector predictor and the reference frame index. But only mv is presented in the notation to shorten the abbreviation. According to (2), the selected MV is dependent on λ (and QP). Particularly, when λ (and corresponding QP) is small, a bigger MV with a larger number of bits (R MC ) is selected. On the other hand, when λ (and the corresponding QP) is big, a smaller MV is selected.
These models estimate rate or distortion based on residual, DCT, quantized, or scanned data, which requires a large amount of computation when used in motion estimation process as they should be evaluated for all motion vector candidates. Furthermore, all the existing R-D models have been proposed for the residual coding part, without considering the motion compensation process. In the following sections, we propose a two-piece R-D model which considers both the prediction and residual coding processes.

SVC limitations
SVC suffers from several shortcomings in practical applications; the most important are mentioned next. There are limited practical software or hardware implementations of SVC encoder and decoder for real-time applications. SVC decoder also requires more decoding power which is more critical in mobile applications. Not only the bit rate changes are coarse and limited to the number of the layers but also the bit rate cannot be lower than the bit rate of the base layer. It should be noted that the bit rate of the base layer cannot be very small because of its negative effect on the R-D performance of enhancement layers. Finally, compared to single layer coding, SVC imposes about 10% bit rate overhead for the enhancement layer [41]. Because of these problems, transrating is a suitable alternative to accommodate each user's requirements in a ubiquitous multimedia by modifying a compressed bitstream.

The proposed two-piece R-D model
We analyze the R-D optimization in a video encoding process in two steps to extract the new optimization criterion for motion vector selection. For this purpose, we first defined motion compensation and residual coding R-D curves for a block. Then, using a simple model for R-D curve of residuals, we analyze the R-D optimization within a block. As a result of this step which is referred to as local optimization, we propose a new two-piece R-D model for the block. This model is later used in Section 4 to analyze the R-D optimization among different blocks in the video sequence. The result of this step, referred to as global optimization, is the new motion estimation criterion and the two-piece Lagrange multiplier selection.

R-D curve of motion compensation
During motion estimation, a search area with a typical size of (2 s +1) × (2 s +1) is considered, where s is the search range. Each mv candidate is coded with R MC (mv) bits, and motion compensation with this mv results in a prediction distortion value of D MC (mv). This distortion can be represented as the energy of residuals, for example, with sum of square difference (SSD), sum of absolute difference (SAD), or sum of absolute transformed difference (SATD) metrics. Figure 2 shows all the R-D points of the motion search for one block. Each point on the convex curve may be selected as the best MV based on the value of the Lagrange multiplier. Hence, we define them as the motion compensation R-D curve for that block. It should be noted that the motion compensation R-D curve of a block is independent of its corresponding residual coding process. The motion compensation R-D curves of few other blocks are illustrated in Figure 3. In this figure, the bold part (B0_MC and B1_MC) represents the motion compensation R-D curve of the corresponding block.

R-D curve of residual coding
A typical residual coder, such as the one used in the H.264/AVC, transforms the residual values, quantizes by a QP, and then uses an entropy coder. By applying different values of QP, several R-D points are obtained which form the R-D curve of residual coding. The residual R-D curve can be represented by an exponential form [6] with a time constant of TC, as expressed in (3), where R Res is the number of bits representing the quantized residuals, D Res is the corresponding quantization distortion which equals to the distortion between the original and the reconstructed residuals [42], and D 0 is the energy of residuals. As described in Subsection 4.3, the same model is also valid for SAD and SATD but with a doubled TC. Several examples of residual R-D curve are shown in Figure 3 for two blocks (B0 and B1). Each curve corresponds to a different MV of a certain block. For example, B0_Res2 is the residual R-D curve for B0 when its second MV is selected.
As it will be described in Subsection 4.3, we have assumed that TC is only related to the size of the block and is independent of the residual values. Hence, the second part of the model of (3), representing the residual coding behavior of that block by an exponential function, is independent of the motion compensation process of that block. The first part of the model of (3) reflects the motion compensation process of the block which is independent of residual coding of that block as well, because D 0 is related only to the original pixels values of the current block and the reconstructed pixels values of the reference frame.

R-D curve of a block
A block rate (R B ) is the sum of the rates of MVs and residuals as presented in (4). The rate of prediction mode has not been considered in (4), since it is fixed during ME step of each block. The block distortion (D B ) is the distortion between the original and the reconstructed block. It can be easily shown that the block distortion (D B ) is equal to the corresponding distortion of the residuals (D Res ) as presented in (5).
We recall that we have used the exponential model of (3) for the R-D curve of residuals. Setting R Res to zero (e. g., by quantizing all residual values to zero) in (5), D B equals D 0 . In this case, on the other hand, block distortion equals to the energy of the residual, in other words D MC (mb). Hence, D 0 equals D MC (mv), and we express the distortion of the block as (6). This model separates the effects of motion compensation and residual coding processes on the distortion of the block.

Two-piece model for R-D curve of a block
We formulate the formation of a block R-D curve as a constrained optimization problem for that block as presented in (7). This formulation expresses that for any value of R B , MV, and R Res should be optimally determined to minimize the distortion value of the block. As mentioned before, this step is referred to as local optimization, because it is the optimum bit allocation between the motion compensation (R MC ) and residual coding (R Res ) processes of each block.
MV for the block can be found by minimizing (6), subject to the constraint on R B . We minimize (6) in logarithmic domain, as presented in (8). Using (4) in (8), and according to the derivation in (9), the initial MV is determined by (10) which minimize D B subject to the constraint on R B . We refer to this MV as local MV (LMV) as it is selected based on the local optimization within the block.
Figure 4 displays a graphical representation for the cost function in (10). In this figure, the distortion axis is in logarithmic scale which corresponds to the logarithm function in (10). In this graphical representation, residual R-D curves, modeled as exponential form, are seen as several lines with the equal slopes of −1 / TC , where TC is the time constant of the R-D curves which is used as a parameter in the model of (3). Therefore, different residual R-D curves of a block are parallel lines, and do not cross each other. Hence, according to (10), the residual R-D curve corresponding to LMV will definitely fall below all remaining residual R-D curves of that block. This implies once again that LMV is the best MV for the block in all R Res values.
As shown in Figure 4, LMV of B0 is the third motion vector (MV3) and B0_Res3 is its corresponding residual R-D curve, falling below other residual curves (B0_Res1 and B0_Res2). But for block B1 as another example, LMV is second motion vector (MV2). Hence, B1_Res2 falls below the other R-D curves of B1. In Figure 5, an actual R-D curve of a block of size 8 × 8 is presented for football (quarter common intermediate format (QCIF)) which is quite similar to our model in Figure 4.
As a conclusion, the R-D curve of each block can be formed based on a two-piece model. As shown in Figure 6, the first piece corresponds to the R-D curves of motion compensation of that block up to LMV. From that point on, the block R-D curve continues on the residual R-D curve corresponding to this LMV. As described in the next section, we have extended the use of this two-piece model to the bit allocation between different blocks of a frame to derive a new optimization condition for motion vector selection.

The proposed motion estimation method
To develop a new optimization condition for motion estimation, we analyze the R-D optimization when blocks of a video are coded using inter prediction. We first analyzed the optimization within each block in the previous section which resulted in a local motion vector (i.e., LMV) and a two-piece R-D model for each block. In this section, we analyze the optimization among different blocks in a video. We refer to this step as global bit allocation.   Figure 6 The proposed two-piece model for R-D curve of a block (B0 and B1).

The proposed optimization condition for MV selection
To determine the optimum point for each block, we should consider the optimization among all the blocks in a video. The optimum point for each block is found by applying the Lagrange theorem of (1) to the two-piece R-D model of each block. Based on the value of λ, the optimum point may be either on the motion compensation piece or on the residual coding piece. As shown in Figure 7 for B0, when the value of λ is smaller than a threshold (λ < λ Thr ), the optimum point falls on the residual R-D curve. In this case, LMV is selected as the best MV for this block.
In the other case (λ > λ Thr ), the optimum point falls on the motion compensation R-D curve, and consequently, for the example, shown in Figure 7, MV2 is selected as the best MV for this block. We refer to this MV as global MV (GMV), hereafter, because it is found in global bit allocation step. GMV is found by applying the Lagrange theorem of (1) to the motion compensation R-D curve of the block as presented in (11).
In this case, as shown in Figure 7, we noted that the rate of the GMV of the block is smaller than that of LMV. This means that the GMV selected by (11) is the final MV of the block only when its rate is smaller than that of LMV. On the contrary, when its rate is larger than that of LMV, GMV is not the final MV of the block and instead LMV is selected as the final MV for the block. Consequently, we select the final MV of the block as the minimum of LMV and GMV using (12).
The proposed motion estimation algorithm that finds the optimum MV for a block with the proposed optimization condition can be summarized as in Figure 8. In this algorithm, by checking all mv candidates in the search area, LMV and GMV are found using (10) and (11), respectively. Then, the MV of the block is selected using (12).

The proposed two-piece Lagrange multiplier
The proposed motion estimation should find LMV for each block using (10), where logarithm function should be evaluated for each motion vector candidate. Based on the simulation results, the execution time of the encoder is increased by about 20% to 30% for different configurations. Based on the proposed two-piece model, we introduce a new two-piece-based Lagrange multiplier for motion estimation process to eliminate the overhead of the computational complexity. The aim is to introduce a Lagrange multiplier that when used with the conventional motion estimation algorithm selects the same motion vector that is selected by the proposed optimization condition.
According to the proposed criterion in (12), the final MV of a block should not go beyond the LMV of that block. On the other hand, as show in Figure 7, LMV of a block is also selected when the value of the Lagrange multiplier is equal to λ Thr . The value of λ Thr can be found for each block using the negative slope of residual R-D curve that corresponds to LMV, in other words, when the residual rate is zero. Based on the model in (5) and (6) for the residual coding curve, λ Thr is calculated using (13) It should be noted that LMV of the current block is found by the motion search process, hence LMV and consequently D MC (LMV) in (13) is not available yet. In this research, we proposed to estimate D MC (LMV) using the information of the collocated block in the previously coded frame. Hence, we estimate the value of λ Thr using (14), where MV′ and D′ MC are the motion vector and the corresponding motion compensation distortion of the collocated block in the previously coded frame. As shown in Figure 7, when λ > λ Thr , the selected GMV is used as the final MV of the block. On the other hand, when λ < λ Thr , the final MV of the block is LMV which is selected by λ Thr . Hence, we propose to define the two-piece Lagrange multiplier (λ 2-piece ) as in (15). It is worth nothing that each case in the model of (15) corresponds to one case in the model of (12). Then, the final MV of the block is selected using (16).
Based on the proposed two-piece Lagrange multiplier, before starting the motion search process for a block, the value of the λ Thr is estimated using (14). Then, the value of the two-piece Lagrange multiplier is determined by (15). Then, the conventional motion estimation process is executed.

Optimization parameters selection
The proposed motion estimation algorithm has two optimization parameters, namely λ and time constant (TC). The parameter of λ is the Lagrange multiplier which is used in mode decision and is defined as (17) for H.264/AVC, where c is about 0.85 according to experimental results [43].
In this section, we discuss the value of TC in the model of (3). The R-D function for a source with normal distribution is modeled as (18) [42], where r and d are average rate and distortion (SSD) per symbol, expressed by (19) and n is the number of symbols in the residual block. Using (19) in (18), the R-D curve of the residual is extracted as (20). Comparing (20) to the model we used in (3), TC can be expressed as (21) for each block.
Hence, the value of TC is calculated for each block based on the number of coefficients in that block. As the value of n is related to the size of block, TC can be calculated based on the size of residual block as (22), where W and H are width and height of the block, and TC MB is the time constant of MB with the size of 16 × 16. For the case that the distortion of motion compensation is calculated in terms of SAD, the values of λ and TC can also be calculated as below. The relation between SAD and SSD can be expressed as (23) where α is approximately 1.25 for the residuals with zero mean Gaussian distribution. Hence, the Lagrange multiplier for SAD can be expressed as (24) which has also been noted in [43].
Substituting the model of (3) in (23), SAD can be expressed as (25) which implies that the TC for SAD distortion metric can be expressed as (26).
5 Transcoding-resilient motion estimation As described in the Introduction section, the motion vector selected by conventional motion estimation using (2) is influenced by λ, and consequently by QP. Hence, the selected MV for high bit rate compression may not be suitable for low bit rate when transcoding the video with reusing motion information, resulting in quality degradation.
In contrast, based on the theatrical analyses and empirical results presented below, we demonstrate that the MV, selected by the proposed motion estimation process, is less affected by QP. In other words, it is suitable for both high bit rate and low bit rate compressions. Hence, the selected MV for high bit rate compression can also be used for low bit rate when transcoding the video, without degrading the quality significantly. Thus, the motion re-estimation stage during transcoding is no longer required, leading to a much simpler and faster adaptation, while at the same time providing better quality.
The proposed motion estimation uses (12) to select the motion vector. In high bit rate compression, QP and its corresponding λ value are small, leading to a large R MC (GMV) according to (11). As a result, the LMV is the most dominant motion vector selected by the proposed motion estimation according to (12). This observation is supported by the experimental results performed over a wide range of QPs, for a variety of video contents and resolutions (city, crew, harbor, soccer in 4CIF and football, garden, mobile, Paris, silent, Stefan, students, tennis in QCIF). Based on the experimental analysis, presented in Figure 9, for QPs in the range of 20 to 30, near 90% of MVs selected by the proposed motion estimation method are LMVs.
Furthermore, the LMV of a block is not significantly affected by changes in QP because, as presented in (10), LMV only depends on the motion compensation rate (R MC ) and the distortion (D MC ). As described below in more details, R MC is unchanged, and D MC is almost unchanged when transcoding a video.
If the selected motion vectors remain unchanged for neighboring blocks (which is valid when transcoding with mode and MV reuse), the predicted motion vectors and consequently the rate of the candidate motion vectors (R MC ) remain unaffected for the current block.
To analyze the behavior D MC , we have compared D MC values of a block for the same candidate motion vectors when different QP values are used to code the reference frame. Let the distortion ratio (DR) to be defined as (27) which represents the ratio of motion compensation distortions of a block for the same candidate motion vector (mv) when the reference frame is encoded with two different QP values of QP 1 and QP 2 .
The probability distribution function (PDF) of the distortion ratio is depicted in Figure 10 for block sizes of 16 × 16 and 4 × 4 when the QP value for the reference data is changed from 20 to 30 and from 20 to 40 for blocks of football and foreman sequences of size QCIF. It can be seen that most of the ratios are near 1.0, which implies that the distortion value of a candidate motion vector takes nearly the same value when QP changes. As a result, the motion compensation rate (R MC ) and distortion (D MC ), and consequently the LMV of the block are not significantly affected when QP changes.
To summarize, as indicated by the analyses and the simulation results, LMV is the most dominant MV that is selected by the proposed motion estimation method. Furthermore, LMV is unaffected by changes in QP. This suggests that the MV selected by the proposed method is not significantly affected by QP for all practical purposes.
It should be noted that the two-piece Lagrange multiplier, proposed in Subsection 4.2, is designed to select the same motion vector that is selected using the proposed condition of (12). Hence, the property described in this section is also valid for the MV which is selected by the proposed two-piece Lagrange multiplier. This property can also be justified based on the concept of the two-piece Lagrange multiplier. When coding a video with a low QP, and using the conventional condition of (2), the Lagrange multiplier (λ) is small, and a large MV is selected which is not suitable for large QP values. But according to (15), the two-piece Lagrange multiplier (λ 2-piece ) is limited to be larger than λ Thr , and hence it is larger than λ. As a result, the MV selected by λ 2-piece is smaller and also suitable for encoding with larger QP.

Simulation results
We analyze the quality performance of the proposed motion estimation for an encoding and a transcoding system in Subsections 6.1 and 6.2, respectively. The complexity of the proposed algorithm and quality-complexity trade-off are presented in Subsections 6.3 and 6.4, respectively. Motion estimation with the proposed two-piece Lagrange multiplier in (15) has the same R-D performance of when using the proposed condition in (12), while it imposes no computational overhead to the encoder. Hence, we present only the results for the proposed two-piece Lagrange multiplier method, which is referred to as the PropME hereafter.
PropME has been evaluated with JM16.2 [44], the reference software of H.264/AVC, and compared with its motion estimation algorithm referred to as the conventional motion estimation (ConvME) hereafter. The baseline profile and its corresponding configuration file have been used for the simulations. IntraPeriod option is set to 50. UMHex motion search method with the search range of 32 and number of reference frames of 1 and 5 have been used in analyses.
Different test sequences have been coded when RC is enabled and disabled. In disabled RC scenario (RC = OFF), constant QP values of 20, 24, 32, 36, and 40 are used for encoding. In enabled RC scenario (RC = ON), initial QP is set to different values, mentioned above, and the target bit rate is set to the corresponding bit rate produced in disabled RC scenario. The average coding efficiency gain for all QP values has been calculated using the Bjøntegaard delta rate (BDR) [45] that presents the bit rate expansion of a method with respect to an anchor method. As a negative value of BDR means coding efficiency gain, we have presented the negative of BDR (−BDR%) in the tables and figures. When comparing the coding efficiency for a specific QP value, BDR method cannot be used, because it calculates the overall coding gain when having several R-D points. Instead, we have used the bit rate (BR) and PSNR values of the method and the anchor for that specific QP to calculate the coding efficiency. It should be noted that the PSNR values of different methods are very close for the same QP value. For a fair comparison, however, the slight change in PSNR has been compensated for in the bit rate, hence the coding efficiency is measured by calculating the percentage of bit rate reduction (BRR) at the same quality using (28).
where PsnrBrRatio is the ratio of PSNR change to the percentage of bit rate change. The value of PsnrBrRatio is calculated for each test sequence, test configuration, and QP value using PSNR and bit rate values of two consecutive QP (QP and QP′) as shown in (29).
The value of PsnrBrRatio is about 5 which means a 0.5-dB difference in PSNR corresponds to a 10% difference in bit rate [45]. As presented in tables below, the average values of BRR over different QP values (Avr.) are almost the same as the values of BDR. This indicates that the BRR method of (28) accurately estimates the coding efficiency.
We have tested the R-D performance of PropME in combination with several well-known MD approaches in the reference software, referred to as MD0, MD1, MD2, and MD4. MD1 is a high complexity full search mode decision, also known as RDO-ON in JM software, where  a macroblock is fully coded for each mode in order to determine its final rate and distortion. Then, the best mode is selected by the Lagrange multiplier of (1). On the other hand, MD0 is a low complexity simplified mode decision method, also known as RDO-OFF in JM software, where the residual is not coded and the motion compensation cost in (2) is used to determine the best mode. MD2 and MD4 are similar to MD1 and MD0, respectively, but they use early skip selection method to reduce the computational cost especially at low bit rate compression. MD0 (and MD4) approach, however imposes a quality performance degradation, has much lower computational cost, and is mostly used in practical hardware and real-time software implementations of H.264/AVC to meet power consumption, gate count, and/or real-time constraints. For a comprehensive analysis, we also compared PropME when different distortion metrics, referred to as D0 and D1 hereafter, are used for motion estimation. In both cases of D0 and D1, SAD is used in integer motion estimation (IME), because SSD and SATD impose an intolerable complexity when used with IME. In D0 and D1 cases, the distortion metric of fractional motion estimation is set to SATD and SSD, respectively. It should be noted that SATD requires the Hadamard transform, hence its computational complexity is higher than that of SSD. According to the simulation results, the overall encoder complexity when using D1 is 15% to 35% less than when using D0.
Based on the above abbreviations, we present different test configurations with DxMDy notation. For example, in D1MD0 configuration, D1 means that the distortion metric of fractional motion estimation is set to SSD, and MD0 means that mode decision method is the simplified mode decision.
The anchor method, also reported for each result, is the corresponding configuration with ConvME when reporting the gain of the PropME over ConvME. The anchor method is D1MD0 configuration with ConvME when comparing the quality-complexity performance of various combinations of motion estimation, mode decision, and distortion metric together.

Quality performance for video encoding
In the simplified mode decision (MD0 or MD4), which uses (2), R-D costs of different modes of a macroblock are calculated by accumulating all motion compensation R-D costs of all the corresponding sub-blocks. These R-D costs are then compared to select the best mode for the macroblock. When using ConvME, a large MV may inappropriately be selected specially for small block sizes (e.g., Inter 8 × 8). As a result, small R-D cost may be calculated for small block size modes, which are then mostly selected as the best mode of the macroblock. This results in inappropriate best mode selection and R-D performance degradation. On the other hand, PropME limits MV to be smaller than LMV, and hence it prevents resulting in inappropriately small motion compensation R-D cost. Hence, the simplified mode decision is not trapped in selecting small block size modes, resulting in better R-D performance as reported in this section below.
In the full search mode decision (MD1 or MD2), R-D costs of different modes of the macroblock are instead calculated using actual bit rate and final distortion of all the corresponding sub-blocks. In this case, the best mode which also has the best MVs, is selected as the final mode. Hence, full search mode decision methods compensate for most of the shortcomings of motion estimation process. This has been confirmed by the simulations, where PropME with MD1 (and MD2) resulted in only about 0.1% coding efficiency for video coding applications. Consequently, there was no point in reporting them in details.
The coding efficiency results of PropME over ConvME have been presented for all the test sequences in Table 1 for the configuration of D1MD0. As it will be presented below in more details, PropME with D1MD0 configuration shows the best quality and computational performance in both video coding and transcoding applications. It has been noticed that the bit rate reduction is more significant for the sequences with higher motion activity. It is about 2.5 times of the average value. Furthermore, the R-D performance improvement is more noticeable in high PSNR ranges (low QPs). This is mostly because in high PSNR, where QP and consequently λ are very small, the ConvME results in a large MV, whereas the PropME limits the rate of MV to the rate of LMV instead. With less rate for the MV, there is an improvement in R-D performance of the coded video. On the other hand, in low PSNR values (larger QPs), λ has a larger value and the optimum point mostly lies on the motion compensation R-D curve. Hence, PropME normally selects the same MV as ConvME does, hence they show almost the same R-D performance.
The bit rate reduction is higher for larger video sequences (4CIF and HD). The reason is that the MV values are larger in high resolution video sequences and they consume more bits. Furthermore, each block corresponds to a smaller area in the real world, hence the motion compensation is more accurate, residual values are small, and residual coding generates fewer bits. As a result, the share of the MV bits is higher in the total bits, and optimum selection of the MV leads to higher bit rate reduction.

Encode
Cascaded CPDT (PropME) CPDT (ConvME) Figure 11 Bit rate-PSNR performance of different transcoding methods for football (QCIF) test sequence. The coding efficiency results have been presented in Table 2 for different video sizes, different configurations, and different number of reference frames. The averaging is performed among different QP values and all test sequences.

Quality performance for video transcoding
The R-D performance of PropME in the transcoding system has been evaluated when the high quality bitstream is generated at the encoder using both ConvME and PropME, and then it is transcoded by the CPDT method with mode and MV reuse [1]. In our test scenario, as shown in Figure 1, each video sequence is coded by a constant QP 1 equal to 20 (or by the corresponding target bit rate and initial QP 1 of 20 when RC is enabled), that generates the high quality bitstream. Then, CPDT transcoding is performed on this bitstream by re-quantizing the residuals with some constant QP 2 values of 24, 28, 32, 36, and 40 (o by the corresponding target bit rates and initial QP 2 values when RC is enabled). These QPs generate bitstreams with approximately 50%, 32%, 22%, 16%, and 14% of the bit rate of the high quality stream, respectively. Figure 11 shows the quality performance of different transcoding methods for football test sequence with QCIF resolution. The performance of the video when originally coded by QP 2 ('Encode' curve) is also presented in the figure as an upper bound of quality performance. It is shown that the cascaded transcoding method ('Cascaded' curve) has the best performance. However, it suffers from the high computational complexity at the transcoding node. The CPDT transcoding method on the sequence that is generated by ConvME ('CPDT(ConvME)' curve) has the worst R-D performance, while transcoding the bitstream that was generated by PropME ('CPDT(PropME)' curve) results in lower bit rate at the same PSNR quality.
The results for all test sequences and the configuration of D1MD0 are presented in Table 3, which shows the coding efficiency when the high quality bitstream is transcoded by the CPDT method ('Transcoding QP2' columns in Table 3). The bit rate reduction is presented for the encoding of the high quality bitstream ('Enc. QP1' column in Table 3) which is coded with QP 1 of 20. The average of encoding and transcoding bit rate reduction and BDR are also presented in the last two columns of Table 3. The bit rate reduction is more noticeable when the transcoder aims to produce a bitstream with lower bit rates or quality (e.g., 'QP 2 (40)' compared to 'QP 2 (36)'). Similar to the encoding application, the bit rate reduction is more significant for the sequences with higher motion activity or higher resolution. The reasons for the above observations are the same as the ones discussed in Subsection 6.1 for video coding application. The results of average bit rate reduction are presented in Table 4 for different video sizes, different configurations, and different number of reference frames.

Complexity performance analysis
We experimentally study the computational complexity overhead of the PropME over ConvME using the encoding time reported by JM16.2 software. In this study, we extracted the extra execution time (EET) of each method using (30).
The underlying platform is a computational grid with different personal computers and Windows operating system. In should be noted that different methods for the same video and QP that are used together to compute the EET were scheduled to be executed on the same computer in the computational grid. The average values of EET are reported in Table 5 for different video sizes and different QP values when the encoding configuration is D1MD0. EET with the other configurations shows the similar results. The results in Table 5 indicate that the computational overhead of the PropME over ConvME is 0.4% in average which is quite negligible.

Complexity-quality performance analysis
We have studied the quality performance of the PropME for video coding and transcoding in Subsections 6.1 and 6.2, respectively, and the complexity performance in Subsection 6.3. In this subsection, we study the complexityquality performance of different combinations of motion estimation algorithm and mode decision methods. For this purpose, we extracted BDR and EET parameters of all combinations with respect to D1MD0 with ConvME. Then, we have summarized and compared the quality-complexity performance of different methods for video coding application in Figure 12 and for video transcoding application in Figures 13 and 14.
For video coding application, quality-complexity comparison has been presented for different configurations and motion estimation methods in Figure 12. As noted in Subsection 6.1, PropME and ConvME have almost the same quality performance when mode decision is MD1 and MD2. Hence, only the results of PropME are presented in the figure for MD1 and MD2. In this figure, BDR and EET have been extracted by averaging all video sizes. It can be observed from Figure 12 that in the case of simplified mode decision (i.e., MD0 and MD4) using D1 instead of D0 reduces the computational overhead of the encoder by about 15% to 35%. At the same time, the coding efficiency of the ConvME is also reduced, while the coding efficiency of the PropME is increased. This proves that PropME is properly working when D1 to reduce the complexity of the encoder. This configuration is useful when a very low complexity encoder is needed.
Results for video transcoding application have been presented in Figures 13 and 14, where BDR is calculated using the R-D points of video encoding with QP 1 of 20 and video transcoding with QP 2 of 24, 28, 32, 36, and 40. Figure 13 presents the quality-complexity comparison of different configurations and the joint multi-rate optimization method [7]. Since the PropME has only 3% better coding efficiency than the ConvME when using with MD1 and MD2, their results have only been reported with PropME in Figure 13. According to Figure 13, the PropME provides better performance in all configurations. It is worth noticing that PropME with MD0 or MD4 also provides better performance than MD1 and MD2. Joint multi-rate optimization method is the best in quality performance, but it is computationally expensive. D1MD0 (ConvME) D1MD4 (ConvME) D0MD0 (ConvME) D0MD4 (ConvME) D1MD0 (PropME) D1MD4 (PropME) D0MD0 (PropME) D0MD4 (PropME) D1MD1 (PropME) D1MD2 (PropME) D0MD1 (PropME) D0MD2 (PropME) Figure 12 Quality-complexity comparison for different configurations of video encoding in video coding application. Rate control = OFF, (Anchor: D1MD0 (ConvME)). D1MD0 (ConvME) D1MD4 (ConvME) D0MD0 (ConvME) D0MD4 (ConvME) D1MD0 (PropME) D1MD4 (PropME) D0MD0 (PropME) D0MD4 (PropME) D1MD1 (PropME) D1MD2 (PropME) D0MD1 (PropME) D0MD2 (PropME) Joint mulƟ-rate  Normalized fps of transcoding D1MD4 (PropME) + CPDT D0MD1 (ConvME) + Shen_MV D0MD1 (ConvME) + Shen_est D0MD1 (ConvME) + Shen_real D0MD1 (ConvME) + Cascade Figure 14 Quality-complexity comparison for different transcoding methods. Rate control = OFF, (Anchor: D1MD0 (ConvME) + CPDT). In Figure 14, we compare the performance of the combination of PropME and CPDT transcoding method with the different fast transcoding methods proposed in [5], namely Shen_est and Shen_real and Shen_MV. Different combinations are presented in Table 6. As the computational complexity of transcoding methods are a lot different, we presented the normalized frame rate per second (NFPS) at which transrating can be performed. NFPS is the number of frames that can be transrated by a given method divided by that of the CPDT method. In Figure 14, it can be observed that the PropME with CPDT transcoding method results in the fastest transcoding with acceptable quality performance. Different variation of the methods, presented in [5], not only have lower transcoding frame rate, but also need higher computational complexity at the encoder side as they are using D0MD1 configurations.

Conclusions
In this paper, we proposed a new optimization condition for MV selection. The motion vector selected by the proposed motion estimation is less sensitive to bit rate or quantization parameter changes, making it suitable for both high bit rate and low bit rate compression. As a result, the MV in the high quality bitstream may be used even after the sequence has been transcoded. This enables using fast transcoding where the bit rate is reduced by re-quantization of residuals, and modes and motion vectors are reused from the high bit rate bitstream to eliminate the motion re-estimation cost in the adaptation node. Compared to the conventional motion estimation algorithm, the proposed motion estimation scheme not only improves the R-D performance of the coded bitstream but also results in significantly improved transcoding efficiency. The above improvements have been achieved with a quite negligible computational overhead. This makes the propose motion estimation method suitable for video encoding in ubiquitous multimedia system where different users have different network bandwidth and/or decoding bit rate capabilities.