Mobile Video Communications Based on Fast DVC to H.264 Transcoding

Nowadays, mobile devices demand multimedia services such as video communications due to the advances in mobile communications systems (such us 4G) and the integration of video cameras into mobile devices. However, these devices have some limitations of computing power, resources and complexity constraints for performing complex algorithms. For this reason, in order to establish a video communications between mobile devices, it is necessary to use low complex encoding techniques. In traditional video codecs (such as H.264/AVC (ISO/IEC, 2003)) these low complexity requirements have not been met because H.264/AVC is more complex at the encoder side. Then, mobile video communications based on H.264/AVC low complexity imply a penalty in terms of Rate – Distortion (RD). However, Distributed Video Coding (DVC) (Girod et al., 2005), and particularly Wyner-Ziv (WZ) video coding (Aaron et al., 2002), provides a novel video paradigm where the complexity of the encoder is reduced by shifting the complexity of the encoder to the decoder (Brites et al., 2008). Taking into account the benefits of both paradigms, recently WZ to H.26X transcoders have been proposed in the multimedia community to support mobile-to-mobile video communications. The transcoding framework provides a scheme where transmitter and receiver execute lower complexity algorithms and the majority of the computation is moved to the network where the transcoder is allocated. This complexity is thus assumed by a transcoder, which has more resources and no battery limitations. Nevertheless, for real time communications it is necessary to perform this conversion from WZ to H.264/AVC with a short delay, and then the transcoding process must be executed as efficiently as possible.

full of multicore systems, an approach is proposed to execute WZ decoding in a parallel way.On the other hand, at the same time WZ is decoding, some information could be gathered are sent to the H.264/AVC encoder in order to reduce the encoding algorithm complexity.In this work, the search area of the Motion Estimation (ME) process is reduced by means of Motion Vectors (MVs) calculated in the WZ decoding algorithm.In this way, the complexity of the two most complex tasks of this framework (WZ decoding and H.264/AVC encoding) are largely reduced making the transcoding process more efficient.

Wyner-Ziv video coding
The first practical Wyner-Ziv framework was proposed by Stanford in (Aaron et al., 2002), and this work was widely referenced and improved in later proposals.As a result, in (Artigas et al., 2007) an architecture called DISCOVER was proposed which outperforms the previous Stanford one.This architecture provided a reference for the research community and finally it was later improved upon with the VISNET-II architecture (Ascenso et al., 2010), which is depicted in Figure 1.In this architecture, the encoder splits the sequence into two kinds of frames: Key Frames (K) and Wyner-Ziv Frames (WZ) in module ( 1).K frames are encoded by an H.264/AVC encoder in (2).On the other hand, WZ frames are sent to the WZ encoder, where the information is firstly quantized (3a), and BitPlanes (BPs) are extracted in (3b); in (3c) each BP is independently channel encoded and several parity bits, which are stored in a buffer (3d), are calculated.On the decoder side, initially K frames are decoded by an H.264/AVC decoder (4).From these frames, Side Information (SI) is calculated in (5), which represents an estimation for each non-present original WZ frame.For this estimation, the Correlation Noise Model (CNM) module (6) generates a Laplacian distribution, which models the residual between SI and the original frame.Afterwards, SI and CNM are sent to the turbo decoder, which corrects differences of SI and the original frame by means of iterative decoding (requesting several parity bits from the encoder through the feedback channel).Finally, decoding bitplanes are reconstructed in module (7c).www.intechopen.com

H.264/AVC
H.264/AVC or MPEG-4 part 10 Advanced Video Coding (AVC) is a compression video standard developed by the ITU-T Video Coding Experts Group (ITU-T VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG).In fact, both standards are technically identical (ISO/IEC, 2003).
The main purpose of H.264/AVC is to offer a good quality standard able to considerably reduce the output bit rate of the encoded sequences, compared with previous standards, while exhibiting a substantially increasing definition of quality and image.H.264/AVC promises a significant advance compared with the commercial standards currently most in use .For this reason H.264/AVC contains a large amount of compression techniques and innovations compared to previous standards; it allows more compressed video sequences to be obtained and provides greater flexibility for implementing the encoder.Figure 2 shows the block diagram of the H.264/AVC encoder.

Fig. 2. H.264/AVC encoder diagram
The ME is the most time-consuming task in the H.264/AVC encoder.It is a process which removes the temporal redundancy between images, comparing the current one with previous or later images in terms of time (reference images), looking for a pattern that indicates how the movement is produced inside the sequence.
To improve the encoding efficiency, H.264/AVC allows the use of partitions resulting from dividing the MB in different ways.Greater flexibility for the ME and Motion Compensated (MC) processes and greater motion vector precision give greater reliability to the H.264/AVC encoding process.The ME process is thus carried out many times per each partition and sub-partition.This feature is known as variable block size for the ME.

Parallel Wyner-Ziv
The DVC framework is based on displacing the complexity from encoders to decoders.However, reducing the complexity of decoders as much as possible is desirable.In traditional feedback-based WZ architectures (Aaron et al., 2002), the rate control is performed at the decoder and is controlled by means of the feedback channel; this is the www.intechopen.commain reason for the decoder complexity, as once a parity chunk arrives at the decoder, the turbo decoding algorithm (one of the most computationally-demanding tasks (Brites et al., 2008) is called.Taking this fact into account, there are several approaches which try to reduce the complexity of the decoder, which usually induces a rate distortion penalty.However, due to technological advances, new parallel hardware is beginning to be introduced into practical video coding solutions.These new features of computers offer a new challenge to the research community with regards to integrating their algorithms into a parallel framework; this opens a new door in multimedia research.It is true that, with regards to traditional standards, several approaches have been proposed since multicores appeared on the market, but this chapter focuses on parallel computing applied to the WZ framework.
Having said this, in 2010 several different parallel solutions for WZ were proposed.In particular, in (Oh et al., 2010)  Graphic Processing Units (GPUs).In this proposal, the authors focus on designing a parallel distribution for a Slepian-Wolf decoder based on rate Adaptative Low Density Check Code (LDPC) with Accumulator (LDPCA).LDPC codes are composed of many bit-nodes which do not have many dependencies between each node, so they propose a parallel execution in three kernels (steps): i) kernels for check node calculations, ii) kernels for bit node calculations, and iii) kernels for termination condition calculations.In a GPU they achieve a decoding 4~5 times faster for QCIF and 15~20 for CIF.On the other hand, in (Momcilovic et al., 2010) Momcilovic et al. proposed a WZ LDPC parallel decoding based on multicore processors.In this work, the authors parallelize several LDPC approaches.On a Quad-Core machine, they achieve a speedup of about 3.5.Both approaches propose low-level parallelism for a particular LDPC/LDPCA implementation.This chapter presents a WZ to H.264/AVC transcoder which includes a higher-level parallel WZ video decoding algorithm implemented on a multicore system.The reference WZ decoding algorithm is adapted to a multicore architecture, which divides each frame into several slices and distributes the work among available cores.In addition, the proposed algorithm is scalable because it does not depend on the hardware architecture, the number of cores or even on the implementation of the internal Wyner-Ziv decoder.Therefore, the time reduction can be increased simply by increasing the number of cores, as technology advances.Furthermore, the proposed method can also be applied to WZ architectures with or without a feedback channel (Sheng et al., 2010).

WZ to H.26x transcoding
Nowadays, mobile-to-mobile video communications are getting more and more common.Transcoding from a low cost encoder format to a low cost decoder provides a practical solution for these types of communications.Although H.264/AVC has been included in multiple transcoding architectures from other coding formats (such as MPEG-2 to H.264/AVC (Fernandez-Escribano et al., 2007, 2008) or even homogeneous H.264/AVC (De Cock et al., 2010), proposals in WZ to H.26x to support mobile communications are rather recent and there are only a few approaches so far.
In 2008, the first WZ transcoder architecture was introduced by Peixoto et al. in (Peixoto et al., 2010).In this work, they presented a WZ to H.263 transcoder for mobile video communications.However, H.263 offers lower performance than other codecs based on H.264/AVC and they did not exploit the correlation between the WZ MVs and the traditional ME successfully and only used them to determine the starting centre of the ME process.
In our previous work, we proposed the first transcoding architecture from WZ to H.264/AVC (Martínez et al., 2009).This work introduced an improvement to accelerate the H.264/AVC ME stage using the Motion Vectors (MV) gathered in the WZ decoding stage.Nevertheless, this transcoder is not flexible since it only applies the ME improvement for transcoding from WZ frames to P frames.In addition, it only allows transcoding from WZ GOPs of length 2 to IPIP H.264/AVC GOP patterns, so it does not use practical patterns due to the high bit rate generated neither flexible.Furthermore, this work used a less realistic WZ implementation.For this reason, the approach presented in this chapter improves this part by introducing a better and more realistic WZ implementation based on the VISNET-II codec (Ascenso et al., 2010), which implements lossy key frame coding, on-line correlation noisy modeling and uses a more realistic procedure call at the decoder for the stopping criterion.

Introduction
The main task of a transcoder is to convert a source coding format into another one.In the case of mobile video communications, the transcoding process should be done as fast as possible.In addition, a flexible transcoder should take into account the conversion between the input and the output patterns.In order to provide a flexible and fast transcoding architecture, it is proposed the architecture displayed in Figure 3.This architecture is composed of a Wyner-Ziv decoder and a H.264/AVC encoder with several modifications or extra modules.In particular, the WZ decoder is redesigned to parallelize the decoding process and the black modules in Figure 3 have been included or modified to obtain a faster H.264/AVC encoding.Details will be given in the following subsections.

Parallelization of WZ decoding
WZ video coding accumulates the majority of the complexity on the decoder side.If you study each module inside the decoder scheme (Figure 1), you discover that most of this complexity is concentrated in the Channel Decoder module (Brites et al., 2008).This module receives successive chunks of parity bits.Then, the quantized symbol stream associated to each bitplane is obtained in an iterative process, which is based on the residual statistics calculated by the CNM.This procedure stops when a condition based on probabilities is satisfied.Obviously, the complexity of the decoder increases when more bitplanes (in the pixel domain) or coefficient bands (for the transform domain) are decoded.At this point, as a first stage on the transcoding process, it is proposed a WZ decoding architecture which distributes decoding complexity across several processing units.The proposed architecture is shown in Figure 3.The approach is a flexible and scalable architecture which distributes the parallel decoding between two parallelism levels: GOPs and frames.First, the input bitstream composed of K frames is stored in a K-frame buffer.Then, at the first parallelism level, the WZ frames inside two K frames delimit a GOP structure, and therefore each GOP decoding procedure is carried out independently by a different core.Additionally, for each WZ frame inside a GOP, an SI is calculated and then split into several parts.Then each portion of the frame is assigned to any core which executes the iterative turbo decoding procedure in order to decode the corresponding part of the WZ reconstructed frame.Therefore, each spatial division of the frame is decoded in an independent way by using the feedback channel to request parity bits from the encoder.When each part of a given frame is decoded, these parts are joined in spatial order and the frame is reconstructed.Finally, a sequence joiner receives each decoded frame and key frames in order to reorganize the sequence in its temporal order.Concerning the scheduler, a dynamic scheduler is implemented.That means that whenever a core is free and there is no pending task, it is assigned to the idle core.The number of tasks is always equal to, or bigger than, the number of cores.So that means there are always tasks in the scheduler queue until the end of the decoding stage is reached.However, partial decoding for each frame requires a synchronization barrier.To illustrate this, Figure 4 shows the decoding time line for a sequence composed of 5 GOPs (with length = 2) on a multicore with four cores.As can be seen, decoder initialization takes some time at the beginning of the decoder process.After that, each core receives a task (defined by a thread) from the scheduler.When a thread finishes the decoding of a part of a frame, it can continue decoding other parts of the same frame.In the case of there being no more parts of this frame for decoding, this core has to wait until the rest of parts of the same frame are decoded.This is a consequence of the synchronization barrier implicit for each frame to be reconstructed.In Figure 4, when a thread is waiting it is labeled as being in an idle state.In addition, while the sequence decoding process is finishing, there are not enough tasks for available cores, so several cores change their status to idle until the decoding process finishes.Nevertheless, real sequences are composed of many GOPs and decoder initialization and ending times are quite shorter than the whole decoding time.The size of the K-frame buffer S is defined by Equation 1, where i is the number of GOPs which can be executed in parallel.For example, in the execution in Figure 4, a 4-core processor can execute two GOPs at the same time, so three stored K frames are providing enough tasks for four cores.In addition, it is not necessary to fill the buffer fully and it could be filled progressively during the decoding process.For different GOP lengths, the buffer size would be the same, since every WZ GOP length only needs two K frames to start the first WZ decoding frame.

= +
(1) Finally, considering that the parity bits could be requested to the encoder without following a sequential order, it calculates the Parity Position (PP) which determinates the parity bit position to start to send.PP is calculated by Equation 2, where I is the Intra period, P is the position of the current GOP, Q is the quantification parameter, and W is the width of the image and H the height. www.intechopen.com

H.264/AVC transcoding approach
In order to provide fast and flexible transcoding at the H.264/AVC encoder side, we have to study two issues: firstly, how MVs generated during the SI process could help to reduce the time used in ME; secondly, taking into account that DVC and H.264/AVC can build different GOPs, how to map MVs between different GOP combinations in order to provide flexibility.

Reducing motion estimation complexity
Within the WZ decoding process, an important task is the SI generation stage, which is the first step in the process for generating the WZ frames from K frames.VISNET-II performs Motion Compensated Temporal Interpolation (MCTI) to estimate the SI.The first step of this method is shown in Figure 5, which consists in matching each forward frame MB with a backward frame MB inside the search area.The process checks all the possibilities inside the search area and chooses the MV that generates the lowest residual.The middle of this MV represents the displacement for the MB interpolated (more details about the SI generation process in (Ascenso et al., 2005)).Obviously, MVs generated in the WZ decoding stage contain approximated information about the quantity of movement of the frame.Following this idea, the present approach proposes to reuse the MVs to accelerate the H.264/AVC encoding stage by reducing the search area of the ME stage.Moreover, the present reduction is adjusted for every input DVC GOP to every H.264/AVC GOP in an efficient and dynamic way.As is shown in Figure 6, the search area for each MB is defined by a circumference with a radius dependent on the incoming SI MV (Rmv).This search area can oscillate between a minimum (defined by Rmin) and a maximum (limited by the H.264 search area).In particular, the length will vary depending on the type of frame and the length of the reference frame, as will be explained in section 4.2.2.Furthermore, a minimum area is considered since MVs are calculated from 16x16 MBs in the SI process, and H.264/AVC can even work with smaller partitions than 16x16.Besides, SI is an approximation of the frame, so some changes could occur when the fame is completely reconstructed.For these reasons, this minimum was set at 4 pixels.

Mapping GOPs from DVC to H.264
One desired feature of every transcoder is flexibility.To achieve it, an important process is to perform a with care known as GOP mapping.On the second part of the transcoder, it is proposed a DVC to H.264/AVC conversion which allows every mapping combination by performing this task using techniques to improve the time spending by the transcoding process.To extract MVs, first the distance used to calculate the SI is considered.For example, Figure 7 shows the transcoding process for a DVC GOP of length 4 to a H.264/AVC pattern IPPP (baseline profile).In step 1, DVC starts to decode the frame labeled as WZ 2 and the MVs generated in its SI generation are discarded because they are not closely correlated with the proper movement (low accuracy).When the WZ 2 frame is reconstructed (through the entire WZ decoding algorithm, WZ' 2 ) in step 2, the WZ decoding algorithm starts to decode frames WZ 1 and WZ 3 by using the reconstructed frame WZ' 2 .At this point, the MVs V 0-2 and V 2-4 generated in this second iteration of the DVC decoding algorithm are stored.These MVs will be used to reduce the H.264/AVC ME process.Notice that in the case of higher GOP sizes the procedure is the same.In other words, MVs are stored and reused when the distance between SI and the two reference frames is 1.Finally, V 0-2 and V 2-4 are divided into two halves because P frames have the reference frame with distance one and MVs were calculated for a distance of two during the SI process.www.intechopen.com For more complex patterns, which include mixed P and B frames (main profile), this method can be extended in a similar way with some changes.Figure 8 shows the transcoding from a DVC GOP of length 4 to a H.264 pattern IBBP.MVs are also stored by always following the same procedure.However, in this case the way to apply them in H.264/AVC changes.For P frames, MVs are multiplied by a factor of 1.5 because MVs were calculated for a distance of 2 and P frames have their references with a distance of 3.For B frames, it depends on the position that they are allocated and it changes for backward and forward searches.
As can be observed, this procedure can be applied to both K and WZ frames.Therefore, following this method the proposed transcoder can be used for transcoding from every DVC GOP to every H.264/AVC GOP.

Experimental results
The proposed transcoder has been evaluated by using four representative QCIF sequences with different motion levels were considered.These sequences were coded at 15 fps and 30 fps using 150 frames and 300 respectively.In the DVC to H.264/AVC transcoder applied, the DVC stage was generated by the VISNET II codec using PD with BP = 3 as quantification in a trade-off between RD performance and complexity constraints but with whatever BP could be used.In addition, sequences were encoded in DVC with GOPs of length 2, 4 and 8 to evaluate different patterns.The parallel decoder was implemented by using an Intel C++ compiler (version 11.1) which combines a high-performance compiler as well as Intel Performance Libraries to provide support for creating multi-threaded applications.In addition, it provides support for OpenMP 3.0 (OpenMP, 2011).In order to test the performance of parallel decoding, it was executed over an Intel i7-940 multicore processor (Intel, 2011), although the proposal is not dependent on particular hardware.For the experiments, the parallel decoding was split into 9 parts where each core has thus a ninth part of the frame.This value is a good selection for QCIF frames (176x144), 16x16 macroblocks (this is the size of the block in the SI generation and thus a QCIF frame has 99 16x16 blocks) and 4 processors (4 cores, 8 simultaneous processes with hyper-threading).
During the decoding process, the MVs generated by the SI generation stage were sent to the H.264/AVC encoder; hence it does not involve any increase in complexity.In the second stage, the transcoder performs a mapping from every DVC GOP to every H.264/AVC GOP using QP = 28, 32, 36 and 40.In our experiments we have chosen different H.264/AVC patterns in order to analyze the behavior for the baseline profile (IPPP GOP) and the main profile (IBBP pattern).These patterns were transcoded by the reference and the proposed transcoder.The H.264/AVC reference software used in the simulations was theJM reference software (version 17.1).As mentioned in the introduction, the framework described is focused on communications between mobile devices; therefore, a low complexity configuration must be employed.For this reason, we have used the default configuration for the H.264/AVC main and baseline profile, only turning off the RD Optimization.The reference transcoder is composed of the whole DVC decoder followed by the whole H.264/AVC encoder.In order to analyze the performance of the proposed transcoder in detail we have taken into account the two halves and global results are also presented.
Furthermore, the performance of the proposed DVC parallel decoding is shown in Tables 1 (for 15 and 30fps sequences).PSNR and bitrate (BR) display the quality and bitrate measured by the reference WZ decoding.To calculate the PSNR difference, the PSNR of each sequence was estimated before transcoding starts and after transcoding finishes.Then the PSNR of the proposed transcoding was subtracted from the reference one for each H.264/AVC RD point, as defined by Equation 3.However, Table 1 do not include results for ΔPSNR because the quality obtained by DVC parallel decoding is the same as the reference decoding, it iterates until a given threshold is reached (Brites et al., 2008).
Equation 4 was applied in order to calculate the Bitrate increment (ΔBR) between reference and proposed DVC decoders as a percentage.Then a positive increment means a higher bitrate is generated by the proposed transcoder.As the results of Table 1 show, when DVC decodes smaller and less complex parts, sometimes the turbo decoder (as part of the DVC decoder) converges faster with less iterations and it implies less parity bits requested and thus a bitrate reduction.However, generally speaking the turbo codec yields a better performance for longer inputs.For this reason, the bitrate is not always positive or negative.
Comparing different GOP lengths, in short GOPs most of the bitrate is generated by the K frames.When the GOP length increases, the number of K frames is reduced and then WZ frames contribute to reducing the global bitrate in low motion sequences (like Hall) or increasing it in high motion sequences (Foreman or Soccer).Generally, decoding smaller pieces of frame (in parallel) works better for high motion sequences, where the bitrate is similar or even lower in some cases.
Concerning the time reduction (TR), it was estimated as a percentage by using Equation 5.In this case, negative time reduction means decoding time saved by the proposed DVC decoding.As is shown in Results for the second stage of the transcoder are shown in Tables 2 and 3.In this case, both H.264/AVC encoders (reference and proposed) start from the same DVC output sequence (as DVC parallel decoding obtains the same quality as the reference DVC decoding), which is quantified with four QP values.For these four QP values, ΔPSNR and ΔBRare calculated as specified in Bjøntegaard and Sullivan's common test rule (Sullivan et al., 2001).TR is given by Equation 5.In Table 2, DVC decoded sequences are mapped to an IPPP pattern.In this case RD loss is negligible and TR is around 40%.For 30 fps sequences, the accuracy of the proposed method works better and RD loss is even lower.In addition, Figure 9 displays each plot for each of the four 4 QP values simulated.As can be observed, all RD points are much closer.For the IBBP pattern (Table 3), the conclusions are similar.Comparing both patterns, the IBBP pattern generates a slightly higher RD loss, but H.264/AVC encoding is performed faster (up to 48%).This is because B frames have two reference frames, but dynamic ME search area reduction is carried out in both of them.Figure 10 displays plots for each of the four QP points when an IBBP pattern is performed.As can be observed, the RD drop penalty is negligible.Finally, to analyze the global transcoding improvement, Tables 4, 5, 6 and 7 summarize global transcoding performance.In this case, Bjøntegaard and Sullivan´s common test rule (Sullivan et al., 2001) was not used because it is a recommendation only for H.264/AVC.Then, to estimate the PSNR obtained by the transcoder, the original sequences were compared with the output sequences after transcoding.For each four QP points, the PSNR measured is displayed as an average (Δ )).To estimate the BR generated by the reference and the proposed transcoder, the BR generated by both stages (DVC decoding and H.264/AVC encoding) was added.Then equation ( 1) was applied and it was averaged for each four H.264/AVC QPs (Δ ).As the DVC decoding contributes with most of the bitrate, results are very similar to those in Tables 1.In order to evaluate the TR, total transcoding time was measured for the reference and proposed transcoder.Then Equation 5 was applied and a mean was calculated for each of the four H.264/AVC QPs ( ).As DVC decoding takes up most of the transcoding time, improvements in this stage have a bigger influence on the overall transcoding time, and so the TR obtained is similar to that in Table 1, reducing the complexity of the transcoding process by up to 73% (on average).

Conclusions
In this chapter it is analyzed the transcoding framework for video communications between mobile devices.In addition, it is proposed a WZ to H.264/AVC transcoder designed to support mobile-to-mobile video communications.Since the transcoder device accumulates the highest complexity from both video coders, reducing the time spent in this process is an important goal.With this aim, in this chapter two approaches are proposed to speed-up WZ decoding and H.264/AVC encoding.The first stage is improved by using parallelization techniques as long as the second stage is accelerated by reusing information generated during the first stage.As a result, with both approaches a time reduction of up to 73% is achieved for the complete transcoding process with negligible RD losses.In addition, the presented transcoder performs a mapping for different GOP patterns and lengths between the two paradigms by using an adaptive algorithm, which takes into account the MVs gathered in the side information generation process.
Oh et al. proposed a WZ parallel execution carried out by

Table 1 ,
DVC decoding time is reduced by up to 70% on average.TR is similar for different GOP lengths, but it works better for more complex sequences.

Table 1 .
Performance of the proposed DVC parallel decoder for 15 and 30 fps sequences (first stage of the proposed transcoder).

Table 3 .
Performance of the proposed transcoder mapping method for IBBP H.264 pattern with 15fps and 30fps sequences.

Table 4 .
Performance of the proposed transcoder for 15fps sequences and IPPP pattern.

Table 5 .
Performance of the proposed transcoder for 30fps sequences and IPPP pattern.

Table 6 .
Performance of the proposed transcoder for 15fps sequences and IBBP pattern.

Table 7 .
Performance of the proposed transcoder for 30fps sequences and IBBP pattern. www.intechopen.com