Side Information Generation for Distributed Video Coding Using Spatiotemporal Joint Bilinear Upsampling

Distributed video coding presents a viable solution for power-constrained multimedia communication. However, its relatively low coding efficiency compared to the conventional video coding schemes remains a challenging issue. The rate-distortion performance of distributed video coding is highly dependent on the quality of side information generated at the decoder and various techniques have been proposed to improve the side information quality in block-based and frame-based distributed video coding architectures. In this paper, a robust spatiotemporal joint bilateral upsampling based side information generation method is proposed. The proposed side information generation method is based on a block-based low-complexity distributed video coding architecture with adaptive block coding mode classification. A partially reconstructed Wyner-Ziv (WZ) frame with skip and key blocks is downsampled and spatiotemporal error concealment and joint bilateral upsampling are used to generate the side information. Simulation results show that the proposed method improves the quality of side information significantly while keeping low computational complexity.


Introduction
Video coding technology has played a key role in the explosion of the current multimedia society.The success of the widespread deployment of digital video applications and services is largely built on the predictive video coding paradigm where the encoder exploits the video redundancy and irrelevancy.This type of video coding is well suited for broadcasting or one-to-many video transmission systems where video is encoded once and decoded many times.However, in resource-constrained environments, a low-complexity encoder is necessary at the expense of a highcomplexity decoder while still maintaining a high coding efficiency.
Distributed video coding (DVC) has emerged as a new video compression paradigm for video applications with resource-constrained devices because it enables lowcomplexity encoding and is naturally robust against transmission errors.Over the past decade, several practical implementations of DVC have been proposed including the Stanford codec [1], PRISM codec [2], and DISCOVER codec [3].However, current DVC architectures still have several technical limitations that prevent their widespread use in real-world applications.In particular, there is still a significant gap in terms of compression efficiency between the current DVC solutions and conventional predictive video coding techniques.
In this paper, we propose a novel SI generation scheme based on spatiotemporal joint bilateral upsampling (STJBU), which is simple and applicable to any block-based DVC architecture.The proposed method consists of three steps: (1) downsampling of a partially reconstructed WZ frame, (2) SI 2 International Journal of Distributed Sensor Networks generation for the WZ blocks in the downsampled WZ frame, and (3) upsampling of the WZ frame using the proposed STJBU algorithm.
The rest of the paper is organized as follows.We review related work on SI generation in DVC in Section 2. In Section 3, the low-complexity DVC (LC-DVC) [19] architecture is briefly introduced which is the basis for the proposed SI generation technique.Then the proposed STJBU based SI generation method is explained in Section 4. Simulation results are presented in Section 5 and the conclusion of the paper is given in Section 6.

Related Work
In the past few years, various approaches have been proposed to improve the performance of DVC.The main issues restricting the use of current DVC architectures in practical applications are its low coding efficiency, high decoding latency, and the presence of a feedback channel.In particular, since the coding efficiency is highly affected by the SI quality in DVC, extensive research has been performed to improve the quality of SI.
A multiple motion hypotheses pixel-based temporal interpolation method is proposed in [4], where global and local motion estimation is incorporated.This work has been extended to an adaptive pixel-based temporal interpolation scheme [5] which can adaptively switch between spatial interpolation and forward/backward temporal extrapolation for SI generation.Similarly, a mode decision scheme is presented in [6] to determine the interpolation mode for each block by combining forward and backward motion vectors.Recently, the optical flow algorithm has been exploited for SI generation to compensate for the weaknesses of block-based methods.An optical flow based SI generation algorithm is proposed in [7] which improves the SI quality by obtaining more accurate motion vectors.A similar method proposed in [8] uses optical flow to improve the SI quality and block clustering to increase local adaptivity in the noise modeling.In general, the complex motion estimation process used in these methods incurs high computational complexity and long decoding time.
In the SI generation method proposed in [9], seed blocks are selected first and these blocks are used for motion estimation of the other blocks.Extra information for WZ blocks was transmitted in [10] to help the block matching process at the decoder.Another method called frame-hash uses a highly compressed WZ frame with zero motion vectors to improve the quality of SI [11].However, the performance of the hash-based DVC schemes is highly dependent on the accuracy of the rate allocation mechanism.An alternative method is to use multiple resolutions in encoding WZ frames.Recently, several SI generation methods based on a mixedresolution (MR) DVC architecture have been proposed [12][13][14][15][16][17][18].In the MR-DVC architecture, the SI quality is improved by exploiting the spatial relationship between the original frames and the scaled ones.
Spatial low-pass filtering is used in image processing to replace a pixel by a uniform or weighted average of its neighboring pixels.An edge preserving bilateral filter was originally proposed in [20] to alleviate the drawback of spatial low-pass filtering when it is performed over discontinuous regions.It takes into account both the geometric closeness of pixels and their photometric similarity.This noniterative filter smooths images while preserving edges by means of a nonlinear combination of nearby pixel values.Joint bilateral filter proposed in [21] extends the bilateral filter to two correlated images.It filters one image with weights generated using the other image.An alternative joint edge-preserving filter, the guided filter, has been proposed in [22] where the guided filter is derived from a local linear model and can perform filtering in constant time.In [23], the joint bilateral filter has been further extended on image pairs with different resolutions, namely, the joint bilateral upsampling.In [24], a multiresolution bilateral filtering is proposed where the bilateral filter is combined with wavelet thresholding to provide an image denoising framework.The joint bilateral filtering has been successfully applied in a variety of image processing and computer vision applications such as photo enhancement and stereo matching [25].

Architecture of Low-Complexity DVC
A simple and unidirectional LC-DVC architecture is proposed in [19].In the encoder of LC-DVC, an incoming frame is adaptively classified as a key or a WZ frame.The key frame is encoded using the H.264/AVC encoder in intramode.The WZ frames are divided into 4 × 4 nonoverlapping blocks and the blocks are further classified into skip, key, and WZ blocks.The classification map resulting from the block classification process is compressed using arithmetic coding and sent to the decoder.The skip blocks are not transmitted and can be reconstructed at the decoder with help of the previous frame.The key blocks are encoded using H.264/AVC in intramode.The WZ blocks are transformed, quantized, and the bit planes are extracted and encoded using BCH codes.
At the decoder, the key frames are decoded using the H.264/AVC decoder.For a WZ frame, the key blocks are decoded first and then the skip blocks are copied from colocated blocks in the previous frame according to the classification map.As it is shown in Figure 1, a partially reconstructed WZ frame which contains the key and skips blocks is generated.Then, the SI for the WZ blocks is generated by using the proposed method which can be applied to any block-based DVC architecture.

Proposed SI Generation Algorithm
The procedure of the proposed SI generation method is shown in Figure 2 and can be divided into 3 steps: (1) downsampling of a partially reconstructed WZ frame, (2) SI generation in the downsampled partially reconstructed WZ frame, and (3) upsampling of the error-concealed WZ frame using the proposed STJBU algorithm.

Downsampling of the Partially Reconstructed WZ Frame.
In order to reduce the computational complexity in spatiotemporal SI generation methods, the partially reconstructed WZ frame is first downsampled.Downsampling has been used in various image or video compression applications to improve the compression efficiency while reducing the computational complexity [26][27][28][29][30].The simplest downsampling method is to retain only every th sample to create a lower resolution signal in downsampling by a factor of .However, this simple downsampling method causes aliasing in the resulting downsampled signal.In this paper, four different downsampling methods are used.

Nearest Neighbor Downsampling.
The intensity of a pixel in the downsampled image is the intensity of the nearest pixel in the original image as shown in (1): otherwise.
(3) (4) 4.2.SI Generation at a Lower Resolution.After the partially reconstructed WZ frame is downsampled, SI is generated for the WZ blocks by exploiting the spatial and temporal correlation.Within a low-delay DVC, the decoder cannot wait for the future frame to arrive before starting the SI generation process and so it must use only the previously reconstructed frame for temporal information.Since the proposed DVC method is block-based and it uses a unique block classification scheme, the decoder is ensured that every WZ block is surrounded by either a key or a skip block in its adjacent 4 neighbors.In this paper, we consider two different methods which are bilinear error concealment and inpainting for SI generation at a lower resolution.

Bilinear Interpolation
. SI generation at decoder can be regarded as error concealment (EC) process where the WZ blocks have to be estimated using EC techniques.Among various spatial error concealment techniques [31][32][33][34], bilinear error concealment [31] is chosen to estimate the WZ blocks because it is simple but highly efficient.Bilinear interpolation is a spatial error concealment method which uses the spatially adjacent blocks to recreate the missing pixels by a weighted averaging procedure.Let  and  represent the vertical and horizontal coordinates of the WZ block, where 0 ≤  ≤  − 1 and 0 ≤  ≤  − 1 ( is the WZ block size).Let () and () be the pixels to the top and bottom of the WZ block and let () and () be the pixels to the left and right of the WZ block.If  is the estimated pixel, it can be calculated by (5).The weights are defined in (6) so that they are inversely proportional to the distance of the neighboring pixels from the estimated pixel:

Region-Filling
Inpainting.EC at the lower resolution frame can also be regarded as a hole-filling problem.Regionfilling inpainting technique proposed in [35][36][37] fills holes within the image by propagating linear structure (also called isophotes) into the target region by diffusion.This interactive processing includes 3 steps, namely, patch priorities  computation, texture and structure information propagation, and confidence value updating.The initial setting includes target region (Ω) specification, source region (Φ) definition by subtracting the target region from the entire image, and the specification of template window size (Ψ) which is usually set to be slightly larger than the largest distinguishable texture element in the region Φ.Once the parameters are determined, the iterative inpainting process starts automatically until all pixels have been filled.In general, region-filling inpainting incurs high computational complexity, but the processing time can be reduced in the proposed method since inpainting is performed at the lower-resolution WZ frame.

Spatiotemporal Joint Bilateral
Upsampling.After applying EC, the error concealed frame is upsampled using the proposed STJBU method.STJBU is an extension of joint bilateral upsampling (JBU) [24].JBU is an extension of bilateral filtering [23] and it uses both a domain filter and a range filter to adaptively combine pixels based on both their geometric closeness and their photometric similarity.The difference between JBU and bilateral filtering is that the range filter in JBU is applied to a second guidance image.
In the proposed method, JBU cannot be applied directly because the target reference pixels used for the range filter are not available.In order to solve this problem, the temporal correlation between the consecutive frames is considered.The information in the previous frame is exploited to be used as the second guide image for the range filter.The collocated block in the previous frame is found by boundary matching and it is used as the reference block for the range filter.The scheme of STJBU is shown in Figure 3.Given a previously decoded frame at high resolution Ĩ and a low resolution input   , which is the error concealed downsampled WZ frame, a spatial filter is applied to the low resolution input   , while the range filter is jointly applied on the previous high resolution frame Ĩ .The upsampled WZ frame S is obtained using the following: where  and  denote integer positions in the high resolution frame grid. and  denote the corresponding coordinates in the lower resolution frame grid,  is the domain filter centered over ,  is the range filter centered at the image value at , and the normalization term   is the sum of the  ⋅  filter weights which ensures that the weights for all the pixels add up to one.

Simulation Results
To evaluate the performance of the proposed SI generation technique, we conducted experiments using four standard test sequences, Hall Monitor, Akiyo, Mother and Daughter, and Foreman of QCIF size (176 × 144) sampled at 15 frames per second.The luminance component of key and WZ frames and the classification map are taken into consideration for the bitrate computation.The GOP size is adaptive and the maximum GOP size is 5.Only the DC band and first two AC bands of the WZ blocks are refined using the BCH code.

Comparison of Different Downsampling
Methods.First, we compare the performance of four different downsampling methods introduced in Section 4.1.For each test sequence, the first 50 frames are used for simulation.For the experiments, the frames are downsampled using different downsampling methods and then upsampled using the proposed STJBU.The resulting frames are compared to the original frame to calculate the peak-signal-to-noise ratio (PSNR).Table 1 shows the average PSNR value of the four test sequences when different downsampling methods are applied.
As shown in Table 1, the nearest neighbor downsampling algorithm has the lowest computational complexity but it produces the lowest quality.The Lanczos algorithm is much more complex than the other methods but gives the best quality.The processing time of the Lanczos algorithm is almost 10 times higher than the other methods.Bilinear and bicubic downsampling algorithms have lower computational complexity with acceptable output quality.By considering the trade-off between the performance and the processing speed, bilinear downsampling is chosen to downsample the partially reconstructed WZ frames.

Comparison of the
Reconstructed WZ Frame Quality.This section compares the visual quality of the SI generated by the proposed method with that of the hybrid spatiotemporal  error concealment [19].Akiyo and Hall Monitor sequences were encoded and decoded using the LC-DVC architecture setting the QPISlice value to 30.For the experiments, we use two different EC techniques along with STJBU.In the following sections, we refer to different techniques as defined in Table 2.
The simulation results shown in Figure 4 illustrate the visual quality of WZ frames obtained by different methods for the Akiyo sequence.As can be seen in Figure 4, the proposed methods produce WZ frames with higher PSNR compared to the ones obtained by the hybrid EC [19].Specifically, Proposed 1 (inpainting + STJBU) achieves better performance than Proposed 2 (BI + STJBU) because image inpainting is more effective than simple BI in error concealment, while  it increases the computational complexity.However, since image inpainting is applied to a lower resolution image, the proposed method maintains low computational complexity.5, 6, 7, and 8, respectively.Taking the Hall Monitor sequence as an example, it can be seen in Figure 7 that Proposed 1 gives the best RD performance, even better than the SDP-DVC [39].However, the RD performance of Proposed 2 is lower than that of LC-DVC and DISCOVER.Both Proposed 1 and Proposed 2 perform better than H.264/AVC in intramode with an extremely simple encoder.BI based error concealment used in Proposed 2 enables a very simple encoder, but it reduces SI quality and RD performance.

Comparison of the
As shown in Figure 8, the proposed method performs worse for higher motion sequences such as the Foreman sequence.However, it should be noted that DISCOVER uses bidirectional motion estimation for SI generation and uses a feedback channel.Therefore, the DISCOVER codec incurs extremely long decoding time and system delay.Since the proposed method is very simple, has low system delay, and does not require a feedback channel while producing International Journal of Distributed Sensor Networks  a comparable rate-distortion performance to state-of-theart DVC methods, it can be a promising solution for video applications in resource-limited environments.

Conclusion
In this paper, we present a robust STJBU-based SI generation method.The proposed method consists of 3 steps: (1) downsampling of a partially reconstructed WZ frame, (2) SI generation for the WZ blocks in the downsampled WZ frame, and (3) upsampling of the WZ frame using the proposed STJBU algorithm.Results show that the proposed method improves the visual quality of the SI by preserving the edges and improves the RD performance by more than 1 dB in comparison to other DVC architectures.The proposed SI generation method is simple and can be implemented into any exiting block-based DVC architecture.Moreover, with its low complexity and low latency, the proposed method can be a promising solution for video applications in resourcelimited environments with a tight delay bound.

Figure 1 :
Figure 1: An example of a partially reconstructed WZ frame.

4. 1 . 4 .
Lanczos Downsampling.The output pixel value of the downsampled image is obtained by using a convolution kernel given in the following:

Figure 2 :
Figure 2: Flowchart of the proposed SI generation scheme.

Figure 6 :
Figure 6: RD performance comparison for Mother and Daughter sequence.

Figure 7 :
Figure 7: RD performance comparison for Hall Monitor sequence.

Figure 8 :
Figure 8: RD performance comparison of Foreman sequence.

Table 1 :
Comparison of different downsampling methods.

Table 2 :
Different SI generation methods being compared.