Spatial-Aided Low-Delay Wyner-Ziv Video Coding

In distributed video coding, the side information (SI) quality plays an important role in Wyner-Ziv (WZ) frame coding. Usually, SI is generated at the decoder by the motion-compensated interpolation (MCI) from the past and future key frames under the assumption that the motion trajectory between the adjacent frames is translational with constant velocity. However, this assumption is not always true and thus, the coding e ﬃ ciency for WZ coding is often unsatisfactory in video with high and/or irregular motion. This situation becomes more serious in low-delay applications since only motion-compensated extrapolation (MCE) can be applied to yield SI. In this paper, a spatial-aided Wyner-Ziv video coding (WZVC) in low-delay application is proposed. In SA-WZVC, at the encoder, each WZ frame is coded as performed in the existing common Wyner-Ziv video coding scheme and meanwhile, the auxiliary information is also coded with the low-complexity DPCM. At the decoder, for the WZ frame decoding, auxiliary information should be decoded ﬁrstly and then SI is generated with the help of this auxiliary information by the spatial-aided motion-compensated extrapolation (SA-MCE). Theoretical analysis proved that when a good tradeo ﬀ between the auxiliary information coding and WZ frame coding is achieved, SA-WZVC is able to achieve better rate distortion performance than the conventional MCE-based WZVC without auxiliary information. Experimental results also demonstrate that SA-WZVC can e ﬃ ciently improve the coding performance of WZVC in low-delay application.


Introduction
Recently, the new applications such as wireless video surveillance and wireless sensor network are emerging. In these applications, a light encoder is required because the computation and memory resources on sensors are scarce. Furthermore, in these systems, there are always a high number of encoders and only one or a few decoders. As a result, the conventional hybrid video coding architectures such as H.26x and MPEG-x, are no longer being applicable due to the intrinsic one-to-many application model with one high-complexity encoder and many low-complexity decoders. In theory, distributed source coding (DSC) can provide an ideal solution to address this problem. The Slepian-Wolf theory shows that under certain conditions, even if the correlated sources are encoded separately and decoded jointly, the coding performance can be as good as joint encoding and decoding [1]. Later, Wyner and Ziv extended this theory to the lossy source coding with side information (SI) at the decoder [2], which is more suitable for practical video coding. Many researchers have applied the practical WZ coding techniques in video coding [3][4][5]. One advantage of WZ coding is that the computational complexity of the encoder is low, such as those schemes proposed in [4,5]. In these schemes, the motion correlation does not need to be exploited at the encoder and the frames are only compressed by low-complexity channel coding method, such as turbo codes. While at the WZ decoder, the motion estimation with high computational complexity is applied to exploit the temporal correlation in SI generation. Subsequently, the errors between the original information and the SI are corrected by using the received parity bits transmitted from the encoder. Another advantage of WZVC is the robustness since the WZVC system is drift-free due to no motion estimation and motion compensation prediction at the encoder. WZVC system is also deemed one type of 2 EURASIP Journal on Image and Video Processing the joint source-channel coding systems [6] since it can be used as a systematic lossy forward error protection method for conventional video coding.
In [3], two typical SI generation approaches are introduced, which are motion-compensated interpolation (MCI) and extrapolation (MCE), respectively. For MCI, SI for the current frame is yielded by performing motion compensation on the adjacent previously and subsequently decoded picture. However, in low-delay application, the temporally subsequent pictures cannot be used as references to generate SI. Therefore, MCE is adopted to generate SI in low-delay application, in which the motion between the decoded frames at time t 2 and time t 1 are estimated and the estimated motion are used to extrapolate the SI at time t. However, the performance of MCE-based low-delay WZVC is often unsatisfactory because motion field cannot be well estimated [3]. In fact, this situation can be improved by the auxiliary information-aided method, in which partial information of the current frame is used as the auxiliary information to help the decoder to improve the accuracy of motion field for MCE. In [7], one frame is partitioned into intra-and WZmacroblocks by a pattern which is similar to H.264/AVC FMO grouping method. The subset of intra-macroblocks is employed as auxiliary information and helps for estimating the SI with temporal concealment method. The auxiliary information-aided method can also be used to improve the quality of SI in the case of MCI. In [5], the quantized DCT domain coefficients named hash bits are performed as the auxiliary information. In [8], a coarse representation of the frame is considered to assist motion estimation at the decoder. For the above auxiliary information-aided WZ coding schemes, significant improvements of performance can always be achieved.
The discrete wavelet transform (DWT) are highly desirable for video coding due to their intrinsic multiresolution structure and energy compaction property. For hybrid video coding, DWT has been applied in many state-of-art coding schemes to obtain the spatial scalable functionality, such as [9,10]. Moreover, in DVC paradigm, the DWT also has been widely used. In [11], the author explored the high-order statistical correlation among the transform coefficients by using DWT and SPHIT algorithms. In [12], hyperspectral images from neighboring frequency bands are closely correlated. The authors propose a prediction model based on linear prediction techniques. Under the model, the correlation among bit-planes from neighboring DWT bands is exploited. In [13], the authors used the shift-invariant redundant discrete wavelet transform (RDWT) reference frames for finding matching blocks to overcome the inefficiency of motion estimation in critically sampled wavelet domain. In [14], the authors proposed a context correlation model between the source and its SI in the wavelet transform domain. Compared to RDWT domain motion estimation and motion compensation, spatial domain motion estimation and motion compensation are usually able to yield better prediction efficiency [9].
To improve low-delay WZ coding, this paper proposes a spatial-aided WZ video coding scheme. The spatial DWT, which inherently supports spatial scalability, is used to generate auxiliary information. At the encoder, one WZ frame is decomposed by a spatial 2D-DWT first and its lowpass subband is used as the auxiliary information. First, the auxiliary information is encoded by DPCM coding method and thus, the partial correlation among adjacent auxiliary information can be removed. Then, the wholeframe is encoded by DCT domain Wyner-Ziv encoder. At the decoder, auxiliary information should be decoded firstly. Then SI is generated by the SA-MCE algorithm in which motion field for generating SI is achieved by performing motion estimation on the spatial auxiliary information and the low-pass subband of previously decoded frames in spatial domain. With the help of the auxiliary information, more precise motion field can be obtained. Hence, the spatialaided Wyner-Ziv video coding (SA-WZVC) approach is able to achieve a better rate distortion performance against the conventional MCE-based WZVC without auxiliary information. In addition, due to the inherent decomposition structure of wavelet transform, the scalability can be achieved easily.
The remainder of this paper is organized as follows: Section 2 describes the proposed scheme in detail. Section 3 analyzes the rate distortion performance of the proposed spatial-aided WZ coding method theoretically and compares it with the conventional MCE-based low-delay WZ coding. By using the theoretical model, some numerical results are presented. In Section 4 simulation results are given.

Spatial-Aided Low-Delay Wyner-Ziv Video Coding
Scheme. As shown in Figure 1, the framework of the spatialaided low-delay WZ coding is similar to the framework presented in [4]. The key frames of the video sequence are compressed using a conventional intra-frame codec. The remaining frames, namely WZ frames, are encoded by spatial-aided low-delay WZ codec. At the encoder, the auxiliary information generation module is applied to the original WZ frames. The generated spatial auxiliary information is encoded by DPCM coding method, while the whole WZ frame is encoded by DCT transform domain Wyner-Ziv video coding (WZVC) as proposed in [3]. At the decoder, the spatial auxiliary information is decoded first. Subsequently, with the help of the decoded spatial auxiliary information, the spatial-aided motion-compensated extrapolation-(SA-MCE-) based SI generation algorithm is performed. At last, the WZ frame is decoded by the DCT domain WZ decoder. The detail of each part in the system is described as follows.

Spatial Auxiliary Information
Coding. There are many methods for the auxiliary information generation such as [5,7,8]. Considering the energy compaction characteristics of DWT, DWT is adopted as a tool to generate the auxiliary information. At the encoder, for each WZ frame, one level 2D-DWT with biorthogonal 9/7 filter is applied to decompose the original frame and the low-low-(LL-) pass subband of current frame is used as spatial auxiliary  Figure 1: Framework of spatial-aided low-delay WZ codec.
information. As a result, the resolution of the spatial auxiliary information is a quarter of the original frame. To reduce the temporal redundancy, DPCM is performed between the adjacent LL subbands to encode the LL subband. For DPCM coding, the difference between the current LL subband and its previously reconstructed reference frame is calculated. Then the residues are DCT transformed and quantized by a quantizer. Finally, the quantized coefficients are encoded by a CA-VLC entropy encoder used in H.264/AVC. If the reference frame is a key frame, the LL subband of fullresolution reconstructed intra-frame needs to be yielded by DWT to form the reference frame for DPCM coding.

Wyner-Ziv Frame
Coding. At the encoder, the whole WZ frame is encoded by DCT transform domain WZ coding [3]. First, a block-wise DCT is applied to the whole WZ frame and the statistical dependencies within a frame are exploited. The transform coefficients are grouped together to form the coefficient bands. Then for each band, different M-level uniform scalar quantizers are applied. Next, the bit-planes are extracted and each bit-plane is organized to fixed length binary codewords. Each codeword is sent to the Slepian-Wolf (SW) encoder as input and the output is the parity bits. The SW coder is implemented using a rate-compatible punctured turbo code (RCPT). Then, these parity bits are punctured into different blocks and stored in a buffer. The blocks of parity bits, which are also called as WZ bits, are successively transmitted to decoder upon request. At the decoder, the spatial auxiliary information of current WZ frame is decoded first. Then, the SI of whole WZ frame is generated with the help of the auxiliary information by an SA-MCE method which is presented in Section 2.5. Subsequently, DCT is applied on the generated full-resolution SI and the coefficients in each DCT block are extracted into different subbands corresponding to the DCT bands partition patterns. The DCT coefficient Y i of SI at the ith position in current subband is used for the bit-plane probabilities evaluation. This means that for every original coefficient X i the value of Y i is used to evaluate the probability of every bit of X i being 1 or 0. The detailed description about the probability evaluation and correlation model being used is introduced in the next subsection.

Correlation Model.
As the turbo decoder obtains the side information, a priori probability of current decoding bitplanes should be calculated first. According to simulation results, the probability distribution of the difference between the source and its SI conforms to a Laplacian model and thus, the Laplacian model is taken as the probability density function for calculating the a priori probability. To estimate the values of the jth bit of X i being 0 or 1, the probability can be calculated as Let b j i denote the jth bit-plane at the position i in current subband and its estimation is b i } are those previously decoded bits and b 0 i is the most significant bit. In (1), S i is the sign bit. If the coefficient X i is positive, S i equals 0; otherwise S i equals 1. For each coefficient band, different standard deviation of Laplacian model 1/α is adopted. The value of 1/α is determined by offline training.
In (2), Z i represents the integer number that has the jth bit b j i and those previously more significant bits Offset is an estimated value used to compensate the lower part of Z i . If X i is partitioned into m bins, offset equals 2 m− j−1 . a is used to adjust the sign of the value (Z i + offset), which is defined as

EURASIP Journal on Image and Video Processing
According to (1), (2), and (3), the transition probability on branches in trellis of turbo code can be obtained. When the decoder receives the parity bits, the trellis graph is traversed for several times. If the bit-error rate (BER) of current bit-plane converges to an acceptable value, the request for parity bits stopped and the current bit-plane is successively decoded. Otherwise, more parity bits are required. After the current bit-plane is decoded, it is used in calculating the a priori probability of next bit-plane as defined in (1).

SA-MCE-Based Side Information Generation.
Motioncompensated extrapolation is a general method in low-delay WZ coding schemes. For the MCE method, as shown in [3], the motion between the decoded frames at time t 1 and time t 2 are estimated and the estimated motion is used to extrapolate the SI at time t. However, due to the absence of information of current frame, the MCE method is not very effective. Therefore, spatial auxiliary information-aided MCE method is adopted in this paper.
The proposed SA-MCE SI generation scheme is depicted as Figure 2. The detailed procedure is as follows. In order to obtain the motion information for motion compensation at high resolution, the low-resolution auxiliary information needs to be upsampled first. Subsequently, motion search can be performed on current upsampled low-resolution frame and previous upsampled low-resolution frames (LL), or on current upsampled low-resolution frame and previous reconstructed high resolution frames (L-H). Due to the lack of high-pass subband, those upsampled low-resolution frames suffer from the artifacts, such as blending, aliasing, and tiling. As shown in [15] the artifacts in the upsampled low-resolution frames (L) can disturb block matching when compared to the blocks in the high-quality reference frame (H). The previously upsampled low-resolution frames have the same artifacts, so the effect of artifacts could be nullified by the similar blocking artifacts. Therefore, it is necessary to perform DWT and IDWT to obtain the upsampled frame of the LL band, even for the case of the previous frame being key frame. The inverse DWT transform IDWT L 0 operator is used to upsample the LL subband and it is defined as follows: where X LL(t) is the LL subband at time t and Δ X LL(t) is the upsampled LL frame at time t. IDWT L 0 operator is an inverse DWT in which the LL subband is X LL(t) and the highpass sub-bands are all set to zeros. Secondly, the motion estimation is performed between the upsampled spatial auxiliary information Δ X w LL(t) and its reference Δ X r LL(t−1) . The reference could be an upsampled LL band of a reconstructed key frame or the upsampled LL band of a reconstructed WZ frame.
In this work, the MVs between the upsampled lowresolution frames are directly used for full-resolution MCE. The previously reconstructed full-resolution frame (either key frame or WZ frame) is used as the reference frame for MCE: where X r F(t−1) denotes the reconstructed full-resolution frame at time (t 1 ) and X w F(t) denotes the motion-compensated full-resolution frame at time t. Because of the interband correlation of DWT transformed coefficients, the high-pass subbands prediction of current WZ frame are also obtained through the motion compensation. Consequently, a fullresolution prediction signal of current WZ frame X w F(t) is generated by (5).
From the numerical results of rate distortion analysis, it can be found that when the quality of auxiliary information is improved adequately, the performance of WZVC can be enhanced. Hence, more bits are allocated to the auxiliary information coding than the WZ frame coding which induces the quality of DPCM-coded LL-band to be high. By means of statistic, it is found that the objective quality of DPCM-coded LL band is better than the LL-band of the extrapolated prediction X w F(t) in most cases. So the DPCM coded LL subband is substituted for the LL band of the fullresolution prediction X w where X w LL(t) is the DPCM-coded LL subbands of WZ frame at time t. Also, X w H(t) represents three high-pass subbands of X w F(t) and it is obtained by DWT operation. At last, the side information Y w F(t) used for WZ decoding is generated.  encoder and the accuracy of motion estimation is assumed to be only related to the finite precision used to present the motion vectors. In MCE-based WZVC scheme, the motion estimation is performed at the decoder. Since the current frame is unavailable at the decoder, motion estimation is performed between two previously reconstructed reference  frames and the obtained MVs are used to extrapolate the SI of current frame. The MVs between two previous frames do not exactly conform to the MVs between the current frame and its previous reference frame, when the motion trajectory among the adjacent frames is not translational with constant velocity. Therefore, the quality of the side information may not be satisfactory. In our spatial-aided WZVC scheme, the reduced-resolution spatial information is encoded and transmitted to decoder side. The underlying idea is that motion estimation at the decoder has an access to spatial auxiliary information, so the partial description  of the current frame may help in obtaining a more accurate estimate of the motion model.

Rate Distortion Analysis
In this work, signal power spectrum and Fourier analysis tools are used to analyze SA-WZVC. The tools are widely used in rate distortion analyzing of hybrid video coding schemes. The rate distortion performance for the conventional MCP-based video coding is analyzed in [16]. Then, the fractional pixel motion search, the long-term motion search, and the multi-hypothesis are studied in [17][18][19][20] respectively. Recently, the signal power spectrum methods are also introduced in the Wyner-Ziv coding. In [21], the authors presented a theoretical rate distortion model to examine the WZVC performance and compare it with the conventional motion-compensated prediction-(MCP-) based video coding. The theoretical results show that although WZVC can achieve as much as 6 dB gain in PSNR over the conventional video coding without motion search, it still falls in 6 dB or more in terms of PSNR behind the best MCP-based inter-frame video coding schemes. In [22], the authors studied the theoretical rate distortion model for auxiliary hash-based WZVC scheme. In this scheme, the hash is the high-pass coefficients of DCT transform and this hash is used to perform motion estimation at the decoder. It proves that at high rates, hash-based motion modeling can virtually achieve the same coding efficiency as motioncompensated predictive coding. However, at medium or low rates, a significant coding loss is observed. In this work, these theoretical analysis tools are extended to investigate the rate distortion performance of our spatial-aided lowdelay WZVC scheme. During our analysis, some ideas and the theoretical tools are borrowed from the above works and these discussions are meaningful since the optimal generation and coding methods for auxiliary information have not been fully exploited yet.

Rate Distortion Analysis of Auxiliary Information-Aided
Wyner-Ziv Coding. In the following discussions, a rate difference model of SA-WZVC scheme versus the conventional MCE-based WZVC scheme is established. This rate difference model relates the accuracy of the motion model to the power spectral density (PSD) of quantization noise signal. Furthermore, the numerical result of the theoretical model is presented and the result demonstrate that for SA-WZVC scheme, a rate savings can be achieved compared with the conventional MCE-based WZVC when a good trade off between the auxiliary information coding and WZ coding is achieved.

Rate Distortion Analysis. The prediction residual e(t) is
where s(t) denotes the input source and s (t) denotes the MCP frame for the conventional video coding or SI for WZVC. According to [16,19], the power spectrum of the prediction residual Φ ee (ω) is expressed as where Φ SS (ω) is the spatial power spectral density (PSD) of the original frame s(t). Also, Δ = (Δ x , Δ y ) is the motion vector error, that is, the difference between the used motion vectors (MVs) and the true MVs. Finally, θ is the noise term introduced by quantization step.
If it is only considered the prediction inaccuracy introduced by either SA-MCE or MCE, the assumption Φ SS (ω) θ can be made according to [16]. So the difference in rate between intra-frame coding of prediction error e and intraframe coding of s is obtained as follows according to [16] or [19]: EURASIP Journal on Image and Video Processing 7 Hence, the rate difference between the SA-WZVC using MVs (d x , d y ) and MCE-based WZVC using MVs (d x , d y ) can be yielded by where (11) in which (d x , d y ) denotes the true MVs and (d x , d y ) denotes the MVs obtained by the MCE algorithm, respectively: and (d x , d y ) indicates the MVs obtained by the SA-MCE algorithm.
In our scheme, the spatial auxiliary information s l (t) is coded by DPCM method. The prediction residual is denoted as e l (t). For the DPCM coding of spatial auxiliary information, the R(D) function is where the PSD of the e l (t) is where Φ SlSl (ω) are the PSD of spatial auxiliary information.
Since the MVs of DPCM coding is (0, 0), the motion vector error is equals to the true MVs: Equation (13) is the R(D) function of the spatial auxiliary information coding. The rate difference which takes the spatial correlation of the prediction error e l (t) and the original signal s l (t) into account is widely used to measure the bit-rate reduction. It represents the maximum bit-rate reduction possible by optimum encoding of the prediction error, compared to optimum intra-frame encoding of the signal for the same mean-squared reconstruction error [19]. To obtain an upper bound of rate reduction, the rate difference is measured by comparing the prediction error of auxiliary information e l (t) with the prediction error of lowpass subband e l (t) whose full-resolution frame is encoded with MCE-based WZVC method.
For the MCE-based WZVC, the prediction error of the low-pass subband can be expressed as e l (t). The R(D) function of the low-pass subband coding can be expressed as where θ is the PSD of quantization error introduced into the low-pass subband. The MVs (d x , d y ) in (11) can be taken as the subpixel accuracy MVs of the low-pass subband.
To coincide with the motion compensation of the low-pass subband coding, these subpixel accuracy MVs can be reduced to integer pixel accuracy. Hence, for one-level DWT, the MVs in (11) are reduced to a half scale. The MV error can be expressed as where ( d x , d y ) is the true MVs of low-pass subband. So the PSD of the low-pass subband prediction error can be derived as According to (13)- (18), the rate difference between the DPCM coding of the spatial auxiliary information and the low-pass subband of the full-resolution frame which is encoded by MCE-based WZVC can be derived as Since it is assumed that the PSD of spatial signal is much larger than the PSD of quantization noise signal, the function (19) can be simplified as According to (10) and (20), it can be derived that the overall rate saving ΔR is 8 EURASIP Journal on Image and Video Processing The first part of (21) can be considered as the overhead by the auxiliary information coding. The second part of (21) is the coding gain from the spatial auxiliary informationaided motion-compensated extrapolation.

Numerical
Results. The rate saving for SA-WZVC versus MCE-based WZVC is examined as follows. According to the statistics of displacement error and the quantization noise's PSD ratio, the rate difference is obtained by (21). The numerical results of theoretical analysis are shown in Table 1 where different qualities of auxiliary information are used in SA-WZVC. This results in different displacement errors and different overheads consumed by the auxiliary information coding. Therefore, different rate savings can be achieved.
In the simulation, Foreman CIF sequence is used and twenty WZ frames are encoded. One-level 9/7 wavelet decomposition is adopted to generate spatial auxiliary information. The quality of key frames in SA-WZVC and MCEbased WZVC is the same. When the quantization scheme of MCE-based WZVC is determined, the quantization error θ introduced into low-pass subband is confirmed. The PSD ratio θ/ θ of quantization error in (21) is only determined by the quantization error of auxiliary information θ. SNR represents the correlation of MVs generated by MCE method and MVs generated by SA-MCE method. It is calculated as follows: For the same reason, when the quantization error of WZ frame and the quantization error of key frame are determined, MVs generated by MCE method is constant too. So the SNR is only affected by the variance of MVs generated by SA-MCE. Also, ΔR f is the rate difference between the spatial-aided WZ coding and MCE-based WZ coding which is defined in (10); ΔR is the overall rate saving defined as (21) that comprises the overhead coding and the rate saving of WZ coding. From the simulation result it can be derived that there exists a tradeoff between the auxiliary information coding and the WZ frame coding. As the quantization error of auxiliary information coding decreased, the SNR increases and the rate saving of WZ coding ΔR f increases. This phenomenon illustrates that if more bits are allocated to auxiliary information coding, the accuracy of MVs generated with the help of the high-quality auxiliary information is improved. The variance of MV error σ 2 Δd2 decreases. Therefore, the rate saving of WZ coding ΔR f is increased. However, the overhead brought by the auxiliary information coding is also increased. The overall rate saving is decreased.
On the contrary, if the quality of auxiliary information decreases, both the accuracy of MVs and the rate saving of WZ coding ΔR f are decreased. The quality of auxiliary information is important that it can affect the coding tradeoff. It can be concluded that if the strategy of bit allocation is optimum, a promising coding gain can be achieved.

Experimental Results and Analysis
In this section, the proposed scheme is implemented to verify the coding efficiency of the spatial-aided low-delay WZ coding. The key frames are H.264/AVC-intra-coded using the reference software JM 9. The spatial auxiliary information is generated by applying DWT decomposition to the original frames and the DWT is implemented with biorthogonal 9/7 filter. The entropy coding method adopted in DPCM coding of spatial auxiliary information is CA-VLC in JM 9. For the low-delay WZ coding of the whole frame, as described in Section 2.3, DCT domain WZ coding scheme is used. A rate-compatible punctured turbo encoder (RCPT) is adopted as Slepian-Wolf codec and the acceptable bit-error rate at the decoder is set to 10 −3 . The parameter of Laplacian distribution model is obtained by offline fitting the difference between the original frame and its side information frame. Due to different distributions, the parameters of each bitplane may have different values. For various sequences, different parameters of Laplacian distribution model are also obtained by offline training.
Foreman, News, and Tempete sequences at CIF resolution are used in testing. In each sequence, 168 frames are encoded and the coding structure is I-W-,· · · -,W-I. The QP for DPCM coding of spatial auxiliary information is equals to the QP of key frames minus two. Five different QPs are chosen for key frame coding: 20, 24, 28, 32, and 36.

Evaluation of Spatial-Aided Wyner-Ziv Video Coding.
The overall RD performance of the "SA-WZVC" is compared with that of a scheme proposed in [7]. In Figures 3(a), 3(b), and 3(c), "SA-WZVC" denotes the proposed spatial-aided WZ coding. One level 2D-DWT with biorthogonal 9/7 filter is applied to generate auxiliary information. The GOP size adopted in the simulation is 6. The scheme proposed in [7] is implemented and it is denoted as "Hybrid Intra/WZVC" in Figures 3(a), 3(b), and 3(c). The auxiliary generation method of the "Hybrid Intra/WZVC" is in spatial domain. Compared with the RD performance of "Hybrid Intra/WZVC," our method also achieves a promising improvement. The quality of the key frames used in our proposed methods and in Hybrid Intra/WZVC scheme proposed in [7] remain the same. The curve of "H.264 Intra" indicates the results of H.264/AVC intra-frame coding. Compared with the overall RD performance of the intra-frames coding and DPCM coding, it can be observed that the proposed method efficiently improves the rate distortion performance of WZVC in lowdelay application.
The ratio of the bit-rate used in key frame coding, auxiliary information coding and WZ coding are presented in Tables 2(a), 2(b), and 2(c), respectively.
In Table 2, QP k represents the quantization parameter of key frame coding. The QP for DPCM coding of spatial auxiliary information is equals to the QP of key frames minus two. According to Tables 2(a), 2(b), and 2(c), at the high bitrate point, most percent of bit-rate is consumed by intracoding of key frames and the auxiliary information coding. The WZ frame coding takes a much low percent. At the low bit-rate point, the rate consumed by WZ coding cannot be    From the simulation results, it can be concluded that longer GOP size degrades the RD performance for the test sequence with high motion such as Foreman and Tempete. In fact, the quality of key frames is very important for the overall RD performance (including both key frames and WZ frames). To investigate this phenomenon, the frame by frame PSNR distribution of the decoded frames and the distribution of bit-rate in Foreman sequence are presented in Figures 5(a) and 5(b). According to Figure 5(a), it is found that the quality of WZ frame located in forward position is better than the quality of WZ frame located in backward position in one GOP. However, the bits consumed by WZ coding of backward frames increase compared to the bitrate of forward WZ frame according to Figure 5(b). It is because that as the frame number increases in one GOP, the quality of reference frames decreases, and this results in the degradation of the SI quality. To recover more errors between SI and the original signals, it has to cost more bits in WZ decoding. Therefore, the performance of WZVC in long GOP size case might decrease. Key frame has to be refreshed in a proper period. For the sequences with smooth motion, such as News, longer GOP size can bring improvement in RD performance. How to find a proper GOP size for low-delay WZVC is our future research topic.

Experiments with Multilevel DWT.
If more than one level wavelet decomposition is carried out, the auxiliary information with smaller resolution is generated and it can produce negligible overhead from the auxiliary information for the whole system. The simulation of SA-WZVC using two-level decomposition has been done. In this case, the lowest-pass subband with the resolution of 88 × 72 is transmitted as auxiliary information. The higher-resolution SI is extrapolated with the aid of lower subband by using the SA-MCE method. The higher resolution frames are successfully refined by WZ coding methods. The RD performance is shown in Figures 6(a), 6(b), and 6(c). Comparing with onelevel DWT decomposition, there is a performance loss in two-level DWT. By a carefully study, it is found that the correlation between the SI and the original information becomes more weaker. This phenomenon attributes to two factors: the energy contained in auxiliary information decreases due to the multilevel DWT and the accuracy of motion information is diminished since the MVs are generated with the aid of the imperfect auxiliary information. The correlation decreasing induces the increasing of rate cost in WZ coding. This cost cannot compensate the rate reduction in overhead coding.

Conclusions
In this paper, a spatial-aided low-delay WZ coding scheme has been presented. In this scheme, the low-pass subband of WZ frame generated by DWT is used as the spatial auxiliary information and encoded by DPCM. At the decoder, the spatial auxiliary information is decoded first. By performing motion estimation on the upsampled spatial auxiliary information, more accurate MVs are obtained comparing with MCE-based SI generation. This improvement enables us to implement a high-efficiency low-delay WZ coding. In our further study, a more general analysis will be considered at the full scale only. The low-pass subband is coded and transmitted as auxiliary information. The high-pass subband could be encoded independently by spatial-aided low-delay WZVC method. In this case, all of the impacts brought by decimation, subsequent interpolation, and simple-coarse quantization could be considered at full scale in a more general manner. Moreover, to fully explore the characteristic of the proposed SA-WZVC in low-delay applications, the case of longer GOP size and the case of one I-frame followed by all WZ frames will be studied and realized in further research. How to find a proper GOP size for low-delay WZVC is also a future research topic.