Compressive Video Coding: A Review of the State-Of-The-Art

Video coding and its related applications have advanced quite substantially in recent years. Major coding standards such as MPEG [1] and H.26x [2] are well developed and widely deployed. These standards are developed mainly for applications such as DVDs where the compressed video is played over many times by the consumer. Since compression only needs to be performed once while decompression (playback) is performed many times, it is desirable that the decoding/decompression process can be done as simply and quickly as possible. Therefore, essentially all current video compression schemes, such as the various MPEG standards as well as H.264 [1, 2] involve a complex encoder and a simple decoder. The exploitation of spatial and temporal redundancies for data compression at the encoder causes the encoding process to be typically 5 to 10 times more complex computationally than the decoder [3]. In order that video encoding can be performed in real time at frame rates of 30 frames per second or more, the encoding process has to be performed by specially designed hardware, thus increasing the cost of cameras.


Introduction
Video coding and its related applications have advanced quite substantially in recent years. Major coding standards such as MPEG [1] and H.26x [2] are well developed and widely deployed. These standards are developed mainly for applications such as DVDs where the compressed video is played over many times by the consumer. Since compression only needs to be performed once while decompression (playback) is performed many times, it is desirable that the decoding/decompression process can be done as simply and quickly as possible. Therefore, essentially all current video compression schemes, such as the various MPEG standards as well as H.264 [1,2] involve a complex encoder and a simple decoder. The exploitation of spatial and temporal redundancies for data compression at the encoder causes the encoding process to be typically 5 to 10 times more complex computationally than the decoder [3]. In order that video encoding can be performed in real time at frame rates of 30 frames per second or more, the encoding process has to be performed by specially designed hardware, thus increasing the cost of cameras.
In the past ten years, we have seen substantial research and development of large sensor networks where a large number of sensors are deployed. For some applications such as video surveillance and sports broadcasting, these sensors are in fact video cameras. For such systems, there is a need to re-evaluate conventional strategies for video coding. If the encoders are made simpler, then the cost of a system involving tens or hundreds of cameras can be substantially reduced in comparison with deploying current camera systems. Typically, data from these cameras can be sent to a single decoder and aggregated. Since some of the scenes captured may be correlated, computational gain can potentially be achieved by decoding these scenes together rather than separately. . Decoding can be simple reconstruction of the video frames or it can be combined with detection algorithms specific to the application at hand. Thus there are benefits in combing reduced complexity cameras with flexible decoding processes to deliver modern applications which are not anticipated when the various video coding standards are developed.
for signals which possess some "sparsity" properties, the sampling rate required to reconstruct these signals with good fidelity can be much lower than the lower bound specified by Shannon's sampling theorem. Since video signals contain substantial amounts of redundancy, they are sparse signals and CS can potentially be applied. The simplicity of the encoding process is traded off by a more complex, iterative decoding process. The reconstruction process of CS is usually formulated as an optimization problem which potentially allows one to tailor the objective function and constraints to the specific application. Even though practical cameras that make use of CS are still in their very early days, the concept can be applied to video coding. A lower sampling rate implies less energy required for data processing, leading to lower power requirements for the camera. Furthermore, the complexity of the encoder can be further simplified by making use of distributed source coding [21,22]. The distributed approach provides ways to encode video frames without exploiting any redundancy or correlation between video frames captured by the camera. The combined use of CS and distributed source coding can therefore serve as the basis for the development of camera systems where the encoder is less complex than the decoder.
We shall first provide a brief introduction to Compressed Sensing in the next Section. This is followed by a review of current research in video coding using CS.

Compressed sensing
Shannon's uniform sampling theorem [7,8] provides a lower bound on the rate by which an analog signal needs to be sampled in order that the sampled signal fully represents the original. If a signal contains no frequencies higher than radians per second, then it can be completely determined by samples that are spaced = ⁄ seconds apart. can be reconstructed perfectly using the these samples by The uniform samples of may be interpreted as coefficients of basis functions obtained by shifting and scaling of the sinc function. For high bandwidth signals such as video, the amount of data generated based on a sampling rate of at least twice the bandwidth is very high. Fortunately, most of the raw data can be thrown away with almost no perceptual loss. This is the result of lossy compression techniques based on orthogonal transforms. In image and video compression, the discrete cosine transform (DCT) and wavelet transform have been found to be most useful. The standard procedure goes as follows. The orthogonal transform is applied to the raw image data, giving a set of transform coefficients. Those coefficients that have values smaller than a certain threshold are discarded. Only the remaining significant coefficients, typically only a small subset of the original, are encoded, reducing the amount of data that represents the image. This means that if there is a way to acquire only the significant transform coefficients directly by sampling, then the sampling rate can be much lower than that required by Shannon's theorem.
Emmanuel Candes, together with Justin Romberg and Terry Tao, came up with a theory of Compressed Sensing (CS) [9] that can be applied to signals, such as audio, image and video www.intechopen.com that are sparse in some domain. This theory provides a way, at least theoretically, to acquire signals at a rate potentially much lower than the Nyquist rate given by Shannon's sampling theorem. CS has already inspired more than a thousand papers from 2006 to 2010 [9].

Key Elements of compressed sensing
Compressed Sensing [4][5][6]10] is applicable to signals that are sparse in some domain. Sparsity is a general concept and it expresses the idea that the information rate or the signal significant content may be much smaller than what is suggested by its bandwidth. Most natural signals are redundant and therefore compressible in some suitable domain. We shall first define the two principles, sparsity and incoherence, on which the theory of CS depends.

Sparsity
Sparsity is important in Compressed Sensing as it determines how efficient one can acquire signals non-adaptively. The most common definition of sparsity used in compressed sensing is as follows. Let ∈ be a vector which represents a signal which can be expanded in an Here, the coefficients = , . In matrix form, (1.2) becomes When all but a few of the coefficients are zero, we say that is sparse in a strict sense. If denotes the number of non-zero coefficients with ≪ , then is said to be S-sparse. In practice, most compressible signals have only a few significant coefficients while the rest have relatively small magnitudes. If we set these small coefficients to zero in the way that it is done in lossy compression, then we have a sparse signal.

Incoherence
We start by considering two different orthonormal bases, Φ and Ψ, of . The coherence between these two bases is defined in [10] by which gives us the largest correlation between any two elements of the two bases. It can be shown that Sparsity and incoherence together quantify the compressibility of a signal. A signal is more compressible if it has higher sparsity in some representation domain Ψ that is less coherent to the sensing (or sampling) domain Φ. Interestingly, random matrices are largely incoherent with any fixed basis [18].

Random sampling
Let be a discrete time random process and let Consider a general linear measurement process that computes < inner products of = , between and a collection of vectors . Let denote the × matrix with the measurement vectors as rows. Then is given by If is fixed, then the measurements are not adaptive or depend on the structure of the signal [6]. The minimum number of measurements needed to reconstruct the original signal depends on the matrices and .
Theorem 1 [11]. Let ∈ has a discrete coefficient sequence in the basis . Let besparse. Select measurements in the domain uniformly at random. Then if for some positive constant , then with high probability, can be reconstructed using the following convex optimization program: where denotes the index set of the randomly chosen measurements.
This is an important result and provides the requirement for successful reconstruction. It has the following three implications [10]: i. The role of the coherence in above equation is transparent -the smaller the coherence between the sensing and basis matrices, the fewer the number of measurements needed. ii. It provides support that there will be no information loss by measuring any set of coefficients, which may be far less than the original signal size. iii. The signal can be exactly recovered without assuming any knowledge about the nonzero coordinates of or their amplitudes.

CS Reconstruction
The reconstruction problem in CS involves taking the measurements to reconstruct the length-signal that is -sparse, given the random measurement matrix and the basis matrix . Since < , this is an ill-conditioned problem. The classical approach to solving ill-conditioned problems of this kind is to minimize the norm. The general problem is given by However, it has been proven that this minimization will never return a -sparse solution. Instead, it can only produce a non-sparse solution [6]. The reason is that the norm measures the energy of the signal and signal sparsity properties could not be incorporated in this measure.
The norm counts the number of non-zero entries and therefore allows us to specify the sparsity requirement. The optimization problem using this norm can be stated as There is a high probability of obtaining a solution using only = + i.i.d Gaussian measurements [10]. However, the solution produced is numerically unstable [6]. It turns out that optimization based on the norm is able to exactly recover sparse signals with high probability using only ≥ log ⁄ i.i.d Gaussian measurements [4,5]. The convex optimization problem is given by Which can be reduced to a linear program. Algorithms based on Basis Pursuit [12] can be used to solve this problem with a computational complexity of [4].

Compressed Video Sensing (CVS)
Research into the use of CS in video applications has only started very recently. We shall now briefly review what has been reported in the open literature.
The first use of CS in video processing is proposed in [13]. Their approach is based on the single pixel camera [14]. The camera architecture employs a digital micro mirror array to perform optical calculations of linear projections of an image onto pseudo-random binary patterns. It directly acquires random projections. They have assumed that the image changes slowly enough across a sequence of snapshots which constitutes one frame. They acquired the video sequence using a total of M measurements, which are either 2D or 3D random measurements. For 2D frame-by-frame reconstruction, 2D wavelets are used as the sparsityinducing basis. For 3D joint reconstruction, 3D wavelets are used. The Matching Pursuit reconstruction algorithm [15] is used for reconstruction.
Another implementation of CS video coding is proposed in [16]. In this implementation, each video frame classified as a reference or non-reference frame. A reference frame (or key frame) is sampled in the conventional manner while non-reference frames are sampled by CS techniques. The sampled reference frame is divided into non-overlapping blocks each of size × = pixels whereby discrete cosine transform (DCT) is applied. A compressed sensing test is applied to the DCT coefficients of each block to identify the sparse blocks in the non-reference frame. This test basically involves comparing the number of significant DCT coefficients against a threshold . If the number of significant coefficients is small, then the block concerned is a candidate for CS to be applied. The sparse blocks are compressively sampled using an i.i.d. Gaussian measurement matrix and an inverse DCT sensing matrix.
The remaining blocks are sampled in the traditional way. A block diagram of the encoder is shown in Figure 1.
Signal recovery is performed by the OMP algorithm [17]. In reconstructing compressively sampled blocks, all sampled coefficients with an absolute value less than some constant are set to zero. Theoretically, if there are − non significant DCT coefficients, then at least = + samples are needed for signal reconstruction [10]. Therefore the threshold is set to < − . The choice of values for , , and depends on the video sequence and the size of the blocks. They have proved experimentally that up to 50% of savings in video acquisition is possible with good reconstruction quality.  [16] Another technique which uses motion compensation and estimation at the decoder is presented in [18]. At the encoder, only random CS measurements were taken independently from each frame with no additional compression. A multi-scale framework has been proposed for reconstruction which iterates between motion estimation and sparsity-based reconstruction of the frames. It is built around the LIMAT method for standard video compression [19].
LIMAT [19] uses a second generation wavelets to build a fully invertible transform. To incorporate temporal redundancy, LIMAT adaptively apply motion-compensated lifting steps. Let k-th frame of the frame video sequence is given by , where ∈{ , ,…} . The lifting transform partitions the video into even frames { } and odd frames { } and attempts to predict the odd frames from the even ones using a forward motion compensation operator. Suppose { } and { } differ by a 3-pixel shift that is captured precisely by a motion vector , then it is given by { } = , exactly.

www.intechopen.com
The proposed algorithm in [18] uses block matching (BM) to estimate motion between a pair of frames. Their BM algorithm divides the reference frame into non-overlapping blocks. For each block in the reference frame the most similar block of equal size in the destination frame is found and the relative location is stored as a motion vector. This approach overcomes previous approaches such as [13] where the reconstruction of a frame depends only on the individual frame's sparsity without taking into account any temporal motion. It is also better than using inter-frame difference [20] which is insufficient for removing temporal redundancies.

Distributed Compressed Video Sensing (DCVS)
Another video coding approach that makes use of CS is based on the distributed source coding theory of Slepian and Wolf [21], and Wyner and Ziv [22]. Source statistics, partially or totally, is only exploited at the decoder, not at the encoder as it is done conventionally. Two or more statistically dependent source data are encoded by independent encoders. Each encoder sends a separate bit-stream to a common decoder which decodes all incoming bit streams jointly, exploiting statistical dependencies between them.
In [23], a framework called Distributed Compressed Video Sensing (DISCOS) is introduced. Video frames are divided into key frames and non-key frames at the encoder. A video sequence consists of several GOPs (group of pictures) where a GOP consists of a key frame followed by some non-key frames. Key frames are coded using conventional MPEG intracoding. Every frame is both block-wise and frame-wise compressively sampled using structurally random matrices [25]. In this way, more efficient frame based measurements are supplemented by block measurement to take advantage of temporal block motion.
At the decoder, key frames are decoded using a conventional MPEG decoder. For the decoding of non-key frames, the block-based measurements of a CS frame along with the two neighboring key frames are used for generating sparsity-constraint block prediction. The temporal correlation between frames is efficiently exploited through the inter-frame sparsity model, which assumes that a block can be sparsely represented by a linear combination of few temporal neighboring blocks. This prediction scheme is more powerful than conventional block-matching as it enables a block to be adaptively predicted from an optimal number of neighboring blocks, given its compressed measurements. The block-based prediction frame is then used as the side information (SI) to recover the input frame from its measurements. The measurement vector of the prediction frame is subtracted from that of the input frame to form a new measurement vector of the prediction error, which is sparse if the prediction is sufficiently accurate. Thus, the prediction error can be faithfully recovered. The reconstructed frame is then simply the sum of the prediction error and the prediction frame.
Another DCVS scheme is proposed in [24]. The main difference from [23] is that both key and non-key frames are compressively sampled and no conventional MPEG/H.26x codec is required. However, key frames have a higher measurement rate than non-key frames.
The measurement matrix Φ is the scrambled block Hadamard ensemble (SBHE) matrix [28]. SBHE is essentially a partial block Hadamard transform, followed by a random permutation of its columns. It provides near optimal performance, fast computation, and memory efficiency. It outperforms several existing measurement matrices including the Gaussian i.i.d matrix and the binary sparse matrix [28]. The sparsifying matrix used is derived from the discrete wavelet transform (DWT) basis.
At the decoder, the key frames are reconstructed using the standard Gradient Projection for Sparse Reconstruction (GPSR) algorithm. For the non-key frames, in order to compensate for lower measurement rates, side information is first generated to aid in the reconstruction. Side information can be generated from motion-compensated interpolation from neighboring key frames. In order to incorporate side information, GPSR is modified with a special initialization procedure and stopping criteria are incorporated (see Figure 3). The convergence speed of the modified GPSR has been shown to be faster and the reconstructed video quality is better than using original GPSR, two-step iterative shrinkage/thresholding (TwIST) [29], and orthogonal matching pursuit (OMP) [30].

Dictionary based compressed video sensing
In dictionary based techniques, a dictionary (basis) is created at the decoder from neighbouring frames for successful reconstruction of CS frames.
A dictionary based distributed approach to CVS is reported in [32]. Video frames are divided into key frames and non-key frames. Key frames are encoded and decoded using conventional MPEG/H.264 techniques. Non-key frames are divided into non-overlapping blocks of pixels. Each block is then compressively sampled and quantized. At the decoder, key frames are MPEG/H.264 decoded while the non-key frames are dequantized and recovered using a CS reconstruction algorithm with the aid of a dictionary. The dictionary is constructed from the decoded key frame. The architecture of this system is shown in Figure 4.
www.intechopen.com Two different coding modes are defined. The first one is called the SKIP mode. This mode is used when a block in a current non-key frame does not change much from the co-located decoded key frame. Such a block is skipped for decoding. This is achieved by increasing the complexity at the encoder since the encoder has to estimate the mean squared error (MSE) between decoded key frame block and current CS frame block. If the MSE is smaller than some threshold, the same decoded block is simply copied to current frame and hence the decoding complexity is very minimal. The other coding mode is called the SINGLE mode. CS measurements for a block are compared with the CS measurements in a dictionary using the MSE criterion. If it is below some pre-determined threshold, then the block is marked as a decoded block. Dictionary is created from a set of spatially neighboring blocks of previous decoded neighboring key frames. A feedback channel is used to communicate with the encoder that this block has been decoded and no more measurements are required. For blocks that are not encoded by either SKIP or SINGLE mode, normal CS reconstruction is performed.
Another dictionary based approach is presented in [33]. The authors proposed the idea of using an adaptive dictionary. The dictionary is learned from a set of blocks globally extracted from the previous reconstructed neighboring frames together with the side information generated from them is used as the basis of each block in a frame. In their encoder, frame are divided as Key-frames and CS frames. For Key-frames, frame based CS measurements are taken and for CS frames, block based CS measurements are taken. At the decoder, the reconstruction of a frame or a block can be formulated as an l 1 -minimization problem. It is solved by using the sparse reconstruction by separable approximation (SpaRSA) algorithm [34]. Block diagram of this system is shown in Figure 5.
Adjacent frames in the same scene of a video are similar, therefore a frame can be predicted by its side information which can be generated from the interpolation of its neighboring reconstructed frames. at decoder in [33], for a CS frame , its side information can be generated from the motion-compensated interpolation (MCI) of its previous and next reconstructed key frames , respectively. To learn the dictionary from , and , training patches were extracted. For each block in the three frames, 9 training patches www.intechopen.com including the nearest 8 blocks overlapping this block and this block itself are extracted. After that, the K-SVD algorithm [35] is applied to training patches to learn the dictionary ∈ × , .
is an overcomplete dictionary containing atoms. By using the learned dictionary , each block in can be sparsely represented as a sparse coefficient vector ∈ × . This learned dictionary provides sparser representation for the frame than using the fixed basis dictionary. Same authors have extended their work in [36] for dynamic measurement rate allocation by incorporating feedback channel in their dictionary based distributed video codec.

Summary
CS is a new field and its application to video systems is even more recent. There are many avenues for further research and thorough quantitative analyses are still lacking. Key encoding strategies adopted so far includes:

•
Applying CS measurements to all frames (both key frames and non-key frames) as suggested by [24].

•
Applying conventional coding schemes (MPEG/H.264) to key frames and acquire local block-based and global frame-based CS measurements for non-key frames as suggested in [23,32]. • Split frames into non-overlapping blocks of equal size. Reference frames are sampled fully. After sampling, a compressive sampling test is carried out to identify which blocks are sparse [16].
Similarly, key decoding strategies includes: • Reconstructing the key frames by applying CS recovery algorithms such as GPSR and reconstruct the non-key frames by incorporating side information generated by recovered key frames [24].

•
Decoding key frames using conventional image or video decompression algorithms and perform sparse recovery with decoder side information for prediction error reconstruction. Add reconstructed prediction error to the block-based prediction frame for final frame reconstruction [23]. • Using a dictionary for decoding [32] where a dictionary is used for comparison and prediction of non-key frames. Similarly, a dictionary can be learned from neighboring frames for reconstruction of non-key frames [33].
These observations suggest that there are many different approaches to encode videos using CS. In order to achieve a simple encoder design, conventional MPEG type of encoding www.intechopen.com process should not be adopted. Otherwise, there is no point in using CS as an extra overhead. We believe that the distributed approach in which each key-frame and non-keyframe is encoded by CS is able to utilise CS more effectively. While spatial domain compression is performed by CS, temporal domain compression is not exploited fully since there is no motion compensation and estimation performed. Therefore, a simple but effective inter-frame compression will need to be devised. In the distributed approach, this is equivalent to generating effective side information for the non-key frames.