Dual-tree complex wavelet transform and super-resolution based video inpainting application to object removal and error concealment

: Video inpainting is a technique that fills in the missing regions or gaps in a video by using its known pixels. The existing video inpainting algorithms are computationally expensive and introduce seam in the target region that arises due to variation in brightness or contrast of the patches. To overcome these drawbacks, the authors propose a novel two-stage framework. In the first step, sub-bands of wavelets of a low-resolution image are obtained using the dual-tree complex wavelet transform. Criminisi algorithm and auto-regression technique are then applied to these subbands to inpaint the missing regions. The fuzzy logic-based histogram equalisation is used to further enhance the image by preserving the image brightness and improve the local contrast. In the second step, the image is enhanced using super-resolution technique. The process of down-sampling, inpainting and subsequently enhancing the video using the super-resolution technique reduces the video inpainting time. The framework is tested on video sequences by comparing and analysing the structural similarity index matrix, peak-signal-to-noise ratio, visual information fidelity in pixel domain and execution time with the state-of-the-art algorithms. The experimental analysis gives visually pleasing results for object removal and error concealment.


Introduction
Video inpainting is a method to fill the missing region or gaps, recover objects in a video. The removal of undesired objects in the video can create missing areas [1][2][3]. The objective of video inpainting is to fill the missing areas so the results are visually convincing both in time and space. The inpainting techniques are classified as partial differential equation (PDE) based approaches and exemplar-based techniques. PDE-based approaches construct the diffusion PDE based on the edge information into the required region. Based on the work of Bertalmio et al. [4], Chan and Shen [5,6] proposed PDE-based models and total variation (TV) model to handle the non-texture image inpainting problems. The drawback of these approaches is that blur gets introduced for the vast region of the target. Granados et al. [7] have proposed a homography registration based inpainting method that fills large holes in videos. The exemplar-based approach is proposed by the Liu and Caselles [8] and Xia et al. [9] for removing objects from video and filling the gaps by dividing the frame. For the restoration of the entirely damaged object, Ghanbari and Soryani [10] proposed a contour-based method for video inpainting. In this work, moving objects are separated, and then using a contour-based comparison and patch-based inpainting method, the damaged regions are inpainted. Tremendous work has been done in the past years on video inpainting, but issues remain when the hole to be filled is largely filled, and high computational time in general required for video inpainting. To address these issues, a super-resolution (SR) based video inpainting framework is proposed. SR is the technique to reconstruct high-resolution [11][12][13] image from one or more low-resolution (LR) images. Caselles and co-workers [4] have developed the first PDE-based inpainting algorithm. The algorithm iteratively updates the output image by adding objective information gained from the surrounding geometric structure. However, this algorithm is not suitable for filling large missing areas. Also, due to the blurring artefacts generated by the diffusion process, the algorithm may cause distractions to the viewers of inpainting results. Also, the algorithm lacks explicit treatment of the pixels on edges. Later, inspired by Bertalmio's algorithm, Chan and Shen [6] proposed the TV inpainting model, which is closely connected to the classic TV denoising model of Rudin et al. [14]. To overcome this problem, Le Sun's et al. [15] proposed a novel Hue-saturationintensity (HSI) mixed denoising method based on 3D spectralspatial cross-TV. Amrani et al. [16] proposed a novel PDE-based inpainting algorithm to compress hyperspectral images. The method inpaints the known data in the spatial and spectral dimensions separately. Then, it applies a prediction model to the final inpainting solution to obtain a representation much closer to the original image. Newson et al. [17,18] optimised a global patch-based function and thus made a significant improvement, especially in motion preservation, by incorporating the optical flow in several stages of the algorithm. In the paper authored by Huang and Tang [19], when restoring the damaged background, the information is directly replicated to fill in the damaged region from the constructed panoramic background image. When restoring the damaged foreground, it seeks the best state matching frame by comparing the next running foreground state of the damaged frame and searches the best matching patch in this frame. Meanwhile, the algorithm also improves the way of searching exemplar patch, the principle of matching cost and way of confidence term updating. Newson et al. [20] proposed a method combining the patch-based method and hierarchical pyramid decomposition. This method uses multiscale scheme inpainting with the help of k-nearest neighbours searching and energy minimum constraints. However, structural connectivity is not yet complete. Janardhana Rao et al. [21] recognised that the priority of the patch falls to a low value due to the rapid decrease in confidence term at lower iterations. To overcome this effect, regularisation factor is added to the confidence term. The regularisation factor is used to reduce the smoothness of the confidence curve. In recent years, to improve the accuracy of the video inpainting, deep learning based image inpainting methods use SR. These methods produce accurate results but the runtime is more because these models need to be trained using external training data [22][23][24].
In this paper, a novel video inpainting framework that combines dual-tree complex wavelet transform (DTCWT), auto-regression (AR), fuzzy enhancement and SR is proposed. The process of down-sampling, inpainting and subsequently enhancing the video using SR technique reduces the video inpainting time. Also, the use of AR avoids the discontinuities that arise due to variation in brightness or contrast of patches.
The contributions of this paper are as below: † Patch priority selection: Proposed a framework to overcome the drawback of the Criminisi et al. [25] algorithm, i.e. sensitive to the parameter setting of the inpainting method. To overcome this, we have inpainted input image several times with different configurations. † Reduction in seam: In the exiting video inpainting methods, seam arises due to change in brightness of patches. To overcome this, AR model is used to select appropriate patches to be copied into the target region. † Reduction in execution time: Existing video inpainting methods are computationally expensive. A modified SR algorithm is proposed to reduce the execution time.

Proposed method
The proposed method is the combination of two sequential operations, as shown in Fig. 1. The first operation is filling in the missing regions of LR image using Criminsi [24] based inpainting algorithm. The second operation super resolves the inpainted LR image to get a high resolution (HR) image. The block diagram is discussed below.

Frame extraction and down-sampling
The frames are extracted from a video, and these frames are down-sampled by factor of two to get LR images. Inpainting a LR image is not dependent on noise [26], and hence computational time is low when compared to inpainting a HR image. After getting the LR image, subtract the current image from the previous image. If the residual is zero or less than the threshold value of pixels, the inpainting process is skipped as it indicates that the current frame is the same as the previous frame. This step avoids the inpainting of redundant frames. If the residual is non-zero, the inpainting process is carried out, as discussed in the below sections.

Dual-tree complex wavelet transform
The author Kingsbury [27] proposed that DTCWT can be used to overcome the disadvantages of the traditional wavelet transform. The complex wavelet transform (CWT) is the complex-valued extension to standard DWT. CWT uses complex value filtering that decomposes the real/complex signal into real and imaginary parts in the transform domain. The real and imaginary coefficients are used to compute amplitude and phase information. DTCWT has separated sub-bands for positive and negative orientations. DTCWT calculates the complex transform of the signal using two separate discrete wavelet transform (DWT) decomposition.
The DTCWT decomposes the input image into its 16 subbands. Out of 16 subbands, 4 are the average subbands which have the average image contents. The other 12 subbands contain high-frequency information. Each subband contains the edge information, which is oriented at various angles. The subbands in the DTCWT are oriented at the angles {± 15°, ± 45°, ± 75°}. So, we get the oriented subbands A1 to A12. The subbands A13 to A16 are discarded as they are the average subbands [28], and they do not contain any special information.

Inpainting method
Criminisi's inpainting algorithm is applied to four sub-bands of the image formed after applying DTDWT. The inpainting approach is as below (Fig. 2): where Ck ()is confidence measure. It is measure of information in the neighbourhood pixel and Dk ()data term. In this where c k is the area of patch c k of pixel and V is the target region of image R. This confidence term has high values near the border of the initial mask, and decreases near the centre of V. It tends then to in-paint first pixels having the most valid neighbours.
The initial conditions of confidence term Cq are calculated as below: The confidence term is updated as below after the patch has been filled with the pixels The term Dk ()takes care of structures in c k . In this method, structure tensors [29] approach is used to calculate data term where Fig. 1  The colour gradient ∇I k at k, n k is unit vector orthogonal to boundary ∂Ω at k and w is a normalised 2D Gaussian function centred at p. (iii) Once these priorities are calculated, the patch c k ′ around the pixel k ′ that has maximum priority is considered for filling. Since the pixels on the boundary of the region ∂V to be in-painted get more priority, the selected patch c k ′ will always consist of both the known and missing pixels. Hence, we need to use contents or similar patches called exemplar. The patch c k ′ is compared with a patch c l ′ around every pixel l to find an exemplar in an image. As the image size increases, the execution time required to find an exemplar also increases. However, the similar patches may be found in nearby region. Instead of searching the whole image, the search is restricted for matching the patches to large-sized search window W k ′ of size 90 × 90 (set empirically) around the patch to be filled up. This step reduces the number of computations required. Find the similarity between two patches using the sum of squared differences (SSD) measure. The patch c l ′ which gives the minimum SSD is considered as exemplar E k ′ . Due to variation in intensity or contrast, the SSD calculated for E k ′ will be high. In such cases, patches will be visible in in-painted image. To achieve better inpainting, estimate the neighbourhood pixels using AR model. The AR parameters suggest the values of neighbouring pixels towards the centre pixel of every 3 × 3 region in that set. (iv) Estimate the missing pixels in the c k ′ with the help of the exemplar E k ′ and AR parameters. Di Zenzo [29]a n dP e r e zet al. [30] methods are used to copy the pixels from the source region in an image into the missing region in a same or different image. Once a patch is processed, its pixels are excluded from the region to be inpainted. The updated missing region is then used in the next iteration and after each iteration, the missing region shrinks. The algorithm terminates when all the missing pixels are filled.
(v) Repeat above steps for all target pixels in each sub-band in the current level until all the missing pixels are filled.

Inverse dual-tree complex wavelet transform (IDTCWT)
To make the proposed method robust, the wavelet images are inpainted by patch size having 7 × 7, 9 × 9 and 11 × 11 and also 11 × 11 rotated by 180°to allow the filling order. By this, four inpainted LR images are obtained. Then IDTCWT is applied to reconstruct the original image.

Image fusion
The obtained LR images are combined to get one LR image using the fusion technique based on variance calculated in DCT domain [31]. The DCT coefficients are calculated for image blocks of size 8 × 8 pixels.

Fuzzy enhancement
The fuzzy logic-based histogram equalisation [32] is used to enhance the image. This method preserves image brightness and improves the local contrast of the original image.

Super-resolution
SR is a technique to reconstruct the HR image using LR images. The high-frequency content lost during the image acquisition process has to be recovered in the SR techniques. From under-sampled LR observations, the primary concern of the SR algorithm is to reconstruct HR images, and it produces high-quality images from blurred, noisy and degraded images.
The proposed iterated back-projection (IBP) based resolution enhancement algorithm is primarily inspired by the recent work of Deshpande et al. [33,34]. The modified IBP algorithm is as below.
(i) LR fused image is divided into patches of 7 × 7 L lr . Initialise iteration.
(ii) The LR image patch is up-sampled by the up-sampling factor f to obtain U r using Bicubic approach.  (iii) Initialise iteration index i = 1.
Apply Lanczos3 low-pass filtering to remove high-frequency components to avoid the aliasing effect.
where σ = 0.5 (iv) where error E g CF If E CF , maximum error E g CF 1 then where

Results
The proposed work is implemented using Matlab2012 on Intel Core i3 machine with 1.8 GHz processor speed and 4 GB RAM. The performance of the proposed framework is evaluated for two applications:   object removal and error concealment. In object removal, the missing area is in arbitrary shape and in loss concealment; the missing area is square or rectangle shape and arbitrary shape.

Object removal
The performance of the proposed algorithm is tested on 12 video sequences having 4700 frames. These video sequences are available in the dataset [35] composed of 6 groups of 31 real-world videos having more than 70,000 frames. The reference image, in the form of binary maps, is provided for all these videos to indicate where the changes occur. The input videos which are used to evaluate the performance of the proposed framework are as shown in Table 1. Due to space limitation, five samples of the input frame and output frame are shown in Fig. 3.
For quality analysis, a new object, as a mask, is inserted into original frames and these frames are inpainted by using the proposed framework. Then the quality of the inpainted frames is analysed using [36][37][38][39] for peak-signal-to-noise ratio (PSNR), structural similarity index matrix (SSIM) and visual information fidelity in pixel domain (VIFP). The analysis of the influence of DWT, non-sub-sampled contourlet transform (NSCT), non-subsampled Shearlet transform (NSST) and DTCWT on the proposed framework is shown in Table 2. From the analysis, as shown in Table 2, DCTWT performs well compared to other algorithms. So, DTCWT is used in inpainting framework.

Super-resolution
Th eSRpl a y sasi gn i ficant role in reducing the execution time of the framework for inpainting the video sequences. The LR images are divided into small patches of size 7 × 7. This is decided by analysing the performance of various patch sizes on two video sequences Lena and Akiyo. The quality analysis for various patch sizes is as shown in Fig. 4. The execution time analysis is carried out on super-resolved and not super-images, as shown in Fig. 5. From Figs. 4 and 5, it can be seen that the quality of the inpainted image is better in small path sizes (3 × 3 and 5 × 5) than other patches. In this experiment, 7 × 7 patch size is selected as more execution time is required for small size patches, and there is a very minimal amount of change in image quality. Fig. 5 shows that, the use of SR approach reduces the time required to inpaint the frames.

Error concealment
The performance of the proposed framework is further analysed in a context of error concealment. In this test, four videos are considered. The state-of-the-art algorithms of Giryes and Elad [40], Zarif et al. [41], Newson et al. [20], Janardhana Rao et al. [21] and Yang Li et al. [22] are used as the comparison baseline, as shown in Table 3. As mentioned in Table 3, three loss rates are considered, viz., 10, 25 and 45% for each video sequence. This loss of pixels is manually done. From Table 3, it can be seen that as the loss of pixels increases, the quality of the inpainted frame decreases. However, compared to state-of-the-art-algorithm, the proposed framework performs well for the increase in loss of pixels. The average execution time required to inpaint the sample video sequences is given in Table 4.
From Table 4, it can be seen that the execution time depends on frame resolution and number of missing pixels. The execution time is more for more missing pixels. For Lena and Akiyo videos missing pixels are very less compared to Pets and Blizzard videos and hence, the time required to inpaint the Lena and Akiyo videos is less compared to other video sequences.

Conclusion
A novel video inpainting framework is proposed to inpaint the video sequences. The framework consists of inpainting, AR and SR methods. In the proposed novel work, inpainting algorithm is applied on sub-bands of images, which are decomposed using DTCWT. This framework uses the AR model to avoid the discontinuity which arises due to variation in brightness or contrast of patches. This framework removes seam in the target region that arises due to variation in brightness or contrast of patches. The fuzzy logic-based histogram equalisation is used to further enhance the image by preserving the image brightness and improve the local contrast. The performance of the proposed framework is evaluated for object removal and error concealment applications. The framework is compared with existing state-of-the-art algorithms. It can be concluded that by using the proposed approach, the video frames are inpainted without much loss of information and at a minimal amount of execution time.