International Journal of Advanced Robotic Systems Superresolution Reconstruction of Video Sequence Using a Coarse-to-fine Registration and Optimal Interpolation Strategy Regular Paper

In this paper, a novel video superresolution (SR) reconstruction approach using a coarse‐to‐fine registration and optimal interpolation strategy is proposed. Firstly, initial correspondence between two consecutive frames is identified by scale invariant feature transform (SIFT) features under the homography transformation model. Then, a least square matching (LSM) method is employed to achieve a high‐precision estimation of homography. Finally, optimal interpolation using the iterative back projection‐based deblurring method was designed to obtain a high‐resolution grid interpolation and to improve the visual quality of the final SR results. The experiment results demonstrate that the proposed method provides an available SR approach in engineering applications.


Introduction
In recent years, rapid improvements in video and image processing technologies have taken place. High resolution (HR) images commonly contain more details of the scene, which are significant for analysis and post-processing. However, in practical applications, caused by the limitation of optical elements' volume and the bandwidth of the communication channel in the video surveillance system, image resolution cannot meet our requirements. Designing SR techniques to obtain an HR image from observed multiple low-resolution (LR) images have been studied in various ways. SR generates the HR image by combining multiple LR images with sub-pixel shifts from each other, since each LR image can provide similar but different information of the scenes.
The major advantage of the SR approach is that flexible, cheaper and existing LR imaging systems can still be utilized, especially for some systems where equipment is already fixed. Some video surveillance systems worked in the 'push boom' work style and provided many overlapped images for the target region. Therefore, using the SR technique to improve the resolution of the targets can be considered.
Most SR algorithms are restricted to planar scenes with the assumption of the homography or affine transformation model. In other words, image registration with sub-pixel accuracy is a prerequisite to SR algorithms; otherwise, inaccurate registration may degrade the reconstructed image quality severely. High precision registration has directly caused several difficulties when applying these methods in real video applications. The first difficultly is the location precision of corresponding points under the different views; the second difficultly is the lack of an accurate model to describe the transformation between the LR images under the conditions of a moving camera and scene variations; the third difficultly is restoring the HR image from multiple registered frames with certain extent errors.
In this paper, we aim to excavate the validate information from multiple video frames with overlapped regions to the maximum extent, and propose a solution to deal with the non-strict transformation between video frames, such as terrain undulation in satellite video and depth variation in a normal video. We mainly focus on the two special issues on video sequence SR: on the one hand, we aim to diminish the errors caused by homography estimation, on the other hand, we use an optimal interpolation method to reduce the disturbance of registration errors and noises.
The paper is organized as follows. Section 2 gives a review of the registration, interpolation and restoration algorithms used in the literature on SR. Section 3 presents the proposed method based on coarse-to-fine registration and optimal interpolation. The experiment results and discussions are given in Section 4. Section 5 concludes the paper.

Related work
Surveys of SR methods can be found in much of the literature such as [1,2]. Most SR methods exploit the highfrequency information of the LR-aliased images to reconstruct the HR image. These methods can be classified into two categories: frequency domain approaches and space domain approaches [3]. Tsai and Huang [4] proposed a frequency domain video SR approach, which exploits the shift property of the Fourier transform and developed a set of system equations in the frequency domain that relates the HR video frames to the observed LR video frames. The performance of frequency-domain video SR is usually limited by the global translation model. The space domain approaches can treat the complicated motion model and deal with the process of corresponding interpolation, filtering and resampling; their observation model involves global and partial motions, optical blur, infra-frame motion blur, spatial variable point spread function, etc. Major algorithms of the spatial domain SR include the nonuniform interpolation method [5,6], the probabilistic method [7], and the MAP/POCS combination method [8], etc. Compared with the frequency-domain approaches, the space domain approaches can yield better results because of their high inclusiveness and flexibility.
Precise image registration is a critical procedure of SR. Registration algorithms mainly fall into two groups: region-based and feature-based registration. Regionbased registration methods attempt to iteratively estimate the movement by minimizing an error function based on the intensity difference in the overlapping area. Regionbased registration methods have the advantage of being able to deal with complex motion with higher accuracy, but the computational cost for the entire image is too high for real-time implementation. Feature-based methods are based on correspondences between points, lines or other geometry primitives. A typical approach is to match extracted Harris corners utilizing a normalized crosscorrelation of the local intensity values. Recently there has been great progress made in the employment of invariant features [9,10] for image registration, for instance, Lowe's scale invariant feature transform (SIFT) features [11], which are geometrically invariant under similarity transforms and invariant under affine changes in intensity. These features have been proven to be robust and reliable. Feature-based techniques are fast, but fail to provide accurate results when the illumination or viewpoints significantly change. In remote-sensing image registration, the least square matching (LSM) [12] method based on affine transformation is frequently adopted, which conducts calculations using information within the image window and produces satisfactory matching results in the case of approximate planar scenes. The precision can reach 1/10-pixel and even higher, but this iterative algorithm needs a proper initial estimation to guarantee the convergence.
Another major issue regarding SR algorithms is their excessive dependence on an accurate registration. If the displacements between images are inaccurately estimated, the SR algorithm may lead to image degradation and not improvement. Some recent SR algorithms have been devised to improve the robustness to cope with such errors. Wang and Qi [13] proposed a robust SR algorithm based on the Kalman filter. The registration uncertainties are included in the filter equations, based on the error estimation of the dynamical model due to the registration error. In [14], Lee and Kang defined a constrained least squares problem as a signal dependent regularization functional, which incorporated the registration errors. Bose and Ahuja [15] used moving least squares to reconstruct the HR image, and proposed a technique to adaptively select the scale and order parameters for estimating the image intensity at each point of the HR grid. Costa [16] made a statistical analysis of the mean square algorithm, which is applied to the image sequence with global translation.
Rew [17] used the least square fitting method with polynomial function to enhance the illumination uniformity. However, the least square fitting method is known to be sensitive to outliers. To deal with this, RANdom SAmple Consensus (RANSAC) was introduced by Fishler and Bolles [18] as the most popular robust estimation technique to estimate transformations in datasets containing outliers. This strategy has been widely used in the curve and surface fitting process, since it is reliableand easy to implement. It operates in a hypothesize-and-verify framework, and given a set of coarse correspondence, RANSAC randomly samples a minimal subset to hypothesize a geometric model. This model is then verified against the remaining correspondences. This process is iterated until a certain criterion is met. The original RANSAC algorithm is proposed for the robust fitting of models with a good tolerance to outliers. We extended this method to solve the interpolation problem with the assumption of an intensity distribution model.

The proposed approach
In this paper, we aim to deal with the distance of the scene from the camera which is much greater than the motion of the camera between views, and the parallex effects caused by the three dimensional character of a scene which is negligibly small, such as in satellite imagery. We adopted the homography model to describe the transformation between LR frames, and employed an optimal interpolation strategy to tackle the limitations of the unified transformation model. Firstly, initial correspondence between two consecutive frames is obtained by using scale invariant feature transform (SIFT) features under the homography transformation model. Then, a least square matching (LSM) method is employed to achieve a high-precision estimation of homography. Finally, optimal interpolation with the iterative back projection-based deblurring method was designed to obtain a high-resolution grid interpolation and to improve the visual quality of the final SR results.
The proposed pipeline contains several processing steps from registration of the LR images to reconstruct of an HR image. Some pre-processing is also performed to enhance the quality of the video sequence. The basic steps of our algorithm are depicted in Fig.1 and described in more detail in the remaining parts of this section.

Registration
SIFT-based image registration is invariant to translation, rotation, scaling and illumination changes in images. However, the detection of the extrema is in the discrete space. The extrema may not be the real extrema in the continuous space and this is especially so for the boundaries of objects, where the location of the keypoints will move slightly under different viewpoints. Inspired by the region-based image registration techniques, we exploited the local region information around the SIFT points using an iterative method to obtain higher accuracy match point locations. In real data experiments, we observed that the accuracy of the SIFT registration method is not sufficient for SR, therefore, a LSM-based refinement stage is performed to achieve the required accuracy. Our coarse-to-fine registration procedure is shown in Fig.2.

A. Initial registration and homography estimation
The SIFT algorithm is a popular candidate for extracting the interest points which are invariant to translation, rotation, scaling and illumination changes. It firstly constructs a Gaussian scale-space pyramid from the input image while also calculates the gradients and difference of Gaussian (DoG) images at these scales. Interest points are detected at the local extrema within the DoG scale space. Once multiple keypoints have been detected at different scales, the image gradients in the local region around each feature point are encoded using orientation histograms and represented in the form of a rotationally invariant feature descriptor. The initial correspondences can be obtained by the SIFT detector and matching approach.
We assume that the transformation between image sequence can be approximated to the planar homography. This is appropriate to a freely moving camera viewing a very distant scene. In all cases, it is assumed that the images are obtained by a perspective pin-hole camera.
A point is represented by a homogeneous coordinate, such as (x,y,1) . Conversely, the point 1 and the H is called the homography matrix .The planar homography has eight degrees of freedom. Each point correspondence generates two linear equations for the elements of H , hence four correspondences are enough to solve the homography directly. If more than four points are available, a least squares solution can be found by linear methods. The method of RANSAC was used to obtain a unique estimate of H from noisy data. Full details are given in Hartley and Zisserman [19].

B. Refine homography with least square matching
The least squares matching (LSM) method attempts to match windows of pixels by establishing a correspondence between them and by minimizing the difference of their intensity values. The precision can reach 1/10-pixel and even higher. The LSM method is appropriate for video sequences which have some overlap regions and enough texture information, especially for those SIFT keypoints.
Because the distance of the scene from the camera is relatively large, the transformation between the local region around the SIFT keypoints can be approximated to an affine transformation. The affine transformation is defined as: where (x, y) , (x ,y )   are the coordinates in image pairs and 0 a , 1 a , 2 a , 0 b , 1 b , 2 b are affine transformation parameters. For satellite video sequences, the time interval and location variation between two frames is small, and the illumination changes slowly and continuously. Considering the noise and geometry distortion, the intensity variation between image pairs can be approximately described as: where 1 n , 2 n are the random noise and 1 g , 2 g are the intensity values in image pairs.
The LSM method estimated the movement by iteratively modifying the affine transformation and radiometric shift parameters to minimize the intensity difference in the correspondence area. The error equation for each pixel can be described as: g a x a y a b x b y After linearization, the equation became: g g (x, y) g (x, y)    was the difference of intensity.
The parameters of the error equation are given as: According to the theory of the least square matching, we optimized the objective function: For all the pixels within the match window, Eq.(6) can be rewritten into a matrix form .
CX L  n n n n n n g g g g g g x y x y x x x y y y If n 6  , Eq. (9) was an over-determined equation and the solution of X can be computed by conventional linear least squares:

RANSAC interpolation
Once the relative motion information is estimated with sub-pixel accuracy, it requires interpolation when the fractional unit of motion is not equal to the HR grid.
Simple methods, such as zero-order hold, bilinear or the least square fitting interpolation, have been used in the literature based on the assumption that the relative shifts were known precisely, and the registration errors can be ignored. However, in real applications, these methods tended to smoothen and blur image details, as well as performed inefficiently in the presence of noise.
Considering we use planar homographies to represent the geometric transformations between any LR images to the reference LR image, the re-projection errors of each pixel in the overlap region are different. These points with larger re-projection errors are regarded as outliers.
In general, the bilinear interpolation and the least square fitting methods are known to be sensitive to outliers, so it was essential for us to select the points with higher precision for an accurate and robust interpolation.
Fischler proposed the RANSAC method for robust fitting of models. It is robust in the sense of good tolerance to outliers in the coarse data and capable of interpreting and smoothing data containing a significant percentage of gross errors. The algorithm has been widely applied to parameter estimation problems in computer vision, such as feature extraction and matching. In order to reduce or eliminate the effect of registration errors, now we introduce this idea to our optimal interpolation.
In our application, we can obtain 20 or more frames with overlapped regions. For each location of the HR grid, we can obtain more than one correspondence location in other frames. For the sake of accuracy, we find the neighbour integral pixels with the minimum Euclidean distance in each frame. Since the pixel location is relevant to the intensity value in the two dimensional image, the deviation of the location can also be regarded as the intensity shift.
We attempted to adopt a second order polynomial model to fit the intensity distribution in the HR grid. Some researchers suggest that using a higher order polynomial can fit the rapid changes more accurately. From the point of view of an engineering application, the magnification factor (MF) of SR technique is less than 2.0. Zhouchen Lin et al. [20] give explicit bounds of the magnification factor and investigate a sufficient number of LR images. They prove that the practical limit is 1.6, if the noise removal and registration is not good enough. For a smaller MF, second order approximation is sufficient for our application. The second order polynomial model can be described as: where x , y are the pixel coordinates and 2 2 1,x,y,x ,xy,y are the basis functions. f(x,y) is the intensity value. Given m points, and m 6  , the equation can be written into a matrix form: Thus, the solution of C can be computed by conventional linear least squares.
We have made the assumption that the intensity distribution of the local grid is submitted to a second order polynomial model, then the RANSAC method was employed to optimize the parameters of the model.
The RANSAC interpolation algorithm is summarized in algorithm 1.

Require:
The set of correspondence pixels for current grid point.

Ensure:
The optimal interpolation for current grid point. 1: for i = 1 to M (M is the required sampling times) do 2: Select a sample of pixels in correspondence pixels set; 3: Estimate the hypothesis parameters based on the sample; 4: Calculate the error of each correspondence point in all frames with regard to the hypothesis; 5: Determine the number of inlier pixels consistent with the hypothesis; 6: end for 7: Select the model hypothesis with largest inlier pixels support, then re-estimate model parameters using all the inliers.

Algorithm 1 The RANSAC multi-frame interpolation
For the RANSAC multiple frame interpolation, the required number of samples M is related to the hypothesis model and noise level. In the optimal interpolation procedure, we have taken four neighbour points from the reference image, which can be treated as control points. Then two or more points were randomly sampled from other images based on the registration results. This solution can reduce the sample times as well as lead to robust fitting. Robustness and flexibility in modeling noise characteristics and a random sample consensus about the solution are the major advantages of the optimal interpolation approach.

Iterative blind de-convolution
Now we will deal with the restoration problem with the aim to remove the blurring and noise of the HR image obtained by the non-uniform interpolation approach.
HR images from optimal interpolation on non-uniformly sampled data still have noise and blurring, so deconvolution restoration of the image is adopted as the final step to estimate the SR images and eliminate the blurring effect for better image quality. The image degeneration can be described by the process of the point-spread function (PSF).
Image restoration is an inverse process to estimate the primitive image from the degenerated image. It is literally a regular de-convolution process if the PSF is determined.
Since the PSF of video sequence remains unknown, we restore the image by the iterative back projection (IBP) method [21,22] using empirical parameters. This process is repeated iteratively to minimize the energy of the error. The IBP method can be expressed as: It needs to be pointed out that the choice of BP h plays an important role in the reconstruction. Our algorithm was mainly focused on down sample degeneration, so we take a Gaussian blurring kernel as the PSF.

Experimental results
In order to evaluate the performance of our approach for SR, a series of simulations and real data experiments were carried out to verify them.

Simulation experiments
We simulated the sub-pixel displacement shown in Fig.3, the sub-pixel displacement is simulated by down sampling the high-resolution grid from the odd-even field. Then, the registration results obtained by the proposed method are compared to those obtained by the SIFT matching method, and illustrated in Fig.4. By using the proposed method, the motion estimation errors and re-projection errors have been reduced significantly, and the registration errors are mainly within 1/10-pixel.  Next, the RANSAC interpolation method is compared with the least square fitting method, which has been depicted in Fig.5. The difference can be easily observed between the two methods under the effect of outliers, and the proposed method has a good tolerance to the errors caused by registration. The interpolation is more accurate and robust. Simulation experiments were also employed to verify the proposed method. The simulation results show a comparison between the proposed algorithms and those of the three existing algorithms on four LR Lena images. The peak signal to noise ratio (PSNR) metrics is used as an indicator of performance. The visual quality of the HR images and PSNR metrics were demonstrated in Fig.6 and Table 1.

Real data experiments
The proposed method has been applied to the real scene image sequence and satellite video (provided by LAPAN TUBSAT), and the HR image is generated from multiple LR frames. Fig. 6 provided a comparison between the SR image and the primitive low resolution image which was amplified to the size of the reconstructed image. From Fig. 6(a) and (b) we can find that the depths of the objects are different, so the transformation between two frames cannot be regarded as homography. However, our algorithm can achieve a good result for almost the whole image. This verified that the proposed method can reduce the shortcomings of the aforementioned homography model, as well as select the most effective information from multiple images to generate a HR image. Then we applied the algorithm to a real satellite video, and the results are shown in Fig.6(c) and (d). Fig.7(a), 7(b), 7(c), 7(d) give a comparison of the details in the primitive and reconstructed images, the HR image contains more distinguishable details such as recognizable objects. Finally, the evaluation quality of the reconstructed image and the interpolation-enlarged image is assessed by entropy, average gradient and edge gradient. The average gradient parameter reflects the minute contrast and texture changes of the image. The information entropy reflects the richness of the image information, and the edge gradient is based on Wang Xing's method [23], which is based on the smoothing effect of blur on edges.

Computational complexity analysis
The procedure of SIFT initial match, RANSAC interpolation and IBP restoration tend to be computationally costly. The computation of the SIFT extractor, initial match and IBP restoration have a good property of computational consistency. This property motivates us to transfer these algorithms to a largely parallel platform such as graphic process units (GPU). We adopted the Marten Bjorkman's CUDA SIFT, and implemented the IBP algorithm on GPU to improve the computation efficiency.
The LSM refinement and RANSAC interpolation were an iterative-based method with a data dependent conditional branch. The branch divergence would sharply decrease the efficiency, so we implemented these stages on CPU.
The overall execution time and the distribution of each stage are shown in Table 3.

Conclusion
This paper proposed a new superresolution method for video sequences. The coarse-to-fine registration gives precise motion estimation for the continuous images, and the random sample consensus interpolation have good tolerance to the noise caused by sampling as well as the registration errors, especially for continuous images with terrain undulation. Experiments on the LAPAN TUBSAT satellite videos demonstrate the promising ability to obtain HR images of the ground scene. It also gives a clear boundary and more details for analysis of the ground surface.
However, this paper was designed to deal with scenes that are far from the camera and with a negligibly small three dimensional character to the scene. The proposed method may often suffer from a sort of degradation because it cannot inherently handle the perspective distortion caused by depth variation. These problems deserve further investigation.