A Video Deblurring Algorithm Based on Motion Vector and An Encorder-Decoder Network

Camera shakes cause video motion blur. Video deblurring has been studied for years, and however, there are still unresolved problems, such as video frame alignment, frame selection, and frame ambiguity evaluation. We propose a video deblurring algorithm based on the motion vector and an encoder–decoder network. Our algorithm consists of four steps: first, the blurry image blocks in a video frame are located using a blurred image quality evaluation algorithm based on a response function of singular values. Second, the corresponding candidates of the blurry image block in the consecutive frames are searched using the motion vector, and the optimal candidate blocks are obtained using an objective function. Third, the blurry image block and the optimal candidate blocks are served as samples, which are inputted to an encoder–decoder network, so that the blurry image block is repaired. Finally, all blurry image blocks are replaced with the repaired ones, the boundary artifacts are eliminated, and the entire video frame is repaired. The experiments show that our algorithm yields sharper repair results, and the overall performance of our algorithm is better than other related algorithms.


I. INTRODUCTION
It is popular of taking videos with digital cameras and smartphones in daily life.Some distortions of video such as defocus and motion blur are inevitable while taking videos by a mobile phone or camera.The defocus mainly occurs on a camera with automatic focusing in the beginning of taking videos, or when the camera focus changes and consequently need refocusing.The motion blur mainly occurs when there are relative motions between the scene and camera.That gives rise to the study of video deblurring algorithm.
The process of generating a blurry image can be modeled as follows: where y represents a blurry image, x represents a sharp image, k represents a blur kernel, * represents a convolution operation, and n represents additive noise.The lost details of the blurry region are usually irreversible.The image is reconstructed by deconvolution in a few studies [1]- [5] The associate editor coordinating the review of this manuscript and approving it for publication was Donatella Darsena.
through the uniform or non-uniform blur kernel estimation.However, different regions in an image have different degrees of blur or blur types.The problem remains unsolved if the image reconstruction is based on one single blurry image only.
Video deblurring is different from image deblurring.There are correlations and continuity across the frames of a video which provide reference for reconstructing high-quality video.Video deblurring algorithms are mainly classified into two categories: geometry-based algorithm and deep learning based algorithm.A geometry-based algorithm uses the traditional block search method, a blurry region in one frame may have corresponding sharp region in the previous or the next frame.The sharp counterpart is used as reference to replace the blurry region.The main task is searching for the optimal sharp counterpart.Matsushita et al. [6] proposed an effective video deblurring algorithm and used a homography matrix to align blurry frame with consecutive sharp frames and replaces a blurry region of the target frame with the sharp regions in one of the consecutive frames.The repair is done block by block, then motion transfer and interpolation are used to improve the sharpness of the entire video frame.
Barnes et al. [7] used Sum of Squared Differences (SSD) as a benchmark in searching blurry blocks in frame.They [8] extended the block search algorithm to the cross-scale search and rotational search.
Cho et al. [9] proposed a video deblurring algorithm using patch-based synthesis.The blur kernel of a blurry image block in a video frame is estimated.This blur kernel is only used to simulate the process from the sharp to the blurry image block, rather than to obtain the final deblurred frame with deconvolution.The counterpart of a blurry image block is searched in the consecutive video frames, that is, to find a sharp image block corresponding to the blurry block in the consecutive video frames.The found sharp image block is convolved with the blur kernel, the convolution result is compared to the original blurry image block, and the image block with the smallest variation is selected as the sharp image block.Wang et al. [10] proposed a patch-based algorithm to reconstruct a video.Their algorithm defines an objective function that combines sharpness and similarity to find a sharp image block to reconstruct the blurry image block.Markov Random Field (MRF) model is used to reconstruct each video frame.These geometry-based algorithms often use feature point matching, geometric constraint and some other means in the block search process.The overall time complexity becomes very high.And the entire video frame, instead of only the blurry region of a video frame, is repaired in most such algorithms.
Deep learning has been used in computer vision such as object detection [11], image classification [12]- [15], multimedia analysis [16]-18], face treatment [19]- [21].Deep learning has achieved good results in video deblurring, too.Kim et al. [22] proposed a spatio-temporal recurrent network for video deblurring.The network uses a deep residual network and adds a dynamic network layer, which handles large-scale ambiguity better than traditional deep learning networks without very high time complexity.Su et al. [23] proposed a Deep Video Deblurry Network (DBN).The input to the network is five consecutive video frames, and the RGB channels between frames are superimposed and served as input to the deep learning network, the network output is the sharp image corresponding to a blurry video frame.The encoder-decoder network structure is used, a connection layer between the encoder and the decoder is added.This layer passes the features on the left side of the network to the next corresponding layer and speeds up network convergence and helps to generate the sharp video frame.These deep learning-based algorithms tend to fix the video frame as a whole, rather than just the blurry region of a video frame.There is only a partial blurring in a video frame, therefore it is not effective repairing the entire video frame.Zhang et al. [29] proposed a DeBLuRring Network (DBLRNet) for spatial-temporal learning by applying a 3D convolution to both spatial and temporal domains.The DBLRNet is able to capture jointly spatial and temporal information encoded in neighboring frames, which directly contributes to improved video deblur performance.
Inspired by these categories of deblurring algorithm, a video deblurring algorithm based on motion vector and an encoder-decoder network is proposed.The main contributions of this study are as follows.
(1) Our algorithm is different from the current deeplearning-based video deblurring algorithms, the image repair with deep learning framework is implemented on blocks.
(2) Our algorithm is different from traditional featurepoint matching and geometric constraints, it has lower time complexity because motion vector is used to search for sharp image blocks in a video.
(3) Our algorithm is different from entire video frame deblurring, a blurred image quality evaluation helps to repair only the blurry regions in a video frame.
(4) A good deblurring effect is achieved using an encoderdecoder network when there is no precise alignment across the frames of a video.
The rest of the paper is organized as follows: our algorithm is proposed in Section 2, the experimental results are analyzed in Section 3, and conclusions are in Section 4.

II. THE PROPOSED METHOD
An effective video deblurring algorithm is proposed based on motion vector and encoder-decoder network.The algorithm's flowchart is shown in Figure 1.The algorithm is divided into the following four steps: In the first step, the repairing blurry image blocks in frames are detected using a blurred image quality evaluation algorithm based on the response function of singular values.In the second step, the candidates corresponding to the blurry image block are found using motion vector, the optimal candidate blocks are selected with the filtering of an objective function.In the third step, the blurry image block and the optimal candidate blocks are inputted into an encoder-decoder network to obtain the repaired result.In the fourth step, the video is repaired by eliminating boundary artifacts.

A. LOCATING BLURRY REGIONS IN A VIDEO
We [24] proposed a blurred image quality evaluation algorithm based on response function of singular values.The algorithm can detect the degree of blur of not only an image, but also of any block inside the image.The algorithm provides technical preparation for video deblurring.
Any frame of a video is divided into image blocks that are not overlapped one another, the size of image block is 128 * 128.The degree of blur of any image block is evaluated using the blurred image quality evaluation algorithm [24].The blurry image block and sharp image block are distinguished by threshold T. In order to find the value of T, 10 images are randomly selected from the LIVE Database [25] and are divided into equal-sized image blocks.52 image blocks are obtained, and there are 26 blurry image The saliency map blocks are denoted by {B S ij }, and the block weight denoted by {w ij }, as follows: where i ∈ {1, 2, 3, . . ., R}, j ∈ {1, 2, 3, . . ., K } is the SIFT number of the block {B S ij }, and β is a constant determined by experiments.r is the scale factor.
The experimental results are shown in Figure 2. All blurry scores of the blurry image blocks are smaller, and those of all sharp image blocks are larger.A black line divides the blurry scores of the sharp image blocks and those of the blurry image blocks, that is, threshold T. The threshold is set to 0.5 in this study.

B. SEARCHING FOR THE OPTIMAL CANDIDATE BLOCK ON MOTION VECTOR
Motion Vector (MV) has been widely used in video compression and encoder-decoder.The displacement among any two consecutive frames of a video is small because of consistency and correlation, only the displacement of characters or scenes needs to be saved, so that the recovery, encoding and decoding of a video frame can be achieved.The calculation of MV is shown in Figure 3.In Figure 3, C represents a macroblock in the current frame, R represents the best matching macroblock found in a reference frame, R is the mapping of R in the current frame.It is assumed that the coordinates of the upper left corner of C is (x c , y c ), the coordinates of the upper left corner of R is (x r , y r ), the formula for calculating MV is shown in equation (3).
where h represents the horizontal component and v represents the vertical component of MV.

2) FINDING THE OPTIMAL CANDIDATE BLOCKS
The candidate blocks found by MV maybe error if the object motion in consecutive frames is large.The candidate blocks found in the reference frames (k − 2, k − 1, k, k + 1, k + 2) by MV are shown in Figure 5, we can tell just by bare eyes that the found candidate sharp blocks and the blurry block are not matching.The search range in the reference frames is expanded in order to locate the optimal candidate blocks.Centering on the candidate block found by MV, the candidate block is expanded N pixels in the up, down, left, and right directions.N is set to 5, the size of the search box is 138 * 138.The upper left corner serves as the starting point for the searching inside a search box, candidate blocks are searched pixel by pixel thoroughly so that 121 candidate blocks are found in every reference frame.A set of candidate blocks found in one reference frame is shown Figure 6, the yellow box is the candidate block found by MV, the red box is the search box.
In order to find the optimal block in each reference frame, an objective function that combines structural similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) is used as follows.
where w 1 and w 2 are the weights assigned to SSIM and PSNR, respectively.p and q are a blurry block and a candidate block, respectively.A larger F indicates being closer to the optimal candidate block.F is calculated for every of 121 candidate blocks in one reference frame, and the candidate block with the largest F is selected as the optimal candidate block in this reference frame.As shown in Figure 7, four optimal candidate blocks are found for a blurry image block by objective function.

3) REPAIR OF A SINGLE IMAGE BLOCK
Su et al. [23] proposed a neural network structure of encoderdecoder style.This network has a good effect on image reconstruction [26], [27].The network structure is shown in Figure 8.In order to enable the characters of each image layer in the encoding phase to be passed to the corresponding layer in the decoding stage, a jump connection layer is added between the corresponding layers in the encoding and decoding phases of the network [28].It helps to speed up network convergence and produce sharper images, and is marked with dotted arrow in Figure 8.The neural network used in this study has 15 layers and includes 3 low convolutional layers (D1, D2, D3), 3 upper convolutional layers (U1, U2, U3) and 6 flat convolutional layers (F0, F1, . . ., F5).These convolutional layers are marked in blue, orange and gray, respectively.The low convolutional layers are mainly used to compress image characters and reduce spatial resolution; the upper convolutional layers are used to increase spatial resolution; the flat convolutional layers are for nonlinear   mapping to maintain image size.The first half of the network is the encoding phase, the second half is the decoding phase.The image characters in the encoding phase are passed to the corresponding layer in the decoding phase.For example, the input of F4 is the integration of U1 and F2, and the input of F5 is the integration of U2 and F1.This is the advantage of the network and the key to generate high quality and sharp images in the decoding phase.The input to the network is 5 consecutive image blocks, and the output is the repaired blurry block.The parameters are shown in Table 1.For example, input layer 15× H × W  represents 5 consecutive video frames, the size of each video frame is 3× H × W, 3 represents RGB channel, H and W are the height and width of a video frame.Output layer is a repaired block with size 3 × H × W.
Samples input to network is shown in Figure 9.Each row is a sample.The blurry image block in the k-th frame of a video is in the center column, and the four optimal candidate blocks found in the reference frames (k − 2, k − 1, k, k + 1, k + 2) are shown in the left and right columns.
Samples are input to a trained network.The parameters in the training phase are described in Section 3. Five blurry image blocks in Figure 9(c) are repaired and shown in Figure 10.PSNR is notably increased after the repair.

C. REPAIR OF AN ENTIRE VIDEO FRAME
The above repair process of an individual block can be extended to an entire frame.An image frame is divided into blocks, the blurry image blocks are found and repaired one by one.Two randomly selected video frames are repaired and shown in Figure 11.The video quality is notably improved even by bare eyes and PSNR increases by 5 or 6 averagely.
There are many white stripes on border in Figure 12 after a close observation, they are artifacts and these spatial discontinuous artifacts are caused by independent blurry block repair.They may not be so obvious from bare eyes, two artifacts are marked with blue arrow and deepened with red lines.This problem is resolvable [10].An index with time and space information is introduced to each image block, MRF model is constructed based on the index offset between the blurry block and candidate blocks.the artifacts are eliminated using the spatial continuity of image blocks.The elimination of artifacts using MRF is shown in Figure 13.

A. EXPERIMENT ENVIRONMENT AND PARAMETERS 1) EXPERIMENT ENVIRONMENT
The computer configuration in our experiment is as follows: the processor is Intel R Core TM i5-8400, the graphics adapter is GTX1060, the size of memory is 8G.The experiment is performed on the Torch7 deep learning framework in Linux.The video database [23] that includes 71 videos, taken by various devices such as iPhone6s, GoPro Hero4 and Cannon 7D, is used.Every of these videos has 2 versions, one version is sharp video, the other is the corresponding blurry version by long exposure.The average length of one video is 3-5 seconds, which can be divided into about 100 video frames.The size of a video frame is 1280 * 720.

2) PARAMETERS
The video database includes 71 videos, 63 videos among them are selected as the training set, the rest 8 videos are used as the testing set.In order to obtain enough training samples, each video frame is cut into image blocks.These blocks have the same size of 128 * 128, are partly overlapped so that at least 712193 samples are obtained from each video frame, which greatly increases the number of training samples.The activation function in the training phase is ReLU.The learning rate starts at 0.004 and is used in the first 30,000 iterations, reduced by half thereafter with every 10,000-iteration until

B. COMPARATIVE EXPERIMENT AND ANALYSIS 1) COMPARATION OF THE REPAIRED VIDEO FRAME
A block-based video deblurring method [9], an efficient video deblurring DBN network method [23] and our algorithm are   used for randomly selected video frames in the verification of our algorithm's performance, the repaired results are shown in Figure 14.Our algorithm has the highest PSNR.
In order to get a close look to the changes, the texture-rich region of the video frame is enlarged and shown in Figure 15.The texture-rich blocks are marked with yellow and zoomed with the corresponding video frame.Our algorithm has a high-quality repair result, for example, good continuity.There is no discontinuous artifact between the textures.

2) OVERALL COMPARISON ON THE TEST SET
In order to verify robustness of our algorithm, an overall test is performed on 8 videos from the testing set.PSNR and SSIM are two measure indicators.For each video, the video repair result is measured by the average PSNR and SSIM of all frames of the video.The experimental results are shown in Table 2 and Table 3.The algorithm with the best performance in each video is marked in bold.As shown in two tables, our algorithm has the best overall performance except for video 7.

3) TEST OF OTHER TYPES BLURS
The video motion blur in the training set is generated only by camera shake.More tests are performed on another type of blur, the video motion blur is caused by the movement of the object.The test results are shown in Figure 16.

IV. CONCLUSIONS
A video deblurring algorithm based on motion vector and an encoder-decoder network is proposed to solve the motion blurs by camera shake.The algorithm is good in repairing another type of blur.In our algorithm, a blurry image block is evaluated, the matching blocks in consecutive frames are found using motion vector, and the optimal candidate blocks are determined using an objective function.A blurry block and its optimal candidate blocks in a neighbor frame are input to an encoder-decoder network, the blurry image block is repaired.Our algorithm has a fast speed because the candidate block search is performed using motion vector and only the blurry regions in a video frame is selected and repaired.The precise alignment between frames is not required because of the improved encoder-decoder neural network.Our algorithm has good repair results on a standard video database.Although our algorithm has a good effect on the videos with motion blurs by camera shake, it should be further improved for videos with object movement blur or defocus blur.
blocks and 26 sharp image blocks.The blur scores of these blocks are calculated as follow:

1 )
FINDING THE SET OF CANDIDATE BLOCKS USING MVMotion vectors are used to find the best matching image blocks of a blurry image block in its neighbor frames.As shown in Figure4, three consecutive video frames are shown in the first row, the blurry image block in the current frame (Frame k) and the matching sharp image blocks in the reference frames (Frame k + 1, Frame k + 2) using MV are shown in the second row.A candidate set of the sharp image blocks corresponding to a blurry image block is obtained.

FIGURE 4 .
FIGURE 4. Finding candidate blocks using MV.(a) A blurry block in frame k.(b) A sharp block in frame k + 1. (c) A sharp block in frame k + 2.

FIGURE 5 .
FIGURE 5. Candidate blocks that fail to match.(a) A candidate block in frame k − 2. (b) A candidate block in frame k − 1. (c) A candidate block in frame k.(d) A candidate block in frame k + 1. (e) A candidate a block in frame k +.

FIGURE 6 .
FIGURE 6. Finding candidate blocks in a search box.

FIGURE 8 .
FIGURE 8. Neural network structure of encoder-decoder style.

FIGURE 9 .
FIGURE 9. Input samples.(a) An optimal candidate block of frame k − 2. (b) An optimal candidate block of frame k − 1. (c) A blur block of frame k − 2. (d) An optimal candidate block of frame k + 1. (e) An optimal candidate block of frame k + 2.

FIGURE 10 .
FIGURE 10.Repair of blurry image blocks.it reaches 10 −6 .GPU acceleration is used in the training process, it takes about 60 hours and 100,000 iterations are completed.Parameters w 1 and w 2 in objective function are set to 0.6 and 0.4, respectively.

FIGURE 11 .
FIGURE 11.Repair of an entire video frame.

FIGURE 13 .
FIGURE 13.The elimination of artifacts.

FIGURE 14 .
FIGURE 14.Comparison of the repaired video frame.

FIGURE 16 .
FIGURE 16.Repair for another type of blur.

TABLE 1 .
Parameters in each layer of a neural work.

TABLE 2 .
Average psnr of different methods.

TABLE 3 .
Average ssim of different methods.