ABDGAN: Arbitrary Time Blur Decomposition Using Critic-Guided TripleGAN

Recent studies have proposed methods for extracting latent sharp frames from a single blurred image. However, these methods still suffer from limitations in restoring satisfactory images. In addition, most existing methods are limited to decomposing a blurred image into sharp frames with a fixed frame rate. To address these problems, we present an Arbitrary Time Blur Decomposition Triple Generative Adversarial Network (ABDGAN) that restores sharp frames with flexible frame rates. Our framework plays a min–max game consisting of a generator, a discriminator, and a time-code predictor. The generator serves as a time-conditional deblurring network, while the discriminator and the label predictor provide feedback to the generator on producing realistic and sharp image depending on given time code. To provide adequate feedback for the generator, we propose a critic-guided (CG) loss by collaboration of the discriminator and time-code predictor. We also propose a pairwise order-consistency (POC) loss to ensure that each pixel in a predicted image consistently corresponds to the same ground-truth frame. Extensive experiments show that our method outperforms previously reported methods in both qualitative and quantitative evaluations. Compared to the best competitor, the proposed ABDGAN improves PSNR, SSIM, and LPIPS on the GoPro test set by 16.67%, 9.16%, and 36.61%, respectively. For the B-Aist++ test set, our method shows improvements of 6.99%, 2.38%, and 17.05% in PSNR, SSIM, and LPIPS, respectively, compared to the best competitive method.


Introduction
Single image deblurring is one of the most classic yet challenging research topics in the field of image restoration, which aims to restore a latent sharp image from a blurred image.Recently, deep-learning-based methods have achieved remarkable success in image deblurring by training models on large-scale synthetic deblurring datasets (e.g., GoPro [1], DVD [2] and REDS [3]).These datasets have been suggested to synthesize blurred images by averaging consecutive sharp frames sampled from videos.They are based on the premise that the motion blur can be seen as the accumulation of movements that occurred during the camera exposure duration [1].Motivated by this, many methods have made significant progress in estimating a sequence of sharp frames from an observed blurred image, which is also known as the blur decomposition task [4].
Due to the complex and ill-posed nature of blur decomposition [5], existing methods [4-6] face significant challenges.There is substantial room for improvement in generating visually pleasing, high-quality frames.In addition, most of them have been designed to restore a fixed number of frames using supervised learning.This limits the flexibility and applicability of these models, as adjusting the network architecture or training procedure is necessary to produce different numbers of frames.One of the practical approaches for restoring a flexible number of frames involves first extracting a fixed number of sharp frames using blur decomposition methods.Subsequently, video interpolation is applied to these frames.However, this approach may not be optimal, since inaccurate blur decomposition can lead to degraded quality of the interpolated frames.
In this work, we propose an Arbitrary Time Blur Decomposition Using Critic-Guided Triple Generative Adversarial Network (ABDGAN) (This article is based on Chapter 5 of the first author's Ph.D. thesis [7]).This approach restores a sharp frame with an arbitrary time index from a single blurred image, a task we refer to as arbitrary time blur decomposition.One of the main challenges for this problem is the lack of ground-truth (GT) images for every continuous time within exposure duration in existing synthetic datasets [1].Recent synthetic deblurring datasets (e.g., GoPro [1]) have used a range of {7, 9, 11, 13} sharp frames to synthesize a blurred image.This means that there are only a limited number of timestamps for each blurred image, with no GT images for all continuous time codes over the exposure time.In this circumstance, when the models are trained by supervised fashion using these datasets, they may not be able to effectively restore images at timestamps that are not present in the training set [8].
As a departure from previous blur decomposition methods that rely on supervised learning, we propose a semisupervised framework.To this end, we adopt the TripleGANs framework [9] which consists of three players, including a generator, a discriminator, and a classifier.For the blur decomposition task, we modify the role of three players and the objective functions for the generator and the label predictor.Specifically, our ABDGAN plays a min-max game of three players consisting of a time-conditional deblurring network G, a discriminator D, and a label predictor C. Our G takes a pair of a time code and a blurred image as inputs, and restores a corresponding sharp moment occurring within exposure duration.Meanwhile, D estimates the probability of whether given images are real or fake.Concurrently, C predicts the time code when the blurred image and latent sharp image are jointly given.Since the training (real) data do not include sharp frames for every consecutive time, C is trained not only on real images but also on the generated sharp images by G.However, in our framework, a naïve adoption of TripleGAN [9]'s approach often faces unstable training of C due to the distribution discrepancy problem [10] between real and fake images.This arises, especially in the early training phase, where the restored images from G may not match the real data distribution well enough.This makes it difficult for C to correctly predict the time codes for both real and fake images.
To mitigate this, unlike the original TripleGANs [9], which directly utilizes fake samples obtained by the generator for training the classifier, our D assists C in filtering out unrealistic fake samples.To this end, we propose a critic-guided (CG) loss, allowing our C to train using reliable fake images generated from G based on the feedback from the critic D.
On the other hand, it is a highly ill-posed problem to recover a temporal ordering of frames from a single blurred image, because motion blur is caused by averaging that ruins the temporal ordering of the instant frames [5].To address the challenge of frame-order recovery, most existing methods [5,6,11,12] utilize the pairwise ordering-invariant (POI) loss [5].The POI loss is invariant to the order by utilizing the average of two temporal symmetric frames and the absolute value of their difference.This approach enables the network to choose which frame among symmetric frames to generate during training [5].As a result, this loss function effectively facilitates stable network convergence by preventing temporal shuffling and ensuring temporal consistency among predicted frames.However, it is suboptimal because pixel-level consistency is not guaranteed, potentially resulting in each pixel in a predicted image corresponding to different GT frames.
To address this problem, we propose a pairwise order-consistency (POC) loss that alleviates the problem of pixel-level inconsistency inherent in the existing POI loss [5].Our POC loss shares similarities with the POI loss by including temporal symmetric frames in the loss function.However, our POC loss differs from the POI loss in that the POI loss implicitly matches pairs of estimated frames and GT frames to define the loss, while our POC loss explicitly determines these pairs.Specifically, the proposed POC loss starts by determining whether the temporal order of predicted sharp images aligns with the GT order or its reverse.This preliminary step enables us to determine which specific GT image and predicted image should be optimally minimized.Following this, we ensure that each pixel in a predicted image consistently matches the corresponding pixel in the same ground-truth frame by rigorously enforcing across all pixels.
Figure 1 exemplifies the superiority of our model compared with previous methods [5,13].Unlike existing methods, our model can restore highly accurate dynamic motion from a blurred image.Moreover, our model can extract sharp sequences at any desired frame rate, while competing methods are constrained to restoring a predetermined number of frames.Our main contributions can be summarized as follows.

•
We propose Arbitrary Time Blur Decomposition Using Critic-Guided TripleGAN (ABDGAN), a semisupervised learning approach, to extract an arbitrary sharp moment as a function of a blurred image and a continuous time code.

•
We introduce a critic-guided (CG) loss, which addresses the issue of training instability, especially in the early stages, by guiding the label predictor to learn from trustworthy fake images with the assistance of the discriminator.

•
We introduce a pairwise order-consistency (POC) loss, designed to guarantee that every pixel in a predicted image consistently matches the corresponding pixel in a specific ground-truth frame.

•
Our extensive experiments demonstrate that our method surpasses existing methods in restoring high-quality frames at the GT frame rates, and consistently produces superior visual quality at arbitrary time codes.
The remainder of this paper is organized as follows.In Section 2, we review previous works on image deblurring.Section 3 presents the details of our proposed ABDGAN.Section 4 analyzes the experimental results of the proposed method.Finally, in Section 6, we discuss the conclusions and future works.

Related Works
In this section, we provide an overview of single image deblurring and blur decomposition methods utilizing deep learning.In Table 1, we briefly categorize the deblurring methods based on their ability to recover a single middle frame, a fixed number of multiple frames, and an arbitrary number of multiple frames.We also introduce the TripleGAN, which is closely related to our proposed approach.

Image Deblurring
In general, image deblurring refers to restoration of a sharp image from an observed blurred image [28].Recently, numerous studies in this field have achieved remarkable success based on deep learning.Early methods [14][15][16] utilized deep learning to estimate the blur kernel.On the other hand, Nah et al. [1] proposed to directly restore the sharp image without additional blur kernel estimation step, and developed a multiscale deblurring network that performs restoration in a coarse-to-fine manner.Motivated by the success of [1], numerous methods, such as multiscale recurrent model [17], multipatch hierarchical network [18], and multi-input-multi-output unet [19] have been proposed and have achieved promising results.On the other hand, generative adversarial networks (GANs) have been widely used in image deblurring.Kupyn et al. [20] proposed DeblurGAN, which adopts conditional GAN for motion deblurring.Extending this, DeblurGAN-v2 [21] is proposed to use the relativistic discriminator and feature pyramid deblurring network for motion blur.DBGAN [22] utilizes GANs to learn image deblurring and blurring process.Furthermore, the GAN-based method is also extended to video deblurring methods, such as DBLRGAN [23].In contrast, our approach diverges from these methods by adopting TripleGANs [9] to the blur decomposition task, which comprises three core networks for adversarial learning.Recently, Kong et al. [24] presented a frequency-domain-based selfattention solver and a discriminative frequency-domain-based feedforward network to enhance the deblurring performance.Roheda et al. [25] proposed a network architecture based on the higher-order Volterra filters for image and video restoration.Mao et al. [26] proposed an adaptive patch exiting reversible decoder that maximizes image deblurring performance while maintaining memory efficiency.

Blur Decomposition
The blur decomposition [4] is used to extract a sharp sequence from a blurred image.As a pioneer, Jin et al. [5] proposed the generation of sharp image sequences by cascading multiple deblurring networks.Instead of using multiple networks, Purohit et al. [11] proposed a two-stage framework based on recurrent networks.Unlike [5,11], which require multiple training stages, Argaw et al. [6] developed an end-to-end trainable framework.To obtain a large number of output frames, Zhang et al. [12] proposed a cascaded structure with three GANs, and each generator is trained to extract seven consecutive frames from an input image.Zhang et al. [13] proposed a motion-offset estimation-based model, in which they train the motion offset generation module first, and then attach it to the deblurring network.Zhong et al. [4] tackled the motion ambiguity problem in blur decomposition by directly conditioning the motion guidance.Despite these efforts, most existing methods require changing the network architecture or retraining according to changes of the number of frames.Recently, ref. [27] mitigated this shortcoming by using a control factor for face image deblurring.However, this method can only be applied to face images, which limits the generalization ability to process large and complex motion in natural scenes.In this work, our goal is to recover the sharp moment from a given blurred image at arbitrary and continuous time codes.This is achieved without the need for retraining or changing the network architecture.

Triple Generative Adversarial Networks
GANs [29] have gained significant attention in image synthesis.Among the various extensions of GANs, conditional GANs [30] were developed to perform conditional image synthesis.However, most of these methods rely on supervised learning [9], which requires fully labeled images in the dataset.To learn conditional image synthesis with partially labeled data, TripleGAN [9] employs an additional label predictor as a pseudo-label generator and plays a min-max game of three networks.To address the distribution discrepancy issue, ref. [10] ensembled multiple classification networks and utilized the feature matching loss between the generated and real samples.Inspired by previous works, we extend the application scopes of TripleGANs [9,10] to the blur decomposition task.However, the simple application of TripleGAN's method leads to unstable network training due to the two main issues: (1) lack of real images for continuous time codes, and (2) distribution discrepancy between real images and fake images.Our ABDGAN framework is mainly designed to effectively resolve the above problems.

Proposed Method
Let b ∈ R H×W×3 and t ∈ [0, 1] represent a single blurred image and a temporal index within the normalized exposure time, respectively.Our goal is to restore a specific sharp moment ŝt ∈ R H×W×3 conditioned on b and t.One of the major challenges in our goal is to predict the sharp moment ŝt for any continuous time code, particularly when the training data contain very few ground-truth images for continuous time code.To overcome this, the proposed ABDGAN utilizes semisupervised learning by leveraging both labeled and unlabeled data.Here, the labeled dataset {b, t, s t } is sampled from the real-data distribution p d .It explicitly contains ground-truth sharp images s t corresponding to each b and t.In contrast, the unlabeled data, the set {b, t} ∼ p d , indicate that there is no ground-truth sharp image s t .By leveraging both labeled and unlabeled data, the proposed method aims to predict an accurate sharp moment for any continuous time code.This is achieved despite the scarcity of ground-truth sharp images in the training dataset.
Learning Arbitrary Time Blur Decomposition based on TripleGANs.Inspired by TripleGANs [9,10], which achieved successful results in conditional image synthesis in a semisupervised manner, our proposed ABDGAN introduces a new strategy.It plays a min-max game consisting of three players: a time-conditional deblurring network G, a discriminator D, and a time-code predictor C, as depicted in Figure 2. As mentioned earlier, one of our major goals is to train G to predict any sharp moment corresponding to arbitrary time code and a blurred image.To achieve this, our D plays a role of providing adversarial feedback for G to restore realistic sharp images.Simultaneously, the main role of C is to provide precise feedback so that G generates an accurate temporal sharp moment corresponding to the input time code among latent sharp motions within the blurred image.
Concretely, given a pair of b and t ∈ [0, 1], the proposed time-conditional deblurring network G outputs a sharp frame ŝt , which is written as ŝt = G(b, t).As illustrated in Figure 2, D receives a pair (b, s) as input, where s represents either a real sharp image s t or restored sharp image ŝt from G.During training, D is trained to predict whether the input comes from the real-data distribution p d (b, s) or the fake-data distribution p g (b, s).Structurally, we exploit a UNet discriminator [31] for our D's architecture.This architecture involves an encoder that outputs a per-image critic score D e (•) and a decoder that outputs a per-pixel critic score D d (•).Given a pair (b, s) as input, where s represents either a real sharp image s t or restored sharp image ŝt , our C is trained to accurately predict the corresponding temporal code, as depicted in Figure 2. Since our C is trained using fake images as well as real images, C can provide adequate feedback to our G to ensure that the restored sharp moment aligns accurately for arbitrary time code.Considering that image restoration is a pixel-by-pixel dense prediction task [32][33][34], we employ a UNet-based architecture [35] for our C to provide per-pixel feedback on t for G. Let the temporal code map t m ∈ R H×W denote a 2-dimensional matrix filled with t as t m(i,j) = t for every pixel coordinate (i, j).Given an input pair of (b, s t ), our C fuses b and s t using channel-wise concatenation, and outputs pixel-wise time-code map tm ∈ R H×W×1 , which is written as tm = C(b, s t ).

Pairwise-order consistency (POC) loss.
For training our G with the labeled data {b, t, s t } ∼ p d more effectively, we propose our POC loss.Unlike conventional POI loss [5] employed in previous studies [5,6,11,12], our proposed POC loss offers distinct advantages by enforcing stronger constraints on the temporal order of predicted frames.The proposed POC loss results in significant improvements in the accuracy and visual quality of predicted frames compared to existing POI loss.
Critic-guided (CG) loss.As mentioned earlier, the distribution discrepancy problem is one of the crucial challenges in training TripleGAN-based framework [10].The limited number of labeled data may not be sufficient for our G to effectively learn to restore a sharp moment when the input temporal code is absent in the training data.To overcome this, our C is trained not only with labeled sharp images but also with fake sharp images restored by G with unlabeled data.However, especially in the early training phase, a distribution discrepancy can arise between real and fake images.This poses a challenge for our C, which is trained to predict correct time codes for both real and fake images.To address this, we propose our CG loss, optimizing C using realistic fake images by leveraging the decision made by D.

Table of notation.
To ensure clarity and consistency, Table 2 shows a concise summary of the notations used throughout this paper.Unless stated otherwise, we maintain consistency in notation.
In the following, we provide explanations of our pairwise-order consistency (POC) loss in Section 3.1, and the critic-guided (CG) loss in Section 3.2.The entire training procedure is described in Section 3.3.tm ∈ R H×W A 2-dimensional matrix predicted by label predictor C from b and s, which is defined by tm = C(b, s), where s is either a real or predicted sharp image.
A predicted sharp frame using generator G, as ŝt = G(b, t)

Pairwise-Order Consistency Loss
In Appendix A, we describe the limitations of the existing POI loss [5].Based on this analysis, we introduce our POC loss, which is designed to overcome the shortcomings.Let {b, t, t, s t , s t } ∼ p d denote the sampled set from the dataset, where t = 1 − t.That is, (s t , s t ) is a pair of GT symmetric frames for the central frame s t=0. 5 .This implies that (s t , s t ) is a pair of GT symmetric frames for the central frame s t=0.5 .Then, we can obtain ŝt = G(b, t) and ŝt = G(b, t).Without loss of generality, t and t satisfy t < t.Then, the proposed POC loss L POC G is defined as follows: where •) to assess whether the temporal order of predicted sharp images aligns with the GT order or its reverse.If Φ(s t , ŝt ) < Φ(s t , ŝt ), this indicates that the GT frame s t is closer to its corresponding predicted frame ŝt than to the opposite time-symmetric predicted frame ŝt .Consequently, this suggests that ŝt and ŝt are correctly aligned with GT temporal order of sharp frames s t and s t , respectively.Based on this, we directly minimize the sum of individual L 1 distance between each predicted frame and its correct GT frame.Conversely, if Φ(s t , ŝt ) ≥ Φ(s t , ŝt ), it implies that the predicted frames ŝt and ŝt are aligned with the reverse order of GT frames.In such a case, we minimize the sum of the L 1 distance between ŝt and s t , and between ŝt and s t .Our POC loss marks a departure from the existing POI loss [5] (Equation (A1)).It introduces stricter constraints to ensure that every pixel in the predicted image aligns consistently with the same ground-truth (GT) frame.This enhancement substantially improves the accuracy and reliability of frame prediction.

Critic-Guided Loss
The proposed CG loss trains C with trustworthy fake samples induced by the pixelwise critic score D d (•), which represents the pixel-wise probability map predicted by D.
The D is trained to predict a probability value close to 1 when the input sample is as realistic as the real image, and 0 for vice versa [29].If the output probability value of D for a fake image generated by G is 0.5, this means that G generates a sharp image whereby D cannot distinguish between real and fake [29].Based on this, we consider the case where the output of D is greater than 0.5 for fake data to be trustworthy (realistic) fake data.Accordingly, the sigmoid-based soft threshold function σ(•) is pixel-wisely applied to [D d (•)] (i,j) , as , where (i, j) indicates the pixel coordinate.Here, x 0 = 0.5 is the x value of the middle point of the sigmoid curve, and k = 15 is the steepness of the sigmoid curve.From this, we can obtain the weighting mask by applying σ(•) to the outputs of D d .For simplicity, we denote σ([D d (ŝ t ′ )] (i,j) ) and σ([D d (ŝ t ′ )] (i,j) ) as σ t ′ (i,j) and σ t ′ (i,j) for each pixel (i, j), respectively.Given {b, t, t, s t , s t } ∼ p d sampled from the dataset and the randomly sampled time codes {t ′ , t ′ } ∼ p t , our CG loss L CG C is defined as follows: where Φ(•, •) is the L 1 distance metric between two images.As a result of this collaboration with D, our C is naturally liberated from the problem of distribution discrepancy between real samples and fake samples.Since our C learns with the realistic fake samples generated using arbitrary value of t ∈ [0, 1], it overcomes the problem of the limited values of t in the training dataset.

Training Objectives of ABDGAN
Similar to [9,10], the proposed ABDGAN plays a min-max game of the three networks D, C, and G. Algorithm 1 briefly outlines the optimization process of our ABDGAN.In Algorithm 1, M denotes the total number of training pairs of (b, s t ).f indicates the integer frame index among the total number of sharp frames F. By calculating t = f −1 F−1 , t represents the temporal code within the normalized exposure time.[36]  ŝt ′ ← G(b, t ′ ) 3: Compute L D using real samples and fake samples by Equation (3) 4: Update the parameters of D using the gradient of L D by Adam [36] Training of the discriminator D. As described in Algorithm 2, the objective of D, L D , is to correctly determine whether the given samples are real or fake.For this purpose, D is optimized to maximize the log probability of real samples and minimize the log probability of fake samples.Subsequently, L D consists of L D R and L D F , which represent the losses for the real samples and fake samples, respectively.The L D is defined as follows:

Algorithm 1 Entire training procedure of ABDGAN
where where tm and tm denote C(b, s t ) and C(b, s t), respectively.For unlabeled data, as described in Section 3.2, our C is trained using the proposed CG loss L CG C (Equation ( 2)).Both L L C and L CG C are defined depending on the estimated order using G and the labeled data {b, t, t, s t , s t }.This allows C to predict a consistent temporal order between time codes with ground-truth images and arbitrary symmetric time codes.Overall, total loss L C is defined as follows: 4: Compute the weighting mask using D d and the sigmoid function as: 2) and L L C by Equation (4) 6: Compute the total loss L C by Equation (5) 7: Update the parameters of C using the gradient of L C by Adam [36] Training of the generator G.To restore accurate pixel intensities, L POC G (Equation ( 1)) is utilized for optimizing G.To encourage G to restore a realistic sharp image according to arbitrary time codes, L C G is defined using estimated time codes from C. Given the symmetric samples {b, t, t, s t , s t } ∼ p d , the ( tm , tm ) can be obtained by tm = C(b, G(b, t)) and tm = C(b, G(b, t)).For randomly sampled time codes {t ′ , t ′ } ∼ p t , we can also obtain t′ m = C(b, G(b, t ′ )) and t′ m = C(b, G(b, t′ )).Then, our L C G is formulated as follows: To guarantee that the generated image is as realistic as the real data, the adversarial loss for G can be defined as follows: Then, we define our adversarial loss L D G using the symmetric pair of time codes {t, t} ∼ p d and randomly sampled codes {t ′ , t ′ } ∼ p t as follows: Based on the above procedure, the entire objective of G, L G , is formulated by the weighted sum of Equations ( 1), (6), and ( 8), where λ POC , λ C , and λ D are balancing weight parameters, which are empirically set to λ POC = 1, λ C = 0.01, and λ D = 0.02.Algorithm 4 shows the training scheme of G.As shown in line (2) of Algorithm 4, given the symmetric samples {b, t, s t , t, s t } ∼ p d , the predicted deblurring results are obtained by ŝt = G(b, t) and ŝt = G(b, t).For unlabeled data (t ′ , t ′ ) ∼ p t , we can obtain the predicted deblurring results from G (line (3) of Algorithm 4).Then, predicted time code maps are obtained using C, as shown in lines (4) and ( 5) of Algorithm 4.Then, our proposed POC loss L POC G is obtained to restore more accurate pixel intensities on labeled data.L C G allows our G to restore a realistic sharp image according to arbitrary time codes.The adversarial loss L D G is computed to guarantee that the generated image is real for both labeled data and unlabeled data.

Algorithm 4 Training of the generator G
1: Sample a batch of labeled data {b, t, t, s t , s t } ∼ p d of size n, Sample a batch of unlabeled data {t ′ , t ′ } ∼ p t (t ′ ) of size n 2: Predict the sharp images on labeled data using G as: ŝt ← G(b, t), ŝt ← G(b, t) 3: Predict the sharp images on unlabeled data using G as: 4: Predict the time code matrix on labeled data using C as: tm ← C(b, ŝt ), tm ← C(b, ŝt ) 5: Predict the time code matrix on unlabeled data using C as: G by Equation ( 6), and L D G by Equation (8) 7: Compute the total loss L G by Equation (9) 8: Update the parameters of G using the gradient of L G by Adam [36]

Experiments
To evaluate and analyze our method, we performed various experiments.In the following subsections, the experimental setup is first explained, describing the implementation details, evaluation metrics, and datasets.Next, quantitative and qualitative comparisons are provided, demonstrating the superiority of the proposed method against previous competitive methods.Finally, an ablation study highlights the importance of the components of the proposed method.

Experimental Setup
Implementation details.The proposed ABDGAN was implemented using PyTorch 1.7.1 [37] and trained on NVIDIA TITAN-RTX GPUs.During training, the batch size n in Algorithm 1 was set to eight.For every iteration, the images were randomly cropped to a spatial size of 256 × 256 × 3, and a random horizontal flip was applied.The learning rates of G, D, and C, denoted as ℓ G , ℓ D , and ℓ C , respectively, in Algorithm 1 were initialized identically as 1 × 10 −4 and decayed exponentially by a factor of 0.99 for each epoch.Our ABDGAN was trained for 200 epochs.The Adam optimizer [36] was used with β 1 = 0.9 and β 2 = 0.999.In our ABDGAN, the time-conditional deblurring network G is built based on the NAFNet-GoPro-width32 base model [38], as detailed in Appendix B. Notably, we trained all the model parameters from scratch.For the discriminator D, we adopted the UNet discriminator [31] to provide per-pixel and per-image feedback to our G during training.The time-code predictor C is also based on the UNet architecture [35] used in BigGAN [39].
Evaluation metrics.For quantitative evaluation, we measured the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [40].These are the most widely used metrics in image restoration tasks [28].In addition, we used the learned perceptual image patch similarity (LPIPS) [41], which is commonly used to evaluate the perceptual quality.To investigate the computational costs of methods, we measured the number of model parameters, floating-point operations (FLOPs), and inference time.The FLOPS and inference time are measured using an input image with size of 256 × 256 × 3 following [19,38,42].For a fair comparison, the inference time of the models were measured using a PC, which is equipped with a single NVIDIA TITAN RTX GPU and Intel(R) Xeon(R) Gold 5218 CPU.When comparing our model with existing blur decomposition methods [5,13], we calculated the FLOPs required to obtain a single sharp sequence from a single blurred image for each model.For example, since the official model of Jin et al. [5] is designed to obtain seven sharp frames from a blurred image, we measured the total FLOPs required to obtain seven sharp frames.Similarly, to compare with our model and Zhang et al. [13] that restores 15 sharp frames from a single blurred image, we measured the total FLOPs required to obtain 15 sharp frames.The inference time is calculated by averaging the time needed to restore 300 sharp video sequences from 300 input blurred images.When comparing our model with single image deblurring methods, the reported inference time represents the average across 300 images.
Training and evaluation datasets.We trained and evaluated our ABDGAN separately using GoPro [1] and B-Aist++ [4] datasets.In this section, we denote each model trained with GoPro and B-Aist++ as "ABDGAN-GP" and "ABDGAN-BA", respectively.While the original benchmark GoPro test set [1] contains blurred images synthesized by averaging more than 11 sharp frames per a single image, the models of Jin et al. [5] and Argaw et al. [6] were designed to restore seven frames from a single blurred image.For a fair comparison, we prepared the GoPro7 test set following Jin et al. [5].Each blurred image is synthesized by averaging seven consecutive sharp frames provided by the official GoPro test set [1].Meanwhile, the method of Zhang et al. [13] was proposed to extract 15 frames from a single blurred image.For a fair evaluation, we further prepared the GoPro15 test set by averaging 15 sharp frames of the official GoPro test videos.To evaluate the generalization performance of our proposed method, which was trained on the GoPro training set [1], we also evaluated the performance of our model on recent benchmark datasets such as REDS [3] and RealBlur [43].While the GoPro training set and test set are synthesized using video captured with 240 fps, the REDS dataset [3] includes motion-blurred images synthesized using 120 fps video.Evaluating the model on the REDS dataset allows for a comprehensive assessment of its performance across various types of motion blur.This is significant despite the inherent differences in frame rates between the training data and the REDS dataset.The RealBlur dataset [43] provides the most prevalent scenarios for motion blur, i.e., low-light environments.Unlike the synthesis methodology used in GoPro and REDS, the RealBlur dataset [43] comprises pairs of the blurred and sharp images captured using their proposed dual camera system.Evaluating our model on this dataset allows us to assess its performance in handling common real-world motion blur conditions.Note that the REDS and RealBlur datasets are used only for the evaluation of our method trained on the GoPro training set.
A brief summary of the datasets is described as follows.
• GoPro dataset: We used the GoPro dataset [1], which is one of the most commonly used datasets for motion deblurring research.This dataset captures videos at 240 fps with a GoPro Hero camera and averages {7,

Quantitative Comparisons
Blur decomposition.To quantitatively evaluate the blur decomposition performance of the proposed method, we compared it with recent methods that generate multiple frames from a single blurred image.As the official codes of [6,11,12] are not released yet, we were only able to obtain the results of Jin et al. [5] and Zhang et al. [13] for testing on the benchmark GoPro dataset [1].The results are shown in Table 3, where F i , F m , and F f indicate the initial, middle, and last frame among the restored frames, respectively."Avg."denotes the average value of F i , F m , and F f .Following [4,6], the results of all methods are reported as the higher metric between the results of ground-truth order and reverse order due to the motion ambiguity [4,6].The results demonstrate that our method outperforms other methods by a large margin in all metrics.Even though Zhang et al. [13] achieve the best result in terms of PSNR and SSIM for central frame prediction, their performance is highly biased towards the central frame.Jin et al. [5] show relatively more consistent performances between F i , F m , and F f than those of Zhang et al. [13].However, the method of Jin et al. [5] is limited to small motion due to their architecture and training procedure [11].Hence, the performance of them often degrades when large and various degrees of blurred images are given, such as the GoPro benchmark test set.The quantitative comparisons on the GoPro7 test set and the GoPro15 test set are reported in Tables 4 and 5, respectively.Since [6] do not provide the code and only report the PSNR and SSIM results in their paper, excluding LPIPS, we report only these metrics in Table 4.It is noteworthy that our method outperforms existing approaches in all metrics, including in restoring F i , F m , and F f from a blurred image.This demonstrates that the sharp frames predicted by our ABDGAN are more consistent and of better quality than existing methods.Furthermore, the proposed method demonstrates superior performance in the GoPro7 test set compared to existing approaches specialized in predicting seven sharp frames from a single blurred image, such as Jin et al. [5] and Argaw et al. [6].Moreover, when compared to Zhang et al. [13], which focuses on restoring 15 sharp frames, our method achieves superior results on the GoPro15 test set.These results highlight the superior generalization capabilities of our proposed method.
Table 3. Quantitative comparison of video extraction performance on the GoPro test set [1].F i , F m , and F f indicate the initial, middle, and last frame among the restored frames, respectively."Avg."denotes the average value of F i , F m , and F f .The symbol ↑ in parentheses represents that the higher the value, the better.Similarly, the symbol ↓ indicates that the lower the value, the better.We highlight the best results and the second best results in bold and underline, respectively.Table 6 shows the comparison of FLOPs, model parameters, and average inference time.Notably, our proposed ABDGAN-GP has 1.23× fewer FLOPs and runs 1.875× faster than Jin et al. [5].Meanwhile, compared with Zhang et al. [13], our ABDGAN-GP performs with 1.10× larger FLOPs and is 2.31× slower.However, when considering the performance of deblurring accuracy in Table 3, our ABDGAN-GP demonstrates a more favorable trade-off between computational efficiency and deblurring accuracy compared to Jin et al. [5] and Zhang et al. [13].
Table 5. Quantitative comparison of video extraction performance on the GoPro15 test set.F i , F m , and F f indicate the initial, middle, and last frame among the restored frames, respectively."Avg."denotes the average value of F i , F m , and F f .The symbol ↑ in parentheses represents that the higher the value, the better.Similarly, the symbol ↓ indicates that the lower the value, the better.The best results are highlighted in bold, respectively.For a fair comparison with the most recent blur decomposition method [4], we also trained and evaluated our ABDGAN with the B-Aist++ dataset [4].Since existing methods [5,13] only released their test models (trained with GoPro dataset) without training code, we did not retrain them with the B-Aist++ dataset.The quantitative results are listed in Table 7.The results of Zhong et al. [4] were obtained using their motion predictor with a sampling number of five, which was reported as the best case in their paper.Even though the method of Zhong et al. [4] is effective in removing motion ambiguity by exploiting multiple motion guidances, the results show that our method can restore more accurate frames in terms of all metrics.Table 7. Quantitative comparison of video extraction performance on the B-Aist++ test set [4].The symbol ↑ in parentheses represents that the higher the value, the better.Similarly, the symbol ↓ indicates that the lower the value, the better.We highlight the best results in bold, respectively.Most recent studies [5,6,13] in blur decomposition, including the proposed ABDGAN, utilize a synthetic motion blur dataset (i.e., GoPro dataset [1]) for training and evaluation.To compare the generalization abilities of these methods on real motion-blurred images, we conduct quantitative comparisons on RealBlur-R and RealBlur-J test sets [43].The results are reported in Table 8.Since the official RealBlur-J and RealBlur-R test sets provide only a single ground-truth image for each blurred image in test sets, all metrics are computed based on the center frame prediction results.The results in Table 8 demonstrate that our method yields the best performance in all metrics.This indicates that the proposed ABDGAN can accurately restore sharp images from real motion-blurred scenes compared to existing methods [5,13].

Methods
Table 8.Quantitative comparison of the center frame prediction on the RealBlur test dataset [43].The symbol ↑ in parentheses represents that the higher the value, the better.Similarly, the symbol ↓ indicates that the lower the value, the better.We highlight the best and the second best results in bold and underline, respectively.Center frame prediction.Since most single image deblurring methods have been developed for predicting the center frame, we also conducted the quantitative comparison on central frames in Table 9.The results of our models were measured when the time code was set to t = 0.5 during the test.The slight decrease in the performance of our ABDGAN compared to NafNet [38], which is a baseline architecture of our generator in terms of PSNR and SSIM, can be attributed to two factors.First, single image deblurring models, including NafNet, are trained specifically to predict only a single center frame from a blurred image.In contrast, our approach is a blur decomposition method that focuses on learning to extract an arbitrary sharp moment from a blurred image.Therefore, while the performance of the proposed method decreases in center prediction, our model has the ability to extract diverse sharp moments from a blurred image.Second, the integration of GANs within our method often tends to synthesize realistic but fake details [45,46].Consequently, our model exhibits slightly lower PSNR and SSIM results compared to methods [17][18][19]38] that did not utilize GANs.However, it is noteworthy that our ABDGAN shows the best results in terms of PSNR and LPIPS among the GAN-based deblurring models, such as [1,5,12,22].Due to the incorporation of our proposed temporal attention modules into NafNet, there is an increase in FLOPs, model parameters, and inference time compared to the original NafNet model.However, our proposed ABDGAN-GP remains competitive compared to other state-of-the-art deblurring models, as reported in Table 9.It achieves a balance between computational efficiency and deblurring performance.Table 9. Quantitative comparison of the center frame prediction on the GoPro test dataset [1].The methods (1st to 6th row) are the single image deblurring models that are trained to restore only center frames.When computing the FLOPs, the image size is set as 3 × 256 × 256.The symbol ↑ in parentheses represents that the higher the value, the better.Similarly, the symbol ↓ indicates that the lower the value, the better.We highlight the best and the second best results in bold and underline, respectively.

Qualitative Comparisons
The visual results on the GoPro test set [1] are shown in Figure 3.The proposed ABDGAN outperforms all other methods [5,13] in extracting sharp and fine details.Moreover, the proposed method can generate plausible dynamic motion, whereas the results of other methods often appear to move globally.In Figure 3, when comparing the local movements of the two cars for input (a), the method of Jin et al. [5] produces overly smooth results.Although the results of Zhang et al. [13] contain images of visually pleasing quality, they fail to restore complex local motions, and all pixels tend to move globally.These observations can be found in most test results (e.g., see fine movements of the walking motion in the results of input (b)).Figure 4 shows the qualitative comparisons on the B-Aist++ test set [4].While [4] fails to extract accurate motion of the dancers, our method can reconstruct more plausible motions (e.g., see the movements of the right hands in the first row and the movements of legs in the second and third rows).Above all, the key aspect of our method is that our single model can accurately restore an arbitrary sharp moment from a blurred image.In our ABDGAN, these various outputs can be obtained by adjusting only the input time code t within [0, 1]. Figure 5 shows the results of the proposed method of restoring flexible frame rates for the input image.Remarkably, our model consistently produces high-quality sharp frames with various number of frames, such as 7, 11, and 15 frames, without requiring any modification to the network architecture or retraining.In contrast, most of the existing blur decomposition methods [5,6,11,12]    The results on the REDS dataset [3] are provided in Figure 6.In this experiment, we use a validation set officially provided by [3].The REDS validation set consists of pairs of blurred and central sharp frames.Hence, we only report these central frames as references for visual comparisons in this experiment.In Figure 6, it is observed that the proposed method restores the sharp frames with more realistic details than competing methods [5,13] (e.g., see the facial components of the glasses-wearing man in the results of input (a), and the pattern of the shirts in the results of input (c)).
Figure 7 shows the qualitative comparisons on the RealBlur test set [43].The official RealBlur test set provides only a single sharp image per a single blurred image.Hence, we only compare the center frame results from Jin et al. [5], Zhang et al. [13] and our proposed ABDGAN-GP.The results demonstrate that our proposed method restores fine details in the sharp frames compared to the competing methods [5,13].

Ablation Study
We conducted ablation experiments to evaluate the impact of the components of our proposed approach.We trained four models, termed M1, M2, M3, and M4.All of these models shared the same architecture as G, but were trained with different loss functions.To solely compare the effects of the existing pairwise-order invariant loss (L POI in Equation (A1)) [5] and our proposed pairwise-order consistency loss (L POC in Equation ( 1 As mentioned in Section 3.2, our CG loss allows our C to learn on trustworthy fake samples based on the critic guidance weight from D. Through this, our method mitigates the difficulty for C to predict the correct label, even if there is a distribution discrepancy between the real and fake images.To evaluate the effect of critic guidance weight from D, during training of the M3 model, we set both σ t ′ and σ t ′ in Equation (2) to a value of 1 for all pixels.This indicates that our C was trained without critic guidance weight from D. In contrast, the M4 model was trained by taking advantage from critic guidance weight from D. Note that M4 is our proposed ABDGAN.In Table 10, we provide a brief summary of the configuration of the ablation models and present the quantitative comparisons.
The visual results in Figure 8 further highlight the effectiveness of the components within our proposed approach.Specifically, (a), (b), (c), and (d) in Figure 8 represent the outputs of M1, M2, M3, and M4, respectively.These outputs correspond to predicted images for the input time code t = 0.5 that exists within the training data.Meanwhile, (e), (f), (g), and (h) in  When comparing the results of M1 and M2 in Table 10, we can observe that our POC loss significantly improves the deblurring performance over the existing loss function of [5] for all metrics.The visual comparison of Figure 8a,b also shows that M2 produces higherquality images than M1, which can be attributed to the use of our proposed POC loss.However, as shown in Figure 8e,f, both M1 and M2 models are unable to restore plausible images for unseen time codes during training.This indicates that models trained solely in a supervised fashion with a limited dataset tend to underperform.Specifically, these models struggle to generate sharp images when the time indices of the input blurred images are absent from the training set.Benefiting from using GANs, the quantitative performances of M3 and M4 are better than that of M2 in terms of LPIPS.This perceptual improvement is also observed in the visual comparison in Figure 8b-d.The visual comparisons in Figure 8g,h show that M4, which is trained with the CG loss, generates more realistic and sharper frames compared to M3.This indicates that the CG loss effectively guides the generator to produce more realistic frames by aligning the generated frame with the distribution of real sharp frames.
Table 10.Effects of the components of our ABDGAN on the GoPro test set [1].The reported metrics are averaged results measured with all predicted frames and GT frames (12,221 images).The symbol ↑ in parentheses represents that the higher the value, the better.Similarly, the symbol ↓ indicates that the lower the value, the better.

Limitations
Figure 9 shows the failure cases of our approach.The first row presents the results on a test image sampled from the benchmark GoPro test set [1].In the second row, we show the outputs when the input blurred image is degraded by defocus blur.The input and the ground-truth images are sampled from the recent benchmark defocus blur test set proposed by Lee et al. [47].Here, the results for central frames are shown for all models.Despite the significant advancements achieved by our proposed ABDGAN in motion-blur decomposition, we recognize two primary limitations.First, when the input image is severely degraded by substantial motion due to a large amount of local motion and camera shake, our method encounters challenges in accurately restoring a sharp frame, as shown in the first row in Figure 9. Second, since our method is specifically designed for restoring motion-blurred images, we did not account for various other types of blur (i.e., defocus blur) that commonly occur in real-world scenarios.As depicted in the second row in Figure 9, our method encounters limitations in such cases.Based on these limitations, we believe that future research should focus on addressing severe motion blur.Additionally, improving the model's robustness to accommodate various blur types, including defocus blur, remains a key focus for broader applicability.

Conclusions
In this paper, we proposed an ABDGAN, which is a novel approach for arbitrary time blur decomposition.By incorporating a TripleGAN-based framework, our ABDGAN learns to restore an arbitrary sharp moment latent in a given blurred image when the training data contain very few ground-truth images for continuous time code.We also proposed a POC loss that encourages our generator to restore more accurate pixel intensities.Moreover, we proposed a CG loss that ensures stability training by minimizing the distribution discrepancy between generated and real frames.Extensive experiments conducted on diverse benchmark motion blur datasets demonstrate the superior performance of our ABDGAN when compared to recent blur decomposition methods in terms of quantitative and qualitative evaluations.
The proposed ABDGAN outperforms the best competitor, enhancing PSNR, SSIM, and LPIPS on the GoPro test set by 16.67%, 9.16%, and 36.61%,respectively.On the B-Aist++ test set, our method provides improvements of 6.99% in PSNR, 2.38% in SSIM, and 17.05% in LPIPS over the best competitive method.In conclusion, the proposed ABDGAN restores arbitrary sharp moment from a single motion-blurred image with accurate, realistic, and pleasing quality.We believe that the proposed ABDGAN expands the application scope of image deblurring, which has traditionally focused on restoring a single image, to arbitrary time blur decomposition.
We anticipate that extending our ABDGAN to tackle a broader range of blur types, including defocus blur, will result in a more versatile and comprehensive deblurring solution.Future work will focus on enhancing the model's capability to handle diverse blur scenarios, thereby improving its applicability and effectiveness in real-world situations.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/s24154801/s1.where (i, j) represents the pixel coordinate.Depending on the signs of term (a) and (b) in Equation (A3b), this can branch into four different cases, as illustrated in Table A1.It is noteworthy that ŝt(i,j) can be matched by either s t(i,j) (1st and 2nd rows in Table A1) or s t(i,j) (3rd and 4th rows in Table A1).Simultaneously, ŝt (i,j) can correspond to either s t(i,j) (1st and 2nd rows in Table A1) or s t(i,j) (3rd and 4th rows in Table A1).This condition implies that each pixel in a single predicted image can be influenced by pixels from two different time-symmetric GT frames, rather than solely by pixels from a single GT image.This may impede the network's ability to consistently learn the temporal order of the GT frames.

Appendix B. Time-Conditional Deblurring Network
The architectural overview of the time-conditional deblurring network G is illustrated in Figure A1.Since our baseline architecture, NafNet [38], is designed as an UNet-like structure [35], our G consists of an encoder and a decoder.In Figure A1, "NafNet Block" indicates the unit block proposed in Chen et al. [38].Inspired by [48], we design the temporal attention module to find the most salient sharp image corresponding to the specific spatiotemporal coordinates within the entangled sharp images in a blurred image.This module allows the deblurring network to learn not only the complex motion of the objects but also to selectively focus on different sharp representations in a blurred image according to the input time code.This level of specificity could be difficult to model using 2D or 3D convolutions [49].The temporal attention module can be flexibly plugged into many single deblurring networks designed for accurate and effective restoration.In more detail, given a blurred image b ∈ R H×W×3 , the proposed G applies a 3 × 3 convolution layer to extract feature maps z in ∈ R H×W×C , which is denoted as the "Input Projection" layer in Figure A1.Then, extracted feature maps z in and input temporal code t are fed into the proposed temporal attention module, which is termed as "Temporal Attention" in Figure A2.The proposed temporal attention module consists of a sequence of layers, where each layer includes multihead cross attention (MCA), multihead self attention (MSA), and a feedforward network (FFN).Then, our module can be expressed as follows: .Following [51], we normalize the spatial coordinates to the range of [−1, 1] in x ∈ R H×W×3 .Inspired by [52], which utilized both a Fourier embedding feature map and spatial coordinate information for producing high-frequency components of the images with fewer artifacts, we apply the Fourier embedding [52] to x.Specifically, the Fourier embedded feature map e f ∈ R H×W×C is obtained by applying 1 × 1 convolution layer followed by sine activation function to x, which is defined by e f = sin[1 × 1conv(x)].In Equation (A4), we denote this Fourier embedding process as "Embedding(•)" for simplicity.Then, we obtain z x by concatenation of spatiotemporal coordinates x, Fourier features e f , and image feature z in , which is defined as follows: z x = Concatenation[x, e f , z in ].The z x is projected to query feature map, and z in is projected to key and value feature maps for cross-attention computation [48].Inspired by [53], who proposed an efficient attention layer by reducing memory costs for image restoration, 3 × 3 depth-wise convolution and 1 × 1 point-wise convolution are

InputFigure 1 .
Figure 1.Example of blur decomposition results on GoPro test set [1].For clarity, we display the magnified parts of the output images.The figure exemplifies the superiority of our model compared with previous methods [5,13].Horizontal and vertical lines are marked at the center coordinate of each image to provide clear observation of object movements between consecutive frames.The first, second, and third rows of output images indicate the initial frame, central frame, and final frame among the predicted video sequence from an input image.For quantitative comparison, we calculate the PSNR, SSIM, and LPIPS values by averaging those of the initial, central, and final frames.

Figure 2 .
Figure 2. The pipeline of ABDGAN.During training, ABDGAN plays a min-max game of the three networks G, D, and C. For every iteration, they are optimized alternatively with the proposed critic-guided loss.For testing, G is only used to render the sharp image at arbitrary t ∈ [0, 1] from a blurred image.

Figure 3 .
Figure 3. Qualitative comparison on official GoPro test set[1] compared with the single-to-video deblurring methods[5,13].For clarity, we display the magnified parts of the output images.The initial, central, and final frames for each method are displayed.Horizontal and vertical lines are marked at the center coordinate of each image to provide clear observation of object movements between consecutive frames.Please refer to the Supplementary Materials for comparisons on full video frames and the results of the proposed deblurring method.
require a change of network architecture and retraining according to the number of outputs.

Figure 4 .
Figure 4. Qualitative comparison on the B-Aist++ test set[4].For clarity, we display the magnified parts of the output images.Horizontal and vertical lines are marked at the center coordinate of each image to provide clear observation of object movements between consecutive frames.Please refer to the Supplementary Materials for comparisons on full video frames and the results of the proposed deblurring method.

Figure 5 .
Figure 5. Several examples for the proposed ABDGAN on official GoPro test set[1] and B-Aist++ test set[4].Note that our results are the outputs of the same network.With the adjustment of the input temporal code value t, our network restores any number of sharp moments from a given blurred image without architectural changes and retraining.To provide a clear observation for input (a), we display the magnified parts of the output images.Similarly, output images for input (b) are shown by varying time codes in our method.Horizontal and vertical lines are marked at the center coordinate of each image to provide clear observation of object movements between consecutive frames.Please refer to the Supplementary Materials for more frames restored by the proposed deblurring method.

Figure 6 .
Figure 6.Qualitative comparison on REDS dataset[3] compared with the single-to-video deblurring methods[5,13].For clarity, we display the magnified parts of the output images.Horizontal and vertical lines are marked at the center coordinate of each image to provide clear observation of object movements between consecutive frames.The initial, central, and final frames for each method are displayed.Please refer to the Supplementary Materials for comparisons on full video frames and the results of the proposed deblurring method.
)), we trained the M1 model using only L POI and the M2 model using only L POC .Note that both M1 and M2 were trained in a supervised fashion without utilizing our D and C.Meanwhile, the M3 model and M4 model were trained by playing a min-max game consisting of our G, D, and C. The training time for both M1 and M2 is approximately 2 days each.For M3 and M4, the training time is approximately 3.5 days each.

Figure 8
Figure 8 are the outputs of M1, M2, M3, and M4, respectively, when the input time code is set to 0.45.Notably, the time code 0.45 does not exist in the training data.

Figure 8 .
Figure 8. Ablation study of our method on the GoPro test dataset[1].For clarity, we display the magnified parts of the output images.

Figure A2 .
Figure A2.The proposed temporal attention module.

Table 1 .
Comparison of image deblurring and blur decomposition methods.

Table 2 .
A summary of notations for the proposed ABDGAN.
t ∈ [0, 1] A time code that represents a specific moment in time within the exposure duration t A symmetric time code to t with respect to central moment 0.5, which can be obtained by t = 1 − t t ′ A randomly sampled time code from the uniform distribution U[0, 1].t m ∈ R H×W A 2-dimensional matrix filled with t, as t m(i,j) = t for every pixel coordinate (i, j).
learning rate ℓ D , ℓ C and ℓ G , and batch size of n.Set balancing parameters between losses λ POC , λ C and λ D .Initialize: N = Number of total training iterations 1: for iteration iter = 1, 2, . . ., N do Training of the discriminator D 1: Sample a batch of {b, s t } ∼ p d of size n, and a batch of {t ′ } ∼ p t of size n.
2: Generate fake samples using G as: is the per-image critic score, measured by encoder of D. [D d (•)] (i,j) represents the per-pixel critic score measured by decoder of D at pixel coordinate (i, j).Training of the time-code predictor C. The details of the training scheme of C are described in Algorithm 3. One key aspect of our ABDGAN is that our C is trained to predict accurate time code for both labeled and unlabeled data.For this, the loss function for training C, L C , is defined as the sum of two regression loss functions L L C and L CG L C , is defined using the the labeled data {b, t, t, s t , s t } ∼ p d , which can be viewed as real images sampled from the training dataset.The proposed L L C is defined as follows: C(Equation (2)).The regression loss function for labeled data, L 9, 11, 13} frames for synthesizing blurred images.It provides a total of 3214 blurred images, of which 2103 are training images and 1111 are test images.Each image has a resolution of 1280 × 720.Following [5,6], we prepared the GoPro7 test set using the original GoPro test set.Unlike original GoPro test set that blurred images are synthesized by averaging more than 11, the GoPro7 test set is generated by averaging 7 sharp frames to create each corresponding blurred [44]].This GoPro7 test set comprises 1744 blurred images and the corresponding 12,208 sharp frames.Note that this test set is only used for evaluation purposes, as in[5,6].Considering that[13]proposed to restore 15 frames from a single blurred image, the GoPro15 test set is prepared by averaging 15 sharp frames using the original GoPro test set.This test set consists of 811 blurry images and the corresponding 12,165 sharp frames.•B-Aist++dataset:Following[4],weutilized the B-Aist++ dataset[4], which consists of synthesized motion-blurred images using a human dancing video[44].The dataset, respectively, contains 73 and 32 video clips for training and testing, and the spatial resolution of the images is 960 × 720.• REDS dataset: The REDS dataset [3] provides realistic images of dynamic scene for image deblurring.It contains synthesized blurred images by merging sharp frames captured with 120 fps videos of 1280 × 720 resolution.• RealBlur dataset: The RealBlur [43] dataset comprises blurred images captured in low-light static scenes.These blurred images simulate motion blur induced by camera shakes and captured in various low-light environments, including nighttime street scenes and indoor settings.The RealBlur test set consists of two subsets, RealBlur-J and RealBlur-R.The RealBlur-J test set contains JPEG images and the RealBlur-R test set contains images captured in the raw camera format.Each test set contains 980 image pairs of ground-truth and blurred images.

Table 4 .
Quantitative comparison of video extraction performance on the GoPro7 test set.F i , F m , and F f indicate the initial, middle, and last frame among the restored frames, respectively."Avg."denotes the average value of F i , F m , and F f .The symbol ↑ in parentheses represents that the higher the value, the better.Similarly, the symbol ↓ indicates that the lower the value, the better.We highlight the best results and the second best results in bold and underline, respectively.

Table 6 .
Quantitative comparison on FLOPs, inference time, and model parameters.