Motion Deblurring using Spatiotemporal Phase Aperture Coding

Motion blur is a known issue in photography, as it limits the exposure time while capturing moving objects. Extensive research has been carried to compensate for it. In this work, a computational imaging approach for motion deblurring is proposed and demonstrated. Using dynamic phase-coding in the lens aperture during the image acquisition, the trajectory of the motion is encoded in an intermediate optical image. This encoding embeds both the motion direction and extent by coloring the spatial blur of each object. The color cues serve as prior information for a blind deblurring process, implemented using a convolutional neural network (CNN) trained to utilize such coding for image restoration. We demonstrate the advantage of the proposed approach over blind-deblurring with no coding and other solutions that use coded acquisition, both in simulation and real-world experiments.


Introduction
Finding the proper exposure setting is a well-known challenge in photography.In general, one has to balance between aperture size, exposure time and the gain to achieve a good image (the trade-off between these factors is sometimes referred to as the 'exposure triangle').This balancing process involves many trade-offs, and therefore requires complex skills and rich experience.In many cases, a large exposure is necessary to allow a sufficient amount of light to reach the sensor in order to achieve a good image with respect to the lighting condition, which usually is not controllable.To increase the amount of light in the sensor plane, one may increase the aperture size.However, large aperture results in a shallow depth-of-field and increased sensitivity to optical aberrations.Increasing the sensor gain can intensify the image signal, with the price of a higher noise level.Increasing the exposure time allows more light to be integrated into the image sensor, but introduces motion blur.
Various efforts have been dedicated to balance automatically the exposure parameters [29].Yet, such solutions are either very specific to the scenario or provide medium performance.A different approach tries to eliminate one Figure 1.Motion deblurring using spatiotemporal phase aperture coding: A moving scene is captured using a camera with spatiotemporal phase aperture coding, which generates a motioncolor coded PSF.The PSF coding serves as a prior for the CNN that performs blind spatially varying motion deblurring.
(or more) of the exposure triangle vertices, by developing methods that can restore the artifacts introduced by a nonbalanced exposure.For example, one may apply a high gain and then perform a denoising operation [22,39,3,26]; increase the aperture size and restore the blurred image using out-of-focus deblurring algorithms [48]; or take long exposures and revert the motion-related blur [21], which is the focus of this work.
In addition to the pure post-processing methods, solutions based on computational imaging [31] attempt to globally analyze the scenario, and then re-design the whole imaging system.In such an approach, the image acquisition is manipulated in a way that (generally) leads to an intermediate image with low-quality.However, this image is distorted in a very specific way such that it encodes information acquired during the exposure.Such information encoding is designed so that it can be employed in the post-processing stage for the final image restoration.Such methods have been demonstrated for various applications, including extended depth-of-field [6,24,46,13,8], hyper/multi-spectral imaging [9,10], depth estimation [24,49,14] and motion deblurring [35,1,25,4,2,41], to name 1 arXiv:2002.07483v1[cs.CV] 18 Feb 2020 a few.
Computational imaging methods for motion deblurring had been presented before, either using a temporal shutter coding [35] or a parabolic motion of the camera [25].In both methods, some assumption/prior knowledge on the motion direction is needed, which limits their performance, as discussed in Section 2.
Contribution: In this work, a computational imaging approach for motion deblurring is proposed and analyzed.The innovation in the proposed approach is the encoding method that embeds dynamic cues for the motion trajectory and extent in the intermediate image (see Fig. 1), which serve as a strong prior for both shift-variant Point-Spread Function (PSF) estimation and deblurring operation.
Our encoding is achieved by performing spatiotemporal phase-coding in the lens aperture plane during the image acquisition.The PSF of the coded system induces a specific chromatic-temporal coupling, which (unlike in a conventional camera) results in a color-varying spatial blur (see Fig. 2).Such a PSF encodes the different motion trajectory of each object.A convolutional neural network (CNN) is trained to analyze the embedded cues and use them to reconstruct a deblurred image.
The encoding design is performed for a general case, whereby objects move in different directions/velocities.Such encoding allows blind deblurring of the intermediate image, since the required prior information for both PSF estimation and image restoration is embedded in it.The deblurring CNN is trained to estimate the spatially varying PSF using the encoded cues, and reconstruct a sharp image.An experiential setup of a camera performing the designed spatiotemporal blur is presented.We demonstrate its ability to perform motion deblurring both in the presence of uniform and non-uniform motion.

Related work
Blind Motion deblurring: Motion deblurring is a vastly studied challenge in image processing.Various image priors have been tested for motion deblurring, e.g.image statistics [23] and sparsity [19] to name a few.In recent years, deep models are used to implicitly learn the transformation from a blurred to a sharp image, such that the model encapsulates both the non-uniform PSF estimation and the image deblurring operation.This approach was demonstrated using various networks: recurrent and scalerecurrent networks [47,42], adversarial training [34,20] and frame burst to sharp image [45].For a recent review and comparison of existing methods, see [21].
Computational imaging-based motion deblurring: Various works proposed to manipulate the imaging process during exposure to allow better motion deblurring.Such approaches include the use of hybrid imaging [2], light field camera [41] or usage of the rolling shutter effects [32].In this work, our focus is on spatiotemporal schemes.We detail now two such prior frameworks.
Raskar et al. developed a temporal amplitude coding of the aperture to counteract motion blur [35].The continuous exposure is analyzed as a wide temporal box filter with narrow frequency response, which limits the motion deblurring performance.Using this analysis, the authors propose a temporal amplitude coding of the aperture (referred to as 'fluttered shutter'), by interchangeably closing and opening it during exposure in some pre-determined timing (i.e. in some 'temporal code').Such coding generates a much wider frequency response, which in turn is utilized for improved motion deblurring results.While achieving very good results, it requires prior knowledge of the motion direction and extent.In addition, it suffers from reduced light efficiency due to the (interchangeable) closing of the aperture during half of the exposure.A follow-up work analyzes the case of fluttered shutter PSF estimation as a part of the deblurring process, thus, avoiding the requirement for prior knowledge of the motion parameters [1].However, such code design requires a compromise in the deblurring performance, to achieve both PSF invertibility and estimation abilities.Moreover, since it also relies on temporal amplitude coding, light efficiency is still decreased.A rigorous model analyzing the design and implementation of a fluttered shutter camera is presented in [43,44].Jeon et al. extended the method to multi-image photography using complementary fluttering patterns [17].
Another approach by Levin et al. searches for a sensor motion that leads to a motion invariant PSF [25].The motivation is that with such a PSF, one may perform non-blind deconvolution using the known kernel on the entire image at once, without the requirement to estimate each object motion trajectory.After a rigorous analysis, it is shown that parabolic motion of the image sensor during exposure leads to the desired motion invariant PSF.Intuitively, one may think of this image acquisition technique as a process in which every moving object, at least for a fraction of the exposure, is in the same velocity of the sensor (assuming the velocity is inside a predefined range).Since each object is 'tracked' by the camera for one brief moment, and in the rest of the exposure it is moving relative to the camera, the blur of all objects turns out to be similar (for the full analysis see [25]).This allows applying a conventional deblurring approach that assumes a uniform blur.While this is a major advantage, this approach has one serious limi-tation: The PSF encoding is limited to the axis in which the parabolic motion took place.If an object moves in different directions, the motion invariant PSF assumption no longer holds, and the performance degrades.In the case of movement to an orthogonal direction, the deblurring ability is completely lost.To solve this issue, follow-up work [4] proposed an advanced solution based on two images taken with two orthogonal parabolic motions.Such a solution allows deblurring of motion in all directions.Yet, it requires a more complex setup and acquisition of two images.The rigorous model presented in [43] for the fluttered shutter camera can also be applied to the parabolic motion camera.
Spatiotemporal coding for other applications: Spatiotemporal coded cameras were investigated also for other tasks.
In [11] the authors suggest a rolling shutter mechanism that can assist computational imaging applications such as optical flow estimation and high-speed photography.It has been shown that using such a modified rolling shutter, one may extract a video sequence just from a single coded exposure photograph [12,27].In [36], the authors present an approach to convert low resolution and low frame-rate video sequences to higher resolution and higher frame-rate, using a dynamic mask designed as a spatiotemporal shutter.A similar problem is approached in [15] using a fluttered shutter camera.Llull et al. [28] use compressed sensing techniques to extract more than ten frames from a single snapshot, where a moving coded aperture mask is used to generate the required spatiotemporal coding.

Spatiotemporal aperture coding
In order to achieve blind deblurring of motion blurred images, the blur kernel has to be estimated and thereafter inverted (even if both of these operations are jointly performed).In a general scene, objects move in different directions and velocities, making the blur kernel shift-dependent.Therefore, linear shift invariant deconvolution operations cannot be used.Yet, one may encode cues in the acquired image to mitigate some of the hurdles.To this end, we aim at encoding in the intermediate image enough information that allows both estimating and inverting the spatiallyvarying PSF of the acquired image, such that improved motion deblurring of a general scene is achieved.
The spatiotemporal PSF design.The design-goal of such a PSF is to encode the object trajectory during the image acquisition.To achieve this task, the PSF has to vary along the trajectory in some way that provides cues for both the motion direction and extent.One may suggest spatial variations of the PSF along the motion trajectory (i.e. during the exposure time), however, such a variation introduces a spatial blur.Since trade-offing motion blur with spatial blur is not desired, the PSF variation has to take place in another dimension.
In our proposed design, the motion variations are projected onto the color space.If the PSF can change in color during the motion, a motion-variant encoding with cues to the motion direction and velocity can be achieved.Generally, color-coding requires color filtering, which results in loss of light and requires some mechanism for filter replacement (either mechanically or electronically); both of these issues are not desired.Therefore, in order to achieve motion-color coding, a phase-mask is used.In various works [7,5,46,13,8,14,33], phase-masks are used for PSF engineering.The advantage in light throughput of phase over amplitude aperture coding is significant (in many amplitude-coding based systems the light throughput is reduced by ∼50%; see for example [24,40]).
In several previous works [46,13,8,14], phase-masks formed of several concentric rings are used for extended depth-of-field (EDOF) and depth estimation.Such masks act as a circularly-symmetric diffraction grating, and therefore introduce a predesigned and controlled axial chromatic aberration.This controlled aberration engineer the PSF to have a joint defocus-color dependency.As opposed to a conventional corrected lens (in which the response is designed to be the same for all colors), incorporating such a phase-mask in the lens aperture introduces a discrepancy in the lens response to the different colors.For example, a phase-mask can be designed to generate an in-focus PSF, which is narrow for the blue wavelength band, wider for green, and even wider for red.To generate such joint defocus-color dependency, the phase-mask is designed in a way that defocus-variations change the width of the PSFs at different colors, such that in another focus plane the narrow PSF color is either green or red, and the 'order' of the PSFs width is interchanging.
This joint dependency can be used for both EDOF [13, 8] and depth estimation [14], by focusing the lens to a specific plane in a scene with objects located at various depths.In such a configuration, each object is blurred by a different blur kernel, according to its defocus condition.This color-depth encoding of the blur kernels allows high quality EDOF (which requires blind shift-variant convolution in the general case) and single image monocular depth estimation.
We suggest using a similar phase-mask for the spatiotemporal encoding required for motion deblurring.We assume a scene with objects located relatively far from the lens (in relation to the focal length), in a way that all objects can be considered as located practically in infinity (such a setting, known as infinite-conjugate imaging, is commonly used in various applications, for example security cameras and smartphone cameras).By adding a mask containing the proper phase rings, the PSF is modulated to be 'colored' (by narrow PSF in a certain color band and the opposite in the other bands, and not by chromatic filtering).When the focus setting varies, the PSF also changes to a different color.
Therefore, if the focus changes gradually during the exposure, the desired spatiotemporal dependency is achieved: The 'color' of the PSF (i.e. the ratio between the PSF width in the different color channels) varies during the exposure, and as every object moves, its motion is blurred differently (in the chromatic dimension) along the trajectory.
In [8,13], the color differences serve as cues to estimate the correct PSF, and deblurring can be done using the sharp color channel (i.e. the color in which the PSF is narrow) that 'carries' the image information (as in most natural images objects always have some color content in all channels, and pure monochromatic objects are rare).Therefore, the colordependent blur can be deblurred by transferring resolution from channel to channel.In our case of motion blur, the color cues are designed to indicate the motion trajectory for shift variant PSF estimation, and thereafter the PSF information is used for deblurring.Since strong defocus-color dependency is desired, a similar mask to the one presented in [14] is used.
As described above, a proper focus variation should be performed during the exposure to achieve the desired colored motion trajectory.Focus/defocus condition is quantified using the ψ measure, defined as: where R is the lens exit pupil radius, λ is the illumination wavelength, z o is the object distance, z img is the sensor distance and f is the lens' focal length.When the lens is focused properly, ψ = 0.If a focus variation is introduced, then ψ changes.By examining the phase-coded lens response, it seems that the defocus variation domain providing the strongest separation is between 0 < ψ < 8 (calculated for blue wavelength), and therefore it is taken as the domain for the gradual focus variation during exposure.
The PSF encoding simulation.The proposed encoding is illustrated in Fig. 2. Fig. 2(left) presents blur of a moving point source captured by a conventional camera.If gradual focus variation is performed to a clear aperture lens during exposure (Fig. 2(middle)), the PSF gets wider in all the colors simultaneously, and thus introduces a considerable spatial blur in the last parts of the motion.However, if the same focus variation is performed to a lens equipped with a ring phase mask (Fig. 2(right)), the PSF colors change along the motion line, from blue through green to red.
To further illustrate the motion encoding of our method, we simulate imaging of moving point sources using our method, and compare it to a conventional camera, the fluttered-shutter camera [35] and the parabolic motion camera [25].Fig. 3 presents the PSF encoding performed by the different methods (this is an extension of a similar comparison shown in Fig. 3 of [25]).The original scene is formed of two sets of point sources arranged in two orthogonal lines.While the joint dot stays in place, all the other dots are moving in different velocities, as illustrated by the arrows in Fig. 3(a).
Imaging simulation of this scene is performed using the four methods.In the conventional case, the stationary dot stays 'as-is', and all the other dots are blurred according to their motion trajectory.Using fluttered shutter camera, parts of the dots' trace is blocked, and the code can be clearly seen.As suggested in [35], such a code generates an easy to invert PSF, assuming the motion direction and extent is known.Indeed, some PSF estimation can be done for blind deblurring, but an inherent invertibility/estimation trade-off exists, as discussed in [1].In addition, the light throughput loss caused by fluttered shutter is clearly seen (for the code proposed in [35] the loss is 50%).
Using the parabolic motion camera (with parabolic motion in the horizontal direction), the PSF is roughly motion invariant in the direction of the sensor motion, as clearly seen in all horizontal dots.Yet, in any other direction, and most significantly in the orthogonal (in this case, vertical) one, each dot linear motion and the sensor parabolic motion are composed, making the PSFs highly motion-variant.
In the proposed joint phase-mask and focus variations coding, each PSF is colored according to the different motion trajectory.The direction is encoded by the blue-greenred transition, and the extent of the transition indicates the velocity of the motion.
Spectral analysis.To analyze the motion encoding ability of our scheme, a spectral analysis of the PSF is carried using the spatiotemporal Fourier analysis model proposed in [25].In this model, a single spatial dimension is examined vs. the temporal dimension, and a 2D Fourier Transform (FT) is carried on this (x, t) plane (which is a slice of the full (x, y, t) space).In such setting, different velocities of a point source form lines at different angles in the (x, t) plane.The analysis in [25] included only the spectrum amplitude, but in our analysis we include also its phase since our encoding is also phase dependent as we show next.
We compare our method with a conventional camera in Fig. 4 (a full analysis including the fluttered shutter and parabolic motion cameras appears in the supplementary material).For the conventional static camera, the (x, t) slice of the PSF has a Sinc spectrum amplitude, which allows good reconstruction of object at this velocity (represented by the angle of the (x, t) PSF).Since the PSF is 'gray' (i.e. has no chromatic shift along its trajectory), its spectrum phase is also gray.This 'gray phase' feature is common also to fluttered-shutter and parabolic motion cameras, as can bee seen in the full analysis in the supplementary material.
Our proposed PSF can be considered as an infinite sequence of smaller PSFs, each one of a different color.As all PSFs have a similar spatial shape, but each has a different color and different location in the (x, t) plane, the spectrum amplitude is 'white' and similar to the spectrum amplitude of the conventional PSF.Yet, the phase (which holds the shift information) is colored, according to the shift (i.e.spatiotemporal location) of each color.Our spatiotemporal chromatic coupling can be considered as utilization of the spectrum phase as a degree of freedom for the coding.The color variations in the phase indicate the coupling between the color and the trajectory, as can be seen in Fig. 4.

The color-coded motion deblurring network
As described in the previous section, the dynamic phase aperture coding generates color variations in the spatiotemporal blur kernel.These chromatic cues encode the different motion trajectories, without a limitation on the motion direction.These cues serve as prior information both for PSF estimation and image deblurring, thus allowing shift-variant motion deblurring.
Traditionally, spatially varying deblurring is performed  in two stages: PSF estimation for the different objects/segments, and then deblurring is applied to each of them.As presented in [8,34], this task can be solved using a single CNN, trained with a dataset containing the various possibilities of the shift-variant blur.One may treat the CNN operation as an end-to-end process that extracts the cues, which allow the PSF estimation, and then utilizes the acquired PSF information for image deblurring.
Training data.To train such a CNN for our motion deblurring process, images containing moving objects blurred with our spatiotemporal varying blur kernel (and their corresponding sharp images) are needed.Since experimentally acquiring a motion-blurred image and its pixel-wise accurate sharp image is very complex (even without the dynamic aperture coding), an imaging simulation is used.Using the GoPro dataset [34], which contains high frame-rate videos of various scenes, we simulate images with the motioncolor coded blur by blurring (using the coded kernel) consecutive frames and then summing them up.Sequences of 9 frames are used, and a dataset containing 2,500 images is created; 80% of it is used for training and the rest for validation and testing.Since our deblurring process is based on local cues encoded by our spatiotemporal kernel and not the image statistics, as we show hereafter, a CNN trained on this synthetic data generalizes well real-world images.
The deblurring network architecture.Since image restoration is sought, a fully-convolutional network (FCN) architecture is considered.As shown in the work of Nah et al. [34], multiscale processing is an efficient tool to grasp the structure of motion-blurred objects.Therefore, the network architecture we use is based on the known U-Net structure [38], as it is one of the leading multiscale FCN architectures.A skip-connection is added between the output and the input, leaving the 'U' structure to estimate the residual correction for the input image.Empirically, this simplifies the convergence (the full structure and details of the network are presented in the supplementary material).
The U-Net architecture is trained using patches of size 128x128 taken from the dataset described above.Since the final goal is to present the performance on images taken with a real camera, noise augmentation is used, with similar noise to the one observed in real images taken using the target camera (AWGN with σ = 9).The network is trained using the Huber loss [16], and the average reconstruction results on the test set are P SN R = 29.5,SSIM = 0.93.Examples of the reconstruction performance achieved on images from the test set in different cases are presented in the supplementary material.
Ablation study.As an ablation study, we generated a version of the same dataset without our spatiotemporal coding, and trained the same architecture on it.In this case, we get a significant over-fitting and poor results on the test set (P SN R = 24.6,SSIM = 0.84).
In another ablation test, we evaluate another network structure, which is similar to the one presented in [8].Consecutive blocks of Conv-BN-ReLU (no pooling) with direct skip connection from the input to the output are used.Such an architecture is also designed to learn the residual correction needed to the image for the deblurring operation, but without multiscale operation.Nominal performance is achieved with this network structure (P SN R = 27.5, SSIM = 0.9), probably because the multiscale operation is important for this task.However, this architecture is much more shallow and with just 2% of the weights of the full U-net model, and it still achieves comparable results to the model of [34] (see Section 5 for the comparison).This comparison demonstrates the benefit of the aperture codingthe encoded cues are such strong guidance for the deblurring operation, that a very shallow model achieves comparable performance to a much larger one.The full details on this model and test appear in the supplementary material.

Experiments
We start by evaluating our proposed method in simulation.Two different comparisons are presented; the first is to other computational imaging methods: the fluttered shutter camera [35] and the parabolic motion camera [25], demonstrating the advantages of our dynamic aperture phase coding vs. other coding methods.The second comparison is to the deblurring CNN presented by Nah et al. [34], which is designed for conventional cameras.Such a comparison illustrates the benefits of coded aperture.Following that, we present real-world results from our designed prototype.

Comparison to other coding methods
In order to demonstrate our PSF estimation ability in the motion deblurring process vs. the motion direction sensitivity of the other methods, a scene with rotating spoke resolution target is simulated.Such scene contains motion in all directions and in various velocities (according to the distance from the center of the spoke target) simultaneously.
The synthetic scene serves as an input to the imaging simulation for the three different methods (fluttered shutter, parabolic motion and ours).The fluttered shutter code being used (both in the imaging and reconstruction) is for motion to the right, in the extent of the linear motion of the outer parts of the spoke target.The parabolic motion takes place on the horizontal direction.Each imaging result is noised using AWGN with σ = 3 to simulate a real imaging scenario in good lighting conditions (since the fluttered shutter coding blocks 50% of the light throughput, the noise level of its image is practically doubled).Fig. 5 presents the deblurring results of the three different techniques. 1  The fluttered shutter based reconstruction restores the general form of the area with the corresponding motion coding (outer lower part, moving right), and some of the opposite direction (outer upper part, moving left), and fails on all other directions/velocities.This can be partially solved using a different coding that allows both PSF estimation and 1 The imaging simulations for the fluttered shutter and parabolic motion cameras were implemented by us, following the descriptions in [35,25].The fluttered shutter reconstruction is performed using the code released by the authors.The parabolic motion reconstruction is performed using the Lucy-Richardson deconvolution algorithm [37,30], as suggested by the authors in Section 4.1 of [25].As the authors stated, a little better performance can be achieved using the original algorithm used in [25], but its implementation is not available.Moreover, probably much better results can be achieved for both methods using a CNN based reconstruction.However, the main issue in the current comparison is the sensitivity of the other coding methods to the motion direction, which is not related to the used reconstruction algorithm.
inversion.Yet, this introduces an estimation-invertibility trade-off.Note also that a rotating target is a challenging case for shift-variant PSF estimation, and thus, as can be seen, a restoration with incorrect PSF leads to poor results.Moreover, the noise sensitivity of this approach is apparent, as it blocks 50% of the light throughput.
The parabolic-motion method achieves good reconstruction for the horizontal motion (both left and right) as can be seen in the upper and lower parts of the spoke (which move horizontally).Yet, notice that its performance are not the same for left/right (as any practical finite parabolic motion cannot generate a true motion invariant PSF).Also, both vertical motions are not coded properly, and therefore are not reconstructed well.Using our method, motion in all directions can be estimated, which allows a shift variant blind deblurring of the scene.

Comparison to blind deblurring
To analyze the advantages in motion-cues coding, our method is compared to the multiscale motion deblurring CNN presented by Nah et al. [34].The test set of the Go-Pro dataset is used as the input.Since Nah et al. trained their model on sequences of between 7-13 frames, similar scenes were created using both our coding method and simple frame summation (as used in [34], with the proper gamma-related transformations).Note that in our case, a spatial (diffraction related) blur is added with the motion blur, so our model is handling a more challenging task.
The reconstruction results are compared for several noise levels-σ = [0, 3] on a [0, 255] scale (the reference method was trained with σ = 2).The measures on each motion length are averaged over the different noise levels, and the results are displayed in Table 1.As can be clearly seen, our method provides an advantage in the recovery error over the method of Nah et al. [34] in both PSNR and SSIM (visual reconstruction results are presented in the supplementary material).In small motion lengths, both methods provide visually pleasing restorations (though our method is more accurate in terms of PSNR/SSIM).Yet, as the motion length increases our improvement becomes more significant.This can be explained by the fact that the architecture used in [34] is trained using adversarial loss, and therefore inherent data hallucination occurs in their reconstruction.As the motion length gets larger, such data hallucination is less accurate, and therefore the reduction in their PSNR/SSIM performance is more significant.Our method employs the encoded motion cues for the reconstruction, therefore providing more accurate results.
Note also that our model is trained only on images generated using sequences of 9 frames, and the results for shorter/longer sequences are still better than the model from [34] that is trained on sequences in all this range.This clearly shows that our model has learned to extract the N f rames Nah et al.
color-motion cues and utilize them for the image deblurring, beyond the specific extent present in the training data.Note that in our dataset an additional diffraction-related spatial blur is added (as mentioned above), so in a case that a similar spatial blur is added to the original GoPro dataset (without the motion-color cues), our advantage over [34] is expected to be even larger.Note also that our network converges well and provides good performance for higher noise levels (as presented in the following), however, for a fair comparison to [34] we limit the noise level here to σ = 3.

Figure 6. The table-top experimental setup:
The liquid-lens and our phase-mask are incorporated in the C-mount lens.The microcontroller synchronizes the focus variation to the frame exposure using the camera flash signal.

Table-top experiment
Following the simulation results, a real-world setup is built (see Fig. 6).A C-mount lens with f = 12[mm] is mounted on a 18MP camera with pixel size of 1. 25[µm].A similar phase-mask to the one used in [14] and a liquid focusing lens are incorporated in the aperture plane of the main lens.A signal from the camera indicating the start of the exposure (originally designed for flash activation) is used to trigger the liquid lens to perform the required focus variation (a detailed description of the experimental setup is presented in the supplementary material).The liquid lens is calibrated to introduce a focus variation equivalent to ψ = [0, 8] during exposure.
The first real-world test validates the desired PSF spatiotemporal encoding.Two white LEDs are mounted on a spinning wheel, and acts as point-sources, similar to the point sources simulated in Fig. 3.A motion blurred image of the spinning LEDs is acquired, with the phase-mask incorporated in the lens and the proper focus variation during exposure.Zoom-in on one of the LEDs is presented in Fig. 7.The gradual color changes along the motion trajectory is clearly visible.The full image including both LEDs is presented in the supplementary material.
Following the PSF validation experiment, a deblurring experiment on moving objects is carried.In order to examine various motion directions and velocities at once, a rotating object is used.For reference, image of the same object is captured with a conventional camera (i.e. the same camera with a fixed focus and without the phase-mask), and deblurred using the multiscale motion blur CNN (Nah et al. [34]).Results on a rotating photo of Seattle's view are presented in Fig. 8.
In addition to the rotating target test, a linearly moving object (toy train) is also captured, in the same configuration as the previous example.The results are presented in Fig. 9.As can be clearly seen, our camera provides much better results in both cases.The full results of the experiments above along with additional experiments and demonstrations are provided in the supplementary material.

Conclusion
A computational imaging approach for motion deblurring is presented.The method is based on spatiotemporal phase coding of the lens aperture, to achieve a motionvariant PSF.The phase coding is achieved using two components: (i) the static/spatial part-a phase-mask designed to code the PSF to have a joint color-defocus dependency; and (ii) the dynamic/temporal part-a gradual variation of the focus setting performed during the image exposure.Jointly, these coding mechanisms achieve motion variant PSF, which is exhibited in a gradual color change of the blur along the motion trajectory.Such a PSF encodes cues to the motion extent and velocity in the acquired image.These cues are in turn utilized for the motion deblurring process, implemented using a CNN model.The CNN operation encapsulates both the PSF estimation and the spatially-variant motion deblurring.
Our approach is compared to blind deblurring methods and computational imaging based strategies.Its shiftvariant PSF estimation ability and generalization potential to real-world scenes are analyzed and discussed.Our technique achieves better performance compared to the other solutions in various scenarios, without imposing a limitation on the motion direction.An experimental setup implementing the proposed method is presented, and the spatiotemporal PSF color encoding is validated in a real world experiment.In addition, as our encoding provides cues to the entire motion trajectory, our approach holds potential for video-from-motion and temporal super-resolution applications, similar to [18,12,27,15,28].

A. PSF spectral analysis
As presented in Section 3 of the paper, a PSF spectral analysis is performed to analyze the differences between the different coding methods.The comparison is performed using the spatiotemporal Fourier analysis model proposed in [25].In this model, a single spatial dimension is examined vs. the temporal dimension, and a 2D Fourier Transform (FT) is carried on this (x, t) plane (which is a slice of the full (x, y, t) space).In such setting, different velocities of a point source form lines at different angles in the (x, t) plane.The analysis in [25] included only the spectrum amplitude, but in our analysis we include also its phase since our encoding is also phase dependent, as presented in the paper and further examined next.
We start by comparing all methods on a static point, represented by a vertical line in the (x, t) plane (see Fig. 10, which is the full version of the comparison presented in the paper).In the three reference methods, the PSF is 'gray' (i.e. has no chromatic shift along its trajectory), and therefore the spectrum phase is also gray.
As discussed in the paper, our proposed PSF can be considered as an infinite sequence of smaller PSFs, each one of a different color.As all PSFs have a similar spatial shape, but each has a different color and different location in the (x, t) plane, the spectrum amplitude is 'white' and similar to the spectrum amplitude of the conventional PSF.Yet, the phase (which holds the shift information) is colored, according to the shift (i.e.spatiotemporal location) of each color.Our spatiotemporal chromatic coupling can be considered as utilization of the spectrum phase as a degree of freedom for the coding.The color variations in the phase indicate the coupling between the color and the trajectory, as can be seen in Fig. 10 (note that vertical artifacts in the phase plots are due to errors of the phase unwrapping method used in the process).Additional comparisons presenting different velocities (which correspond to different angles in the (x, t)) space are presented in Figs.11-12.In (c) FT ph. Figure 10.PSF spectral analysis.PSFs and the corresponding spectra for a static point source captured using (top to bottom) static camera, parabolic motion camera, fluttered-shutter camera and our method.(a) (x, t) slice of PSF and its (b) amplitude and (c) phase in Fourier domain.
the conventional camera, no information is encoded in the phase (as can be seen, the three phases are almost the same).The parabolic motion camera is designed to generate a motion invariant PSF, therefore its phase also holds little information (the minor differences are due to the fact that the PSF is not fully motion invariant, due to the finite parabolic motion).Using our method, the different velocities of the source is coded in the different colored pattern of the spectrum phase.Note that the phase of the fluttered-shutter camera PSF indeed contains some motion estimation information (i.e. the temporal code generate phase variations), but this ability holds an estimation-invertibility trade-off, as mentioned in the paper and discussed in [1].In our method, where the color space is utilized for encoding, the motion cues are much stronger and allow improved PSF estimation while preserving PSF invertibility as in each part of the motion at least part of the spectrum is sharp, and can serve as a guide to reconstruct the blurred colors.

B. CNN structure and details B.1. U-Net model
As discussed in the paper, our proposed architecture is based on the known U-net architecture [38].The Unet model includes several downsampling blocks, with their corresponding upsampling operations, which concatenate to their output the input of the same scale downsampling block, thus allowing multiscale processing.We use the U-net model available in https://github.com/milesial/Pytorch-UNet, with several modifications.A skip-connection is added between the input and the output, thus, letting the 'U' structure to estimate a 'residual' correction to the input blurred image (empirically, this change allows a much faster convergence).
The net contains four downsampling blocks and their corresponding four upsampling blocks, with additional convolutions in the input and output.Each convolution operation consists of 3 × 3 filters (with proper padding to keep the original input size, and without bias), followed by BatchNorm layer and a Leaky-ReLU activation.Each CONV block contains double Conv-BN-ReLU sequence.controlled motion 'in the wild', only our method is used, without a reference conventional camera.In such scenes, the motion occurs in every direction, and in various velocities.One can see (especially in the zoom-ins on Fig. 21) that our CNN is able to reconstruct the objects moving in different velocities and directions, while also deblurring the static parts (which also get some blur in the coding process).Notice that we do not re-train the reconstruction network, but rather use the same used in all other experiments (that was trained on the GoPro dataset with the color-coded motion simulation).Thus, in areas where the motion extent is quite large, the reconstruction performance decrease.The reason is that our model is trained with data that contains a limited velocities range; inside this range, the reconstruction results are good, while in areas with motion beyond this limit, the reconstruction ability is limited.This is an inherent trade-off in our method-as the velocities range in the dataset gets larger, the CNN can handle a longer motion extent.However, the cost is that some compromise is done in the reconstruction performance of the slower motion range.therefore the motion extent scale in pixel terms is different.

Figure 2 .
Figure 2. Motion blurred PSF simulation: (left) conventional camera, (middle) gradual focus variation in conventional camera and (right) the proposed camera-gradual focus variation with phase aperture coding.
(a) First frame (b) Last frame (c) Conventional camera (d) Fluttered Shutter camera (e) Parabolic motion camera (f) Our camera Figure 3. Simulation of the different coding methods: (a) first frame (arrows indicate dots path and velocity), (b) last frame, (c) conventional static camera, (d) fluttered shutter camera[35], (e) Parabolic motion camera[25] and (f) our proposed camera.(The imaging is performed on single pixel dots to simulate point sources.For visualization purposes dilation and gamma correction are applied.) (a) (x, t) PSF (a) FT amp.(b) FT ph.

Figure 4 .
PSF spectral analysis.PSFs and the corresponding spectra of a (top) static camera and (bottom) our method.(a) (x, t) slice of PSF and its (b) amplitude and (c) phase in Fourier domain.
(a) Rotating target (b) Flutter-shutter rec.(c) Parabolic motion rec.(d) Our rec. Figure 5. Simulation results of rotating target: (a) rotating target and the reconstruction results for (b) fluttered-shutter, (c) parabolic motion camera and (d) our method.

Figure 7 .
Figure 7. Experimental validation of PSF coding: a moving white LED captured with our camera, validates the required PSF encoding.

Figure 11 .
PSF spectral analysis.PSFs and the corresponding spectra for a moving point source (the velocity is indicated by the angle at the ((x, t) plane) captured using (top to bottom) static camera, parabolic motion camera, fluttered-shutter camera and our method.(a) (x, t) slice of PSF and its (b) amplitude and (c) phase in Fourier domain.

Figure 20 .
Figure 20.PSF encoding validation experiment: two rotating white LEDs simulating point sources captured using our camera.The color coded motion trace indicates both direction and velocity.

Figure 21 .
Figure 21.Outdoor experiment: (left) full outdoor scenes with marks on magnified areas, and zoom-ins on (middle) intermediate image and (right) reconstruction results.