Optical Flow Estimation by Matching Time Surface with Event-Based Cameras

In this work, we propose a novel method of estimating optical flow from event-based cameras by matching the time surface of events. The proposed loss function measures the timestamp consistency between the time surface formed by the latest timestamp of each pixel and the one that is slightly shifted in time. This makes it possible to estimate dense optical flows with high accuracy without restoring luminance or additional sensor information. In the experiment, we show that the gradient was more correct and the loss landscape was more stable than the variance loss in the motion compensation approach. In addition, we show that the optical flow can be estimated with high accuracy by optimization with L1 smoothness regularization using publicly available datasets.


Introduction
Event-based cameras are bio-inspired vision sensors that asynchronously output perpixel brightness changes as the event stream instead of video frames [1]. The asynchronous nature of the event camera offers several advantages over traditional cameras, such as a high temporal resolution, a high dynamic range and low latency, which make event cameras more popular in many domains, such as mosaicing and tracking [2], 3D reconstruction [3], and high frame rate video generation [4]. Event-based cameras are also suitable for optical flow estimation since the precise timestamp at pixel-level intensity changes directly encode fine grain motion information.
However, the novel output of the camera, which is completely different from an image, provides new challenges in algorithm development. Several attempts have been made to apply a technique using spatiotemporal image derivatives and an assumption of brightness constancy [5,6] to event-based vision. Benosman et al. [7] and Brosch et al. [8] proposed that the spatial image derivative was approximated using the integration of events and applied to the optical flow constraint. Bardow et al. [9] estimated optical flow while simultaneously restoring image brightness from events only. It is preferable to utilize the precise timing information than to approximate the image gradient or restore the brightness using events for which absolute brightness information is lost.
Recently, motion compensation approaches that estimate the optical flow by searching for the trajectory that maximizes event alignment [10] have been developed. The best trajectory is obtained by maximizing the objective function that measures the edge sharpness of the image of warped events (IWE). However, Zhu et al. [11] pointed out that the variance loss leads to the overfitting of the networks by predicting the optical flows that make the events overlap in a line.
We also examined this phenomenon by visualizing the landscape and gradient of the loss function in various textures. We found that the gradient became larger in the direction pushing events to a line and the loss landscape was unstable around a line. Furthermore, the gradient is the spatial derivative of the surface formed by the warped event timestamp and may be incorrect where the surface is not continuous (e.g., around the latest event). As in the experiment in [10,12], if the optical flow parameter R 2 is common in the patch, the effect is small because the average is taken, but the effect is large in the pixel-level parameterization.
What we hope to achieve in this research is to estimate pixel-wise optical flow from events only, without additional sensor information, and without restoring the luminance. We focused on the time surface that retains the timestamp of the latest event at each pixel. The time surface is an aggregation of spatio-temporal events in the image form while keeping accurate timestamps that encodes motion information. To estimate the dense optical flow from the time surface, we propose Surface Matching Loss, which evaluates the timestamp consistency of the time surface between two times. By optimizing the proposed loss function with the smoothness regularization, dense optical flows can be estimated. An overview of the proposed method is shown in Figure 1. In the experiment, the landscape of our loss function was shown to be gentle in any direction for any texture. In addition, we evaluated dense optical flow estimated with L1 smoothness regularization. The accuracy of the estimated optical flow was higher than with conventional methods, such as simultaneous estimation with image restoration [9] and the contrast maximization approach [10,12].
Our main contributions are summarized as follows: • We propose the loss function measuring the timestamp consistency of the time surface for optical flow estimation using event-based cameras. This proposed loss function makes it possible to estimate dense optical flows without explicitly reconstructing image intensity or utilizing additional sensor information. • Visualizing the loss landscape, we show that our loss is more stable regardless of the texture than the variance used in the motion compensation framework. Alongside this, we also show that the gradient is calculated in the correct direction in our method even around a line segment. • We evaluate the dense optical flow estimated by optimization with L1 smoothness regularization. Our method recodes with higher accuracy compared with the conventional methods in the various scenes from the two publicly available datasets.

Related Work
There have been several works on estimating optical flows from event data in a novel data format. Benosman et al. [15] show that optical flow can be estimated directly from timing information by fitting a plane in x − y − t space. This approach succeeds in simple scenes with sharp edges, but fails in more complicated textures [16].
Benosman et al. [7] and Brosch et al. [8] applied the luminance gradient, which is approximated using the summation of events in the local spatiotemporal patch to the optical flow constraint [5,6]. Bardow et al. [9] proposed a method for simultaneous optical flow and intensity estimation from only event data. The proposed cost function is defined within a large sliding window containing a large number of images and optical flow parameters, and includes spatiotemporal smoothness regularization. However, the intermediation of the luminance information does not take full advantage of the precise timestamp of event-based cameras.
Several learning-based approaches [11,17,18] also exist. Zhu et al. [17] proposed a method for self-supervised learning for neural networks by photometric loss using grayscale images taken at the same time as events by Dynamic and Active-pixel Vision Sensor (DAVIS) [14]. It seems to be unable to learn in blacked-out or blurred scenes where brightness images can't be taken. Ye et al. [18] also proposed a method for calculating photometric loss by warping event slices using optical flows generated from the depth and ego-motion output on the network.
The most recently studied method is the motion compensation framework [10,11,[19][20][21][22], maximizing event alignment along point trajectories on the image plane. Event alignment is measured by various indices, such as the association probability of events in the EM algorithm [20], the number of overlapping events per pixel [21], the contrast of the IWE [10], and the average timestamp [11,22].
Zhu et al. [11] pointed out that the loss functions evaluating the image contrast [10] lead to the overfitting of the networks by estimating the optical flow that pushes the events into a line. Gallego et al. [12] investigated more than twenty loss functions to analyze event alignment. They reported that the accuracy of the average timestamp [11,22] was low, and the variance, the gradient and magnitude, and the Laplacian magnitudes are the best functions. Stoffregen et al. [23] analyzed the properties of contrast maximization reward functions and showed aperture uncertainty depending on data.
The gradient in the contrast maximization framework for optical flow estimation (refer to Equation (12) in Section 4.2) is calculated like the derivative of the time surface. In this calculation, the surface around the latest timestamp is spatially differentiated at discontinuous points, resulting in an incorrect gradient. In case of patch-based optical flow, the effect is small because the average is taken, but the effect is large in the pixel-level parameterization. In addition, we found that the gradient that makes events overlap in a line may be larger.
The aim of this study is to examine a method for estimating dense optical flow from events only, without additional sensor information and without recovering luminance. We focused on the common data representation of events, the time surface, which keeps the latest timestamp in the form of an image. To estimate the dense optical flow from the time surface, we propose a new Surface Matching Loss that measures the consistency of the timestamps on the time surface between two times. With this method, the gradient is more stable than that of the variance loss in any texture, and the loss landscape is gentle in any direction.

Event Representation
Event-based cameras output events in response to changes in brightness for each pixel independently. An event e k = (x k , t k , p k ) is generated if the change in logarithmic brightness at a certain pixel reaches a pre-defined threshold. x k , t k , p k ∈ {+, −} represent pixel coordinates, the microsecond timestamp and polarity (i.e., sign of the brightness change), respectively.

Time Surface
The data representation commonly used in event-based vision is the surface of active events [15], also referred as time surface [24,25]. The time surface at a pixel x = (x, y) and polarity p is defined as where t is the timestamp of the latest event with polarity p which is generated at pixel x.
The time surface is a summary of the event-stream at each instant, which enables a scalable event-processing.

Surface Matching Loss
The event camera generates a trajectory of events in x − y − t space by the motion of an object. The time surface maintains the latest timestamp of the trajectory for each pixel. Consider the linear motion of an object over a very short time interval ∆t within a rigid scene, represented by an arrow, as in Figure 2a. Using the displacement v(x) in time duration ∆t at pixel x, the following equation is derived from the consistency of the timestamp: The right-hand side can be considered to be the time surface shifted by ∆t as shown in Figure 2b. Since the time surface consists of timestamps, it can be shifted in time at any time interval. The shifted time surface S p is defined as follows: The middle of Figure 1 shows a time surface S p and a shifted time surface S p at each polarity with actual event data [13]. Rewriting Equation (2) with S p (x) gives: This equation means there is timestamp consistency between the time surface S p and the shifted time surface S p . Practically, the time surface is composed of a collection of events for a certain duration τ until a reference time t 0 . Thus, the value range of the time surface is as follows: Equation (4) includes the non-linear term S p (x + v(x)) and it can be linearized around a close approximation v 0 to v using Taylor expansions as with the image-based TV-L1 optical flow estimation [26,27] as follows: ∇S p represents the spatial derivative of the shifted time surface. The Surface Matching Loss is defined as summing for all pixels and each polarity after applying the L1 norm: In addition, since this loss is ill-posed alone, we add the following smoothness regularization term: In general, optical flows are smoothly varying and have discontinuities at object edges. The smoothness regularization term aims to regularize the flow by reducing the difference between neighboring pixels. We chose the L1 norm, which constrains the optical flow smoothly while allowing for discontinuities in the optical flow. The L1 norm, which is generally considered to be robust to an outlier, is adopted to reduce the influence of the noise on the time surface. The estimated optical flow is obtained by minimizing this cost function as follows: Note that the estimated optical flow v is at the 3-D point (x, S p (x)) on the time surface corresponding to the starting point of the arrow in Figure 2. That is, the reference time of the optical flow is not (t 0 − τ), but the value of the time surface S p (x). Figure 3 shows a one-dimensional time surface for the varying motion speed for the duration of τ. As shown in the figure, the constant motion assumption in τ is not necessary because the optical flow is determined on the time surface. The assumption of constant speed is only necessary for the duration ∆t to calculate the optical flow. It is easy to satisfy the assumption because ∆t can be arbitrarily short, unlike image-based methods which have a fixed frame rate.

Comparison with Contrast Maximization
The most relevant to our work is the contrast maximization framework [10], which estimates the optical flow by searching for the trajectory that maximizes event alignment.
In the contrast maximization, the IWE is created by warping events with the candidate velocity v and integrating them in an image as follows: where v is the warped event and N e is the number of the events. The variance of the IWE measures how well events agree with the candidate point trajectories as follows: where is the mean of the IWE and N p is the number of pixels. By maximizing the variance of the IWE, the best trajectory can be obtained. Figure 4 shows an overview of the contrast maximization approach [10] and the proposed surface-matching approach. Both of them measure the alignment of events across time. In the surface-matching approach, alignment of events is evaluated by the timestamp consistency referring to the fixed previous time surface S under the motion smoothness assumption in a rigid scene. On the other hand, there is no reference value in the variance loss, and alignment is implicitly measured by the contrast of the IWE generated from warped events. In the contrast maximization, alignment of events is implicitly measured by the contrast of the image of warped events (IWE) generated from warped events. In our surface-matching approach, alignment is evaluated by the timestamp consistency referring to the fixed previous time surface S. The green dots indicate the events in x − t space.

Experiment
In order to verify the effectiveness of the proposed loss function, experiments about the loss landscape (Section 4.3) and optical flow estimation (Section 4.3) were performed.

ESIM
An Open Event Camera Simulator (ESIM) [28] can accurately and efficiently simulate an event-based camera and output a set of events and the ground truth of optical flows with any camera motion and scene. MVSEC The Multi-Vehicle Stereo Event Camera Dataset (MVSEC) [29] contains an outdoor driving scene-by day and by night-and an indoor flight scene by the drone. The event-based camera used is mDAVIS-346B with a resolution of 346 × 260, which can capture general images simultaneously. The dataset provides the ground truth optical flow generated from depth maps by LiDAR and poses information by the Inertial Measurement Unit (IMU). HACD The HVGA ATIS Corner Dataset (HACD) [25] is built with a recording of planar patterns to evaluate corner detectors. Those sequences were taken by an Asynchronous Time-based Image Sensor (ATIS) [30] with a resolution of 480 × 360. It also contains the position of markers at four corners of the poster, each 10 ms.
With this information, the homography of the plane can be calculated, and the ground truth optical flow at any point on the poster can be obtained.

Loss Landscape
As in the experiment performed in [12], we investigated the shape of the loss function in the parameter space v ∈ R 2 when the optical flow is common at all pixels. In addition, the gradient for each pixel at a certain initial parameter was visualized. The gradients are simply compared, except for regularization.
Variance The variance loss represented by Equation (11). Events are warped by the optical flow v ∈ R 2 which is common at all pixels. The duration of the events used and the time interval of the optical flow were set to ∆t + τ and ∆t in order to match our method with the condition. Surface Matching Loss The proposed loss function represented by Equation (7). This loss is calculated by the difference between the time surface S p and the shifted time surface S p warped by the optical flow. The sign has been inverted to match the variance. Event data is generated by ESIM in the translation on the image plane with various textures such as rectangle, checkerboard, brick and grass. The spatial resolution of the image and the patch size is set to 240 × 180 and 30.

Results and discussions
Consider the situation where a rectangle is translated on the image plane by a movement vector v = (2, −1) (pix/∆t) as shown in Figure 5a. Figure 5b visualizes the landscape of the loss function in the magenta patch. Surface Matching Loss is wider in all directions, whereas the variance has a narrow peak.
Here, we compare the gradients of the two methods. The partial derivative of v with respect to the IWE is described as follows (refer to Equation (39) in [12] or the appendix of [19]): . When the motion model in the contrast maximization is 2-D optical flow, the motion model Jacobians ∂x k (v)/∂v are equivalent to the timestamp. Equation (12) is the operation of the spatial derivative of the time surface after warping the events. Therefore, where the time surface is discontinuous, an erroneous gradient is calculated. As the parameter v approaches the true value, the gradient increase since the time surface of warped events becomes steep and the many events overlap. On the other hand, the main gradient of our Surface Matching Loss is ∂L surface /∂v ∝ ρ p ∇S p , which is the element product of the temporal difference and the spatial derivative of the time surface. In surface matching, as the parameter v approaches the true value, the gradient becomes gentler as the difference between the two surfaces becomes smaller.
The gradients in the patch are shown in Figure 5c,d with magenta arrows in the enlarged IWE and time surface. Figure 5c shows the situation when v = (1, 0) . It should be noted that the gradient in the x direction is further increased by integrating events in the places where many events overlap at the right side of the rectangle. This can make it difficult to reach a global optimal solution. In addition, it can be seen that the gradients of the variance loss are reversed at the right discontinuous edge formed by the new time stamp. On the other hand, in Surface Matching, the gradient is not reverse even at the edge of the surface and has the same length in the x and y direction. Figure 5c shows the situation when v = (2, 0) , which is the optimal solution in x direction. However, there are largely reversed gradients on both sides of the event overlap. They cancel out on average if the image has two parameters, but can be unstable in pixel-wise estimation. On the other hand, Surface Matching Loss has no cost and no gradient at the place where it is properly matched, and the gradient in the y direction is relatively large.
In addition, the landscape of the loss function in scenes of checkerboard, brick, and grass are visualized in the Figure 6. In all scenes, variance has narrow peaks, while surface matching loss shows a gentle and stable landscape in all directions. In the grass scene, the landscape of Variance is almost elliptical, but in checkerboard and brick, the landscape is narrow along with the line texture.

Dense Optical Flow Estimation
We examine the effectiveness of the loss function by estimating dense optical flows. In this experiment, we focused on online optimization methods rather than learning-based methods, simply to compare loss functions excluding other factors. It should be noted that the proposed loss function can be used in combination with learning-based methods. The comparison was performed for the following optical flow estimation methods: Reconstruction The method for estimating optical flow simultaneously with luminance restoration [9]. A large number of optical flows and image parameters were optimized under a temporal smoothness assumption sequentially in the sliding window containing the events with a duration of 128 ∆t. Variance The method of maximizing a variance of the IWE [10]. In [10], the optical flow parameters were common in the patches, but in order to make the conditions uniform, a dense optical flow estimation was performed by adding the L1 smoothness regularization as follows: ; v(x)))). (13) Events were warped by the optical flow at each pixel: The loss function is optimized by the primal-dual algorithm in the same way as the TV-L1 method [31]. The duration of the events used was set to ∆t + τ to match our method.
Surface Matching Our proposed method rewritten as follows: The parts of the time surface where no event occurred for τ were filled with 0. Since S + and S − have common optical flow parameters, optimization was performed for different polarities for simplicity. The loss function was optimized by the primal-dual algorithm in the same way as the images-based TV-L1 optical flow [27]. The given displacement field v 0 must be close to v, so that the approximation error of the Taylor expansions in Equation (6) is small. In practice, v 0 is given by the current estimate of v. For the first iteration, we set the initial value of v 0 to zero. The time surface S p , S p ∈ [t 0 − τ, t 0 ] was normalized to [0, 255] and convolved with a Gaussian kernel of σ = 0.8.
The Graphics Processing Unit (GPU) was used only as a method of Reconstruction. Reconstruction and Variance are our implementation, not the author implementation. The datasets used in this experiment are MVSEC and HACD composed of real data. The scene used are outdoor_day1, 2, outdoor_night1, 2, 3 and indoor_flying1, 2, 3 from MVSEC, and guernica_2, paris_2 and graffiti_2 from HACD. The duration of the optical flow ∆t was set to 7.5 ms for outdoor scenes and 10 ms for indoor scenes from MVSEC, and to 5 ms for all scenes from HACD. The time width τ of the effective time surface was set to 10∆t in all scenes. In the experiments, ∆t and τ were determined empirically for each scene. The regularization parameter λ was set to 0.15.
The average end-point error (AEE) was used as an evaluation metric: v est and v gt indicate the estimated optical flow and the ground truth optical flow, respectively. The AEE was calculated only at the location where the event occurred. Furthermore, in MVSEC, the car reflection part at the bottom of the image was excluded and, in HACD, the outside of the poster was excluded. This is because they do not fit the model of generating the ground truth optical flow. N is the number of the valid pixels.

Qualitative result
The estimated optical flows in MVSEC and HACD are shown as Figure 7 and 8, respectively. In the Variance method in all scenes, the gradient is estimated in different directions at some places because the time surface differentiates at discontinuous points.

Quantitative result
The result on the quantitative evaluation with the AEE is shown as Table 1. Our method is superior in two of the five sequences in MVSEC, all of the scenes in indoor_flying and HACD. Table 1. Quantitative results of optical flow estimation. The evaluation metric is the AEE (pix/∆t).

Day 1 Day 2 Night 1 Night 2 Night 3 Flying 1 Flying 2 Flying 3 Guernica Paris Graffiti
Reconstruction [9] 0 We consider the scene in which surface matching was inferior to reconstruction. Figure 9 shows the optical flow from the outdoor_day2 sequence. In the outdoor sequences of MVSEC, the optical flow is estimated up and down by the vibration of the car due to the unevenness of the road. The ground truth optical flow used the velocity data smoothed by the central moving average filter, so the effects of the vertical vibration were reduced [17]. Reconstruction used temporal smoothness regularization with a long time sliding window of 128∆t, which reduced the effects of vibration and increased the evaluation metric. Figure 10 shows the AEE and the pitch rate from IMU in the part of the outdoor_day2 scene. In the left part of the graph, where there is less vertical vibration on the ground, ours has a smaller error. The accuracy of our method can also be improved by adding temporal smoothness constraints. Conversely, the optical flow with Reconstruction is estimated to be small in HACD scenes where the direction of motion changes drastically.
The runtime was about 3 fps in a CPU-based MATLAB; however, the real-time implementation is possible because the optimization framework of the proposed method is the same as that of the image-based TV-L1 optical flow, which was implemented in real-time on the GPU [26]. Furthermore, in contrast to the image-based optical flow, which solves the problem coarse to fine by constructing a spatial pyramid due to the large frame interval, ∆t can be arbitrarily small in the proposed method, eliminating the need for a spatial pyramid and enabling faster estimation.

Study on Hyperparameters
In our method, the effective time of the time surface τ and the duration of the optical flow ∆t are important hyperparameters. It is preferable that ∆t is as short as possible due to the optical flow constraint while the movement occurs. τ represents the time range of the time surface, and there is no constraint, however, if it is too large, the object will appear to have a long exposure time, and occlusion will occur. For this reason, we investigated the accuracy of optical flow estimation when ∆t and τ were changed. The guernica scene was used for the evaluation.
The result of the study on hyperparameters is shown in Table 2. Looking at the τ = 50 ms column, the shorter ∆t is, the higher the accuracy because the shorter time interval satisfies the constant motion conditions as hypothesized. In the ∆t = 5 ms row, the shorter τ is, the higher the accuracy because the ground truth optical flow is at the reference time (t 0 − τ). The best parameters were the smallest combination, (∆t, τ) = (2.5 ms, 25 ms) in this experiment.

Conclusions
We propose a method of estimating the optical flow from event-based cameras by matching the time surface of the events. By optimizing the loss function that measures the timestamp consistency between the time surface and one that is slightly shifted, it is possible to estimate a dense optical flow from only the events without restoring the luminance. The experiments show that the proposed loss function has a stable landscape for the estimated parameter regardless of texture. It is shown that the gradient of our loss is equally calculated in all directions while that of the variance loss increases in the direction pushing the event to a line. Additionally, in the experiments involving dense optical flow estimation, our method added L1 smoothness regularization recoded with high accuracy.
∆t is an important parameter that must be adjusted for each scene. In future work, we would like to develop a mechanism to adaptively determine this parameter.
Author Contributions: J.N. wrote the manuscript and code, Y.S. and Y.A. reviewed the work and contributed via discussions. All authors have read and agreed to the published version of the manuscript.