Dynamic Vision Sensor Tracking Method Based on Event Correlation Index

Dynamic vision sensor is a kind of bioinspired sensor. It has the characteristics of fast response, large dynamic range, and asynchronous output event stream. These characteristics make it have advantages that traditional image sensors do not have in the ﬁeld of tracking. The output form of the dynamic vision sensor is asynchronous event stream, and the object information needs to be provided by the relevant event cluster. This article proposes a method based on the event correlation index to obtain the object’s position, contour, and other information and is compatible with traditional tracking methods. Experiments show that this method can obtain the position information of the moving object and its continuous motion trajectory and analyze the inﬂuence of the parameters on the tracking eﬀect. This method will have broad application prospects in security, transportation, etc.


Introduction
As an important research direction in the field of computer vision, object tracking is widely used in security, traffic, and unmanned driving [1][2][3]. e image sensors we use for object tracking daily are imaging in frames. is imaging method, although the output image is intuitive, is more pleasing to the eye. However, this fixed frame rate imaging method will lose the object information between frames during object tracking or blur because the object moves too fast, which will affect the accuracy of object tracking. And, a faster frame rate will bring higher power consumption and require more storage to store data.
In order to solve these problems, with the development of bionics, the dynamic vision sensor (DVS) emerged [4][5][6][7][8][9], which is a new type of image sensor that generates event flow based on changes in light intensity. e principle of the sensor is similar to the retina. Its pixel structure is shown in Figure 1. e pixel structure is composed of three parts [10][11][12]. e first part is a current-voltage logarithmic conversion light sensor circuit for sensing light intensity. Photoelectric conversion: this part is similar to cone cells in the retina. e second part is a variable amplifying circuit similar to the bipolar cells in the retina, using a switched capacitor amplifying structure whose function is to complete sampling and amplification. e third stage is mainly composed of two comparators, which are similar to the ganglion cells in the retina. When the light intensity becomes weak, it generates an OFF signal, and when the light intensity becomes strong, it emits an ON signal. e sensor images the position where the light intensity changes in the field of view, thus reducing the amount of data. Since there is no need to expose for charge accumulation, the light intensity change can be detected continuously, so there is a very high time resolution between events. Benefit from its logarithmic photoelectric conversion unit, the sensor has a large dynamic range of 120 dB [10].
However, due to the special imaging method and unique data format of dynamic vision sensors, traditional tracking algorithms are not applicable. Here, we propose a new method to calculate the event correlation index of each event and extract the location information and contour information of the object from it. is method not only reduces the dimensionality of the three-dimensional event stream but also uses existing tracking algorithms, such as centroid tracking algorithm. Compared to simply compressing the three-dimensional event stream into a two-dimensional binary image, this method can preserve the spatiotemporal correlation of events, thereby reducing noise impact on tracking algorithms. e rest of the paper is organized as follows: Section 2 reviews the work related to the dynamic vision sensor tracking method, Section 3 introduces the algorithm of this article, Section 4 will experiment and evaluate the method, and Section 5 will analyze the experimental results. In section 6, conclusion will be summarized.

Related Works
With the advancement of semiconductor design and technology, the resolution of dynamic vision sensors has been further improved, and the readout rate of the event stream has also been greatly increased. e resolution of the latest dynamic vision sensor from Prophesee in France has reached 1280 × 720 [13], and the maximum event readout rate reaches 1066 Meps.
e Cele-X V DVS of China's CelePixel Technology Co., Ltd. has a resolution of 1280 × 800 and a readout rate of 160 Meps [14]. In this case, processing each event in the event stream in turn to determine whether it is an object is a great challenge to the calculation speed of the tracking system. Moreover, traditional tracking methods suitable for frames are difficult to use in event streams. Although each event in the event stream contains location information, a single event cannot effectively convey the information of the moving object, and it is even impossible to determine whether the event is generated by the object or noise. e events generated by the object have high temporal and spatial correlation, and only by using these events can we obtain the object's location information and time information.
Most of the existing tracking methods based on dynamic vision sensors use event clustering to extract object location information. e cluster determination depends on the distance between events and the number of events that are closer. e event distance in the cluster is less than a certain threshold, and the number of events is more than a certain threshold, which is defined as an object. In [15], a cluster-based method inspired by the traditional mean shift method has been used to track the arm of a robot football goalkeeper. Other work that uses cluster-based methods to track moving objects is reflected in the paper [16] published by Schraml and Belbachir. Compared with the literature [15], Schraml's algorithm is different in the way the events are allocated to the cluster. e allocation of newly generated events depends on the 3D Manhattan distance in space and time between the event and the cluster. Compared with the traditional Euclidean distance, this clustering method can suppress noise. Because of the low memory usage, the cluster-based method is suitable for embedded vision systems, but the cluster size needs to be adjusted according to specific goals, so the above methods are only suitable for specific scenarios. e method of clustering events based on the Gaussian mixture model (GMM) came into being, and these works are reflected in the literature [17,18]. Piatkowska et al. call this method K-Gaussian clustering method [17]. In this algorithm, events are modeled by Gaussian clustering. Later, Lagorce et al. improved the method, in which the spatial distribution of events was modeled by bivariate Gaussian [18].
is is also inspired by the mean shift algorithm. Determine which cluster the event belongs to in the event stream, and then update the cluster.
Literature [19] proposed event coherence detection algorithm. is method divides the event stream into 32 or 64 blocks according to space, performs event correlation detection to extract the event clusters, and then matches the newly discovered object with objects in the tracker. However, when the object has a large geometric size and is on the boundary of space, the object will be divided into multiple by mistake, which affects the tracking effect.
is paper proposes a method that uses the event stream in a fixed period of time and the Gaussian kernel convolution method to compress the three-dimensional event stream into a two-dimensional image that retains the event correlation, so that the traditional image processing method can be used to extract the relevant events. Space coordinates are used to determine the location information of the object and then track the object. e advantage of this method is that it can directly obtain the location information of the event cluster and in turn obtain the event stream at that location. Not only can simple traditional image processing methods be used but also the event stream of the object can be retained to analyze the continuous motion trajectory and status.

Algorithms
Object tracking is the process of locating the position of the object in the subsequent frames according to the position of the object in the first frame of the video sequence. In the traditional method, first locate the position of the object in the first frame and then search around for the object that matches the previous frame in the subsequent frames. e dynamic vision sensor outputs the event stream asynchronously. e event stream contains the events generated by the object movement and the noise of the sensor itself. e correlation between events cannot be directly reflected in the event stream data, which makes it impossible to obtain object information directly from the event stream. Although each event contains its own location information, a single event cannot express the information of the moving object. According to the above analysis, the object tracking method based on the dynamic vision sensor is divided into two parts: (1) the event stream is sliced according to a fixed time period, the object detector is used in the slice to obtain the object position information, and the information of objects in the first slice is stored in the tracker; (2) match the objects in the tracker with the objects found in the subsequent event stream and update the tracker. We propose the following algorithm.

Object Detector
(1) Collect the event stream ES of the time period T because a certain number of events are required to obtain the correlation of the events. e correlation of events is quantified by a two-dimensional Gaussian kernel and expressed by the event correlation index. e calculation method is In equation (1), σ 1 and σ 2 are the standard deviations of the space distance and time distance, respectively. e coordinates and time of occurrence of the event are independent of each other, so ρ � 0 in the normal distribution, and e j (x j , y j , t j ) is the supporting event of e i (x i , y i , t i ).
(2) After obtaining the event correlation index of each event in the event stream for this period, add all the event correlation indexes of each pixel as the grey value and store it in the corresponding location in a two-dimensional matrix with the same size as the sensor resolution. In this way, the event correlation image ECI is obtained, and the pixel value of the image is where m and n are the horizontal and vertical resolution of the sensor, respectively. Use OTSU [20] to adaptively acquire the threshold λ and binarize ECI according to equation (3) and divide the picture into two parts: the object and the background. e acquired threshold needs to be judged with the minimum threshold. If the threshold is lower than the minimum threshold, it means that there is no event cluster whose correlation meets the requirements in the event stream, that is, there is no moving object.

Complexity 3
where p WB (x i , y i ) is grey value of pixel (x i , y i ) in binary image. (3) In order not to lose events with a low correlation index of the object edge, the binary image needs to be dilated. e object edge event correlation index is lower. is is because there are more events near the centre of the object than events near the edge, so the correlation index is lower than the centre. (4) At this point, the object position and contour information can already be obtained on the binary image, so that according to the contour range, use equation (4) to extract the object events OBJ k in the event stream: Here, O k (x, y) is the contour curve of object K. ese events are the events generated by the object movement. According to equation (5), the object centroid (x, y) is obtained and used to update the tracker: where N is the number of events in OBJ k .

Tracker Update
(1) When the object is detected by the detector for the first time, it is directly stored in the tracker and assigned an ID. (2) When the tracker already has previous object information, it needs to match the detected new object with the existing object. According to the high time resolution characteristics of DVS events, the event stream generated by the object has extremely high continuity in time and space, so the objects in the two time periods can be matched by matching the centre of mass and contour information, that is, where ID(OBJ k ) is the ID of object K, T 1 and T 2 are two adjacent consecutive time periods and the duration is T, and R k is the bounding rectangle of the object. (3) When the object in the tracker fails to match the detected object for a long time, it will be deleted and the ID will no longer be used.

Experiments
e technical focus of the method in this paper is to extract the position and contour of the event cluster formed by the moving object in the event stream, so as to obtain the object events in the event stream. erefore, in the experiment, the effect of obtaining the object contour from the event cluster will be tested. At the same time, the tracker matching effect will also be tested. e experimental device uses DAVIS 346, the sensor resolution is 346 × 260, the time resolution of the event is 1us, and the event format is e(x i , y i , p i , t i ), where x i and y i are the pixel coordinates of the event and p i is the polarity of the event. e light intensity changes from dark to bright, the polarity is 1, and the opposite polarity is 0; t i is the time stamp. e data used in the experiment was acquired under natural light indoors. In order to simulate a small object moving at a high speed, the spot of the laser pointer was used to move quickly on the whiteboard, so that a fast moving point object was formed in the field of view. e parameter settings in the algorithm are set as follows: time period T � 2 ms, σ 1 � 1, σ 2 � 0.5, and the expanded structure element is a rectangle with a side length of 5. e object is not matched for more than 20 ms and will be deleted. Figure 2 is a spatiotemporal scatter plot of a 100 ms event stream, where the red points are positive and the blue points are negative. Although there are a lot of noise and hot pixels in the figure, it can be intuitively seen that the object event clusters are continuous. Hot pixels are blue lines in Figure 2, which are generated by pixels that send events incorrectly all the time. Visually, object event cluster can be found to have a strong correlation, but background activity in Figure 2 has no correlation with other events.
rough the algorithm in this paper, the event correlation index image is obtained, the contour position of the object is extracted, and the object is tracked. e correlation index image and object contour at some moments are shown in Figure 3: e images on the left of Figure 3 are event correlation index images of 6 periods, each time period is 2 ms, and the 4 Complexity  grey scale of the image represents the correlation strength of the event generated by the pixel, which means that there is a target at the location that has generated an event. e images on the right are diagrams of the event stream in a 2 ms time period, in which the time axis is facing the inside of the screen, only look at the spatial position of the event, and the target events can be marked with a box. It can be seen that the correlation of events within the outline is significantly higher than that outside the outline. Figure 4 is the tracking effect diagram of the tracker. e red dots are the acquired events of the object. It can be seen from the figure that the event stream of the object is completely retained, and the trajectory is clearly and coherently in the three-dimensional space-time image. It can provide support for the subsequent object's motion trajectory analysis. e events of the target extracted by the algorithm in the original event stream are marked. e left image in Figure 5 is the original event stream, from which the position and shape of the target can be seen in the centre of figure. In the figure on the right, the red events are the target events extracted by the algorithm, and the blue events are the original events. Figure 5 shows that the algorithm can accurately obtain the target position and shape.

Analysis
e tracking method based on the event correlation index focuses on selecting the variance of the Gaussian kernel, and the appropriate variance should highlight the relevant events. According to the characteristics of the Gaussian kernel, the larger the σ, the larger the range of the Gaussian kernel, and more events support the central pixel.
When the space width parameter σ 1 in the Gaussian kernel is unchanged, the change of the time width parameter σ 2 will affect the discovery of the object and the determination of the object contour range; when σ 2 is smaller, the correlation of the event in the time dimension will be ignored. A small amount of noise that is close in time will also get a higher correlation index, which will cause the discovery of false objects. When σ 2 is larger, more previous events will provide support, but the events on the edge of the object do not have the support provided by previous events, and then the correlation index of these events is lower than that inside the object, and the object contour will be smaller than the actual object. Figure 6 is a three-dimensional image of Gaussian kernels with different σ 2 when is 1, and the three images in Figure 7 are the results of obtaining the object event in a 2 ms time period when σ 2 is 0.1, 0.5, and 1, respectively. When σ 2 � 0.1, there are false objects caused by noise on the left side of the figure. When σ 2 � 0.5, the obtained object contour contains the events generated by the object. When σ 2 � 1, the obtained object contour is small, and it fails to include the sparse events generated by the object edge.
When the time width parameter σ 2 in the Gaussian kernel is unchanged, the change of the space width parameter σ 1 will also affect the determination of the object contour range. When the space width is small, the correlation index provided by the support events of different space distances is not much different and the value is small, which is not conducive to calculating the threshold.
erefore, the space width should be greater than one pixel, and the support events of different distances provide different correlations. In this way, a meaningful correlation index can be obtained, and too large space width will increase the interference of noise on object discovery. Figure 8 shows the three-dimensional images of Gaussian kernels with different d1 when σ 2 is 0.5. e three images in Figure 9 are the results of obtaining the object event in a 2 ms time period when σ 1 is 0.5, 1, and 2. When σ 1 � 0.5, because the correlation index is too small to determine the appropriate threshold, a lot of noise is obtained. When σ 1 � 1, the object contour obtained includes the events generated by the object. When σ 1 � 2, affected by the hot pixels, a false object is obtained at the bottom right of the picture.

Conclusions
In this work, we propose a new method to obtain the object position in the DVS event stream and track the object and analyze the influence of the parameters used on this method. A single DVS event cannot contain enough object information. Events with high correlation are required to form an event cluster to reflect the object's position and movement status. is method uses the event correlation index to obtain the event cluster of the object, thereby determining the location and shape of the object. is method can apply traditional image processing and object tracking methods to dynamic vision imaging systems, making them compatible with other existing systems. Experiments have proved that this method can get the position of the object and obtain the complete event stream of the object movement.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.