Comparison of Tracking Techniques on 360-Degree Videos

Featured Application: Tracking techniques are essential for attaching an AR tag to a physical target in 360-degree videos. Abstract: With the availability of 360-degree cameras, 360-degree videos have become popular recently. To attach a virtual tag on a physical object in 360-degree videos for augmented reality applications, automatic object tracking is required so the virtual tag can follow its corresponding physical object in 360-degree videos. Relative to ordinary videos, 360-degree videos in an equirectangular format have special characteristics such as viewpoint change, occlusion, deformation, lighting change, scale change, and camera shakiness. Tracking algorithms designed for ordinary videos may not work well on 360-degree videos. Therefore, we thoroughly evaluate the performance of eight modern trackers in terms of accuracy and speed on 360-degree videos. The pros and cons of these trackers on 360-degree videos are discussed. Possible improvements to adapt these trackers to 360-degree videos are also suggested. Finally, we provide a dataset containing nine 360-degree videos with ground truth of target positions as a benchmark for future research.


Introduction
Nowadays, 360-degree videos are becoming more and more popular. Omnidirectional cameras, also called 360-degree cameras, are widely available and more lightweight, and can even be installed on drones [1]. They are useful for recording indoor or outdoor activities to cover views in all perspectives. Rendering 360-degree videos on a virtual reality headset can provide immersive experience for users of education, entertainment, and tourism [2]. For augmented reality applications using 360-degree videos, a common request is to register a virtual tag to a physical target. As shown in Figure 1, a virtual billboard marked in red color must follow its corresponding physical target over time. For this purpose, automatic tracking of a specific target in 360-degree videos is highly desirable. Therefore, we explore the characteristics of 360-degree videos and compare the performance of existing tracking techniques on 360-degree videos.
A 360-degree video consists of a sequence of 360-degree images with a fixed interval of time. Each 360-degree image is a panorama either captured by an omnidirectional camera or combined by multiple cameras to cover the complete horizontal field of view (i.e., 360-degree FOV). As shown in Figure 1, a 360-degree image is typically flattened in an equirectangular format in that longitude lines are projected to vertical straight lines of constant spacing. Similarly, latitude lines are mapped to horizontal straight lines of constant spacing. A 360-degree video consists of a sequence of 360-degree images with a fixed interval of time. Each 360-degree image is a panorama either captured by an omnidirectional camera or combined by multiple cameras to cover the complete horizontal field of view (i.e., 360-degree FOV). As shown in Figure 1, a 360-degree image is typically flattened in an equirectangular format in that longitude lines are projected to vertical straight lines of constant spacing. Similarly, latitude lines are mapped to horizontal straight lines of constant spacing.
Due to the natures of the immersiveness and full field of view, 360-degree videos are ideal to be adopted in virtual reality applications. However, the interaction ways during playback of 360-degree videos are quite limited. Users cannot walk within the scene in 360-degree videos, that is, 3D translation is not allowed besides the passive motion caused by a moving camera. Nevertheless, users can freely select their point of view during playback of 360-degree videos (i.e., 3D rotation is possible). Conveniently, a 360-degree video can be viewed via an ordinary web browser in that a user can pan around by clicking and dragging. Alternatively, a 360-degree video can be observed via a headmounted display (HMD) in that a user can pan around simply by rotating his head.
In ordinary monoscopic videos, real-time tracking techniques rely on either offline datasets or online learning to train an appearance model, then apply the trained model to track potential targets frame by frame. However, tracking algorithms designed for ordinary videos may not perform well on 360-degree videos with their unique characteristics [3,4]. For example, occlusion problems are almost unavoidable in panoramic 360-degree videos. Moreover, an object may disappear from the left but reappear on the right border, or disappear from the top then reappear on the bottom border in 360-degree videos. Nonrigid deformation is obvious on 360-degree videos in an equirectangular format. Lighting and scale changes happen frequently in 360-degree videos due to continuous change of viewpoints. Camera shakiness is another common problem in aerial 360-degree videos captured by drones. Trackers designed for ordinary videos tend to confuse or even lose the tracking target under these circumstances in 360-degree videos. To this end, Cai et al. [5] adapted the Kernelized Correlation Filter (KCF) tracking algorithm to work on 360-degree videos. Delforouzi et al. [6] modified the Track-Learn-Detection (TLD) tracker to fit the needs of 360-degree videos. Nevertheless, a thorough evaluation of state-of-the-art tracking algorithms on 360-degree videos is still missing. For this purpose, we implement eight popular tracking techniques and evaluate their performance on 360-degree videos. To make a fair comparison of both quality and time of tracking, we adopt the default parameters of these trackers in an open-sourced library called OpenCV. The experimental results are analyzed to reveal the pros and cons of these tracking algorithms on 360degree videos. Due to the natures of the immersiveness and full field of view, 360-degree videos are ideal to be adopted in virtual reality applications. However, the interaction ways during playback of 360-degree videos are quite limited. Users cannot walk within the scene in 360-degree videos, that is, 3D translation is not allowed besides the passive motion caused by a moving camera. Nevertheless, users can freely select their point of view during playback of 360-degree videos (i.e., 3D rotation is possible). Conveniently, a 360-degree video can be viewed via an ordinary web browser in that a user can pan around by clicking and dragging. Alternatively, a 360-degree video can be observed via a head-mounted display (HMD) in that a user can pan around simply by rotating his head.
In ordinary monoscopic videos, real-time tracking techniques rely on either offline datasets or online learning to train an appearance model, then apply the trained model to track potential targets frame by frame. However, tracking algorithms designed for ordinary videos may not perform well on 360-degree videos with their unique characteristics [3,4]. For example, occlusion problems are almost unavoidable in panoramic 360-degree videos. Moreover, an object may disappear from the left but reappear on the right border, or disappear from the top then reappear on the bottom border in 360-degree videos. Nonrigid deformation is obvious on 360-degree videos in an equirectangular format. Lighting and scale changes happen frequently in 360-degree videos due to continuous change of viewpoints. Camera shakiness is another common problem in aerial 360-degree videos captured by drones. Trackers designed for ordinary videos tend to confuse or even lose the tracking target under these circumstances in 360-degree videos. To this end, Cai et al. [5] adapted the Kernelized Correlation Filter (KCF) tracking algorithm to work on 360-degree videos. Delforouzi et al. [6] modified the Track-Learn-Detection (TLD) tracker to fit the needs of 360-degree videos. Nevertheless, a thorough evaluation of state-of-the-art tracking algorithms on 360-degree videos is still missing. For this purpose, we implement eight popular tracking techniques and evaluate their performance on 360-degree videos. To make a fair comparison of both quality and time of tracking, we adopt the default parameters of these trackers in an open-sourced library called OpenCV. The experimental results are analyzed to reveal the pros and cons of these tracking algorithms on 360-degree videos.
The contribution of this paper is a thorough evaluation of eight popular tracking algorithms on 360-degree videos. Both qualitative and quantitative comparisons are made in terms of accuracy and speed. According to the experimental results, we discuss the strengths and weaknesses of these trackers on 360-degree videos, and suggest potential ways to adapt them to 360-degree videos for better tracking performance. As a basis of the comparison, we capture nine 360-degree videos in a variety of scenarios. Three of them are captured on the ground and six of them are captured in the air.
Positions of interesting targets in these 360-degree videos are manually marked in each frame as the ground truth of tracking. The dataset containing these nine 360-degree videos with the ground truth is provided (online link in supplementary materials) to be a benchmark for future research.

Background
With the advance of virtual reality technology in the past few years, 360-degree images and videos have become a blooming research topic. Neng and Chambel [7] designed and evaluated 360-degree hypervideos that allow users to explore and navigate through links. Berning et al. [8] adopted 360-degree interactive video to create evaluation scenarios where users can select their point of view during playback. Rupp et al. [9] used 360-degree videos as a learning tool and analyzed the effects of immersiveness of three devices: a smartphone, a Google Cardboard, and an Oculus Rift. Pakkanen et al. [10] compared three interactive ways for 360-degree video playback: remote control, head orientation, and hand gesture. Huang et al. [11] presented an automatic approach to generate spatial audio for panorama images based on object detection and action recognition.
Giving the initial position of an unknown object, the purpose of tracking is to locate the object in successive frames of a video. Mousas [12] proposed a method for controlling motions of a virtual partner character based on a performance-capturing process using multiple inertial measurement units (IMUs). Instead of relying on IMUs for human tracking, this paper focuses on vision-based methods for unknown object tracking. Among all the existing online visual tracking algorithms, we choose eight modern and popular trackers for evaluation and comparison on 360-degree videos.
The Boosting tracker, proposed by Grabner et al. [13], is an online version of the AdaBoost feature selection algorithm. The online boosting algorithm maintains a global classifier pool of weak classifiers with multiple selectors. Each new training sample is used to update each weak classifier in the pool. A cascade system initializes the first selector with the current sample's importance, selects the best weak classifier with the least error, and passes the estimated importance to the next selector until all selectors have been updated. In the end, a strong classifier is chosen from the best weak classifiers, and the worst weak classifier is replaced with a random one. The Boosting tracker utilizes the initial target area in the current frame as a positive example, and exploits other areas with the same size around the target as negative examples. Then, the online-trained classifier searches the neighborhood for potential targets in the next frame. The Boosting tracker can handle temporary occlusions as well as complex backgrounds.
The Multiple Instance Learning (MIL) tracker, proposed by Babenko et al. [14], extends the online boosting algorithm by using a set of image patches (called a bag) instead of a single sample for training. A bag containing at least one positive example is called a positive bag, otherwise it is called a negative bag. The MIL tracker collects lots of small image patches centered at the tracking object as potential positive bags, and chooses the best one to be the positive example. This strategy not only prevents the MIL tracker from losing important information but also avoids the mislabeling problem.
The MedianFlow tracker, proposed by Kalal et al. [15], is a bidirectional approach that combines forward and backward tracking. The forward and backward consistency is analyzed as a quality measure to assist the tracking. The MedianFlow tracker constructs both forward and backward trajectories at each time instant, and their corresponding errors are estimated. The trajectory with the minimum forward-backward error is chosen as the candidate for the succeeding tracking. As a result, the MedianFlow tracker is more reliable to follow objects with consistent movement.
The Minimum Output Sum of Squared Error (MOSSE) tracker, proposed by Bolme et al. [16], is a tracker based on correlation filters. It achieves high efficiency by computing correlation in time domain. The MOSSE filter improves the ASEF filter to overcome the potential overfitting problem. The MOSSE tracker calculates the minimum output sum of square error to find out the most possible location of the tracking object. The benefits of using a correlation filter make the MOSSE tracker more robust to the problems of scaling, rotation, deformation, and occlusion compared to traditional approaches. Also, Appl. Sci. 2019, 9, 3336 4 of 16 MOSSE is more flexible than other correlation-filter-based trackers because the target is not required to be in the center of the image in the beginning of tracking.
The TLD tracker, proposed by Kalal et al. [17], is mainly composed of three parts: a tracker, a learner, and a detector. The job of the tracker is to follow the target through consecutive frames; the learner relies on a P-expert and an N-expert to estimate misdetection and false alarm, respectively, then updates the detector. The detector locates potential targets according to an appearance model, feeds the outputs to the learner, and corrects the tracker if necessary. The TLD tracker is well known for its ability of failure recovery at the expense of instability. Compared to other online trackers struggling with the problem of accumulating errors, the combination of tracking and detecting modules makes the TLD tracker more reliable for long-term tracking.
The KCF tracker, proposed by Henriques et al. [18], extends the MOSSE concept and takes advantage of overlapping regions in multiple positive samples. The abundant data is computed in Fourier domain to increase the learning speed. The KCF tracker emphasizes the importance of the negative samples and tends to use more samples for better training. To this end, a cyclic shift is applied to generate more samples from each important sample. The characteristic of circulant matrices for regression samples is utilized to speed up the computation. Also, kernel tricks are exploited to deal with the problem of nonlinear regression. Instead of scanning through raw pixels, the KCF tracker extracts the Histogram of Gradient (HoG) features to improve the accuracy of tracking.
The Generic Object Tracking Using Regression Networks (GOTURN) tracker, proposed by Held et al. [19], adopts an offline dataset to train a Convolutional Neural Network (CNN) model in advance. Then, it relies on the generated model for online tracking. The process of pretraining takes advantage of readily available information in offline datasets to learn both target appearance and motion relationship. Without the requirement to update CNN weights in run-time, the GOTURN tracker has another significant advantage of online tracking speed. Although it is not necessary to include specific tracking targets in the dataset for pretraining, the GOTURN tracker tends to favor objects in the training set over objects that are not in the training set. A potential issue of the GOTURN tracker is the quality of the pretrained model that can seriously affect the performance of the online tracking process.
The Channel and Spatial Reliability Tracker (CSRT), proposed by Lukezic et al. [20], is based on the Discriminative Correlation Filter (DCF) algorithm. It improves the DCF tracker by introducing spatial and channel reliability. The spatial reliability map is used to find out the optimal filter size. The ability to adjust filter size makes the CSRT tracker better than the traditional DCF algorithm by excluding unrealistic samples. Another benefit from the spatial reliability map is its ability to handle nonrectangular targets. The channel reliability is measured to weigh the importance of each channel filter, then combine them to get the final response map. Using only the HoGs and Colorname standard feature sets, the CSRT tracker can achieve an impressive accuracy with real-time speed.

Experiment Setup
To measure the performance of eight trackers on 360-degree videos in a variety of situations, we prepared a dataset containing nine 360-degree videos captured using a Garmin Virb 360-degree camera. The 360-degree camera can be installed on top of a helmet for ground video capturing as shown in Figure 2a. Alternatively, the 360-degree camera can be attached to a drone for aerial video capturing as shown in Figure 2b. A Garmin Virb 360-degree camera contains two 12-megapixel sub-cameras that are opposite to each other. Each sub-camera has a wide field of view (FOV) of 202 degrees. At each time instant, the hardware inside the camera analyzes the overlap between two images captured by sub-cameras, then aligns and stitches two images together to form a seamless 360-degree image.
images captured by sub-cameras, then aligns and stitches two images together to form a seamless 360-degree image. One problem of 360-degree videos is the huge file size. Typically, a 360-degree video contains 30 360-degree images per second, and each 360-degree image has a resolution of 3840 × 2160 with three color channels, resulting in a total of 746 MB per second. Video compression techniques can be applied to effectively reduce the size of the captured video file. Smaller video size can increase the frame rate of tracking and make the motion between two consecutive frames smaller, and hence improve the performance of tracking. On the other hand, high compression ratio and low bit rate can reduce the video quality, and hence degrade the accuracy of tracking. Wang et al. [21] studied the influences of the choice of video coding parameters on the performance of visual object tracking. In default setting, the hardware inside the Garmin Virb 360-degree camera applies the most commonly used AVC/H.264 encoding and generates a standard MP4 video file with a maximum bit rate of 120 Mbps. To make a fair comparison of eight trackers, our experiments are made based on the same video encoding and bit rate in all nine video sequences.
As shown in Table 1, these 360-degree video sequences cover multiple scenarios, each containing a combination of characteristics such as viewpoint change, occlusion, deformation, lighting change, scale change, and camera shakiness. Each 360-degree video sequence lasts 100~1000 frames with a resolution of 3840 × 2160. For speedup purpose, all video sequences are down-sampled to 1920 x 1080 in our tracking experiments. The benchmark machine is a PC with 3.2 GHz CPU and 16 GB RAM. The operating system is Microsoft Windows 10.  One problem of 360-degree videos is the huge file size. Typically, a 360-degree video contains 30 360-degree images per second, and each 360-degree image has a resolution of 3840 × 2160 with three color channels, resulting in a total of 746 MB per second. Video compression techniques can be applied to effectively reduce the size of the captured video file. Smaller video size can increase the frame rate of tracking and make the motion between two consecutive frames smaller, and hence improve the performance of tracking. On the other hand, high compression ratio and low bit rate can reduce the video quality, and hence degrade the accuracy of tracking. Wang et al. [21] studied the influences of the choice of video coding parameters on the performance of visual object tracking. In default setting, the hardware inside the Garmin Virb 360-degree camera applies the most commonly used AVC/H.264 encoding and generates a standard MP4 video file with a maximum bit rate of 120 Mbps. To make a fair comparison of eight trackers, our experiments are made based on the same video encoding and bit rate in all nine video sequences.
As shown in Table 1, these 360-degree video sequences cover multiple scenarios, each containing a combination of characteristics such as viewpoint change, occlusion, deformation, lighting change, scale change, and camera shakiness. Each 360-degree video sequence lasts 100~1000 frames with a resolution of 3840 × 2160. For speedup purpose, all video sequences are down-sampled to 1920 × 1080 in our tracking experiments. The benchmark machine is a PC with 3.2 GHz CPU and 16 GB RAM. The operating system is Microsoft Windows 10. images captured by sub-cameras, then aligns and stitches two images together to form a seamless 360-degree image.
(a) (b) One problem of 360-degree videos is the huge file size. Typically, a 360-degree video contains 30 360-degree images per second, and each 360-degree image has a resolution of 3840 × 2160 with three color channels, resulting in a total of 746 MB per second. Video compression techniques can be applied to effectively reduce the size of the captured video file. Smaller video size can increase the frame rate of tracking and make the motion between two consecutive frames smaller, and hence improve the performance of tracking. On the other hand, high compression ratio and low bit rate can reduce the video quality, and hence degrade the accuracy of tracking. Wang et al. [21] studied the influences of the choice of video coding parameters on the performance of visual object tracking. In default setting, the hardware inside the Garmin Virb 360-degree camera applies the most commonly used AVC/H.264 encoding and generates a standard MP4 video file with a maximum bit rate of 120 Mbps. To make a fair comparison of eight trackers, our experiments are made based on the same video encoding and bit rate in all nine video sequences.
As shown in Table 1, these 360-degree video sequences cover multiple scenarios, each containing a combination of characteristics such as viewpoint change, occlusion, deformation, lighting change, scale change, and camera shakiness. Each 360-degree video sequence lasts 100~1000 frames with a resolution of 3840 × 2160. For speedup purpose, all video sequences are down-sampled to 1920 x 1080 in our tracking experiments. The benchmark machine is a PC with 3.2 GHz CPU and 16 GB RAM. The operating system is Microsoft Windows 10.

Experiment Process
Our implementation of eight tracking algorithms was based on the open-sourced OPENCV library with Version 3.4.2 (Intel Corporation, Santa Clara, CA, USA). All eight trackers were initialized with default parameters. Among these trackers, the GOTURN tracker was the only one based on the Convolutional Neural Network (CNN) and utilized a standard Caffe model for online tracking. For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 16

Experiment Process
Our implementation of eight tracking algorithms was based on the open-sourced OPENCV library with Version 3.4.2 (Intel Corporation, Santa Clara, CA, USA). All eight trackers were initialized with default parameters. Among these trackers, the GOTURN tracker was the only one based on the Convolutional Neural Network (CNN) and utilized a standard Caffe model for online tracking. For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 16

Experiment Process
Our implementation of eight tracking algorithms was based on the open-sourced OPENCV library with Version 3.4.2 (Intel Corporation, Santa Clara, CA, USA). All eight trackers were initialized with default parameters. Among these trackers, the GOTURN tracker was the only one based on the Convolutional Neural Network (CNN) and utilized a standard Caffe model for online tracking. For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as

Experiment Process
Our implementation of eight tracking algorithms was based on the open-sourced OPENCV library with Version 3.4.2 (Intel Corporation, Santa Clara, CA, USA). All eight trackers were initialized with default parameters. Among these trackers, the GOTURN tracker was the only one based on the Convolutional Neural Network (CNN) and utilized a standard Caffe model for online tracking. For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as

Sequence 6
v v

Experiment Process
Our implementation of eight tracking algorithms was based on the open-sourced OPENCV library with Version 3.4.2 (Intel Corporation, Santa Clara, CA, USA). All eight trackers were initialized with default parameters. Among these trackers, the GOTURN tracker was the only one based on the Convolutional Neural Network (CNN) and utilized a standard Caffe model for online tracking. For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as

Experiment Process
Our implementation of eight tracking algorithms was based on the open-sourced OPENCV library with Version 3.4.2 (Intel Corporation, Santa Clara, CA, USA). All eight trackers were initialized with default parameters. Among these trackers, the GOTURN tracker was the only one based on the Convolutional Neural Network (CNN) and utilized a standard Caffe model for online tracking. For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as

Experiment Process
Our implementation of eight tracking algorithms was based on the open-sourced OPENCV library with Version 3.4.2 (Intel Corporation, Santa Clara, CA, USA). All eight trackers were initialized with default parameters. Among these trackers, the GOTURN tracker was the only one based on the Convolutional Neural Network (CNN) and utilized a standard Caffe model for online tracking. For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as For each frame in a 360-degree video, the centroid of the tracking target was marked up manually in advance as the ground truth of tracking. Figure 3 shows the flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos. Each tracker was executed on individual 360-degree video sequences in turn to measure the tracking speed in terms of Frames Per Second (FPS). A spatial displacement (in pixels) was computed as the absolute distance between the tracker's output position and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame was counted as a correct tracking frame. The accuracy was defined as the ratio of the number of correct tracking frames to the number of all frames. Because only the GOTURN, TLD, CSRT, and MedianFlow trackers could adjust and update target window size dynamically, adaptable window size was implemented in these four trackers for qualitative evaluation. To make a fair comparison of eight trackers, the size of the target window was not considered for quantitative evaluations. For the same reason, the Kalman filter was not applied for all trackers in our experiments.

Experiment Results
To demonstrate qualitative tracking outputs of eight trackers on nine 360-degree video sequences, six representative snapshots with a fixed interval of time in each 360-degree video sequence are shown in detail in  Figure 4. The video sequence 2 was captured by a moving motorcycle. The tracking target was a stadium on one side of the road. The building was occasionally occluded by trees and streetlamps. All trackers were affected by the occlusion problem and so only produced decent tracking results as shown in Figure 5. Especially, the MedianFlow tracker confused the tracking target with other obstacles. It suffered seriously from temporal occlusion in this sequence. The video sequence 3 was captured by a drone with an overlooking view of a lake. The tracking target was the roof of a green building. The shape of the target deformed dramatically due to the nature of 360-degree videos in an equirectangular format. Luckily, some trackers could still follow the target smoothly for a short period of time.

Experiment Results
To demonstrate qualitative tracking outputs of eight trackers on nine 360-degree video sequences, six representative snapshots with a fixed interval of time in each 360-degree video sequence are shown in detail in  Figure 4. The video sequence 2 was captured by a moving motorcycle. The tracking target was a stadium on one side of the road. The building was occasionally occluded by trees and streetlamps. All trackers were affected by the occlusion problem and so only produced decent tracking results as shown in Figure 5. Especially, the MedianFlow tracker confused the tracking target with other obstacles. It suffered seriously from temporal occlusion in this sequence. The video sequence 3 was captured by a drone with an overlooking view of a lake. The tracking target was the roof of a green building. The shape of the target deformed dramatically due to the nature of 360-degree videos in an equirectangular format. Luckily, some trackers could still follow the target smoothly for a short period of time.

Experiment Results
To demonstrate qualitative tracking outputs of eight trackers on nine 360-degree video sequences, six representative snapshots with a fixed interval of time in each 360-degree video sequence are shown in detail in  Figure 4. The video sequence 2 was captured by a moving motorcycle. The tracking target was a stadium on one side of the road. The building was occasionally occluded by trees and streetlamps. All trackers were affected by the occlusion problem and so only produced decent tracking results as shown in Figure 5. Especially, the MedianFlow tracker confused the tracking target with other obstacles. It suffered seriously from temporal occlusion in this sequence. The video sequence 3 was captured by a drone with an overlooking view of a lake. The tracking target was the roof of a green building. The shape of the target deformed dramatically due to the nature of 360-degree videos in an equirectangular format. Luckily, some trackers could still follow the target smoothly for a short period of time.    The video sequence 4 was captured by a drone flying around a lake harbor. The tracking target was a small boat docking at a pier. The scale of the target changed a lot over time as shown in Figure  7. Though the KCF and MOSSE trackers achieved a fair accuracy at the beginning, they lost the target in the middle of the sequence. Even worse, the GOTURN and TLD trackers got confused and tracked the wrong objects in the early stage. The video sequence 5 was captured by a drone flying along a lakeshore. The tracking target was a lakeside building with a red roof. The scale of the building changed slowly over time, hence the MedianFlow tracker performed quite well. The sequence 5 contained another nature of 360-degree videos in that the tracking target disappeared from one side and reappeared on the other side of the panoramic image. Most trackers cannot recover from this problem. Interestingly, the TLD tracker correctly recovered the target as shown in the last snapshot of Figure 8. The video sequence 6 was captured by a drone flying across a lake. The tracking target was a fast-moving boat with apparent viewpoint change. With the problem of large motion in this sequence, only the GOTURN tracker obtained great results. In comparison, the KCF and MOSSE trackers lost the tracking target very early as shown in Figure 9.    The video sequence 4 was captured by a drone flying around a lake harbor. The tracking target was a small boat docking at a pier. The scale of the target changed a lot over time as shown in Figure  7. Though the KCF and MOSSE trackers achieved a fair accuracy at the beginning, they lost the target in the middle of the sequence. Even worse, the GOTURN and TLD trackers got confused and tracked the wrong objects in the early stage. The video sequence 5 was captured by a drone flying along a lakeshore. The tracking target was a lakeside building with a red roof. The scale of the building changed slowly over time, hence the MedianFlow tracker performed quite well. The sequence 5 contained another nature of 360-degree videos in that the tracking target disappeared from one side and reappeared on the other side of the panoramic image. Most trackers cannot recover from this problem. Interestingly, the TLD tracker correctly recovered the target as shown in the last snapshot of Figure 8. The video sequence 6 was captured by a drone flying across a lake. The tracking target was a fast-moving boat with apparent viewpoint change. With the problem of large motion in this sequence, only the GOTURN tracker obtained great results. In comparison, the KCF and MOSSE trackers lost the tracking target very early as shown in Figure 9. The video sequence 4 was captured by a drone flying around a lake harbor. The tracking target was a small boat docking at a pier. The scale of the target changed a lot over time as shown in Figure 7. Though the KCF and MOSSE trackers achieved a fair accuracy at the beginning, they lost the target in the middle of the sequence. Even worse, the GOTURN and TLD trackers got confused and tracked the wrong objects in the early stage. The video sequence 5 was captured by a drone flying along a lakeshore. The tracking target was a lakeside building with a red roof. The scale of the building changed slowly over time, hence the MedianFlow tracker performed quite well. The sequence 5 contained another nature of 360-degree videos in that the tracking target disappeared from one side and reappeared on the other side of the panoramic image. Most trackers cannot recover from this problem. Interestingly, the TLD tracker correctly recovered the target as shown in the last snapshot of Figure 8. The video sequence 6 was captured by a drone flying across a lake. The tracking target was a fast-moving boat with apparent viewpoint change. With the problem of large motion in this sequence, only the GOTURN tracker obtained great results. In comparison, the KCF and MOSSE trackers lost the tracking target very early as shown in Figure 9. Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 16   The video sequence 7 was captured by a moving biker. The tracking target was a large building with slow viewpoint change, and some partial occlusion. Although several trackers received decent scores, they were not really focused on the center of the building. Only the BOOST tracker accurately tracked the whole building throughout this sequence as shown in Figure 10. The video sequence 8 was captured by a drone flying on windy days. The characteristic of this sequence was camera shakiness which is a common problem on drone-recorded 360-degree videos. Even with the jittery   The video sequence 7 was captured by a moving biker. The tracking target was a large building with slow viewpoint change, and some partial occlusion. Although several trackers received decent scores, they were not really focused on the center of the building. Only the BOOST tracker accurately tracked the whole building throughout this sequence as shown in Figure 10. The video sequence 8 was captured by a drone flying on windy days. The characteristic of this sequence was camera shakiness which is a common problem on drone-recorded 360-degree videos. Even with the jittery   The video sequence 7 was captured by a moving biker. The tracking target was a large building with slow viewpoint change, and some partial occlusion. Although several trackers received decent scores, they were not really focused on the center of the building. Only the BOOST tracker accurately tracked the whole building throughout this sequence as shown in Figure 10. The video sequence 8 was captured by a drone flying on windy days. The characteristic of this sequence was camera shakiness which is a common problem on drone-recorded 360-degree videos. Even with the jittery The video sequence 7 was captured by a moving biker. The tracking target was a large building with slow viewpoint change, and some partial occlusion. Although several trackers received decent scores, they were not really focused on the center of the building. Only the BOOST tracker accurately tracked the whole building throughout this sequence as shown in Figure 10. The video sequence 8 was captured by a drone flying on windy days. The characteristic of this sequence was camera shakiness which is a common problem on drone-recorded 360-degree videos. Even with the jittery motion and target deformation caused by the shaking camera, most trackers still performed quite well throughout the sequence as shown in Figure 11. The video sequence 9 was captured by a drone flying along a seashore. The tracking target was the summit of a mountain, and the illumination changed dramatically over space and time. In the middle of the sequence, the sun sat just behind the mountain top. The problem of backlighting caused tracking loss for the KCF tracker, and tracking error for the TLD and GOTURN trackers. Surprisingly, other trackers still followed the target very well as shown in Figure 12. In summary, Figure 13 outlines the quantitative results of eight trackers on nine sequences. The vertical axis indicates the tracking accuracy, and the horizontal axis represents the predefined value of the tolerated threshold.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 16 motion and target deformation caused by the shaking camera, most trackers still performed quite well throughout the sequence as shown in Figure 11. The video sequence 9 was captured by a drone flying along a seashore. The tracking target was the summit of a mountain, and the illumination changed dramatically over space and time. In the middle of the sequence, the sun sat just behind the mountain top. The problem of backlighting caused tracking loss for the KCF tracker, and tracking error for the TLD and GOTURN trackers. Surprisingly, other trackers still followed the target very well as shown in Figure 12. In summary, Figure 13 outlines the quantitative results of eight trackers on nine sequences. The vertical axis indicates the tracking accuracy, and the horizontal axis represents the predefined value of the tolerated threshold.    Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 16 motion and target deformation caused by the shaking camera, most trackers still performed quite well throughout the sequence as shown in Figure 11. The video sequence 9 was captured by a drone flying along a seashore. The tracking target was the summit of a mountain, and the illumination changed dramatically over space and time. In the middle of the sequence, the sun sat just behind the mountain top. The problem of backlighting caused tracking loss for the KCF tracker, and tracking error for the TLD and GOTURN trackers. Surprisingly, other trackers still followed the target very well as shown in Figure 12. In summary, Figure 13 outlines the quantitative results of eight trackers on nine sequences. The vertical axis indicates the tracking accuracy, and the horizontal axis represents the predefined value of the tolerated threshold.    Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 16 motion and target deformation caused by the shaking camera, most trackers still performed quite well throughout the sequence as shown in Figure 11. The video sequence 9 was captured by a drone flying along a seashore. The tracking target was the summit of a mountain, and the illumination changed dramatically over space and time. In the middle of the sequence, the sun sat just behind the mountain top. The problem of backlighting caused tracking loss for the KCF tracker, and tracking error for the TLD and GOTURN trackers. Surprisingly, other trackers still followed the target very well as shown in Figure 12. In summary, Figure 13 outlines the quantitative results of eight trackers on nine sequences. The vertical axis indicates the tracking accuracy, and the horizontal axis represents the predefined value of the tolerated threshold.

Discussion
In terms of Frames Per Second (FPS), Table 2 summarizes the speed comparison of all eight trackers in nine video sequences. According to the experimental results, the MOSSE is the fastest tracker with an average of 3776 FPS. The KCF is the next efficient tracker with an average of 175 FPS. The MedianFlow tracker can also achieve 63 FPS. Among all eight trackers, The TLD is the slowest tracker. In fact, it is about 600 times slower than the MOSSE tracker and not suitable for real-time applications. In terms of tracking quality, Table 3 summarizes the accuracy comparison of all eight trackers in nine video sequences. The strengths and weaknesses of all eight trackers are outlined in Table 4. The GOTURN is the only tracker based on deep learning but does not perform well in terms of accuracy on 360-degree videos. Interestingly, it works well in some special cases. For example, it is the only tracker that could flawlessly track a fast-moving boat in sequence 6, possibly because the pretrained dataset contained boats. In fact, the performance of GOTURN heavily depends on the appearance model of the tracking target. Hence, we believe the GOTURN tracker can be improved by training specific target models for 360-degree videos in advance.  Generally, the MedianFlow tracker performs well on consistent and slowly changing video sequences. However, occasional occlusion prevents it from making an agreement in bidirectional analysis and the tracking fails as shown in video sequence 2.
The TLD is a slow tracker with high false detect rate but works well in the case of failure recovery. Especially, the TLD tracker is a good choice to track a target that disappears from one place and reappears in another place in 360-degree videos.
The KCF tracker performs well on ordinary videos but not on 360-degree videos due to its fix-sized filters. The characteristics of 360-degree videos such as scale change, viewpoint change, and deformation easily lead the KCF tracker to a track loss. Thus, it is only useable to track plain targets that do not contain these characteristics.
The Boost tracker achieves a fair accuracy on 360-degree videos, though it does not sense tracking failure and continues to track a wrong target as shown in video sequence 5. The parameters of tolerance need to be adjusted accordingly to avoid false tracks for the Boost tracker.
The CSRT tracker is a good choice for tracking on 360-degree videos because it detects target objects using the HoG features instead of raw pixels. It can adjust the size of target window dynamically as well. Nonetheless, it still has a hard time recovering from a temporarily disappearing target as shown in video sequence 2, or tracking a fast-moving target as shown in video sequence 6.
The MIL tracker can properly handle most of the cases on 360-degree videos. Its weakness is the problem of occlusion caused by change of viewpoints. The MIL tracker tends to fail in recovering the tracking target even after the occlusion.
For applications with high-speed demand or large motion, the MOSSE tracker is the best choice since its tracking speed is significantly higher than other trackers, though the fix-sized tracking window could be a problem for video sequences with huge scale change.
A typical example of 360-degree videos in an equirectangular format is shown in Figure 14. An ideal tracker should be able to tackle the problems in 360-degree videos such as viewpoint change, occlusion, deformation, lighting change, scale change, and camera shakiness. For viewpoint change caused by a moving camera, a motion model is helpful to assist tracking. To handle occlusion problems, the ability to recover the temporarily missing target is essential. To alleviate the nonrigid deformation, trackers must learn and update the appearance model in run-time. To accommodate lighting change, extracting illumination-robust features is critical. To solve the problem of scale change, adaptable and dynamic target window size is beneficial. To deal with a shaking camera, trackers should measure and compensate the global motion. An inherent problem of 360-degree videos in an equirectangular format is the image distortion, especially for the northern and southern polar areas in a panoramic image. The distortion problem affects the performance of tracking in two aspects. First, the motion of the tracking target is distorted after an equirectangular projection [22]. In video sequence 6, a boat moving in a straight line looks like it is moving in a curve in the 360-degree video in an equirectangular format. The distortion of trajectory degrades the accuracy of all trackers, especially for fast-moving targets. Nevertheless, the GOTURN tracker can handle this problem properly as long as it includes the training samples with distorted motions in the process of pretraining. Second, the tracking target suffers from nonrigid deformations in an equirectangular format. In video sequence 2, the deformation of the target building makes straight lines become curves. Thus, it tends to cause a track loss for trackers using a static target appearance model. Surprisingly, most trackers survive the slow target deformation in this case except the TLD tracker. The deformed target triggers frequent reinitialization in the TLD modules, and the target appearance is prone to drift in the presence of occasional occlusions. In video sequence 3, the tracking target is accompanied by both motion distortion and target deformation. As a result, most trackers are unstable and achieve low tracking accuracy in this case.

Conclusions
The problems of viewpoint change, occlusion, deformation, lighting change, scaling change, and shakiness occur frequently in 360-degree videos. According to our experiments with maximum tolerated threshold, the CSRT achieves the best overall accuracy and is the most robust tracker on 360-degree videos. Alternatively, the MOSSE is the most efficient tracker in terms of speed. For future work, an ideal tracker that can deal with these problems in 360-degree videos is crucial. We believe that a multimodal fusion is beneficial in combining abilities of failure recovery, robustness, and adaptable target size for online tracking on 360-degree videos. A Kalman filter can also be applied for better prediction and stabilization of unknown object tracking in 360-degree videos. Our dataset containing nine 360-degree videos with ground truth is accessible through the link at the end of the paper. It can be utilized as a benchmark for future research. An inherent problem of 360-degree videos in an equirectangular format is the image distortion, especially for the northern and southern polar areas in a panoramic image. The distortion problem affects the performance of tracking in two aspects. First, the motion of the tracking target is distorted after an equirectangular projection [22]. In video sequence 6, a boat moving in a straight line looks like it is moving in a curve in the 360-degree video in an equirectangular format. The distortion of trajectory degrades the accuracy of all trackers, especially for fast-moving targets. Nevertheless, the GOTURN tracker can handle this problem properly as long as it includes the training samples with distorted motions in the process of pretraining. Second, the tracking target suffers from nonrigid deformations in an equirectangular format. In video sequence 2, the deformation of the target building makes straight lines become curves. Thus, it tends to cause a track loss for trackers using a static target appearance model. Surprisingly, most trackers survive the slow target deformation in this case except the TLD tracker. The deformed target triggers frequent reinitialization in the TLD modules, and the target appearance is prone to drift in the presence of occasional occlusions. In video sequence 3, the tracking target is accompanied by both motion distortion and target deformation. As a result, most trackers are unstable and achieve low tracking accuracy in this case.

Conclusions
The problems of viewpoint change, occlusion, deformation, lighting change, scaling change, and shakiness occur frequently in 360-degree videos. According to our experiments with maximum tolerated threshold, the CSRT achieves the best overall accuracy and is the most robust tracker on 360-degree videos. Alternatively, the MOSSE is the most efficient tracker in terms of speed. For future work, an ideal tracker that can deal with these problems in 360-degree videos is crucial. We believe that a multimodal fusion is beneficial in combining abilities of failure recovery, robustness, and adaptable target size for online tracking on 360-degree videos. A Kalman filter can also be applied for better prediction and stabilization of unknown object tracking in 360-degree videos. Our dataset containing nine 360-degree videos with ground truth is accessible through the link at the end of the paper. It can be utilized as a benchmark for future research.