Employing Shadows for Multi-Person Tracking Based on a Single RGB-D Camera

Although there are many algorithms to track people that are walking, existing methods mostly fail to cope with occluded bodies in the setting of multi-person tracking with one camera. In this paper, we propose a method to use people’s shadows as a clue to track them instead of treating shadows as mere noise. We introduce a novel method to track multiple people by fusing shadow data from the RGB image with skeleton data, both of which are captured by a single RGB Depth (RGB-D) camera. Skeletal tracking provides the positions of people that can be captured directly, while their shadows are used to track them when they are no longer visible. Our experiments confirm that this method can efficiently handle full occlusions. It thus has substantial value in resolving the occlusion problem in multi-person tracking, even with other kinds of cameras.


Introduction
In immersive virtual environments, locomotion through the virtual space is among the most crucial forms of interaction. The primary manifestation of human locomotion is walking, and, hence, genuine walking has substantial advantages over both virtual walking and flying as a mode of locomotion, in terms of its simplicity, straightforwardness, and naturalness [1]. Thus, it is not surprising that real walking in the physical space, which can engender greater degrees of flow experience and is preferred over non-moving modalities [2], has emerged as one of the most natural and effective interaction methods in virtual reality systems [3]. Moreover, it also increasingly serves as a natural means of interaction in many multi-player games. In the Interactive Tag Playground, for instance, the movement of players is tracked by four Kinect devices [4].
There are many algorithms seeking to track genuinely walking people, and visual tracking is a common method [5][6][7][8], of which Yilmaz et al. provide a good overview [9]. With regard to the methods used for visual tracking, RGB cameras are popular visual tracking devices relying on cues from color and intensity signals [10]. Recently, RGB Depth (RGB-D) cameras such as Microsoft's Kinect, which is based on vision techniques, have enabled many new applications. They constitute a non-intrusive and appealing tracking technology due to their low cost and ease of deployment [6,11].
Unfortunately, one often faces the challenge of occlusion in multi-person tracking with a single front-view camera [6]. In recent years, many methods have sought to address this, including methods based on multiple cameras [12][13][14][15], Kinect setups relying on the ceiling [16], and approaches that To address this problem, recent work has focused on the single perspective occlusion problem. An optimal camera placement scheme can aid in avoiding the full occlusion problem [16,29,30]. Wu et al. mounted a Kinect on the ceiling of a room so as to obtain a bird's eye view to detect humans, wherein occlusions between people on the ground do not occur [16]. However, in settings with a high ceiling or without any ceiling, mounting a camera to obtain a bird's eye view is either unfeasible or inconvenient.
Another approach to cope with occlusion under a single perspective is to rely on prediction methods based on motion trajectories, such as Kalman filters or particle filters [31,32]. Meshgi et al. proposed a tracker that exposes the particle filter to a probabilistic treatment of occlusions, fusing various features collected from the color and depth channels [33]. A binary flag (the occlusion flag) is tied to each particle in order to express the state of the tracker's belief regarding the particle's occlusion. However, long-duration occlusion from a single camera entails a loss of observation information, and thus these methods may fail to track the occluded person in the presence of long-duration or full occlusions.
Further research has proposed methods to fuse Kinect data with other sensor data [17,18,34]. Li et al. adopted a novel RFID-depth hybrid sensing approach to track both the identity and location of multiple people in groups [17]. As smartphones are ubiquitous, and most modern ones are equipped with an accelerometer and gyroscope, setups combining smartphones and other sensors have been considered [18,35,36]. Meng et al. presented a method that fuses smartphones and a depth-sensing camera to track people [18]. With the aid of a smartphone, when the user ends up out of the Kinect's view, the system can detect that the user is still playing by analyzing the pattern of the smartphone's sensor readings. None of the aforementioned methods fully address the long-duration and full occlusion problems adequately.

Shadow Detection
The person's shadow has the same motion as the person. Therefore, we can treat the shadow as a separate moving object to detect and track [37]. The most widely adopted approach for moving object detection with fixed camera is background subtraction [38,39], and the moving object is detected based on the difference between the current frame and the current background model [40].
In recent years, deep learning has been widely used for shadow detection. Khan et al. first introduced deep Convolutional Neural Networks (CNNs) to automatically learn features for shadow regions and boundaries [41]. In the training phase, two CNNs are trained, one for labeling shadow regions and the other for labeling shadow boundaries. In the test phase, the predictions from both CNNs are combined into a unary potential for a Conditional Random Field (CRF) to label image pixels as shadow or non-shadow ones. Vicente et al. proposed a multikernel model for shadow region classification, and also embedded the multikernel region classifier into a CRF [42]. The parameters and hyperparameters of the model are efficiently optimized based on least-squares SVM leave-one-out estimates. However, these methods are mainly based on local region classifications. A recent approach for shadow detection introduces a Conditional Generative Adversarial Network (CGAN), wherein the generator of a CGAN has a full view of the entire image and can reason about the global structure and context [43]. Nguyen et al. presented the first application of adversarial training for shadow detection and developed a novel CGAN architecture with a tunable sensitivity parameter [44]. Wang et al. designed a framework based on a novel STacked CGAN (ST-CGAN) and presented a multi-task perspective. Compared with the existing work, it jointly learns both the detection and removal in an end-to-end manner such that the two objectives mutually benefit each other [45].

Multi-Person Tracking Method
In a front-view tracking system based on a single RGB-D camera, occlusion is a significant problem that can easily cause tracking to fail when tracking multiple people. Thus, it is necessary to design a powerful algorithm to avoid such tracking failures. In this section, we explore the basic idea and principle, and provide the details of our algorithm.

Basic Principle
As a popular RGB-D camera, Microsoft Kinect has one RGB camera, one infrared camera, and one infrared projector, to provide color, depth, and predicted skeleton data. The Kinect device with infrared camera and infrared projector modules is adopted for reliable human detection both in bright and in dark environments. The official Kinect software development kit (SDK) provides data in three spaces: the color image space, depth image space, and skeleton space. The depth image and skeleton spaces are in the camera coordinate system. The origin is the center of the infrared camera. The Z axis is the infrared camera's axis and perpendicular to the image plane.
In many applications, it has been shown that skeleton data can be reliably used to track people. Kinect V2 can predict the skeleton data of up to 6 people simultaneously. However, with a single Kinect, the skeleton of a person is lost when that person is occluded by others. Although RGB image and depth data of the Kinect can be used together as clues to resolve partial occlusions, these methods cannot handle complete body occlusions particularly well. In Figure 1, when the person H b is occluded by person H f , person H b cannot be seen from k 1 . In this case, the skeleton and joints of person H b cannot be detected for further tracking.

Basic Principle
As a popular RGB-D camera, Microsoft Kinect has one RGB camera, one infrared camera, and one infrared projector, to provide color, depth, and predicted skeleton data. The Kinect device with infrared camera and infrared projector modules is adopted for reliable human detection both in bright and in dark environments. The official Kinect software development kit (SDK) provides data in three spaces: the color image space, depth image space, and skeleton space. The depth image and skeleton spaces are in the camera coordinate system. The origin is the center of the infrared camera. The Z axis is the infrared camera's axis and perpendicular to the image plane.
In many applications, it has been shown that skeleton data can be reliably used to track people. Kinect V2 can predict the skeleton data of up to 6 people simultaneously. However, with a single Kinect, the skeleton of a person is lost when that person is occluded by others. Although RGB image and depth data of the Kinect can be used together as clues to resolve partial occlusions, these methods cannot handle complete body occlusions particularly well. In Figure 1, when the person Hb is occluded by person Hf, person Hb cannot be seen from k1. In this case, the skeleton and joints of person Hb cannot be detected for further tracking. Fortunately, in such a case, shadows are cast that can easily be found. Such shadows are regions that are not directly reached by the light due to the obstruction by the human. They may exist in both indoor and outdoor settings. Therefore, shadows are informative in revealing the existence of the occluded person. Since the shadow of a person moves in conjunction with that person's body, it can easily be captured from the RGB image of the Kinect, and thus it is possible to evaluate the position of the occluded person by analyzing their shadow. Furthermore, human shadows can also be expressly created by adding a light source, as a simple and low-cost solution.
To investigate this, we tracked the center of mass (COM) of a person walking along the X axis in the coordinate system of the Kinect at a constant speed as well as at multiple speeds. Kinect V2 provides 25 joints of the human body, of which we selected the SpineBase joint to represent the position of the person. Figure 2 shows the tracking trajectories of the person computed by her skeleton (blue line) and shadow (red line). We further considered a person walking along the Z direction in the coordinate system of the Kinect at a constant speed and at various speeds. Figure 3 shows the resulting tracking trajectories for the person as computed by their skeleton and based on their shadow. Finally, Figure 4 shows the tracking trajectories of a person computed based on their skeleton and shadow, wherein the person walks at different speeds along different directions in the coordinate system of the Kinect. The results show that the position of the person computed by their shadow is close to that computed by their skeleton. Thus, we can rely on shadows as a valuable clue to assist in capturing a person's position when their skeleton is lost, while continuing to rely on the skeleton data to compute the position of a person whenever such skeleton data is available. Fortunately, in such a case, shadows are cast that can easily be found. Such shadows are regions that are not directly reached by the light due to the obstruction by the human. They may exist in both indoor and outdoor settings. Therefore, shadows are informative in revealing the existence of the occluded person. Since the shadow of a person moves in conjunction with that person's body, it can easily be captured from the RGB image of the Kinect, and thus it is possible to evaluate the position of the occluded person by analyzing their shadow. Furthermore, human shadows can also be expressly created by adding a light source, as a simple and low-cost solution.
To investigate this, we tracked the center of mass (COM) of a person walking along the X axis in the coordinate system of the Kinect at a constant speed as well as at multiple speeds. Kinect V2 provides 25 joints of the human body, of which we selected the SpineBase joint to represent the position of the person. Figure 2 shows the tracking trajectories of the person computed by her skeleton (blue line) and shadow (red line). We further considered a person walking along the Z direction in the coordinate system of the Kinect at a constant speed and at various speeds. Figure 3 shows the resulting tracking trajectories for the person as computed by their skeleton and based on their shadow. Finally, Figure 4 shows the tracking trajectories of a person computed based on their skeleton and shadow, wherein the person walks at different speeds along different directions in the coordinate system of the Kinect. The results show that the position of the person computed by their shadow is close to that computed by their skeleton. Thus, we can rely on shadows as a valuable clue to assist in capturing a person's position when their skeleton is lost, while continuing to rely on the skeleton data to compute the position of a person whenever such skeleton data is available.      Hence, the key idea of our algorithm is as follows: If there is no occlusion between people, the Kinect device can detect the person and provide skeleton data, and we can directly use it to track the person; otherwise, i.e., in the case that one person is occluded and cannot be detected by the Kinect, the skeleton data of the person is not available, so we make use of their shadow in the RGB image captured by the Kinect to assess the person's position.
When relying on the shadow of a person to evaluate their location, it is necessary to segment the shadow within the RGB image and compute the position in the skeleton space. This requires a Hence, the key idea of our algorithm is as follows: If there is no occlusion between people, the Kinect device can detect the person and provide skeleton data, and we can directly use it to track the person; otherwise, i.e., in the case that one person is occluded and cannot be detected by the Kinect, the skeleton data of the person is not available, so we make use of their shadow in the RGB image captured by the Kinect to assess the person's position.
When relying on the shadow of a person to evaluate their location, it is necessary to segment the shadow within the RGB image and compute the position in the skeleton space. This requires a conversion between the image space and the skeleton space. In particular, Figure 1 shows an RGB image, in which H b is occluded by H f . Figure 5 shows the skeleton space, where o is the center of the Kinect infrared camera. G is the plane corresponding to the ground, parallel to the xoz plane. Here, o 1 is the projection point of o on the ground plane G, and p f as well as p b represent the positions of the feet of H f and H b , respectively. Here, p f , p b , o 1 are on G.
(red line), respectively, while they walk along arbitrary directions in the coordinate system of the Kinect (a) at various speeds, and (b) at a constant speed.
Hence, the key idea of our algorithm is as follows: If there is no occlusion between people, the Kinect device can detect the person and provide skeleton data, and we can directly use it to track the person; otherwise, i.e., in the case that one person is occluded and cannot be detected by the Kinect, the skeleton data of the person is not available, so we make use of their shadow in the RGB image captured by the Kinect to assess the person's position.
When relying on the shadow of a person to evaluate their location, it is necessary to segment the shadow within the RGB image and compute the position in the skeleton space. This requires a conversion between the image space and the skeleton space. In particular, Figure 1 shows an RGB image, in which Hb is occluded by Hf. Figure 5 shows the skeleton space, where o is the center of the Kinect infrared camera. G is the plane corresponding to the ground, parallel to the xoz plane. Here, o1 is the projection point of o on the ground plane G, and pf as well as pb represent the positions of the feet of Hf and Hb, respectively. Here, pf, pb, o1 are on G.
Since Hb is occluded by Hf, the points o1, pf, pb are on the same line. In this case, Hb is occluded by Hf, but the shadow of Hb is visible for the RGB camera. In Figure 5, the shadow of Hb is represented by st. Therefore, we can compute the intersection point of o1pf and st on the plane G, which can be seen as pb to represent the position of Hb. Additionally, to quickly and efficiently extract the shadow of the occluded person, we need to locate the region in which the occluded person's shadow can be found. In this paper, we assume that the light source and Kinect are placed such that the shadow of each person is on the left or Since H b is occluded by H f , the points o 1, p f , p b are on the same line. In this case, H b is occluded by H f , but the shadow of H b is visible for the RGB camera. In Figure 5, the shadow of H b is represented by st. Therefore, we can compute the intersection point of o 1 p f and st on the plane G, which can be seen as p b to represent the position of H b .
Additionally, to quickly and efficiently extract the shadow of the occluded person, we need to locate the region in which the occluded person's shadow can be found. In this paper, we assume that the light source and Kinect are placed such that the shadow of each person is on the left or alternatively on the right side of their body from the perspective of the Kinect during the tracking process. We let a = 0 or 1 designate the left or right side, respectively. For example, in Figures 1 and 6, we can observe that the shadows of H f and H b are both located to their left. As shown in Figure 6a, the lines l and r divide the RGB image into three parts Rl, Rm, and Rr, As shown in Figure 6a, the lines l and r divide the RGB image into three parts R l , R m , and R r , where R l and R r are on the left and right side of H f , respectively. R r has no shadow, R l only has shadows of H f and H b , and the bodies and small parts of shadows of H f and H b are in the region R m . Hence, the region R l only has shadows of H f and H b after subtracting the background image (see Figure 6b). In this case, it is easier to extract shadows from R l than from the entire RGB image.
During the tracking process, if the shadow of H b appears in R l (when a = 0), then it will always exist in R l . Here, R l changes along with the changes of the position of H f in each frame (the method to compute these will be introduced in Section 4). Hence, we extract shadows from R l . Similarly, if the shadow of H b appears in R r (when a = 1), then it will always exist in R r . Here, R r changes along with any change in position of H f in each frame. Hence, we extract shadows from R r .

Algorithm Overview
In our method, when there is no loss of tracking, we detect the human body and obtain its skeletal model using the standard approach [46]. The skeleton is used to compute the person's position. Otherwise, the shadow is used to track a person whose skeleton is lost in the tracking process.
In the initialization stage, the algorithm first captures the background image. For each person, once they are inside the depth-perceiving area of the Kinect, we begin to obtain their position using their skeleton data. Additionally, to quickly and efficiently extract the shadow of the occluded person when an occlusion occurs, we need to determine the region in which the person's shadow is located, and set the value of a. During the main runtime stage, we attempt to obtain a person's position using their skeleton data in each frame. If the skeleton of a person is found, we compute their position. Otherwise, we invoke our Shadow-based Tracking Algorithm (STA) to evaluate their position.
The details of our proposed tracking method are as follows (see Figure 7): Initialization. Capture the background image as B, compute the number of people participating in the initialization scene, obtain each person's position using the skeleton data obtained from the Kinect, and compute the value of a that indicates on which side of the person the shadow is located.
During the tracking, determine if tracking loss has occurred. According to the difference of the number of users at adjacent time steps, determine whether any tracking loss has occurred. Assume Initialization. Capture the background image as B, compute the number of people participating in the initialization scene, obtain each person's position using the skeleton data obtained from the Kinect, and compute the value of a that indicates on which side of the person the shadow is located.
During the tracking, determine if tracking loss has occurred. According to the difference of the number of users at adjacent time steps, determine whether any tracking loss has occurred. Assume the number of people at time k is N k . If N k+1 < N k , there is tracking loss, and we use the method of computing the position under tracking loss. Otherwise, we obtain the skeleton data of all target people from the Kinect.
Compute the position under tracking loss. There may be two cases in which a person has no skeleton: (1) The person is occluded by others. (2) The person is not occluded but walks out of the Kinect's field of view. Based on the cause of tracking loss, we use the corresponding method to compute the position of the user. If the person is beyond the Kinect's field of view, an audio feedback signal is played to warn them. At this moment, the user is requested to take steps backwards until the alarm clears, at which point we can get the position from the skeleton data. If, in contrast, the person is occluded by others, we use the proposed STA algorithm to evaluate the position.
Output each person's position. Based on different methods, the position of each person can be computed. These are used as input for the rendering of virtual scenes later.
Determine whether the tracking is over. If so, finish the tracking; otherwise, continue to track users.

STA Algorithm Design and Implementation
For the case of a person being occluded by others, we invoke our novel STA Algorithm (Algorithm 1) exploiting shadows to solve the tracking under occlusion. In the following, we give an overview of our STA algorithm. Assume that H b is occluded by H f (see Figure 1). First, we find the region encompassing the shadow of H b according to the value of a that we obtained in the initialization stage. Subsequently, we segment H b 's shadow. Thirdly, we compute a line representing the shadow and map the line onto the plane G in the skeleton space, designating it as st. The specific details to implement this algorithm will be discussed next.

Search Shadow Region
We consider how to find the region R in accordance with the value of a computed in the initialization stage, which indicates which side of the person the shadow is on. We first obtain the head joint point of H f based on their skeleton data and transform it into the RGB image space using the function MapCameraPointsToColorSpace() provided by the Kinect SDK, marked as h f (see Figure 6a).
If a = 0, we consider the line l, which is a perpendicular line across the point (h f .x − d x / 2, 0). The left region R of l is used to extract the shadow of H b . Otherwise, we consider the line r, which is a perpendicular line across the point (h f .x + d x / 2, 0). The right region R of r will be used to extract the shadow of H b . Here, d x is evaluated according to the maximum width of the bodies of H f and H b in the initialization stage.

Extract Shadow
A pixel in the background subtraction results is considered as a possible shadow pixel when it has lower luminosity compared with the background image. We obtain the difference image C by subtracting the background image B from R: For each pixel R(x, y) ∈ R, C(x, y) = |R(x, y) − B(x, y)|.
If C(x, y) > T, then C(x, y) is considered as belonging to the shadows of H f or H b , and we set C(x, y) = 1; otherwise, we set C(x, y) = 0.
In addition, each pixel can be presented using one Gaussian mixture model, and other methods such as texture or non-parametric representations may also be deployed [38,47].

Compute the Position of Occluded People via Shadows
Next, we consider how to compute the position of the occluded person. This entails computing the intersection point of o 1 p f and st on the plane G (see Figure 5).
First, we scan R from left to right. For each scan line, we access the pixel in C from top to bottom (see Figure 8). When we encounter the first pixels with a value of 1, we record this, stop the scan, and initiate the next scan. All such pixels together constitute the upper contour of H b (i.e., the red line in Figure 9a). Here, S is used to represent the set of points on the upper contour of H b . Then, we use the least squares method to fit this contour of H b to a line (the blue line in Figure 9b for each j = H to 0 do 8: if C(i, j) = 1 then 9: Add the point R(i, j) to S; 10: Break; 11: Use the least squares method to fit the points in S to a line, and map it onto the plane G, designated as st;

Experimental Results
In order to verify the effectiveness and usability of our method, we conducted experiments using a Kinect sensor to track a set of human participants.

Experimental Results
In order to verify the effectiveness and usability of our method, we conducted experiments using a Kinect sensor to track a set of human participants.

Experimental Results
In order to verify the effectiveness and usability of our method, we conducted experiments using a Kinect sensor to track a set of human participants.

Experiment Design
In our experiments, we used two Kinect devices, one for tracking users and the other for evaluation purposes. We relied on Kinect k 1 to track two human participants. In the tracking process, participant H f is always visible from k 1 , while for participant H b , the device may experience tracking loss. In order to evaluate the accuracy of our method, we used the secondary Kinect k 2 to record the position of H b . Moreover, user H b was always visible from k 2 , and the trajectories obtained by k 2 were used as reference values to test our method.
The Kinects were placed at a height of 0.8 m above the ground, and the size of the tracking area was 5.7 m 2 . The relative positioning of the Kinect with respect to the human shadows is illustrated in Figure 10. We designed two experiments to assess the system. In the first experiment, we specifically evaluated the tracking accuracy when tracking is inhibited due to bodily occlusion. In the second experiment, we evaluated the tracking accuracy when the human participant moved freely, whereby the skeleton may have on occasion be tracked successfully, and on occasion may have failed to be tracked. In our experiments, we used two Kinect devices, one for tracking users and the other for evaluation purposes. We relied on Kinect k1 to track two human participants. In the tracking process, participant Hf is always visible from k1, while for participant Hb, the device may experience tracking loss. In order to evaluate the accuracy of our method, we used the secondary Kinect k2 to record the position of Hb. Moreover, user Hb was always visible from k2, and the trajectories obtained by k2 were used as reference values to test our method.
The Kinects were placed at a height of 0.8 m above the ground, and the size of the tracking area was 5.7 m 2 . The relative positioning of the Kinect with respect to the human shadows is illustrated in Figure 10. We designed two experiments to assess the system. In the first experiment, we specifically evaluated the tracking accuracy when tracking is inhibited due to bodily occlusion. In the second experiment, we evaluated the tracking accuracy when the human participant moved freely, whereby the skeleton may have on occasion be tracked successfully, and on occasion may have failed to be tracked.

Experiment 1:
The first experiment was designed to assess the accuracy of our method when person Hb was occluded. In the experiment, the person was free to walk around, so that X and Z values could change. In order to better verify the accuracy of the algorithm, we considered three different runs along different paths. We first conducted runs leaving the position along one axis Figure 10. Example test environment. Experiment 1. The first experiment was designed to assess the accuracy of our method when person H b was occluded. In the experiment, the person was free to walk around, so that X and Z values could change. In order to better verify the accuracy of the algorithm, we considered three different runs along different paths. We first conducted runs leaving the position along one axis unchanged and allowing for the position along the other to change, so as to analyze the tracking accuracy with respect to an individual axis. In other words, the Z position changed, while the X position did not change, or vice versa. Subsequently, we allowed for changes along both axes while computing the position of the target. Accordingly, we designed three different motion paths depending on the direction of motion: Path 1: When the points o 1 p f and p b were approximately collinear, and the line o 1 p f was parallel to the Z axis, H b moved back and forth along the Z direction, as in Figure 5. We analyzed the tracking accuracy with regard to the Z value when H b was in full and long-duration occlusion.
Path 2: When a full-body occlusion occured, H f and H b moved back and forth along the X direction simultaneously. We analyzed the tracking accuracy of H b with regards to the X value. We assessed each of these paths 10 times, relying on a pool of 5 human participants to assume the roles of H b and H f .

Experiment 2.
In the second experiment, the participant H b was instructed to move freely within the space. Compared with the first experiments, the user's motion path had not been designed in advance. Thus, at any given instant, the skeleton of H b may or may not be detected by the Kinect's regular tracking algorithm. We verified the effectiveness of our method in various scenarios that may occur during person tracking, including non-occlusion and occlusion.

Experiment 1
There are three different motion paths in Experiment 1. The comparison between the position obtained via H b 's skeleton and the position computed via our shadow-based algorithm for Path 1 is given in Figure 12. Here, H b moves back and forth along the Z direction. The results show that there is only a minor deviation between the trajectories, as tracked by the participant's skeleton and computed by our method, when the person is in long-term occlusion and full-body occlusion. Similarly, Figure 13 provides part of the tracking results for Path 2, recording X values of the occluded person H b . The results show that there is a minor deviation between the trajectories as tracked directly from H b 's skeleton, as opposed to those computed via shadows when the person is in full-body occlusion. When the person stops at a certain location, the tracking deviation of the algorithm is larger than while the person is in motion. Although there is a larger deviation when the person stops at a certain location, the deviation becomes smaller between the positions as tracked by the Kinect vs. as computed by our method when the user transitions from a stationary position to motion, which indicates that our algorithm has a small cumulative error.
For Path 3, the respective participant is free to walk around, entailing changes along both the X and Z axes. Figure 14 shows the obtained changes in both the X and Z directions. The results indicate that there is a minor deviation between the positions tracked by the person's skeleton in comparison with those computed by our method.
• Experiment 1 There are three different motion paths in Experiment 1. The comparison between the position obtained via Hb's skeleton and the position computed via our shadow-based algorithm for Path 1 is given in Figure 11. Here, Hb moves back and forth along the Z direction. The results show that there is only a minor deviation between the trajectories, as tracked by the participant's skeleton and computed by our method, when the person is in long-term occlusion and full-body occlusion. Similarly, Figure 12 provides part of the tracking results for Path 2, recording X values of the occluded person Hb. The results show that there is a minor deviation between the trajectories as tracked directly from Hb's skeleton, as opposed to those computed via shadows when the person is in full-body occlusion. When the person stops at a certain location, the tracking deviation of the algorithm is larger than while the person is in motion. Although there is a larger deviation when the person stops at a certain location, the deviation becomes smaller between the positions as tracked by the Kinect vs. as computed by our method when the user transitions from a stationary position to motion, which indicates that our algorithm has a small cumulative error.
For Path 3, the respective participant is free to walk around, entailing changes along both the X and Z axes. Figure 13 shows the obtained changes in both the X and Z directions. The results indicate Figure 11. Comparison between trajectories as tracked by a person's skeleton as opposed to computed using shadows for Path 1, where Users 1, 2, 3, and 4 are randomly selected human participants, and their tracking results correspond to (a-d), respectively. There are three different motion paths in Experiment 1. The comparison between the position obtained via Hb's skeleton and the position computed via our shadow-based algorithm for Path 1 is given in Figure 11. Here, Hb moves back and forth along the Z direction. The results show that there is only a minor deviation between the trajectories, as tracked by the participant's skeleton and computed by our method, when the person is in long-term occlusion and full-body occlusion. Similarly, Figure 12 provides part of the tracking results for Path 2, recording X values of the occluded person Hb.
The results show that there is a minor deviation between the trajectories as tracked directly from Hb's skeleton, as opposed to those computed via shadows when the person is in full-body occlusion. When the person stops at a certain location, the tracking deviation of the algorithm is larger than while the person is in motion. Although there is a larger deviation when the person stops at a certain location, the deviation becomes smaller between the positions as tracked by the Kinect vs. as computed by our method when the user transitions from a stationary position to motion, which indicates that our algorithm has a small cumulative error.
For Path 3, the respective participant is free to walk around, entailing changes along both the X and Z axes. Figure 13 shows the obtained changes in both the X and Z directions. The results indicate Figure 12. Comparison between trajectories as tracked by a person's skeleton as opposed to computed using shadows for Path 1, where Users 1, 2, 3, and 4 are randomly selected human participants, and their tracking results correspond to (a-d), respectively. that there is a minor deviation between the positions tracked by the person's skeleton in comparison with those computed by our method.
Overall, the results suggest that our algorithm effectively computes the position of people, even when they are completely occluded or occluded for a long time, regardless of whether their position changes along a single axis or along both axes. Moreover, our algorithm is able to compute a person's position effectively regardless of whether they are stationary, in motion, or in either of the two state transitions. This shows that our algorithm is robust in coping with a variety of occlusions.  (c) (d) Figure 13. Comparison between trajectories as tracked by a person's skeleton as opposed to computed using shadows for Path 3, where Users 1, 2, 3, and 4 are randomly selected human participants, and their tracking results correspond to (a-d), respectively.
• Experiment 2 We compared the tracking results obtained by the user's skeleton data against the shadow-based tracking results of our algorithm when a person moves freely in the tracking area in Experiment 2. At any instance, Hb may or may not be detected by Kinect k1, depending on whether there are occlusions. Hb's trajectory is recorded by a different Kinect k2, such that Hb is always visible from k2. The trajectory of Hb is represented by blue lines in Figure 14. Overall, the results suggest that our algorithm effectively computes the position of people, even when they are completely occluded or occluded for a long time, regardless of whether their position changes along a single axis or along both axes. Moreover, our algorithm is able to compute a person's position effectively regardless of whether they are stationary, in motion, or in either of the two state transitions. This shows that our algorithm is robust in coping with a variety of occlusions.
• Experiment 2 We compared the tracking results obtained by the user's skeleton data against the shadow-based tracking results of our algorithm when a person moves freely in the tracking area in Experiment 2. At any instance, H b may or may not be detected by Kinect k 1 , depending on whether there are occlusions. H b 's trajectory is recorded by a different Kinect k 2 , such that H b is always visible from k 2 . The trajectory of H b is represented by blue lines in Figure 15. In particular, we computed the deviation between the result of our method and the trajectory of the occluded person Hb obtained from k2 as follows: Here, refers to the error value of the trajectory of Hb at time t, NF refers to the duration of the entire tracking run, and respectively refer to the trajectory of Hb at time t computed by our method and captured by Kinect k2.
Based on this, in order to evaluate the effectiveness of our method, we compute the accuracy as = Hence, one obtains accuracy values in the range [0, 1], such that the smaller the error value, the higher the accuracy. When there is no track loss, H b is detected by k 1 in S 1 , so the obtained position is consistent with the position obtained by k 2 in Figure 15. When there is track loss, the trajectory of H b is computed by our algorithm in S 2 , which is close to the position obtained by k 2 . Overall, when a person is in different tracking states, the trajectory obtained by our algorithm is very close to the actual trajectory of that person, which shows the effectiveness of our algorithm.
In Experiment 2, the participant can move freely, which means that various different situations may occur during tracking. To better assess the accuracy of our algorithm, we more explicitly measured the tracking accuracy in Experiment 2 to more quantitatively assess it.
In particular, we computed the deviation between the result of our method and the trajectory of the occluded person H b obtained from k 2 as follows: Here, Err t i refers to the error value of the trajectory of H b at time t, N F refers to the duration of the entire tracking run, p t i and q t i respectively refer to the trajectory of H b at time t computed by our method and captured by Kinect k 2 .
Based on this, in order to evaluate the effectiveness of our method, we compute the accuracy as Hence, one obtains accuracy values in the range [0, 1], such that the smaller the error value, the higher the accuracy.
First, we computed the tracking deviation of participants along the X and Z axes. The maximum tracking deviation along the X axis is 0.21 and the minimum tracking deviation is 0.14, while the average is 0.17. Similarly, the maximum and minimum deviations along the Z axis are 0.21 and 0.1, respectively, and the average is 0.16. Parts of the tracking results are shown in Figure 16. First, we computed the tracking deviation of participants along the X and Z axes. The maximum tracking deviation along the X axis is 0.21 and the minimum tracking deviation is 0.14, while the average is 0.17. Similarly, the maximum and minimum deviations along the Z axis are 0.21 and 0.1, respectively, and the average is 0.16. Parts of the tracking results are shown in Figure 15. Subsequently, we measured the tracking accuracy based on the deviation (see Figure 16). The mean value of the tracking accuracy is 0.8 based on our method, which demonstrates that shadows can indeed be used to track the positions of people when their skeletons are lost. Subsequently, we measured the tracking accuracy based on the deviation (see Figure 17). The mean value of the tracking accuracy is 0.8 based on our method, which demonstrates that shadows can indeed be used to track the positions of people when their skeletons are lost. Additionally, we computed the computation time for our method. The algorithm is evaluated on a 2.8GHz Intel Core i5 computer, and parts of the tracking results are shown in Figure 17. The average time cost is 67 min for each frame, which is equivalent to about 15 frames per second (fps). This indicates that our proposed method is a feasible choice for real-time applications on modest hardware.

Discussion
Occlusion has been a persistent problem for multi-person tracking with a single view camera. Although a variety of tracking algorithms have been proposed, they do not effectively and efficiently solve the challenges presented by long duration and full-body occlusion. In this paper, we explore the idea of relying on shadows as additional cues in tracking body movement, rather than merely Additionally, we computed the computation time for our method. The algorithm is evaluated on a 2.8GHz Intel Core i5 computer, and parts of the tracking results are shown in Figure 18. The average time cost is 67 min for each frame, which is equivalent to about 15 frames per second (fps). This indicates that our proposed method is a feasible choice for real-time applications on modest hardware. Additionally, we computed the computation time for our method. The algorithm is evaluated on a 2.8GHz Intel Core i5 computer, and parts of the tracking results are shown in Figure 17. The average time cost is 67 min for each frame, which is equivalent to about 15 frames per second (fps). This indicates that our proposed method is a feasible choice for real-time applications on modest hardware.

Discussion
Occlusion has been a persistent problem for multi-person tracking with a single view camera. Although a variety of tracking algorithms have been proposed, they do not effectively and efficiently solve the challenges presented by long duration and full-body occlusion. In this paper, we explore the idea of relying on shadows as additional cues in tracking body movement, rather than merely treating such shadows as noise. We found that shadows are informative in revealing the whereabouts of an occluded person. Wang and Yagi found that shadows are helpful in pedestrian detection [21]. Our findings are consistent with the previous study.

Discussion
Occlusion has been a persistent problem for multi-person tracking with a single view camera. Although a variety of tracking algorithms have been proposed, they do not effectively and efficiently solve the challenges presented by long duration and full-body occlusion. In this paper, we explore the idea of relying on shadows as additional cues in tracking body movement, rather than merely treating such shadows as noise. We found that shadows are informative in revealing the whereabouts of an occluded person. Wang and Yagi found that shadows are helpful in pedestrian detection [21]. Our findings are consistent with the previous study.
Based on these considerations, the problem of computing the motion of the occluded person is transformed into that of computing the shadow movement of the occluded person. Nevertheless, only little prior work has evaluated this issue in multi-person tracking.
In our proposed method, we focused on how to compensate for the reduction in the observable data. To this end, our method leveraged the user's shadow as a feature to locate occluded subjects. Theoretically, it is also possible that H b was occluded by H f and that their shadow was covered by that of H c (see Figure 19). In this case, our method could still use the overlapping shadows to compute the position of H b . We computed the tracking accuracy and determined the mean value of the tracking accuracy as 0.8 based on our method. Based on these considerations, the problem of computing the motion of the occluded person is transformed into that of computing the shadow movement of the occluded person. Nevertheless, only little prior work has evaluated this issue in multi-person tracking.
In our proposed method, we focused on how to compensate for the reduction in the observable data. To this end, our method leveraged the user's shadow as a feature to locate occluded subjects. Theoretically, it is also possible that Hb was occluded by Hf and that their shadow was covered by that of Hc (see Figure 18). In this case, our method could still use the overlapping shadows to compute the position of Hb. We computed the tracking accuracy and determined the mean value of the tracking accuracy as 0.8 based on our method. In terms of limitations, the success of this method hinges on an accurate shadow detection, which implies that if the shadow is overly light, it will likely not be captured accurately. Our method is robust in the indoor setting, considering that this problem can be addressed by adjusting the lighting so as to obtain darker shadows. The method cannot deal with occlusion when there is no sunlight in an outdoor setting. To further improve the detection, we intend to integrate region of interests (ROI) segmentation [48] and Frustum PointNets [49] to design a more accurate multi-user tracking algorithm for a single RGB-D camera setup. Considering the fact that shadows tend to vary according to the relative position between a person and the light source, the ROI segmentation network will be used to learn to segment partial shadows given in the bounding box, which is different from conventional foreground segmentation networks that focus on segmenting the entire object.
Additionally, our experiments confirm that the proposed method can resolve long-duration and full-body occlusion between two people with a single Kinect. Our approach could also be extended to support multiple people. Furthermore, we only considered the case of a single shadow of a person for a given light source. In settings involving more than one shadow of a person, our method would need to adopt a more elaborate shadow tracking mechanism. Hence, the present study constitutes an initial exploration and opens up new avenues in exploring the potential of shadows for tracking purposes.

Conclusions
In this paper, we proposed a method fusing shadow and skeletal data to track two people using just a single Kinect device. Our experiments show that our algorithm can cope with both longduration occlusion and full-body occlusion. Our experiments demonstrate that one can improve the tracking capabilities for people in motion with a single Kinect, without needing to resort to the use of additional sensor devices.
The system has recently been applied to mobile VR systems such as maze games as well as a firefighter training simulation system for two players in an indoor setting. In such types of VR In terms of limitations, the success of this method hinges on an accurate shadow detection, which implies that if the shadow is overly light, it will likely not be captured accurately. Our method is robust in the indoor setting, considering that this problem can be addressed by adjusting the lighting so as to obtain darker shadows. The method cannot deal with occlusion when there is no sunlight in an outdoor setting. To further improve the detection, we intend to integrate region of interests (ROI) segmentation [48] and Frustum PointNets [49] to design a more accurate multi-user tracking algorithm for a single RGB-D camera setup. Considering the fact that shadows tend to vary according to the relative position between a person and the light source, the ROI segmentation network will be used to learn to segment partial shadows given in the bounding box, which is different from conventional foreground segmentation networks that focus on segmenting the entire object.
Additionally, our experiments confirm that the proposed method can resolve long-duration and full-body occlusion between two people with a single Kinect. Our approach could also be extended to support multiple people. Furthermore, we only considered the case of a single shadow of a person for a given light source. In settings involving more than one shadow of a person, our method would need to adopt a more elaborate shadow tracking mechanism. Hence, the present study constitutes an initial exploration and opens up new avenues in exploring the potential of shadows for tracking purposes.

Conclusions
In this paper, we proposed a method fusing shadow and skeletal data to track two people using just a single Kinect device. Our experiments show that our algorithm can cope with both long-duration occlusion and full-body occlusion. Our experiments demonstrate that one can improve the tracking capabilities for people in motion with a single Kinect, without needing to resort to the use of additional sensor devices.
The system has recently been applied to mobile VR systems such as maze games as well as a firefighter training simulation system for two players in an indoor setting. In such types of VR applications, users wear a head-mounted display and move inside a specific area to explore the scenes and interact with the virtual world. Furthermore, the method could be invoked as a complementary means of tracking people using other RGB-D cameras and video cameras.