Optically lightweight tracking of objects around a corner

The observation of objects located in inaccessible regions is a recurring challenge in a wide variety of important applications. Recent work has shown that indirect diffuse light reflections can be used to reconstruct objects and two-dimensional (2D) patterns around a corner. However, these prior methods always require some specialized setup involving either ultrafast detectors or narrowband light sources. Here we show that occluded objects can be tracked in real time using a standard 2D camera and a laser pointer. Unlike previous methods based on the backprojection approach, we formulate the problem in an analysis-by-synthesis sense. By repeatedly simulating light transport through the scene, we determine the set of object parameters that most closely fits the measured intensity distribution. We experimentally demonstrate that this approach is capable of following the translation of unknown objects, and translation and orientation of a known object, in real time.

The widespread availability of digital image sensors, along with advanced computational methods, has spawned new imaging techniques that enable seemingly impossible tasks. A particularly fascinating result is the use of ultrafast time-of-flight measurements 1, 2 to image objects outside the direct line of sight [3][4][5][6] . Being able to use arbitrary walls as though they were mirrors can provide a critical advantage in many sensing scenarios with limited visibility, like endoscopic imaging, automotive safety, industrial inspection and search-and-rescue operations.
Out of the proposed techniques for imaging occluded objects, some require the object to be directly visible to a structured 7 or narrow-band 8-10 light source. Others resort to alternative regions in the electromagnetic spectrum where the occluder is transparent [11][12][13] . We adopt the more challenging assumption that the object is in the direct line of sight of neither light source nor camera (Fig. 1), and that it can only be illuminated or observed indirectly via a diffuse wall [3][4][5][6]14 . All the observed light has undergone at least three diffuse reflections (wall, object, wall), and reconstructing the unknown object is an ill-posed inverse problem. Most solution approaches reported so far use a backprojection scheme as in computed tomography 15 , where each intensity measurement taken by the imager votes for a manifold of possible scattering locations. This explicit reconstruction scheme is computationally efficient, in principle real-time capable 6 Figure 1: Tracking objects around a corner. a, Our experimental setup follows the most common arrangement reported in prior work, except that it does not use time-of-flight technology. A camera observes a portion of a white wall. To the right of the camera's field of view, a collimated laser illuminates a spot that reflects light toward the unknown object. The light distribution observed by the camera is the result of three diffuse light bounces (wall-object-wall) plus ambient contributions. b, Geometry of three-bounce reflection for a single surface element. c, Flow diagram of our tracking algorithm. Given shape, position and orientation of an object (the "scene hypothesis"), we simulate light transport to predict the distribution that this object would produce on the wall. By comparing this distribution to the one actually observed by the camera, and refining the parameters to minimize the difference, the object's motion is estimated.
extended with problem-specific filters 3,16 . However, it assumes the availability of ultrafast timeresolved optical impulse responses, whose capture still constitutes a significant technical challenge. Techniques proposed in literature include direct temporal sampling based on holography 1, 17, 18 , streak imagers 2 , gated image intensifiers 5 , serial time-encoded amplified microscopy 19 , singlephoton avalanche diodes 20 , and indirect computational approaches using multi-frequency lockin measurements [21][22][23] . In contrast, implicit methods state the reconstruction task in terms of a problem-specific cost function that measures the agreement of a scene hypothesis with the observed data and additional model priors. The solution to the problem is defined as the function argument that minimizes the cost. In the only such method reported so far 4 , the authors regularize a leastsquares data term with a computationally expensive sparsity prior, which enables the reconstruction of unknown objects around a corner without the need for ultrafast light sources and detectors.
Here we introduce an implicit technique for detecting and tracking objects outside the line of sight in real time. Imaged using routinely available hardware (2D camera, laser pointer), the low-frequent intensity distribution reflecting off the object to the wall serves as our main source of information. To this end, we combine a simulator for three-bounce indirect light transport with a reduced formulation of the reconstruction task. Rather than aiming to reconstruct the geometry of an unknown object, we assume that the target object is rigid, and that its shape and material are either known and/or irrelevant. Translation and rotation, the only remaining degrees of freedom, can now be found by minimizing a least-squares energy functional, forcing the scene hypothesis into agreement with the captured intensity image.
Our main contributions are threefold. We propose to use light transport simulation to tackle an indirect vision task in an analysis-by-synthesis sense. Using synthetic measurements, we quantify the effect of object movement on the observed intensity distribution, and predict under which conditions the effect is significant enough to be detected. Finally, we demonstrate and evaluate a hardware implementation of a tracking system. Our insights are not limited to intensity imaging, and we believe that they will bring non-line-of-sight sensing closer to practical applications.

Results
Light transport simulation (synthesis). At the center of this work is an efficient renderer for three-bounce light transport. Being able to simulate indirect illumination at an extremely fast rate is crucial to the overall system performance, since each object tracking step requires multiple simulation runs. Like all prior work, we assume that the wall is planar and known, and so is the position of the laser spot. The object is represented as a collection of Lambertian surface elements (surfels), each characterized by its position, normal direction and area. As the object is moved or rotated, all its surfels undergo the same rigid transformation. We represent this transformation by the scene parameter p, which is a three-dimensional vector for pure translation, or a six-dimensional vector for translation and rotation. The irradiance received by a given camera pixel is computed by summing the light that reflects off the surfels. The individual contributions, in turn, are obtained independently of each other as detailed in the Methods section, by calculating the radiative transfer from the laser spot via a surfel to the location on the wall observed by a pixel. Note that by following this procedure, like all prior work, we neglect self-occlusion, occlusion of ambient light, and interreflections. To efficiently obtain a full-frame image, represented by the vector of pixel values S (p), we parallelized the simulation to compute each pixel in a separate thread on the graphics card. The rendering time is approximately linear in the number of pixels and the number of surfels. On an NVIDIA GeForce GTX 780 graphics card, the response from a moderately complex object (500 surfels) at a resolution of 160×128 pixels is rendered in 3.57 milliseconds.
To estimate the magnitude of changes in the intensity distribution that are caused by motion or a change in shape, we performed a numerical experiment using this simulation. In this experiment, we used a fronto-parallel view on a 2 m×2 m wall, with a small planar object (a 10 cm×10 cm white square) located at 50 cm from the wall. Object and laser spot were centered on the wall, but not rendered into the image. Fig. 2 shows the simulated response thus obtained. By varying position and location of the object, we obtained difference images that can be interpreted as partial derivatives with respect to the components of the scene parameter p. Since the Figure 2: Intensity difference images. To investigate the effect of changes in object position and orientation on the intensity distribution observed on the wall, we performed a simplified synthetic experiment with an orthographic view of a 2 m×2 m wall, and laser spot and object centered with respect to the wall. The reference distribution (bottom left) was produced by a 10 cm×10 cm square-shaped object, located at 50 cm from the wall. Six difference images (top row), obtained by translating (±2.5 cm) and rotating (±7.5 • ) the object about the X, Y and Z axes, illustrate the distribution and magnitude of the respective change in the signal. The images shown in the bottom row visualize the difference caused by a change in shape. For display, each difference image has been amplified by the indicated factor (2 to 100,000) that also reflects the relative significance of the effect: Translations and rotations (except around the Y axis) caused the signal to change by roughly 1% per centimeter or per angular degree. A change in the object shape led to a peak difference around 1-2%, and rotation around the Y axis had a much smaller effect.
overall light throughput drops with the fourth power of the object-wall distance, translation in Y direction caused the strongest change. Translation in all directions and rotation about the X and Z axes affected the signal more strongly than the other variations. With differences amounting to several percent of the overall intensity, these changes were significant enough to be detected using a standard digital camera with 8-to 12-bit A/D converter.
Experimental setup. Our experiment draws inspiration from prior work 3,4,6,14,16 ; the setup is sketched in Fig. 1(a). Here, due to practical constraints, some of the idealizing assumptions made during the synthetic experiment had to be relaxed. In particular, only an off-peak portion of the intensity pattern could be observed, and we had to shield the camera from the laser spot to avoid lens flare. The actual reflectance distribution of the wall and object surfaces was not perfectly Lambertian, and additional light emitters and reflectors, not accounted for by the simulation, were present in the scene. To obtain a measured image M containing only light from the laser, we took the difference of images captured with and without laser illumination. Additionally, we subtracted a calibration measurement B containing light reflected by the background. A specification of the devices used, and a more detailed introduction of the data pre-processing steps, can be found in the Methods section.
Tracking algorithm (analysis). With the light transport simulation at hand, and given a measurement of light scattered from the object to the wall, we formulate the tracking task as a nonlinear minimization problem. Suppose M and S (p) are vectors encoding the pixel values of the measured object term and the one predicted by the simulation under the transformation parameter or scene hypothesis p, respectively. We search for the parameter p that brings M and S (p) into the best possible agreement by minimizing the cost function The factor γ(a, b) projects b to a, minimizing the distance a − γ(a, b) · b 2 2 . By including this factor into our objective, we decouple the recovery of the scene parameter p from any unknown global scaling between measurement and simulation, caused by parameters such as surface albedos, camera sensitivity and laser power. To solve this non-linear, non-convex, heavily over-determined problem, we use the Levenberg-Marquardt algorithm 24 as implemented in the Ceres library 25  Tracking result. To evaluate the method, we performed a series of experiments that are analysed in Fig. 4 and 5. The physical object used in all experiments was a car silhouette cut from plywood and coated with white wall paint, shown in Fig. 3(a). While our setup is able to handle arbitrary three-dimensional objects (as long as the convexity assumption is reasonable), this shape was twodimensional for manufacturing and handling reasons.
For a given input image M and object shape, the cost function f (p) in Equation (1) depends on three to six degrees of freedom that are being tracked. Fig. 3(b) shows a slice of the function for translation in the XY-plane, with all other parameters fixed. Although the global minimum is located in an elongated, curved trough, only four to five iterations of the Levenberg-Marquardt algorithm are required for convergence from a random location in the tracking volume. In realtime applications, since position and rotation can be expected to change slowly over time, the optimization effort can be reduced to two to three iterations per frame by using the latest tracking result to initialize the solution for the next frame.
In Experiment 1, we kept the object's orientation constant. We manually placed the object at various known locations in an 60 cm×50 cm×60 cm working volume, and recorded 100 camera frames at each location. These frames differ in the amount of ambient light (mains flicker) and in the photon noise. For each frame, we initialized the estimated position to a random starting point in a cube of dimensions (30 cm) 3 centered in the tracking volume, and refined the position estimate by minimizing the cost function (Equation 1). The results are shown in Fig. 4(a). From this experiment, we found positional tracking to be repeatable and robust to noise, with a sub-cm standard deviation for each position estimate. The root-mean-square distance to ground truth was a Experiment 1: Position tracking b Experiment 2: Rotation tracking Figure 4: Tracking a known object. a, Result of three tracking sessions where the object was translated along the X, Y and Z axes (Experiment 1). We recorded 100 input images at each position and reconstructed the object position for each input image independently. Plots and error bars visualize the mean and standard deviation of the recovered positions. The area shaded in gray is the confidence range for the true position which was determined using a tape measure. b, Result of three tracking sessions where the object was rotated around the X, Y and Z axes (Experiment 2). From 100 input images, we jointly reconstructed translation and rotation. Shown are mean and standard deviation of the recovered rotation angle. The higher uncertainty reflects the fact that rotation in general has a smaller effect on the signal, and the ambiguity between translational and rotational motion (also see Fig. 2.). measured at 4.8 cm, 2.9 cm and 2.4 cm for movement along the X, Y and Z axis, respectively. This small systematic bias was likely caused by a known shortcoming of the image formation model, which does not account for occlusion of ambient light by the object.
In Experiment 2, we kept the object at a (roughly) fixed location and rotated it by a range of ±30 • around the three coordinate axes using a pan-tilt-roll tripod with goniometers on all joints. Again, we recorded 100 frames per setting. We followed the same procedure as in the first experiment, except that this time we jointly optimized for all six degrees of freedom (position and orientation). The results are shown in Fig. 4(b). As expected, the rotation angles were tracked with higher uncertainty than the translational parameters. We identify two main sources for this uncertainty: the increased number of degrees of freedom and the pairwise ambiguity between X translation and Z rotation, and between Z translation and X rotation (Fig. 2). We recall that in the synthetic experiment, the effect of Y rotation was vanishingly small; here, the system tracked rotation around the Y axis about as robustly as the other axes. This unexpectedly positive result was probably owed to the strongly asymmetric shape of the car object.
So far, we assumed that the object's shape was known. Since this requirement cannot always be met, we dropped it in Experiment 3. We used the data already captured using the car object for the first experiment, but performed the light transport simulation using a single oriented surface element instead of the detailed object model. Except for this simplification, we followed the exact same procedure as in Experiment 1 to track the now unknown object's position. The results are shown in Fig. 5(a). Despite a significant systematic shift introduced by the use of the simplified object model, the position recovery remained robust to noise and relative movement was still detected reliably.
The need for a measured background term B can hinder the practical applicability of our approach as pursued so far. In Experiment 4, we lifted this requirement. When omitting the term without any compensation, the tracking performance degraded significantly ( Fig. 5(b)). However, we observed that the background image, caused by distant scattering, was typically smooth and well approximated by a linear function g(u, v) = au + bv + c in the image coordinates u and v (Fig. 6). We extended the tracking algorithm to fit such linear models to both input images M and S (p), and subtract the linear portions prior to evaluating the cost function (Eq. 2). This simple pre-processing step greatly reduced the bias in the tracking outcome and enabled robust tracking of object motion (Fig. 5(c)) even in unknown rooms.
The supplementary material to this paper contains two videos, each showing a real-time tracking session (Session 1: translation only; Session 2: translation and rotation) using the described setup. A live view of the hidden scene is shown alongside screen output from the tracking software. The average reconstruction rate during these tracking sessions was 10.2 frames per second (limited by the maximum capture rate of our camera-laser setup) for Session 1, and 3.7 frames per second (limited by computation) for Session 2. The 2-dimensional car model was represented  Positional tracking as in Experiment 1, but with no knowledge about the object shape. We used a single oriented surface element for the light transport simulation. b, Result of Experiment 4: Positional tracking as in Experiment 1, but without subtracting the pre-calibrated room response. The estimated absolute position greatly deviated from the ground-truth position (shaded areas). c, Subtraction of a linear fit significantly reduced the tracking error and makes the tracking task feasible even in the absence of a background measurement. In all cases, the standard deviation (error bars) remained small, indicating that changes in position could still be robustly detected. by 502 surfels; the total compute time required for a single tracking step was 72.9 ms for translation only, and 226.1 ms for translation and rotation.
Discussion. In this work, we showed that the popular challenge of tracking an object around a corner does not inherently necessitate the use of time-of-flight technology. Rather, by formulating an optimization problem based on a simplistic image formation model, we demonstrated parametric object tracking in a room-sized scene with sub-cm repeatability, only using 2D images with a laser pointer as the light source. As our technique, in its current form, does not rely on temporally resolved measurements of any kind, it has the unique property of being scalable to very small scenes (down to the diffraction limit) as well as large scenes (sufficient laser power provided). We identify two main limiting factors to the performance of our technique: a systematic bias caused by shortcomings in the scene and light transport models, and high sensitivity to image noise when tracking object rotation. The adoption of advanced light transport models and noise reduction techniques will further improve the tracking quality. We note that the analysis-by-synthesis approach per se is not limited to intensity imaging, but may form a valuable complement to other sensing modalities as well. For instance, a simple extension to the light transport model would enable it to accommodate time-of-flight imaging. Like in all prior work, we assumed knowledge about the geometry and reflectance of a wall that receives light scattered by the unknown object. Thanks to recent progress in mobile mapping 26 , such data is already widely available for many application scenarios. We imagine a potential application of our technique to be in urban traffic safety, where the motion of vehicles and pedestrians is constrained to the ground plane and hence described by a small number of degrees of freedom.

Methods
Light transport simulation. By assuming that all light has undergone exactly three reflections, we can efficiently simulate indirect illumination, with an overall computational complexity that is linear in the number of pixels and the number of surfels n. The geometry of this simulation is provided in Fig.1b. Each camera pixel observes a radiance value, L, leaving from a point on the wall, p W , that, in turn, receives light reflected by the object's surfels. The portion contributed by the surfel of index i ∈ {1 . . . n} is the product of three reflectance terms, one per reflection event; and the geometric view factors known from radiative transfer 27,28 : otherwise denotes a normalized and clamped dot product as used in Lambert's cosine law. Each line in Eq.
(2) models one of the three surface interactions. n S , n i and n W are the normal vectors of laser spot, surfel and observed point on the wall, and f {S,i,W } (ω in , ω out ) are the values of the corresponding bidirectional reflectance distribution functions (BRDF). The incident and outgoing direction vectors ω in and ω out that form the arguments to the BRDF are given by the scene geometry. In particular, the vectors p L , p S , p i , p W and p C represent the positions of, in this order: the laser source, the laser spot on the wall, the i th surfel, the observed point on the wall, and the camera (center of projection). A i is the area of the i th surfel, and ρ 0 a constant factor that subsumes laser power and the light efficiencies of lens and sensor. This factor is cancelled out by the projection performed in the cost function Eq. (1), so we set it to ρ 0 = 1 in simulation. The total pixel value is simply computed by summing Eq. (2) over all surfels: This summation neglects mutual shadowing or inter-reflection between surfels, an approximation that is justifiable for flat or mostly convex objects. For lack of measured material BRDFs, we further assume all surfaces to be of diffuse (Lambertian) reflectance such that f {S,i,W } := const = 1, again making use of the fact that the cost function Eq. (1) is invariant under such global scaling factors. If available, more accurate BRDF models as well as object and wall textures can be included at a negligible computational cost.
Capture devices. Our image source was a Xenics Xeva-1.7-320 camera, sensitive in the nearinfrared range (900 nm-1,700 nm), with a resolution of 320×256 pixels at 14 bits per pixel. We used an exposure time of 20 ms. The laser source (1 W at 1.550 nm) was a fiber-coupled laser diode of type SemiNex 4PN-108 driven by an Analog Technologies ATLS4A201D laser diode driver and equipped with a USB interface trigger input. On the output side of the fiber, we fed the collimated beam through a narrow tube with absorbing walls to reduce stray light.
A desktop PC with an NVIDIA GeForce GTX 780 GPU, 32GB of RAM and an Intel Core i7-4930K CPU controlled the devices and performed the reconstruction.
Measurement routine and image pre-processing. After calibrating the camera's gain factors and fixed pattern noise using vendor tools, we assumed that all pixels had the same linear response. All images were downsampled to half the resolution (160×128 pixels) prior to further processing. Due to the diffuse reflections, the measurements do not contain any high-frequent information apart from noise, thus moderate down sampling is a safe way to improve the performance of the later reconstruction.
The images measured by the camera are composed of several contributions, each represented by a vector of pixel-wise contributions: ambient light not originating from the laser, A; laser light scattered by static background objects present in the scene, B; and laser light scattered by the dynamic object, O. All measured images are further affected by noise, the main sources being photon counting noise and signal-independent read noise. We assume the scene to remain stationary at least during short time intervals between successive captures. Further assuming the spatial extent of the object to be small, shadowing of A and B by the object, as well as ambient light reflected by the object, can be neglected. By turning the laser on and off, and inserting and removing the object, the described kind of setup can capture the following combinations of these light contributions: Laser off (0), object absent (0): I 00 = A + noise Laser on (1), object absent (0): I 10 = A + B + noise Laser off (0), object present (1): I 01 = A + noise Laser on (1), object present (1): The input image to the reconstruction algorithm, M, was obtained as the difference of images captured in quick succession with and without laser illumination. Additionally, we subtracted a calibration measurement containing light reflected by the background: The addition or subtraction of two input images increases the noise magnitude by a factor of about √ 2. The background estimate B was captured with the object removed by recording difference images with and without laser illumination. We averaged n = 300 such difference images to reduce noise in the background estimate: