Exploiting distinctive visual landmark maps in pan–tilt–zoom camera networks

https://doi.org/10.1016/j.cviu.2010.01.007Get rights and content

Abstract

Pan–tilt–zoom (PTZ) camera networks have an important role in surveillance systems. They have the ability to direct the attention to interesting events that occur in the scene. One method to achieve such behavior is to use a process known as sensor slaving: one (or more) master camera monitors a wide area and tracks moving targets so as to provide the positional information to one (or more) slave camera. The slave camera can thus point towards the targets in high resolution.

In this paper we describe a novel framework exploiting a PTZ camera network to achieve high accuracy in the task of relating the feet position of a person in the image of the master camera, to his head position in the image of the slave camera. Each camera in the network can act as a master or slave camera, thus allowing the coverage of wide and geometrically complex areas with a relatively small number of sensors.

The proposed framework does not require any 3D known location to be specified, and allows to take into account both zooming and target uncertainties. Quantitative results show good performance in target head localization, independently from the zooming factor in the slave camera. An example of cooperative tracking approach exploiting with the proposed framework is also presented.

Introduction

In realistic surveillance scenarios, it is impossible for a single camera sensor either fixed or with pan–tilt–zoom (PTZ) capabilities to monitor outdoor wide areas entirely so as to be able to detect and track moving entities and discover interesting events. In fact, small changes of the viewpoint can determine large differences in the appearance of the moving entities due to illumination, cast shadows and (self-)occlusions and therefore drastically impact the performance of object detection and tracking as well as of recognition. To solve this problem, camera networks are employed to acquire multiple views of the entities from different viewing angles and therefore recover the information that might be missing if observed from a single viewing direction. Fixed cameras are generally adopted, being sufficiently simple to compute their relative spatial relationships [24]. Although fixed camera networks has been successfully applied in real application contexts, nevertheless they still suffer from the inherent problem of sensor quantization. In fact, fixed optics and fixed sensor resolution can make the structure of far-away entities similar to the texture of near-field entities. Super-resolution algorithms [30] applied to low resolution video frames do little to improve video quality.

Instead, effective solution to this problem can be obtained from the combination of a fixed camera with a PTZ camera working in a cooperative way. The two cameras are typically settled in a master–slave configuration [49]: the master camera is kept stationary and set to have a global view of the scene so as to permit to track several entities simultaneously. The slave camera is used to follow the target trajectory and generate close-up imagery of the entities driven by the transformed trajectory coordinates, moving from target to target and zooming in and out as necessary. In the most general case, this master–slave configuration can be exploited in a PTZ camera network, where several slave PTZ cameras can be controlled from one or several master PTZ camera(s) to follow the trajectory of some entities and generate multi-view close-up imagery in high resolution. In this framework, each master camera operates as if it was a reconfigurable fixed camera. An important capability of PTZ camera networks, particularly useful in biometric recognition in wide areas, is that of focusing on interesting human body parts such as head [38].

However, the working implementation of PTZ camera networks poses much more complex problems to solve than classical stationary camera networks. Assuming that all the cameras observe a planar scene, the image relationships between the camera image planes undergo a planar time-variant homography. But background appearance is not stationary and camera parameters change through time as the PTZ cameras pan, tilt and zoom, so making it difficult to compute their relative spatial positions. Estimating the time-variant image-to-image homography between a fixed master and a slave PTZ camera in real-time is also challenging. Occlusions, sensor quantization and foreshortening effects significantly limit the area of the PTZ camera view where to search for feature matches with the view of the master camera, making therefore difficult to compute the corresponding homography. In addition, the magnification factor achieved by zooming cameras can determine a large variation of the image structure, so limiting the matching performance. Fig. 1 exemplifies the principal problems to be solved in this framework.

In this paper, we discuss a novel solution for the effective implementation and real-time operation of PTZ camera networks. The approach that is proposed exploits a prebuilt map of visual 2D landmarks of the wide area to support multi-view image matching. The landmarks are extracted from a finite number of images taken from a non calibrated PTZ camera, in order to cover the entire field of regard.1 Each image in the map also keeps the camera parameters at which the image has been taken. At run-time, features that are detected in the current PTZ camera view are matched to those of the base set in the map. The matches are used to localize the camera with respect to the scene and hence estimate the position of the target body parts. Fig. 2 shows the main components of our system. A discussion of the motivations and basic ideas underlying the approach followed has been presented in some detail in [16].

We provide several new contributions in this research:

  • A novel uncalibrated method to compute the time-variant homography, exploiting the multi-view geometry of PTZ camera networks. Our approach avoids drifting and does not require calibration marks [23] or manually established pan–tilt correspondences [49].

  • The target body part (namely the head) is localized from the background appearance motion of the slave zooming camera. Head or face detection and segmentation are not required. Differently from [38] our solution explicitly takes into account camera calibration parameters and their uncertainty.

  • Differently from [1], [49], where a PTZ camera and a fixed camera are set with a short baseline so as to ease feature matching between the two fields of view, in our solution we define a general framework for arbitrary camera network topology. In this framework, any node of the network sharing a common field of view can exploit the master–slave relationship between cameras.

In the following we first provide an overview of the related work in Section 2. Hence, in Section 3, PTZ camera networks with master–slave configuration are defined in terms of their relative geometry and functionality. The details of map building process are presented in Section 4. Camera pose tracking and sensor slaving are presented in Section 5. System performance is discussed in Section 6, followed by final remarks.

Section snippets

Related work

Sensor slaving is a relatively simple practice provided that both the master and the slave camera are calibrated with respect to a local 3D terrain model [9]. Camera calibration allows to transfer 3D object locations onto the camera image plane and therefore use this information to steer the pan tilt and zoom of the slave sensor in the appropriate direction.

Several methods have been published in the literature to perform calibration of PTZ cameras. Early works have concentrated the attention on

Basic geometric relationships

If all the cameras in a network have an overlapping field of view (i.e. they are in a fully connected topology), they can be set in a master–slave relationship pairwise. According to this, given a network of M PTZ cameras Ci viewing a planar scene, N={Cis}i=1M, at any given time instant each camera can be in one of two states si{master,slave}. The network can be therefore in one of 2M-2 possible state configurations. All cameras in master, or all cameras in slave state cannot be defined.

Building a global map of the scene

In our approach, images of the scene taken from a non calibrated PTZ camera at different values of pan, tilt and zoom are collected off-line to build up a global map of the scene under observation, keeping memory of the geometric information at which each image was taken. The construction of the global map of the scene is decoupled from frame-to-frame tracking of the camera pose and focal length so that data association errors are not integrated into the map and more precise tracking is

Online master–slave relationship estimation

Matching of current frame keypoints with those in the global map is made according to nearest neighbour search in the feature descriptor space. We followed the Lowe’s technique [31] that assumes the 1-NN in some image is a potential correct match, while the 2-NN in the same image is an incorrect match. The final image Im is the one that has the highest number of feature matches with respect to the current image It. Once the image Im is found, the correct homography Gt relating It to Im is

Experimental results

The validity of the framework has been evaluated on a special test setup in a wide outdoor parking area of 80x15 meters. Two IP PTZ Sony SNC-RZ30 cameras were placed in proximity of the long side extremes, at about 60 meters far from each other, operating in a master–slave configuration. Images from both cameras were taken at 320 × 240 pixels of resolution. We used two scene maps of visual landmarks of the observed scene, one from each camera. Each map was built at three zoom factors (wide view,

Conclusion

In this paper we have shown how to combine distinctive visual landmarks maps and PTZ camera geometry in order to define and compute the basic building blocks of PTZ camera networks. The proposed approach can be generalized to networks with any arbitrary number of cameras, each of which can act either as master or as slave. The proposed framework does not require any 3D known location to be specified, and allows to take into account both zooming camera and target uncertainties. Results are very

Acknowledgment

This work is partially supported by the EU IST VidiVideo Project (Contract FP6-045547) and by Thales Italia, Florence, Italy.

References (49)

  • H. Bay et al.

    Speeded-up robust features (surf)

    Computer Vision and Image Understanding

    (2008)
  • R. Kumar et al.

    Robust methods for estimating pose and a sensitivity analysis

    CVGIP

    (1994)
  • J. Badri, C. Tilmant, J. Lavest, Q. Pham, P. Sayd, Camera-to-camera mapping for hybrid pan–tilt–zoom sensors...
  • A. Bartoli et al.

    Motion panoramas

    Computer Animation and Virtual Worlds

    (2004)
  • J. Batista, P. Peixoto, H. Araujo, Real-time active visual surveillance by integrating peripheral motion detection with...
  • A.D. Bimbo, F. Dini, A. Grifoni, F. Pernici, Uncalibrated framework for on-line camera cooperation to acquire human...
  • B. Bose, E. Grimson, Ground plane rectification by tracking moving objects, in: Proceedings of the Joint IEEE...
  • D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, A. Calway, Real-time and robust monocular slam using predictive...
  • J. Civera et al.

    Drift-free real-time sequential mosaicing

    International Journal of Computer Vision

    (2008)
  • Collins, Lipton, Kanade, Fujiyoshi, Duggins, Tsin, Tolliver, Enomoto, Hasegawa, A system for video surveillance and...
  • R. Collins et al.

    Algorithms for cooperative multisensor surveillance

    Proceedings of the IEEE

    (2001)
  • C.J. Costello, C.P. Diehl, A. Banerjee, H. Fisher, Scheduling an active camera to observe people, in: Proceedings of...
  • A. Criminisi et al.

    Single view metrology

    International Journal of Computer Vision

    (2000)
  • J. Davis, X. Chen, Calibrating pan–tilt cameras in wide-area surveillance networks, in: Proceedings of ICCV 2003, vol....
  • A.J. Davison et al.

    Monoslam: real-time single camera slam

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • L. de Agapito et al.

    Self-calibration of rotating and zooming cameras

    International Journal of Computer Vision

    (2001)
  • A. Del Bimbo, F. Dini, A. Grifoni, F. Pernici, Exploiting single view geometry in pan–tilt–zoom camera networks, in:...
  • A. del Bimbo, F. Pernici, Distant targets identification as an on-line dynamic vehicle routing problem using an...
  • A. Hampapur, S. Pankanti, A. Senior, Y.-L. Tian, L. Brown, R. Bolle, Face cataloger: multi-scale imaging for relating...
  • R. Hartley, Self-calibration from multiple views with a rotating camera, in: Proceedings of European Conference on...
  • R.I. Hartley et al.

    Multiple View Geometry in Computer Vision

    (2000)
  • R. Horaud et al.

    Camera cooperation for achieving visual attention

    Machine Vision and Applications

    (2006)
  • G. Hua, M. Brown, S. Winder, Discriminant embedding for local image descriptors, in: ICCV07, 2007, pp....
  • A. Jain, D. Kopell, K. Kakligian, Y.-F. Wang, Using stationary-dynamic camera assemblies for wide-area video...
  • Cited by (0)

    View full text