Elsevier

Image and Vision Computing

Volume 23, Issue 11, 1 October 2005, Pages 999-1008
Image and Vision Computing

Extraction of visual features with eye tracking for saliency driven 2D/3D registration

https://doi.org/10.1016/j.imavis.2005.07.003Get rights and content

Abstract

This paper presents a new technique for deriving information on visual saliency with experimental eye-tracking data. The strength and potential pitfalls of the method are demonstrated with feature correspondence for 2D to 3D image registration. With this application, an eye-tracking system is employed to determine which features in endoscopy video images are considered to be salient from a group of human observers. By using this information, a biologically inspired saliency map is derived by transforming each observed video image into a feature space representation. Features related to visual attention are determined by using a feature normalisation process based on the relative abundance of image features within the background image and those dwelled on visual search scan paths. These features are then back-projected to the image domain to determine spatial area of interest for each unseen endoscopy video image. The derived saliency map is employed to provide an image similarity measure that forms the heart of a new 2D/3D registration method with much reduced rendering overhead by only processing-selective regions of interest as determined by the saliency map. Significant improvements in pose estimation efficiency are achieved without apparent reduction in registration accuracy when compared to that of using an intensity-based similarity measure.

Introduction

New techniques for minimally invasive surgery [1], [2] have brought significant benefits to patient care, including reduced trauma, shortened hospitalisation, and improved diagnostic accuracy and therapeutic outcome. Endoscopy is the most common procedure in minimal access surgery, but requires a high degree of manual dexterity from the operator owing to the complexity of the instrument controls, restricted vision and lack of tactile perception. Training methods for these skills involve the use of computer simulation, which needs to be as realistic as possible for specialist training to be effective. One important factor is the level of visual realism presented to the trainee, which is made possible by recent advances in image-based modelling and rendering [3]. Genuine surface texture information can be extracted from matched real endoscopic videos with geometry derived from tomographic images of the same patient. This process relies on the accurate registration of 2D video images to 3D tomographic datasets. The method of 2D/3D registration is a much researched topic, and many applications arisen are from medical imaging and surgical planning applications [4], [5], [6], [7], [8], [9], [10].

In general, methods of medical image registration can be classified into landmark, segmentation, and voxel-based techniques [11], [12]. Voxel-based techniques [13] have shown to be effective for the registration of both unimodal or multimodal images depending on the similarity measures and optimisation strategy used. One of the typical 2D/3D registration problems addressed is the registration of fluoroscope images against 3D CT or MRI datasets, for which a voxel similarity measure is typically used. Registering 2D bronchoscopy video to 3D tomographic datasets, however, poses unique problems. Unlike fluoroscope images, endoscope video of internal organs features textured surfaces with shading varying greatly with illumination conditions, distance from the light source, surface reflectivity, degree of subsurface scattering, and inter-reflection properties. Practically, these effects are difficult to model accurately. The photo-consistency method [14] assumes a Lambertian illumination model but relies on having multiple cameras with a known rigid relationship between them, which is difficult to achieve in most endoscopy applications. The flexibility of colonoscopes and bronchoscopes also rules out the use of landmark-based systems such as those with an infrared tracker as described in [10]. The bronchoscope tip does not form a rigid relationship with instruments outside of the body as required by the technique. Recent advances in electromagnetic sensor technology have allowed the fabrication of positional trackers small enough to be inserted into the biopsy channel of the bronchoscope [15]. This, however, limits the tracking functionality to purely exploratory interventions in which no other catheters are used. Although EM tracking of the bronchoscope tip can be highly accurate (sub-millimetre in the best case), positions are given relative to the EM field emitter and not relative to the surrounding parts of the anatomy, which have often been subjected to non-rigid deformation after pre-operative acquisition of the 3D geometry. Furthermore, the current state-of-the-art catheter-based EM tracker still cannot offer full six degrees of freedom. Therefore, image-based pose estimation with 2D/3D registration is essential to patient-specific endoscope simulation.

In practice, two parallel approaches for 2D/3D endoscope registration are currently being explored. Videos captured by a flexible endoscope can be registered against 3D pre-operative data by using naturally occurring anatomical features as landmarks. Alternatively one can perform registration by using image pixels directly. Normalised cross-correlation [16], [17] and mutual information-based [18], [19] similarity measures are commonly adopted in this situation. With this approach, the 3D dataset is commonly projected onto a 2D image while catering for occlusion, which in effect, reduces the problem to a 2D–2D registration problem. Similarity measures such as correlation ratio [20], intensity gradients, texture similarity measures [13], and joint intensity distributions [21] have been adopted previously.

In practice, for each video frame with which the 3D dataset is to be registered, a large number of pose parameters must be evaluated. A different image needs to be rendered from the 3D CT dataset for each unique set of pose parameters and then compared with the video frame. This is computationally expensive and usually represents the bottleneck in the entire registration process. Although the rendering process can be accelerated via the use of specialised graphics hardware, the computational burden is still excessive especially when photo-realistic rendering is considered.

In order to reduce the rendering overhead, this paper proposes the use of visual saliency for selective rendering so that registration efficiency can be maximised. With this approach, criteria determining the portion of the image to be rendered are essential. To this end, a saliency map derived from a model of the human visual system is employed. Since the similarity measure is not applied to the entire image, pixel-oriented rendering methods, such as ray tracing [22], can be effectively exploited. For large datasets, the proposed scheme can outperform hardware z-buffer-based methods and also allows for more sophisticated illumination models.

The following subsections briefly discuss visual search, the function of saliency in the modelling of this human activity, followed by a detailed explanation of the 2D/3D registration problem. Subsequent sections will then describe how the modelling of visual search through the analysis of eye-tracking data can be used to improve the performance of 2D/3D registration algorithms.

Visual search is the act of searching for a target within a scene. If the scene is between 2 and 30°, the eyes will move across it to find the target, and for larger scenes, the head moves as well. The myriad of visual search tasks performed in a single day is so large that it has become a reactive rather than deliberative process for most normal tasks. In practice, the eye movements associated with visual search can be detected with high accuracy by using the relative position of the pupil and corneal reflection from an infra-red light source [26], [27]. During a visual search, a saccade moves the gaze of the eye to the current area of interest. This area of interest normally needs to be dwelled on for longer than 100 ms in order for the brain to register the underlying visual information. This point is called a fixation. The objective of eye-tracking systems is to determine when and where fixations occur. The spatio-temporal characteristics of human visual search, together with the intrinsic visual features of the fixation points, can provide important clues to the salient features that visual comparisons are based upon.

The use of saliency in human visual recognition has long been recognised. It has been shown that the human visual system does not apply processing power to visual information in a uniform manner. The intermediate and higher visual processes of the primate vision system appear to select a subset of the available sensory information before further detailed processing is applied [23]. This selection takes the form of a focus of attention that scans the scene in a saliency-driven task-oriented manner [24]. While attention can be controlled in a voluntary manner, it is also attracted in an unconscious manner to conspicuous, or salient, visual locations. The detection of visual saliency for recognition has traditionally been approached by pre-defined image features dictated by domain-specific knowledge. The process is hampered by the fact that visual features are difficult to describe and assimilation of near-subliminal information is cryptic. Our previous research has shown that it is possible to use eye tracking to extract intrinsic visual features used by observers without the need of explicit feature definition [25].

The problem of 2D/3D registration can be formulated as a parameter estimation problem. A camera, C, has both intrinsic and extrinsic parameters. The intrinsic parameters include focal length, lens distortion, and optical origin. For a typical endoscope, these parameters are all fixed and can be determined pre-operatively by using techniques such as [28], [29]. The extrinsic parameters determine the pose of the camera in terms of position given by three Euclidean coordinates, (Vx,Vy,Vz), and orientation defined by three Euler angles, (θ0,θ1,θ2). These correspond to the six degrees of freedom of rigid body motion. An object, A, is transformed (via perspective projection) into a 2D image, I, using the camera, C. The problem of 2D/3D registration is to determine the extrinsic parameters (V,θ) uniquely with image, I, and the position, orientation and 3D geometry of A. A closely related problem in computer vision is that of pose estimation where the goal is to determine the position and orientation of the object, A, given only its 3D geometry, the image, I, and the position and orientation of the camera, C.

2D/3D registration is often approached as an optimisation problem whereby the camera pose is determined through minimisation of a cost function O(V,θ). In practise there is no closed form solution for finding the minimum of O, thus an iterative solution is required. In each iteration of the minimiser, the object, A, is perspectively projected using the current estimate of the extrinsic camera parameters, (V,θ), to produce a 2D image, I′. A specially constructed distance measure, d, compares I with I′, indicating the amount of disimilarity as a scalar value, and thus the cost function can be defined by O(V,θ)=d(I,I′). There are a variety of methods to construct function, d, and in this paper, a weighted cross-correlation measure on image intensities has been used. Typically a large number of iterations will be required to find each pose so the overall computation time will be dominated by the rendering of image I′ and the evaluation of d(I,I′). Both these processes can be made more efficient by computing a small subset of I′ which, as is described in the following sections, will be determined through the analysis of visual search behaviour when subjects are presented with examples of images I and I′.

Section snippets

Participants

Six participants, three male and three female, were selected from our research group and consented to have their eye movements tracked for this study. All were in their 20s and had experience viewing endoscopy images.

Apparatus

A phantom lung model made from silicone-rubber and the inner surfaces were textured using acrylics. The size of the inner airways ranged from 12 to 5 cm in diameter. The phantom was scanned with a Siemens Somaton Volume Zoom 4-Channel Multidetector CT with a slice thickness of 3 mm

Feature selection

Fig. 2 illustrates an example when the described Gabor filter is applied to a video frame with four different orientations (0, 45, 90, 135°). The resulting four images were summed to produce the orientation-independent Gabor edge response image, as shown in Fig. 2(e). This process was repeated for each of the four scales. The differences between scales were used to create the saliency map.

Fig. 3 illustrates one representative image that was used in this experiment, where the associated fixation

Discussion

We have demonstrated that saliency-based 2D/3D registration by utilising only 25% of the image content can achieve a comparable accuracy to traditional correlation-based 2D/3D registration. In this study, salient image features were extracted based on the analysis of human visual behaviour but without the use of domain-specific knowledge. Eye-tracking data collected from participants with a training data set were used to automatically determine salient features for all seen and unseen video

References (36)

  • D. Wagner et al.

    Endoview: A phantom study of a tracked virtual bronchoscopy

  • D. Dey et al.

    Automatic fusion of freehand endoscopic brain images to three-dimensional surfaces: creating stereoscopic panoramas

    IEEE Transactions on Medical Imaging

    (2002)
  • F. Tendick, D. Polly, D. Blezek, J. Burgess, C. Carignan, G. Higgins, C. Lathan, K. Reinig, Final Report of the...
  • D. Dey, P.J. Slomka, D.G. Gobbi, T.M. Peters, Mixed reality merging of endoscopic images and 3-d surfaces. 796-803, in:...
  • L. Joskowicz, Fluoroscopy-based navigation in computer-aided orthopaedic surgery, in: Proceedings of the IFAC...
  • L. Joskowicz, C. Milgrom, A. Simkin, L. Tockus, Z. Yaniv, Fracas: a system for computer-aided image-guided long bone...
  • D.L.G. Hill et al.

    Medical image registration

    Physics in Medicine and Biology

    (2001)
  • G.P. Penney et al.

    A comparison of similarity measures for use in 2D–3D medical image registration

    Lecture Notes in Computer Science

    (1998)
  • Cited by (0)

    View full text