Elsevier

Pattern Recognition

Volume 79, July 2018, Pages 65-78
Pattern Recognition

Unsupervised image saliency detection with Gestalt-laws guided optimization and visual attention based refinement

https://doi.org/10.1016/j.patcog.2018.02.004Get rights and content

Highlights

  • Gestalt laws guided saliency detection via characterizing HVS and forming objects.

  • Smooth at superpixel and object levels by fusing bottom-up and top-down mechanisms;

  • Background suppression with background correlation term & spatial compactness term.

  • Two-stage refinement to show best among 10 state-of-the-art methods on 5 datasets.

Abstract

Visual attention is a kind of fundamental cognitive capability that allows human beings to focus on the region of interests (ROIs) under complex natural environments. What kind of ROIs that we pay attention to mainly depends on two distinct types of attentional mechanisms. The bottom-up mechanism can guide our detection of the salient objects and regions by externally driven factors, i.e. color and location, whilst the top-down mechanism controls our biasing attention based on prior knowledge and cognitive strategies being provided by visual cortex. However, how to practically use and fuse both attentional mechanisms for salient object detection has not been sufficiently explored. To the end, we propose in this paper an integrated framework consisting of bottom-up and top-down attention mechanisms that enable attention to be computed at the level of salient objects and/or regions. Within our framework, the model of a bottom-up mechanism is guided by the gestalt-laws of perception. We interpreted gestalt-laws of homogeneity, similarity, proximity and figure and ground in link with color, spatial contrast at the level of regions and objects to produce feature contrast map. The model of top-down mechanism aims to use a formal computational model to describe the background connectivity of the attention and produce the priority map. Integrating both mechanisms and applying to salient object detection, our results have demonstrated that the proposed method consistently outperforms a number of existing unsupervised approaches on five challenging and complicated datasets in terms of higher precision and recall rates, AP (average precision) and AUC (area under curve) values.

Introduction

For human beings, our visual attention system is mainly made up by both bottom-up and top-down attention mechanisms that enable us to allocate to the most salient stimuli, location, or feature that evokes the stronger neural activation than others in the natural scenes [5], [6], [7]. Bottom-up attention helps us gather information from separated feature maps e.g. color or spatial measurements, which is then incorporated to a global contrast map representing the most salient objects/regions that pop out from their surroundings [11]. Top-down attention modulates the bottom-up attentional signals and helps us voluntarily focus on specific targets/objects i.e. face and cars [15]. However, due to the high level of subjectivity and lack of formal mathematical representation, it is still very challenging for computers to imitate the characteristics of our visual attention mechanisms. In [11], it is found that the two attentional functions have distinct neural mechanisms but constantly influence each other to attentions. To this end, we aim to build a cognitive framework where separated model for each attentional mechanism is integrated together to determine the visual attention refer to the salient object detection.

To extract features at the bottom level, color plays an important role since it is a central component of the human visual system, which also facilitates our capability for scene segmentation and visual memory [22]. Color is particularly useful for object identification as it is invariant under different viewpoints. We can move or even rotate an object, yet the color we see seems unchanged due to the light reflected from the object into the retina remains the same. As a result, the salient regions/objects can be easily recognized intuitively for their high contrast to the surrounding background.

In addition to color features, our visual perception system is also sensitive to spatial signals, as the retinal ganglion cells can transmit the spatial information within natural images to the brain [25]. As a result, our human beings pay more attention to the objects and regions not only with dominant colors but also with close and compact spatial distributions. Therefore, the main objective of saliency detection is to computationally group the perceptual objects on the base of the way how our human visual perception system works.

Although color and spatial features have been widely used for salient object detection, the efficacy can still be fragile, especially in dealing with large objects and/or complicated background in the scenes [23]. The salient object often cannot be extracted as a whole (see examples in Fig. 1), though it is still relatively easily for our HVS to identify the full range of the salient objects. This shows a gap between existing approaches to an ideal one that can better exploit the potential of our HVS for more accurate salient object detection. To this end, we propose a Gestalt-law guided cognitive approach to calculate bottom-up attention. As gestalt-laws can characterize the capabilities of HVS to yield whole forms of objects from a group of simple and even unrelated visual elements [27], e.g. edges and regions, we aim to employ these laws to guide/improve the process of salient object detection.

For modelling top-down attention, Al-Aidroos et al. [28] proposed a theory named ‘background connectivity’ to describe the stimulus-evoked response of our visual cortex. It is found that focus on the scenes rather than objects may increase the background connectivity. Inspired by this theory, we employed a robust background detection model to represent the background connectivity of top-down attention in the images as post-processing to further refine the saliency maps detected using gestalt-laws guided processing.

Fig. 1 shows several examples in which the salient objects contain poor color and/or spatial contrasts. As such, conventional approaches either fails to detect the object as a whole or results in massive false alarms. Within the proposed cognitive framework, salient objects can be successfully detected whilst the false alarms are significantly suppressed. Descriptions of the proposed salient model and its implementation are detailed in Sections 3-4.

The main contributions of this paper can be highlighted as follows:

  • 1)

    We propose gestalt laws guided optimization and visual attention based refinement framework (GLGOV) for unsupervised salient object detection, where bottom-up and top-down mechanisms are combined to fully characterize HVS for effective forming of objects in a whole;

  • 2)

    We introduce a new background suppression model guided by the Gestalt law of figure and ground, where superpixel-level color quantization and adaptive thresholding are applied to determine object-level foreground and background for the calculation of the background correlation term and the spatial compactness term to further suppress the background and highlight the saliency objects;

  • 3)

    We have carried out comprehensive experiments on five challenging and complex datasets and benchmarked with eight state-of-the-art saliency detection models, where useful discussions and conclusions are achieved.

The rest of this paper is organized as follows. Section 2 summarizes the related work on saliency detection. The proposed framework by combining bottom-up and top-down HVS mechanisms for saliency detection is presented in Section 3, where the implementation detail is discussed in Section 4. Section 5 presents the experimental results and performance analysis. Finally, some concluding remarks are drawn in Section 6.

Section snippets

Related work

In the past decades, a number of salient object detection methods have been developed to identify salient regions in terms of the saliency map and capture as much as possible human perceptual attention. In general saliency detection methods can be categorized into two classes, i.e. supervised and unsupervised approaches. Most supervised methods including those using deep learning [29], [30], [31], [32], [33] are able to obtain good saliency maps, where high performance computers even with

The proposed GLGOV framework for unsupervised saliency detection

A new saliency detection framework inspired by the Gestalt laws of HVS is proposed. The proposed framework contains six main modules, i.e. homogeneity, similarity and proximity, figure and ground, background connectivity, two stage refinement and performance evaluation. The overall diagram of our saliency detection framework is illustrated in Fig. 2 where corresponding gestalt laws and visual psychology used in different modules are specified and also detailed below.

The homogeneity module aims

Implementation detail of the proposed GLGOV framework

In this section, the implementation of the proposed saliency detection framework is detailed in five stages, i.e. homogeneity, similarity and proximity, figure and ground, background connectivity and two-stage refinement below.

Experimental results

For performance evaluation of our proposed saliency detection method, in total 10 state-of-the-art algorithms are used for benchmarking, as listed below by the first letter of the name of methods. They are selected for two main reasons, i.e. high citation and wide acknowledgement in the community and/or newly presented in the last 3–5 years. Introduction to the datasets and criteria used for evaluation as well as relevant results and discussions are presented in detail in this section.

  • Bayesian

Conclusions

Inspired by both Gestalt laws optimization and background connectivity theory, in this paper, we proposed GLGOV as a cognitive framework to combine bottom-up and top-down vision mechanisms for unsupervised saliency detection. Experimental results over five publicly available datasets have shown that our method helps to produce the best overall accuracy and average accuracy when benchmarking with a number of state-of-the-art unsupervised techniques. Additional assessments in terms of the PR

Acknowledgements

This work was supported by the Natural Science Foundation of China (61672008, 61772144), the Fundamental Research Funds for the Central Universities (18CX05030A), the Natural Science Foundation of Guangdong Province (2016A030311013), Guangdong Provincial Application-oriented Technical Research and Development Special fund project (2016B010127006), and International Scientific and Technological Cooperation Projects of Guangdong Province (2017A050501039).

Yijun Yan received the M.E. degree from University of Strathclyde, Glasgow, UK, in 2013. He is currently a PhD student in the Department of Electronic and Electrical Engineering at University of Strathclyde, Glasgow, UK. His research interests include image retrieval, saliency detection and object tracking.

References (80)

  • A. Desolneux et al.

    Computational gestalts and perception thresholds

    J. Physiol.-Paris

    (2003)
  • N.I. Córdova et al.

    Attentional modulation of background connectivity between ventral visual cortex and the medial temporal lobe

    Neurobiol. Learn. Mem.

    (2016)
  • N.-C. Yang et al.

    A fast MPEG-7 dominant color extraction with new similarity measure for image retrieval

    J. Visual Commun. Image Represent.

    (2008)
  • Y. Xie et al.

    Bayesian saliency via low and mid level cues

    IEEE Trans. Image Process.

    (2013)
  • J. Zhang et al.

    Minimum barrier salient object detection at 80 fps

  • M. Cheng et al.

    Global contrast based salient region detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • J.M. Wolfe

    Guided Search 2.0 A revised model of visual search

    Psychonom. Bull. Rev.

    (1994)
  • C. Koch et al.

    Shifts in selective visual attention: towards the underlying neural circuitry

    Matters of Intelligence

    (1987)
  • R. Desimone et al.

    Neural mechanisms of selective visual attention

    Annu. Rev. Neurosci.

    (1995)
  • J. Kim et al.

    Salient region detection via high-dimensional color transform

  • Y.-F. Ma et al.

    Contrast-based image attention analysis by using fuzzy growing

  • J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” presented at the Advances in neural information...
  • F. Katsuki et al.

    Bottom-up and top-down attention: different processes and overlapping neural systems

    Neuroscientist

    (2014)
  • C. Yang et al.

    Saliency detection via graph-based manifold ranking

  • X. Hou et al.

    Saliency detection: a spectral residual approach

  • Z. Liu et al.

    Superpixel-based saliency detection

  • R. Achanta et al.

    Salient region detection and segmentation

    Computer Vision Systems

    (2008)
  • X. Li et al.

    Saliency detection via dense and sparse reconstruction

  • R. Achanta et al.

    Frequency-tuned salient region detection

  • R. Achanta et al.

    Saliency detection using maximum symmetric surround

  • W. Zhu et al.

    Saliency optimization from robust background detection

  • E. Rahtu et al.

    Segmenting salient objects from images and videos

    European Conference on Computer Vision

    (2010)
  • K.R. Gegenfurtner

    Cortical mechanisms of colour vision

    Nat. Rev. Neurosci.

    (2003)
  • G. Li et al.

    Visual saliency detection based on multiscale deep CNN features

    IEEE Trans. Image Process.

    (2016)
  • M.M. Cheng et al.

    Global contrast based salient region detection

  • E. Doi et al.

    Efficient coding of spatial information in the primate retina

    J. Neurosci.

    (2012)
  • G. Li et al.

    Deep contrast learning for salient object detection

  • N.R. Carlson et al.

    Psychology: The Science of Behavior

    (2010)
  • N. Al-Aidroos et al.

    Top-down attention switches coupling between low-level and high-level areas of human visual cortex

    Proc. Natl. Acad. Sci.

    (2012)
  • Cited by (152)

    • Salient detection via the fusion of background-based and multiscale frequency-domain features

      2022, Information Sciences
      Citation Excerpt :

      These datasets contain images of various resolutions. The proposed method is compared to some of the current state-of-the-art methods, e.g., HFT [24], SF [30], GS [37], MR [43], wCtr [49], BSCA [31], HDCT [19], FCB [28], RCRR [45], GLGO [42], and RP [22]. We show the advancement and superiority of our proposed algorithm from the following aspects: 1.

    • A new scheme of vehicle detection for severe weather based on multi-sensor fusion

      2022, Measurement: Journal of the International Measurement Confederation
    • Semi-supervised Active Salient Object Detection

      2022, Pattern Recognition
      Citation Excerpt :

      In this section, we briefly review existing deep models for SOD, annotation-efficient learning based techniques, adversarial learning based dense prediction models and recent work in deep active learning. Deep Salient Object Detection: Depending on the form of the supervision, existing deep SOD models can be roughly divided into four categories: 1) fully-supervised models [5–10,23–28], which mainly focus on producing high-resolution salient object prediction by learning from pixel-level annotated data; 2) weakly-supervised models [12,14,15,29–31], which learn saliency from weak but easy-to-obtain annotations, including image-level labels [12,14], image contour [31] and scribble labels [15]; 3) unsupervised models [16–18,32,33], which start with noisy labels computed by conventional handcrafted feature based methods, and design network to learn latent saliency from the noisy label; and 4) semi-supervised models [34,35], which learn saliency from given initial small set of labeled pool and a large amount of unlabeled data. Our solution belongs to the fourth direction where we aim to learn an effective salienct object detection model with limited budget.

    • A-contrario framework for detection of alterations in varnished surfaces

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      In all these studies, the detection is performed by rejecting a naive model which describes the statistic of the unstructured data. At the same time, grouping principles have been used for tasks related to the higher level perceptual organization of scenes [29,30], thanks to their general nature. In our case, we are interested in the applications dealing with the detection of changed areas across multi-temporal images.

    View all citing articles on Scopus

    Yijun Yan received the M.E. degree from University of Strathclyde, Glasgow, UK, in 2013. He is currently a PhD student in the Department of Electronic and Electrical Engineering at University of Strathclyde, Glasgow, UK. His research interests include image retrieval, saliency detection and object tracking.

    Jinchang Ren received his PhD in Electronic Imaging and Media Communication from Bradford University, U.K. Currently he is a Senior Lecturer (Associate Professor) with University of Strathclyde, Glasgow, U.K. His research interests focus mainly on visual computing and multimedia signal processing, especially on semantic content extraction for video analysis and understanding and hyperspectral imaging.

    Genyun Sun received the B.S. degree from Wuhan University, China, in 2003 and PhD in Institute of Remote Sensing Applications, Chinese Academy of Sciences in 2008. He is currently an Associate Professor with China University of Petroleum, Qingdao, China. His research interests include remote sensing image processing, hyperspectral and high resolution remote sensing, and intelligent optimization algorithms.

    Huimin Zhao received the Ph.D. degree in electrical engineering from the Sun Yat-sen University in 2001. At present, he is a professor of the Guangdong Polytechnic Normal University. His research interests include image, video and information security technology.

    Junwei Han received the Ph.D. degree in pattern recognition and intelligent systems from the School of Automation, Northwestern Polytechnical University, Xi'an, China, in 2003. He is currently a Professor with Northwestern Polytechnical University. His current research interests include multimedia processing and brain imaging analysis.

    Xuelong Li is a full professor with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an 710119, Shaanxi, P.R. China. He is a Fellow of the IEEE.

    Stephen Marshall received the BSc degree from the University of Nottingham and the PhD degree from the University of Strathclyde, U.K. He is a Professor with the Department of Electronic and Electrical Engineering in Strathclyde, and a Fellow of the IET. His research focuses in nonlinear image processing and hyperspectral imaging.

    Jin Zhan received B.S. and Ph.D. degrees from Sun Yat-sen University in 2004 and 2015, respectively, and she is currently a Lecturer in the School of Computer Sciences, Guangdong Polytechnic Normal University. Her research interests include image/video analysis, computer vision, machine learning and applications.

    View full text