Unsupervised image saliency detection with Gestalt-laws guided optimization and visual attention based refinement
Introduction
For human beings, our visual attention system is mainly made up by both bottom-up and top-down attention mechanisms that enable us to allocate to the most salient stimuli, location, or feature that evokes the stronger neural activation than others in the natural scenes [5], [6], [7]. Bottom-up attention helps us gather information from separated feature maps e.g. color or spatial measurements, which is then incorporated to a global contrast map representing the most salient objects/regions that pop out from their surroundings [11]. Top-down attention modulates the bottom-up attentional signals and helps us voluntarily focus on specific targets/objects i.e. face and cars [15]. However, due to the high level of subjectivity and lack of formal mathematical representation, it is still very challenging for computers to imitate the characteristics of our visual attention mechanisms. In [11], it is found that the two attentional functions have distinct neural mechanisms but constantly influence each other to attentions. To this end, we aim to build a cognitive framework where separated model for each attentional mechanism is integrated together to determine the visual attention refer to the salient object detection.
To extract features at the bottom level, color plays an important role since it is a central component of the human visual system, which also facilitates our capability for scene segmentation and visual memory [22]. Color is particularly useful for object identification as it is invariant under different viewpoints. We can move or even rotate an object, yet the color we see seems unchanged due to the light reflected from the object into the retina remains the same. As a result, the salient regions/objects can be easily recognized intuitively for their high contrast to the surrounding background.
In addition to color features, our visual perception system is also sensitive to spatial signals, as the retinal ganglion cells can transmit the spatial information within natural images to the brain [25]. As a result, our human beings pay more attention to the objects and regions not only with dominant colors but also with close and compact spatial distributions. Therefore, the main objective of saliency detection is to computationally group the perceptual objects on the base of the way how our human visual perception system works.
Although color and spatial features have been widely used for salient object detection, the efficacy can still be fragile, especially in dealing with large objects and/or complicated background in the scenes [23]. The salient object often cannot be extracted as a whole (see examples in Fig. 1), though it is still relatively easily for our HVS to identify the full range of the salient objects. This shows a gap between existing approaches to an ideal one that can better exploit the potential of our HVS for more accurate salient object detection. To this end, we propose a Gestalt-law guided cognitive approach to calculate bottom-up attention. As gestalt-laws can characterize the capabilities of HVS to yield whole forms of objects from a group of simple and even unrelated visual elements [27], e.g. edges and regions, we aim to employ these laws to guide/improve the process of salient object detection.
For modelling top-down attention, Al-Aidroos et al. [28] proposed a theory named ‘background connectivity’ to describe the stimulus-evoked response of our visual cortex. It is found that focus on the scenes rather than objects may increase the background connectivity. Inspired by this theory, we employed a robust background detection model to represent the background connectivity of top-down attention in the images as post-processing to further refine the saliency maps detected using gestalt-laws guided processing.
Fig. 1 shows several examples in which the salient objects contain poor color and/or spatial contrasts. As such, conventional approaches either fails to detect the object as a whole or results in massive false alarms. Within the proposed cognitive framework, salient objects can be successfully detected whilst the false alarms are significantly suppressed. Descriptions of the proposed salient model and its implementation are detailed in Sections 3-4.
The main contributions of this paper can be highlighted as follows:
- 1)
We propose gestalt laws guided optimization and visual attention based refinement framework (GLGOV) for unsupervised salient object detection, where bottom-up and top-down mechanisms are combined to fully characterize HVS for effective forming of objects in a whole;
- 2)
We introduce a new background suppression model guided by the Gestalt law of figure and ground, where superpixel-level color quantization and adaptive thresholding are applied to determine object-level foreground and background for the calculation of the background correlation term and the spatial compactness term to further suppress the background and highlight the saliency objects;
- 3)
We have carried out comprehensive experiments on five challenging and complex datasets and benchmarked with eight state-of-the-art saliency detection models, where useful discussions and conclusions are achieved.
The rest of this paper is organized as follows. Section 2 summarizes the related work on saliency detection. The proposed framework by combining bottom-up and top-down HVS mechanisms for saliency detection is presented in Section 3, where the implementation detail is discussed in Section 4. Section 5 presents the experimental results and performance analysis. Finally, some concluding remarks are drawn in Section 6.
Section snippets
Related work
In the past decades, a number of salient object detection methods have been developed to identify salient regions in terms of the saliency map and capture as much as possible human perceptual attention. In general saliency detection methods can be categorized into two classes, i.e. supervised and unsupervised approaches. Most supervised methods including those using deep learning [29], [30], [31], [32], [33] are able to obtain good saliency maps, where high performance computers even with
The proposed GLGOV framework for unsupervised saliency detection
A new saliency detection framework inspired by the Gestalt laws of HVS is proposed. The proposed framework contains six main modules, i.e. homogeneity, similarity and proximity, figure and ground, background connectivity, two stage refinement and performance evaluation. The overall diagram of our saliency detection framework is illustrated in Fig. 2 where corresponding gestalt laws and visual psychology used in different modules are specified and also detailed below.
The homogeneity module aims
Implementation detail of the proposed GLGOV framework
In this section, the implementation of the proposed saliency detection framework is detailed in five stages, i.e. homogeneity, similarity and proximity, figure and ground, background connectivity and two-stage refinement below.
Experimental results
For performance evaluation of our proposed saliency detection method, in total 10 state-of-the-art algorithms are used for benchmarking, as listed below by the first letter of the name of methods. They are selected for two main reasons, i.e. high citation and wide acknowledgement in the community and/or newly presented in the last 3–5 years. Introduction to the datasets and criteria used for evaluation as well as relevant results and discussions are presented in detail in this section.
- •
Bayesian
Conclusions
Inspired by both Gestalt laws optimization and background connectivity theory, in this paper, we proposed GLGOV as a cognitive framework to combine bottom-up and top-down vision mechanisms for unsupervised saliency detection. Experimental results over five publicly available datasets have shown that our method helps to produce the best overall accuracy and average accuracy when benchmarking with a number of state-of-the-art unsupervised techniques. Additional assessments in terms of the PR
Acknowledgements
This work was supported by the Natural Science Foundation of China (61672008, 61772144), the Fundamental Research Funds for the Central Universities (18CX05030A), the Natural Science Foundation of Guangdong Province (2016A030311013), Guangdong Provincial Application-oriented Technical Research and Development Special fund project (2016B010127006), and International Scientific and Technological Cooperation Projects of Guangdong Province (2017A050501039).
Yijun Yan received the M.E. degree from University of Strathclyde, Glasgow, UK, in 2013. He is currently a PhD student in the Department of Electronic and Electrical Engineering at University of Strathclyde, Glasgow, UK. His research interests include image retrieval, saliency detection and object tracking.
References (80)
- et al.
Visual attention: Bottom-up versus top-down
Curr. Biol.
(2004) - et al.
How does the brain solve visual object recognition?
Neuron
(2012) - et al.
Complex networks driven salient region detection based on superpixel segmentation
Pattern Recognit.
(2017) - et al.
Bottom-up saliency detection with sparse representation of learnt texture atoms
Pattern Recognit.
(2016) - et al.
Diversity induced matrix decomposition model for salient object detection
Pattern Recognit.
(2017) - et al.
Learning feature fusion strategies for various image types to detect salient objects
Pattern Recognit.
(2016) - et al.
Attention-driven image interpretation with application to image retrieval
Pattern Recognit.
(2006) - et al.
Visual attention guided bit allocation in video compression
Image Vision Comput.
(2011) - et al.
Fusing disparate object signatures for salient object detection in video
Pattern Recognit.
(2017) - et al.
A feature-integration theory of attention
Cognit. Psychol.
(1980)
Computational gestalts and perception thresholds
J. Physiol.-Paris
Attentional modulation of background connectivity between ventral visual cortex and the medial temporal lobe
Neurobiol. Learn. Mem.
A fast MPEG-7 dominant color extraction with new similarity measure for image retrieval
J. Visual Commun. Image Represent.
Bayesian saliency via low and mid level cues
IEEE Trans. Image Process.
Minimum barrier salient object detection at 80 fps
Global contrast based salient region detection
IEEE Trans. Pattern Anal. Mach. Intell.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Guided Search 2.0 A revised model of visual search
Psychonom. Bull. Rev.
Shifts in selective visual attention: towards the underlying neural circuitry
Matters of Intelligence
Neural mechanisms of selective visual attention
Annu. Rev. Neurosci.
Salient region detection via high-dimensional color transform
Contrast-based image attention analysis by using fuzzy growing
Bottom-up and top-down attention: different processes and overlapping neural systems
Neuroscientist
Saliency detection via graph-based manifold ranking
Saliency detection: a spectral residual approach
Superpixel-based saliency detection
Salient region detection and segmentation
Computer Vision Systems
Saliency detection via dense and sparse reconstruction
Frequency-tuned salient region detection
Saliency detection using maximum symmetric surround
Saliency optimization from robust background detection
Segmenting salient objects from images and videos
European Conference on Computer Vision
Cortical mechanisms of colour vision
Nat. Rev. Neurosci.
Visual saliency detection based on multiscale deep CNN features
IEEE Trans. Image Process.
Global contrast based salient region detection
Efficient coding of spatial information in the primate retina
J. Neurosci.
Deep contrast learning for salient object detection
Psychology: The Science of Behavior
Top-down attention switches coupling between low-level and high-level areas of human visual cortex
Proc. Natl. Acad. Sci.
Cited by (152)
A graph-based top-down visual attention model for lockwire detection via multiscale top-hat transformation
2023, Expert Systems with ApplicationsSalient detection via the fusion of background-based and multiscale frequency-domain features
2022, Information SciencesCitation Excerpt :These datasets contain images of various resolutions. The proposed method is compared to some of the current state-of-the-art methods, e.g., HFT [24], SF [30], GS [37], MR [43], wCtr [49], BSCA [31], HDCT [19], FCB [28], RCRR [45], GLGO [42], and RP [22]. We show the advancement and superiority of our proposed algorithm from the following aspects: 1.
Sports match prediction model for training and exercise using attention-based LSTM network
2022, Digital Communications and NetworksA new scheme of vehicle detection for severe weather based on multi-sensor fusion
2022, Measurement: Journal of the International Measurement ConfederationSemi-supervised Active Salient Object Detection
2022, Pattern RecognitionCitation Excerpt :In this section, we briefly review existing deep models for SOD, annotation-efficient learning based techniques, adversarial learning based dense prediction models and recent work in deep active learning. Deep Salient Object Detection: Depending on the form of the supervision, existing deep SOD models can be roughly divided into four categories: 1) fully-supervised models [5–10,23–28], which mainly focus on producing high-resolution salient object prediction by learning from pixel-level annotated data; 2) weakly-supervised models [12,14,15,29–31], which learn saliency from weak but easy-to-obtain annotations, including image-level labels [12,14], image contour [31] and scribble labels [15]; 3) unsupervised models [16–18,32,33], which start with noisy labels computed by conventional handcrafted feature based methods, and design network to learn latent saliency from the noisy label; and 4) semi-supervised models [34,35], which learn saliency from given initial small set of labeled pool and a large amount of unlabeled data. Our solution belongs to the fourth direction where we aim to learn an effective salienct object detection model with limited budget.
A-contrario framework for detection of alterations in varnished surfaces
2022, Journal of Visual Communication and Image RepresentationCitation Excerpt :In all these studies, the detection is performed by rejecting a naive model which describes the statistic of the unstructured data. At the same time, grouping principles have been used for tasks related to the higher level perceptual organization of scenes [29,30], thanks to their general nature. In our case, we are interested in the applications dealing with the detection of changed areas across multi-temporal images.
Yijun Yan received the M.E. degree from University of Strathclyde, Glasgow, UK, in 2013. He is currently a PhD student in the Department of Electronic and Electrical Engineering at University of Strathclyde, Glasgow, UK. His research interests include image retrieval, saliency detection and object tracking.
Jinchang Ren received his PhD in Electronic Imaging and Media Communication from Bradford University, U.K. Currently he is a Senior Lecturer (Associate Professor) with University of Strathclyde, Glasgow, U.K. His research interests focus mainly on visual computing and multimedia signal processing, especially on semantic content extraction for video analysis and understanding and hyperspectral imaging.
Genyun Sun received the B.S. degree from Wuhan University, China, in 2003 and PhD in Institute of Remote Sensing Applications, Chinese Academy of Sciences in 2008. He is currently an Associate Professor with China University of Petroleum, Qingdao, China. His research interests include remote sensing image processing, hyperspectral and high resolution remote sensing, and intelligent optimization algorithms.
Huimin Zhao received the Ph.D. degree in electrical engineering from the Sun Yat-sen University in 2001. At present, he is a professor of the Guangdong Polytechnic Normal University. His research interests include image, video and information security technology.
Junwei Han received the Ph.D. degree in pattern recognition and intelligent systems from the School of Automation, Northwestern Polytechnical University, Xi'an, China, in 2003. He is currently a Professor with Northwestern Polytechnical University. His current research interests include multimedia processing and brain imaging analysis.
Xuelong Li is a full professor with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an 710119, Shaanxi, P.R. China. He is a Fellow of the IEEE.
Stephen Marshall received the BSc degree from the University of Nottingham and the PhD degree from the University of Strathclyde, U.K. He is a Professor with the Department of Electronic and Electrical Engineering in Strathclyde, and a Fellow of the IET. His research focuses in nonlinear image processing and hyperspectral imaging.
Jin Zhan received B.S. and Ph.D. degrees from Sun Yat-sen University in 2004 and 2015, respectively, and she is currently a Lecturer in the School of Computer Sciences, Guangdong Polytechnic Normal University. Her research interests include image/video analysis, computer vision, machine learning and applications.