Elsevier

Neurocomputing

Volume 429, 14 March 2021, Pages 199-214
Neurocomputing

Attention-aware concentrated network for saliency prediction

https://doi.org/10.1016/j.neucom.2020.10.083Get rights and content

Abstract

This paper presents a biologically-inspired saliency prediction method to imitate two main characteristics of the human perception process: focalization and orienting. The proposed network, named ACNet is composed of two modules. The first one is an essential concentrated module (CM), which assists the network to “see” images with appropriate receptive fields by perceiving rich multi-scale multi-receptive-field contexts of high-level features. The second is a parallel attention module (PAM), which explicitly guides the network to learn “what” and “where” is salient by simultaneously capturing global and local information with channel-wise and spatial attention mechanisms. These two modules compose the core component of the proposed method, named ACBlock, which is cascaded to progressively refine the inference of saliency estimation in a manner similar to that humans zoom in their lens to focus on the saliency. Experimental results on seven public datasets demonstrate that the proposed ACNet outperforms the state-of-the-art models without any prior knowledge or post-processing.

Introduction

Saliency prediction aims to identify the spatial locations of human’s eye fixation points over an image, and it has received increasing attention in recent years. According to visual cognition science [1], humans selectively attend to distinctive salient regions of an image for capturing visual structure better and ignore the others. Numerous saliency models are proposed during the past few years and have been used in various applications and other vision tasks such as salient object detection [2], object segmentation [3], [4], image cropping [5] and diagnosis of mental illness [6].

According to biological evidence, traditional methods [7], [8], [9], [10] applied hand-crafted features based on low-level cues (e.g., contrast, texture, color) for saliency prediction. However, these features sometimes fail to simulate the complex activation of the human visual system, particularly in complex scenarios.

Recently, methods based on convolutional neural networks (CNNs) have emerged for saliency prediction and reported significant improvements. Owing to the complex semantic representations learned by CNNs, the learning based features have been demonstrated superior to that of the hand-crafted features on saliency prediction. Despite being studied actively, how to devise an effective yet efficient deep neural network model for saliency prediction remains an open challenge.

In this paper, a biologically-inspired architecture named attention-aware concentrated network (ACNet) is proposed for saliency prediction. In general, the operation of visual attention is thought as two-stage process [11]. In the first stage, attention is distributed uniformly over the external visual scenario and concurrently processes information. Subsequently, attention is concentrated on a specific area of the visual scenario. Such a cognitive process reveals two characteristics of the human perception process: focalization and orienting. In order to imitate these characteristics, a concentrated module is designed along with a parallel attention module, together forming the proposed method. The two aspects of our motivation are described as follows.

Firstly, the model is motivated by the focalization of the human perception process, which is the act of concentrating on a discrete aspect of information while ignoring the other perceivable information. While encountering complex scenarios, in which the salient objects have different sizes, locations and similar structure as background, humans can adaptively zoom in the lens to concentrate on salient regions and the rest of the area becomes invisible. To imitate such a zoom-in process, a module that can extract features with various receptive fields is needed. Therefore, we design a concentrated module (CM) to perceive various local contexts and explicitly handle the problem of extracting multi-scale saliency features. As is shown in Fig. 1(b), features from CM response with multi-receptive-field, which is suitable for a variety of cases, e.g., long or short distance(2, 3 row), small or big object size(1, 3 row). However, some features from CM maybe also harmful for saliency prediction. As shown in first and third columns of Fig. 1(b), features with large receptive fields may respond to redundant details of background, while the features with small receptive fields may lose some salient information. Therefore, these features need to be sifted by another module, which is described in the next paragraph.

Secondly, motivated by the orienting of visual attention, which acts to selectively attend to an object or location over others by moving eyes to point in that direction [12], we design a parallel attention module (PAM) that inherits the feature-enhancing ability and extract global and local information simultaneously to generate context-aware attention map. With such parallel attention module, the network can explicitly learn “what” and “where” should be emphasized. As is shown in Fig. 1(c), the first column represents spatial attention maps generated by the PAM. As we can see, clustered background that may generate distractions is compressed while salient regions are highlighted. As for channel-wise attention shown in Fig. 1(d), different channels that response to various objects of the input images are assigned weights to reflect the relative importance, i.e., to emphasize the salient object. Similar to humans who are attracted by salient objects and move eye fixations to discrete areas, the proposed network is explicitly guided to choose salient objects (by channel-wise attention) and focus on salient regions (though spatial attention). In addition, this module serves as a feature filter for CM to adaptively select appropriate scale and receptive-field for generating saliency regions.

The CM and PAM construct the ACBlock, which is able to progressively refine the inference of saliency estimation by explicitly guiding the network to “see” images with appropriate receptive field and learn “what” and “where” is salient.

To summarize, the contributions of this paper are as follows:

  • It provides a deeper insight into Saliency Prediction through the imitation of the focalization and orienting of the human perception process and an end-to-end framework that refines saliency estimation in a progressive manner.

  • We design a concentrated module (CM) which perceives various local contexts though a series of group convolutional filters with different receptive fields, thereby explicitly enlarging the receptive field of saliency model.

  • We present a parallel attention module (PAM) for discriminative saliency representations with integration of both global and local information.

  • We perform extensive experiments on seven widely used challenging datasets, in which the proposed method yields consistent improvements over a number of strong baselines.

Section snippets

Related work

In this section, we briefly review the related works about saliency prediction and attention mechanisms applied in various vision tasks.

The proposed method

In the present paper, we propose a biologically-inspired saliency prediction method, which contains a feature extraction network (Section 3.1) to imitate the bottom-up process of human visual system, several cascaded ACBlocks consisting of CMs (Section 3.2) and PAMs (Section 3.3) to imitate the focalization and orienting of human perception process, respectively. The CM exploits multi-scale and multi-receptive-field semantics and the PAM explicitly guides the model to learn “what” and “where”

Experimental setup

In this section, we describe datasets and metrics used for evaluating the proposed method, and provide implementation details.

Experimental evaluation

In this section, we firstly perform ablation experiments to gain a better understand to the design of our modules and validate their effectiveness. Then we demonstrate the contribution of each key module in the network. We also show quantitative and qualitative comparisons with other state of the art models.

Conclusion

In this paper, we propose a biologically-inspired deep saliency model, named as ACNet, for saliency prediction. We design a concentrated module and a parallel attention module to imitate the two characteristics of the human perception process: focalization and orienting. The former extracts features with various receptive-fields and squeezes features efficiently, thereby enhancing the representation ability of the corresponding network and simplifying the saliency estimation. The latter extends

CRediT authorship contribution statement

Pengqian Li: Conceptualization, Methodology, Software, Validation, Writing - original draft, Visualization. Xiaofen Xing: Writing - review & editing, Supervision. Xiangmin Xu: Funding acquisition, Supervision. Bolun Cai: Writing - review & editing, Investigation. Jun Cheng: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The work is supported in part by the National Natural Science Foundation of China under Grant U1801262, 61702192, U1636218, 61806210; in part by Key-Area Research and Development Program of Guangdong Province, China, under grant 2019B010154003; in part by the Natural Science Foundation of Guangdong Province, China, under Grant 2019A1515012146, 2020A1515010781; in part by the Fundamental Research Funds for the Central Universities, under grant 2018MS79, 2019PY21, and 2019MS028; and in part by

Pengqian Li is currently pursuing the M.S. degree with the School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China. His research interests include saliency prediction, salient object detection and segmentation.

References (68)

  • G. Li et al.

    Constrained fixation point based segmentation via deep neural network

    Neurocomputing

    (2019)
  • R.J. Peters et al.

    Components of bottom-up gaze allocation in natural images

    Vision Research

    (2005)
  • S. Jia et al.

    Eml-net: An expandable multi-layer network for saliency prediction

    Image and Vision Computing

    (2020)
  • R.A. Rensink

    The dynamic representation of scenes

    Visual Cognition

    (2000)
  • W. Wang et al.

    Salient object detection driven by fixation prediction

  • R. Shi et al.

    Gaze-based object segmentation

    IEEE Signal Processing Letters

    (2017)
  • W. Wang et al.

    Deep cropping via attention box prediction and aesthetics assessment

  • W. Wei et al.

    Saliency prediction via multi-level features and deep supervision for children with autism spectrum disorder

  • J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: Advances in Neural Information Processing Systems, 2007,...
  • S. Goferman et al.

    Context-aware saliency detection

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2011)
  • T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: 2009 IEEE 12th International...
  • J. Zhang et al.

    Saliency detection: a boolean map approach

  • J. Jonides

    Further toward a model of the mind’s eye’s movement

    Bulletin of the Psychonomic Society

    (1983)
  • M.I. Posner et al.

    Effects of parietal injury on covert orienting of attention

    Journal of Neuroscience

    (1984)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • A. Borji, Boosting bottom-up and top-down visual features for saliency estimation, in: 2012 IEEE Conference on Computer...
  • M. Cerf, J. Harel, W. Einhäuser, C. Koch, Predicting human gaze using low-level saliency combined with face detection,...
  • A. Kroner, M. Senden, K. Driessens, R. Goebel, Contextual encoder-decoder network for visual saliency prediction,...
  • Z. Che et al.

    How is gaze influenced by image transformations? Dataset and model

    IEEE Transactions on Image Processing

    (2019)
  • S. Yang, G. Lin, Q. Jiang, W. Lin, A dilated inception network for visual saliency prediction, IEEE Transactions on...
  • C. Fosco et al.

    How much time do you have? Modeling multi-duration saliency

  • R. Cong et al.

    Review of visual saliency detection with comprehensive information

    IEEE Transactions on circuits and Systems for Video Technology

    (2018)
  • R. Cong et al.

    Hscs: Hierarchical sparsity based co-saliency detection for rgbd images

    IEEE Transactions on Multimedia

    (2018)
  • R. Cong et al.

    Video saliency detection via sparsity-based reconstruction and propagation

    IEEE Transactions on Image Processing

    (2019)
  • C. Li et al.

    Nested network with two-stream pyramid for salient object detection in optical remote sensing images

    IEEE Transactions on Geoscience and Remote Sensing

    (2019)
  • R. Cong et al.

    Going from rgb to rgbd saliency: a depth-guided transformation model

    IEEE Transactions on Cybernetics

    (2020)
  • E. Vig et al.

    Large-scale optimization of hierarchical features for saliency prediction in natural images

  • J. Pan et al.

    Shallow and deep convolutional networks for saliency prediction

  • X. Huang et al.

    Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks

  • M. Cornia, L. Baraldi, G. Serra, R. Cucchiara, A deep multi-level network for saliency prediction, in: 2016 23rd...
  • W. Wang et al.

    Deep visual attention prediction

    IEEE Transactions on Image Processing

    (2017)
  • J. Pan, C. Canton, K. McGuinness, N.E. O’Connor, J. Torres, E. Sayrol, X. a. Giro-i Nieto, Salgan: Visual saliency...
  • K. Sss et al.

    Deepfix: A fully convolutional neural network for predicting human eye fixations

    IEEE Transactions on Image Processing

    (2017)
  • M. Kümmerer, L. Theis, M. Bethge, Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet, in:...
  • Cited by (0)

    Pengqian Li is currently pursuing the M.S. degree with the School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China. His research interests include saliency prediction, salient object detection and segmentation.

    Xiaofen Xing received the B.S., M.S., and Ph.D. degrees from the South China University of Technology, China, in 2001, 2004, and 2013, respectively, where she has been an Associate Professor with the School of Electronic and Information Engineering, since 2007. Her main research interests include image/video processing, human computer interaction, and video surveillance.

    Xiangmin Xu (M’13) received the Ph.D degree from the South China University of Technology, Guangzhou, China. He is currently a Full Professor with the School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China. His current research interests include image/video processing, human–computer interaction, computer vision, and machine learning.

    Bolun Cai received the M.S. and Ph.D. degrees from the South China University of Technology, China, in 2016 and 2019, respectively. He is currently a senior researcher with Tencent WeChat AI. His research interests include computer vision, machine learning, and image processing.

    Jun Cheng received the Ph. D. degree from Nanyang Technological University, Singapore. He is now with UBTech Research institute, UBTech Robotics Corp, Shenzhen, China, leading the machine vision R&D for robotics. He has authored/co-authored 120+ prestigious journals/conferences, such as TMI, TIP, TBME, BOE, IOVS, JAMIA, MICCAI, CVPR, ICCV, ECCV and invented more than 30 patents. He is currently Associate Editor for top medical imaging journal IEEE Transactions on Medical Imaging.

    View full text