TranSalNet: Towards perceptually relevant visual saliency prediction

Visual saliency prediction using transformers - Convolutional neural networks (CNNs) have significantly advanced computational modelling for saliency prediction. However, accurately simulating the mechanisms of visual attention in the human cortex remains an academic challenge. It is critical to integrate properties of human vision into the design of CNN architectures, leading to perceptually more relevant saliency prediction. Due to the inherent inductive biases of CNN architectures, there is a lack of sufficient long-range contextual encoding capacity. This hinders CNN-based saliency models from capturing properties that emulate viewing behaviour of humans. Transformers have shown great potential in encoding long-range information by leveraging the self-attention mechanism. In this paper, we propose a novel saliency model that integrates transformer components to CNNs to capture the long-range contextual visual information. Experimental results show that the transformers provide added value to saliency prediction, enhancing its perceptual relevance in the performance. Our proposed saliency model using transformers has achieved superior results on public benchmarks and competitions for saliency prediction models. The source code of our proposed saliency model TranSalNet is available at: https://github.com/LJOVO/TranSalNet


Introduction
Visual attention represents an important mechanism of the human visual system (HVS), which allows humans to select and interpret the most relevant information in the visual scene [1]. Simulating visual attention in the form of an Figure 1: Examples of visual saliency prediction. The first row shows the images that stimulate the human eye to view freely. The so-called "Ground Truth" in the second line refers to the fixation density maps, also called saliency maps, generated from the human fixation location. The third and fourth rows show the prediction results of the traditional (GBVS) and deep learning-based (SAM-ResNet) saliency models, respectively. Image (a) and (b) are from MIT1003 dataset; (c) and (d) are from SALICON dataset. It can be seen that both traditional and deep learning-based models are capable of capturing human viewing behaviour, but the deep learning-based model provides better results in demanding scenes, such as (b) and (d) to a considerable extent.
Existing saliency prediction models can be categorized into two types, traditional and deep learning-based models. Traditional models [9,10,11,12] apply low-level visual features such as colour, luminance, texture, and contrast, to simulate the visually salient areas in the scene. These models remain rather limited as higher-level features such as objects are often omitted; but these features exhibit significant determinants of visual saliency [13,14]. Although some traditional models [15] have been extended with specific higher-level features, e.g., faces and texts, there are still obstacles in combining low-level and higher-level visual features. Rather than designing handcrafted features, deep learning-based saliency models automatically discover representations from images [16,17,18,19,20,21,22,23,24]. These methods typically use convolutional neural networks (CNNs) to construct feature encoders and decoders to generate visual saliency maps. Deep learning-based visual saliency models have achieved remarkable success, mainly due to the availability of well-established deep CNNs [25,26,27,28] and large-scale datasets relevant to human visual attention [29]. Figure 1 illustrates examples of visual saliency prediction using both traditional and deep learning-based models, and the correspondences between the ground truth (i.e., where humans look in an image) and prediction (i.e., output of a computational saliency model).
Accurately predicting saliency as perceived by humans remains an academic challenge. One way to improve the reliability of saliency prediction is to incorporate the properties of the HVS in the construction of computational models [9,30]. Despite the significant progress made by the deep learning-based models, each convolution kernel in CNNs only receives information from a local subset of pixels in an image, which makes fully CNN-based models deficient in obtaining long-range contextual information. When humans view an image, foveal vision provides the highest resolution visual information, but in the meantime peripheral vision still provides the HVS with non-detailed but long-range visual information [31,32,33]. In other words, the HVS uses the long-range information of an image to modulate the local maxima of saliency in the visual field [30,34]. Therefore, the ground truth saliency map represents the perceptual spatial interactions of local and non-local (i.e., long-range) information. This HVS property could be beneficial for predicting visual saliency in a perceptually more relevant manner so that the machine generated saliency map can faithfully reflect human perception. Previous studies mainly attempted to solve this problem through two approaches. One approach is capturing multi-scale information through the CNNs [35,22,19,36,17], which introduces image representations with different granularities to some extent. This approach may not provide the optimal solution as it still lacks the ability to model the way visual information is perceived by the HVS, e.g., some studies have used multi-scale images or multi-scale representations to improve saliency prediction [35,22,19,36,17], but challenges remains for models in optimally fusing multi-scale information to mimic the functionality of the HVS. Another approach is adding long-range modelling capabilities to network structures to increase spatial representations. By using a Long-Short Term Memory (LSTM)-based architecture [18,37], this approach has proven effective in handling local and long-range visual information thus refining the accuracy of saliency prediction. Although these studies have demonstrated promising outcomes, much work is needed to close the gap between saliency prediction and human perception. The transformer [38], which consists of a self-attention mechanism, provides an elegant solution to process long-range information. By effectively modelling long-range dependency, the transformer has proven efficacy in the field of natural language processing [39] and more recently achieved promising results in computer vision tasks [40,41]. However, the use of transformers in visual saliency prediction has not been fully explored until now.
To address the above-mentioned challenges and to build a human-like saliency model, we propose a novel saliency prediction model called TranSalNet, which integrates transformers into a CNN-based architecture. Transformer encoders can learn spatially long-range dependencies by using a self-attentive mechanism, resulting in a perceptually more relevant saliency representation. To the best of our knowledge, this is the first study to explore the combination of CNNs and transformers to enhance saliency prediction. Also, we demonstrate the benefits of transformer components in saliency prediction. Our model achieves superior performance not only on the MIT300 benchmark (the most widely recognised dataset for saliency benchmark) but also on the SALICON Saliency Prediction Challenge (the largest dataset available for saliency prediction).

Related work
We contribute towards a perceptually more relevant saliency prediction method using deep learning models with transformers. This section provides a comprehensive review on deep learning-based saliency prediction models , methods for evaluating saliency models (especially evaluating the perceptual relevance of saliency prediction), transformer applications in vision tasks, and multi-scale and long-range information in visual saliency prediction.

Deep learning-based visual saliency prediction
A number of deep learning-based visual saliency prediction models have been proposed in recent years. The Ensembles of Deep Networks (eDN) [42] represents one of the first models that adopts shallow CNNs to detect the visual saliency of natural images. The saliency features are extracted by CNNs and combined by a linear classifier to create saliency maps. Since then, with the application of deep neural networks and large-scale saliency datasets, deep learning-based saliency prediction has achieved further remarkable successes. DeepGaze and DeepGaze II [23], which are based on AlexNet [25] and VGGNet [26], respectively, successfully build pre-trained networks as feature extractors to train deeper networks for saliency prediction. By comparing VGGNet, AlexNet, and GoogleNet [43], Huang et al. [35] found that VGGNet detects saliency more effectively than the other two models. Many visual saliency prediction models based on VGGNet have since been proposed [16,17,19]. EML-NET [20] focuses on exploring the use of more sophisticated feature extractors (i.e., a parallel two-stream CNN-based encoder) to enhance the performance of saliency prediction. By comparing ResNet-50 with DenseNet [28] and NASNet [44], it is argued that in the field of saliency prediction, the widely used ResNet-50 could still be "shallow" for the large-scale saliency datasets, such as SALICON. Similarly, DeepGaze II-E discusses the contribution of different backbones to saliency prediction. It is found that appropriately concatenating multiple backbones pre-trained on ImageNet [45] is effective in improving the performance of saliency models.
In addition to the efforts mentioned above, there are several studies that adopt multi-scale or long-range information to improve visual saliency prediction. We discuss this issue below in Section 2.4.

Evaluation methods for saliency models
A number of metrics have been proposed to measure the agreement between the predicted saliency map and the ground truth produced by human eye movements. By investigating commonly used metrics, Bylinskii et. al. [46] found that under general assumptions the Linear Correlation Coefficient (CC) and Normalized Scanpath Saliency (NSS) metrics could be used as representative metrics for benchmarking saliency models. More importantly, they also suggested different evaluation metrics should be used for different applications, for example, metrics that are more appropriate for evaluating the capability of salient object detection may not be necessarily useful for the evaluation of saliency prediction of other vision applications [46,47]. Li et. al. [48] found that only a limited number of evaluation metrics, i.e., NSS, CC, and Similarity (SIM) are in close agreement with human judgements through a large-scale subjective experiment. Similarly, Yang et. al. [47] found that CC and SIM are the most in line with human evaluation of saliency maps. Kummerer et. al. [49] also demonstrated that it is difficult for a saliency model to perform equally well on all popular saliency evaluation metrics. They proposed a novel approach that allows a saliency model to generate different "saliency maps" according to the characteristics and behaviours of different metrics; and the model that adopts this evaluation method is referred to as a "probabilistic model." As a distinction, without targeting any specific evaluation metric, a saliency model that generates a single saliency map for a given image is referred to as a "classical model". Since our aim is to generate a single saliency map for each image that can faithfully reflect human perception, we evaluate models in the "classical model" framework. In evaluating models, we apply all commonly used evaluation metrics to quantify model performance, but make a clear distinction between "perception-based metrics (i.e., NSS, CC, and SIM)" and "non-perception-based metrics", as defined in [46]. By doing so, the perceptual relevance of the predicted saliency maps can be appropriately measured.

Transformer in visual tasks
The transformer was first introduced to the tasks of natural language processing (NLP) [38]. Because of its powerful long-range dependency modelling capabilities, the transformer has achieved remarkable success in the field of NLP. Consequently, a number of studies in the field of computer vision are also exploring the effectiveness of the use of transformer.
The vision transformer is one of the first pure transformer architectures for image processing, which uses a vanilla version of the transformer to form a network that achieves performance comparable to that of state-of-the-art CNN-based models. After this work, several models, such as DeepViT [50] and Swin Transformer [51], have achieved further success in visual tasks by using the transformer.
Currently, the transformer has also demonstrated excellent performance in the field of salient object detection [52], which is related to the current work, even though it is a substantially different task [53]. Salient object detection aims to segment salient objects from an image and generate a binary map [54]. However, in visual saliency prediction, the aim is to predict the density map of human fixations (i.e., the spatial deployment of visual attention).
In summary, the previous studies have shown the powerful representation capabilities of the transformer, particularly for capturing long-range information, which could have potential contributions to predicting gaze. However, the use of transformers in visual saliency prediction has not been fully explored until now. In this paper, we will investigate the benefits as well as application of transformer components in saliency prediction.

Multi-scale and long-range information in visual saliency prediction
By using multi-scale image representations to simulate different perceptual scales, successful results have been achieved in vision tasks such as image segmentation [55], human pose estimation [56], and salient object detection [57]. In the filed of visual saliency prediction, Huang et al. [35] and Fan et al. [22] proposed CNN-based models that extract multi-scale features from images of different resolutions separately and concatenate the results to obtain salient semantic objects with different granularities hence to optimise saliency prediction. In order to obtain multi-scale contextual information, Deep Visual Attention (DVA) [17] constructs three decoders of different granularities to generate multi-scale saliency estimates for saliency prediction. EML-NET [20] also uses multi-scale feature maps from encoder networks to obtain holistic scene features for saliency prediction. MSI-Net [19] adopts convolutional layers with different dilation rates to augment multi-scale information for saliency prediction. GazeGAN [36] is a generative adversarial network for saliency prediction, which uses a modified U-Net with multi-scale information by using skip-connections to construct its generator. UNISAL [21] adopts skip-connections to provide the decoder network with multi-scale features. These studies have demonstrated that multi-scale information is beneficial to visual saliency prediction.
Similarly to other vision tasks [52,58], visual saliency prediction has also benefitted from neural networks with longrange modelling capabilities to simulate the spatial attentional mechanisms. DSCLSTM [37] extracts local feature maps by using CNNs first, and then incorporates non-local scene contexts into the local feature maps by using LSTM-based components to predict human eye fixation points in natural scenes. Cornia et al. [18] developed visual saliency models that integrate an LSTM module into the CNN-based network to simulate explicit properties of the human attention mechanism. Similarly, Fang et al. [59] used LSTM to obtain pseudo sequential information to simulate the human visual attention shift. These studies suggest that modelling the relevant dependence between spatial information can refine the saliency prediction models.  Figure 2: Schematic overview of TranSalNet. Assume that the spatial size of inputs is w × h. After the input image is processed by the CNN encoder, which provides three sets of multi-scale feature maps have spatial size of w respectively. Then the contextual information of these feature maps is enhanced by transformer encoders. The predicted saliency map is generated by the CNN decoder, which uses skip-connection (orange arrows) and element-wise production to fuse multi-scale context-enhance feature maps. The illustration of the transformer encoder is shown below the architecture diagram, which consists of standard Multi-head Self-Attention (MSA) and Multi-layer Perceptron (MLP) blocks.
In this paper, we combine these two strategies. More specifically, we integrate transformer encoders into a CNN-based architecture to provide multi-scale image representations with enhanced long-range contextual information, resulting in perceptually more relevant visual saliency prediction.

The proposed model
The schematic overview of our proposed TranSalNet model is shown in Figure 2. Firstly, a given image is fed into a CNN encoder. In order to obtain multi-scale image representations, three sets of feature maps with different spatial sizes are extracted from the CNN encoder. Due to the inherent inductive biases of CNN encoder architectures, the extracted image representations lack long-range contextual information, which potentially makes a saliency model less humanlike (note the human visual system is proficient in capturing both local and long-range visual information). Therefore, to obtain perceptually more relevant visual saliency prediction, these feature maps are passed through three transformer encoders, yielding long-range context-enhanced feature maps. Then the CNN decoder fuses these feature maps for saliency prediction.

The CNN encoder
Previous research has shown that the use of CNN-based networks to extract features for saliency prediction is effective. Likewise, we used a CNN encoder as the feature extractor in this study.
The CNN models used in this study were initially constructed for image classification. In order to provide image feature maps to the downstream networks, the fully connected layer at the end of these CNNs is removed to form a viable CNN encoder. We extract feature maps with three sets of different spatial sizes from the CNN encoder. Given an input image with size w × h × 3, the spatial dimensions of the extracted feature maps are w 8 × h 8 , w 16 × h 16 , and w 32 × h 32 , respectively. In this study, two feature extraction networks are adopted to construct two versions of TranSalNet models. One version uses ResNet-50 [27] as an encoder, which is a feature extraction network widely used in saliency prediction. This version of the model is referred to as TranSalNet_Res. The CNN body of ResNet-50 is composed of five convolutional blocks that are denoted as conv1 and conv2_x to conv5_x. We extract feature maps from the deeper conv3_x, conv4_x, and conv5_x blocks. However, [20] suggests that ResNet-50 itself as an encoder is probably relatively "shallow." Therefore, we use DenseNet-161 [28], which has higher performance on the ImageNet benchmark, as the CNN encoder to build another version referred to as TranSalNet_Dense. For DenseNet-161, it mainly consists of four "Dense Blocks" denoted as DenseBlock 1 to 4. We extract feature maps from the deeper DenseBlock 2, DenseBlock 3, and DenseBlock 4.
Although previous work [35,16,17,19,36] showed that adopting multi-scale feature maps is beneficial to saliency prediction, our experiments found that using feature maps from shallower network blocks, i.e. the conv1 and conv2_x, may cause undesired artefacts to appear in the saliency maps. Therefore, we exclude feature maps from the shallower network blocks.

The transformer encoder
The three sets of multi-scale feature maps are respectively fed into three transformer encoders to enhance the long-range and contextual information. The details of transformer are depicted at the bottom of Figure 2. Let x 1 , x 2 , and x 3 be the feature maps that have spatial dimensions of w 32 × h 32 , w 16 × h 16 , and w 8 × h 8 , respectively, first, a 1 × 1 convolution layer (Conv 1×1 ) is used to reduce the computational cost and align with the acceptable input size of the transformer encoder. More specifically, both x 1 and x 2 are reduced to 768 dimensions, and x 3 changed to 512 dimensions. Following this, as there is no relative or absolute position information in the feature maps, it is necessary to utilise position embedding (POS) to enable position-awareness before feeding the input into the transformer encoders. Therefore, the absolute POS [40] is implemented before feeding input into the transformer encoders, which performs an element-wise addition to the input and a learnable matrix with the same shape as the input. Each transformer encoder contains two same layers of standard Multi-head Self-Attention (MSA) and Multi-layer Perceptron (MLP) blocks [40]. In our model, we apply 12-heads attention in transformer encoder 1 and 2, and 8-heads in encoder 3. The MLP block contains two layers with a GELU activation function. Besides, Layer Normalization (LN) and residual connection are applied before and after each block respectively. The processing in each transformer encoder can be represented as: where z l is the output feature maps of the l-th layer in transformer encoder, and x i is the input feature maps from the CNN encoder. The feature maps that are passed through transformer encoder 1, 2, and 3 are context-enhanced and denoted as x c 1 , x c 2 , and x c 3 respectively.

The CNN decoder
A CNN decoder is used to fuse the long-range context-enhanced feature maps from the transformer encoders and restore the original image resolution. The CNN decoder is a fully CNN network containing block_1 to block_7, which is used to implement pixel-level classification to predict saliency maps. Batch normalization (BN) and the activation function (ReLU for block_1 to block_6; Sigmoid for block_7) are applied after each 3 × 3 convolution operation (Conv 3×3 ), where the former is used to promote the convergence and the latter is used to increase the nonlinear factor of the model. Since the input image is 32-scale downsampled by the encoder network, a 2-scale upsampling that adopts nearest-neighbor interpolation is performed to the feature map in block_1 to block_5 to obtain a saliency map of the same size as the input. In order to enhance the long-range and multi-scale context of the feature map during the decoding process, the upsampled feature map and the transformer's output from the corresponding skip-connection are fused by an element-wise product operation. The processes from block_1 to block_6 can be expressed as: Upsample(x f i−1 ), i = 4, 5, 6 (6) where x f i andx f i are the input and output features of the i-th block. The output block, i.e., block_7, is used to reduce the dimensionality of the feature maps to a 2D map for pixel-level classification. Therefore, the sigmoid activation function is applied to the feature map:ŷ = sigmoid(Conv 3×3 (x f 6 )), whereŷ is the predicted saliency map.

Loss function
Recent saliency prediction studies [18,20,36] have shown that taking advantage of the saliency evaluation metrics to define the loss function can significantly improve the performance of saliency prediction models.
Following a similar idea, we adopt a linear combination of four metrics as the loss function to train our model, including the Normalized Scanpath Saliency (NSS), Kullback-Leibler divergence (KLD), Linear Correlation Coefficient (CC), and Similarity (SIM). Let y s , y f , andŷ be the ground truth saliency map, fixation map, and predicted saliency map, and i indicates the ith pixel of y s andŷ, our loss function is defined as: where λ 1 , λ 2 , λ 3 , and λ 4 are the weights of each metric, and where σ(·) and µ(·) stand for standard deviation and mean respectively; where is a regularization constant and set to 2.2204 × 10 −16 ; where cov(·) is the covariance and σ(·) is standard deviation; In L KLD , L CC and L SIM , y s , andŷ are normalized so that i y s i = iŷi = 1. Since the higher NSS, SIM, and CC values and the lower KLD value represent the better agreement between predicted saliency maps and ground truth, we set λ 1 , λ 3 , and λ 4 to negative and λ 2 to positive. In order to balance the impact of different sub-loss functions on the module result, we determine the weights of individual sub-loss functions based on TranSalNet's performance on the SALICON validation set. In our experiments, the weights are adjusted to ensure these sub-loss functions (note the ranges of output values are different for these functions) contribute relatively equally to the model outcome. This is achieved by training and validating TranSalNet on the SALICON training and validation sets each time by a single sub-loss function. According to the recorded minimal loss values on the validation set, weights are initially assigned to the sub-loss functions so that their contributions to the combined loss are relatively equal. In a second step, these weights in a combined loss are further adjusted to achieve balanced results on all evaluation metrics. As per our empirical studies, the default weights λ 1 , λ 2 , λ 3 , and λ 4 of the combined loss function are set to −1, 10, −2, and −1, respectively.

Datasets
Four commonly used benchmark saliency datasets are used to train and evaluate our proposed saliency model and variants.

Evaluation Metrics
Various metrics have been proposed to evaluate the agreement between the predicted saliency map and the ground truth. In general, these metrics can be described as location-based and distribution-based metrics depending on how the ground truth is represented [46]; the former adopts the fixation map (i.e., in the form of a binary image) and the latter uses the saliency map (i.e., in the form of a gray-scale image) as the ground truth for visual saliency evaluation. Six popular metrics are widely used to quantify the general performance of saliency models, including CC, SIM, KLD, NSS, AUC (Area under ROC Curve), and sAUC (Shuffled AUC). Details of these metrics can be found in [46]. The first three are distribution-based metrics, and the remaining three are location-based metrics. For KLD, the closer the value is to zero, the better the agreement between prediction and ground truth. For the other five metrics, higher values represent higher consistency. Now, in this paper, we aims to evaluate the general performance of our proposed model, but in the meantime the perceptual relevance of the saliency model is the focus of our study. To this end, on the basis of the study of [46], we classify the six metrics into two categories based on their capability of being in close agreement with human judgements of saliency maps: "perception-based metrics", which include NSS, CC, and SIM; and "non-perception-based metrics", which include sAUC, AUC, and KLD [46]. Note, "non-perception-based metrics" do not necessarily mean they are not measuring the gaze behaviour, they may focus on specific properties of viewing behaviour, such as detecting salient objects in the visual field. It is stated in [46] that "AUC, KL are appropriate for detection applications, as they penalize target detection failures. However, where it is important to evaluate the relative importance of different image regions, such as for image-retargeting, compression, and progressive transmission, metrics like NSS or SIM are a better fit." This provides sufficient grounds for building perceptually more relevant saliency prediction models, which is the primary goal of our work.

Setup
By following a similar procedure in the state-of-the-art [23,35,17,19,18], a model should be first initialised by the weights pre-trained on ImageNet [45], then trained on the 10,000 images of the SALICON training set to reduce the risk of overfitting. Consequently, the best model on its validation set should be selected for further testing on the SALICON test set and training on MIT1003 and CAT2000.
To obtain fair results in each dataset, k-fold cross-validation (k = 10) is applied for each model. More specifically, each dataset is divided into 10 non-overlapping subsets. For MIT1003, each subset contains around 100 images; For CAT2000, each subset contains 200 images (10 from each category). Each time, one subset is kept as a test set, one as a validation set, and the remaining eight subsets altogether are used as the training set. To eliminate randomness, each test set corresponds to a fixed validation set and training set. We report the overall performance of 10 times test results.
To reduce the computational cost while aligning with the aspect ratio (4:3) of the images in SALICON, all input images are resized and padded to a same size of 384×288 pixels. A consistent standard is followed in all training phases. The Adam optimizer [64] is used to minimize the loss function. The learning rate is set to 1 × 10 −5 , which is then multiplied by 0.1 for every 3 epochs. Models are trained with a batch size of 4 for 30 epochs with a stop patience of 5 epochs.

Ablation study
Ablation experiments are conducted to investigate the contribution of three key components in our modelling: (1) Transformer encoders (E 1 , E 2 , and E 3 denote Transformer encoder 1, 2, and 3 in Figure 2, respectively), (2) Skipconnections (SC), (3) Combined loss function (L CB ). To this end, nine model variants are constructed to demonstrate the added value of one or more of the above components, as shown in Table 1.
Among them, BaseNet is constructed as a baseline that adopts the widely used ResNet-50 as the CNN encoder, removes all transformer encoders and skip-connections except for the Conv 1×1 layer before transformer encoder 1, and is trained by the BCE loss function. BaseNet+ adds the transformer encoder 1 based on the BaseNet. SkipNet is equipped with skip-connections based on the BaseNet. TranSalNet_Res_BCE adds the transformer encoder 1, 2, and 3 based on the SkipNet, which utilises ResNet-50 as the CNN encoder and is identical in architecture to the proposed TranSalNet (demonstrated in Figure 2) but is trained by the BCE loss. The model variants trained by the combined loss that are consistent with the architecture of the above four model variants are denoted as BaseNet(L CB ), BaseNet+(L CB ), SkipNet(L CB ), and TranSalNet_Res, respectively. TranSalNet_Dense replaces ResNet-50 with DenseNet-161 as the CNN encoder. The overall performance of these model variants on the MIT1003 and CAT2000 datasets is shown in Table 2. The illustration of saliency maps of four images from these two datasets can also be found in Figure 3.
By comparing BaseNet/BaseNet(L CB ) and BaseNet+/BaseNet+(L CB ), it can be found that adding a transformer encoder improves the overall performance, i.e., BaseNet+/BaseNet+(L CB ) outperforms BaseNet/BaseNet(L CB ) in the majority of instances. Especially, on the perception-based metrics, i.e., CC, SIM, and NSS, BaseNet+/BaseNet+(L CB ) give consistently better performance than BaseNet/BaseNet(L CB ), suggesting that the transformer encoder contributes to the perceptual relevance of saliency prediction. Besides, the benefits of enhancing saliency prediction by providing multi-scale image representations through skip-connections have been demonstrated in previous studies. Similarly, by adding skip-connections to BaseNet/BaseNet(L CB ), the performance of model variants SkipNet/SkipNet(L CB ) improves on most instances in the ablation study.
By uniting transformer encoders and skip-connections, the decoder network can obtain multi-scale feature maps with long-range context enhanced by transformer encoders. As a result, the performance of TranSal-Net_Res_BCE/TranSalNet_Res is further boosted on all instances of perception-based metrics as well as most instances of non-perception-based metrics. This provides additional evidence that the transformer is of added value for visual saliency prediction. Also, this demonstrates the effectiveness of the TranSalNet architecture, which integrates transformer encoders into CNN-based models via skip-connections to obtain multi-scale representations with enhanced long-range visual information. Table 2 also demonstrates the practical plausibility of training the proposed model with the linear combination of sub-loss functions. Compired with the model variations trained by L BCE (i.e., BaseNet, BaseNet+, SkipNet, and TranSalNet_Res_BCE), the model variations trained by L CB (i.e., BaseNet(L CB ), BaseNet+(L CB ), SkipNet(L CB ), and TranSalNet_Res) achieve higher performance on the majority of saliency metrics. In particular, the TranSalNet_Res outperforms the TranSalNet_Res_BCE on all instances in the ablation study. In summary, the effectiveness of the transformer encoder, the TranSalNet architecture, and the combined loss function has now been demonstrated in this ablation study.
In addition, previous research [20,24] has shown that using backbones with greater representational capability could improve saliency prediction. Similarly, by simply replacing the backbone from the widely used but comparatively Figure 3: Comparison of the saliency prediction performance of nine model variants in our ablation study. The images of top two rows are from the MIT1003 dataset and the bottom two rows are from the CAT2000 dataset. It can be seen that by adopting transformer encoder, skip-connection to provide multi-scale information, and combined loss function, the generated saliency maps are significantly refined relative to the ground truth. Table 3: Performance comparison of state-of-the-art saliency models on MIT1003 and CAT2000. Red and orange font indicate the best and 2nd best performance, respectively. MIT1003 CAT2000 perception-based metrics non-perception-based metrics perception-based metrics non-perception-based metrics "shallow" ResNet-50 (used by TranSalNet_Res) with DenseNet-161 [28], TranSalNet_Dense has been further improved as shown in Table 2.

On MIT1003 and CAT2000 datasets
Seven state-of-the-art deep learning-based saliency models that adopt multi-scale representations or attention mechanisms, including FastSal [65], UNISAL [21], MSI-Net [19], SAM-VGG [18], SAM-ResNet [18], ML-Net [16], and Deep Visual Attention (DVA) [17] are selected for the general performance comparison on the MIT1003 and CAT2000 datasets. In order to ensure a fair comparison, the same k-fold Cross-Validation (k = 10 for MIT1003 and CAT2000) strategy and the dataset splitting method used in TranSalNet are employed for fine-tuning and testing of these models. The corresponding pre-trained weights on the SALICON dataset is loaded for each fine-tuning instance. For MIT1003 and CAT2000 datasets, the overall performance of 10 times test results is reported in Table 3.
It can be seen that our models (both TranSalNet_Res and TranSalNet_Dense) achieve the best performance on all perception-based metrics in both MIT1003 and CAT2000, while producing competitive results on non-perceptionbased metrics (i.e., being best or 2nd best in most instances in the comparative study). It should be noted that our TranSalNet_Res and the five state-of-the-art models all use ResNet-50 or VGGNet (representing similar network capacity) as the feature extraction network. TranSalNet_Res achieves the best performance on most instances (except for sAUC and KLD in CAT2000), implying the contribution of enhanced long-range information to saliency prediction using transformers. Moreover, the performance our TranSalNet_Res could be further enhanced by replacing ResNet-50 by a network with higher capacity, namely DenseNet-161. Figure 4 shows saliency maps generated by our models and other models for images including common contexts such as objects, portraits, natural, indoor, social, and cartoon scenes. By visually assessing these saliency maps, our models are in closer agreement with the ground truth than other models. Figure 4: Comparison of saliency maps generated by our models (TranSalNet_Res and TranSalNet_Dense) and other state-of-the-art saliency models. The images from (a) to (d) are from the MIT1003 dataset, and the images from (e) to (h) are from the CAT2000 dataset.

On MIT300 competition
For the MIT300 competition, we use the MIT1003 to train an optimal model, in which 703 images are randomly selected as a training set and the rest as a validation set. The optimal model is submitted to and tested by the MIT/Tuebingen Saliency Benchmark [63]. It should be noted that the benchmark evaluates models by different standards, i.e., models must be explicitly claimed as either probabilistic or non-probabilistic models, so they can be fairly evaluated within the category they belong to [46]. In this paper, same as the original MIT Saliency Benchmark [46], we "do not assume that our model is probabilistic". Note that for evaluating probabilistic models, metric-specific adaptations are applied using regularization and scaling of saliency values, hence, a probabilistic model generates optimal saliency maps for individual metrics [49]. But a non-probabilistic model only outputs a single saliency map for all metrics. So it is nontrivial to compare a non-probabilistic (i.e., classical) model to a probabilistic model [46]. To avoid unfair model comparison under different assumptions, Table 4 shows only non-probabilistic classical saliency models on the leader-board of [63]. It can be seen that our models (both TranSalNet_Res and TranSalNet_Dense) consistently rank in the top 1st or 2nd positions on the perception-based metrics (note the only exception is for TranSalNet_Res on NSS, but its performance score is fairly comparable to the 1st or 2nd scores as shown in Table 4). On the non-perception-based metrics, our models exhibit competitive performance on sAUC and AUC, with the performance scores comparable to the results in the 1st and 2nd positions. In addition, even though we include top probabilistic models such as DeepGaze II-E [24], MSI-Net [19], UNISAL [21], SalFBNet [67], and DeepGaze II [23] for performance comparison, our model can still remain competitive in perception-based metrics (results available on website of [63]).

On LSUN'17 competition
Although our aim is to predict the spatial distribution of human fixations, the human attention measured by mouse tracking can still reflect eye movement behaviour to a certain extent [29]. SALICON provides so far the largest-scale saliency dataset (via mouse tracking), which allows the opportunity to examine the saliency models from the perspective of being "data rich". Moreover, for LSUN'17 competition (on SALICON test set), a unified evaluation process is  adopted, i.e., the saliency models are not treated differently because of their type of being probabilistic or classical. In the competition each model submitted should generate one single saliency map for each image. Therefore, in order to provide a complementary comparison of state-of-the-art saliency models, Table 5 reports the results of models submitted to the competition based on the 2017 version (i.e., the latest version). It can be seen that our TranSalNet_Res and TranSalNet_Dense achieve superior performance on the perception-based metrics and promising results on other non-perception-based metrics. This shows that our model are competitive on the LSUN 2017 leaderboard, in particular for prediction saliency in a perceptually relevant manner.

Discussion
It is crucial to note that metric selection for saliency model evaluation should be based on specific modelling assumptions and specific target applications [46]. The study in [46] concludes that "under the assumptions of non-probabilistic modelling, NSS and CC provide the fairest comparison"; "if evaluating probabilistic models, KLD is recommended"; and "specific tasks and applications also call for a difference choice of metrics". In [48], researchers have verified that "NSS, CC and SIM best correspond to human perception". In a recent study [47], it is found that CC and SIM are the most appropriate saliency evaluation metrics for image quality assessment applications. Therefore, as the results demonstrated in Table 3, Table 4, and Table 5, the proposed saliency models (TranSalNet_Res and TranSalNet_Dense) could be the best "human-like" models (i.e., based on perception-based metrics CC and SIM) to evaluate the relative importance of different image regions for the applications such as image re-targeting, image compression and transmission, and visual quality assessment.
Using skip-connections to provide multi-scale features from encoder to decoder has been shown in previous studies to be an effective method for computer vision tasks. For example, the widely used U-Net [55] style networks usually connect feature maps of each spatial size to the decoders from shallow to deep encoder blocks. However, as can be seen in Figure 5, using skip-connections to connect shallow encoder blocks (i.e., the blocks provide feature maps with spatial sizes of w 4 × h 4 and w 2 × h 2 ) with decoder blocks (i.e., block_4 and block_5 in the decoder) may lead to some shapes of objects and texts appearing in the predicted saliency maps, which are not consistent with the ground truth. This implies that adding low-level features from the encoder directly to the decoder may interfere with the saliency prediction of TranSalNet. Figure 5: The column on the righthand side illustrates the salinency maps with undesired artefacts caused by adding skip-connections to TranSalNet_Res to connect shallow encoder blocks with decoder block_4 and block_5. From left to right, the remaining three columns are: stimuli, ground truth saliency maps, and saliency maps generated from TranSalNet_Res, respectively. Multi-head Self Attention (MSA) is part of the transformer encoder. Previous research has shown that the number of heads of MSA could affect the model's performance [68]. According to the suggestions from [46], we use CC and SIM as the performance metrics to illustrate the impact of the head number of MSA on our proposed TranSalNet in Figure 6. For each head number combination, the model is trained on the SALICON training set, validated with 2000 images of its validation set, and tested on the rest of the validation set three times. The demonstrated results are the mean results. As can be seen in Figure 6, the scores of CC and SIM tend to increase with the increase in the head number of MSA. However, when the transformer encoders 1 and 2 (E 1 and E 2 ) adopt 12 heads each, and transformer encoder 3 (E 3 ) adopts 8 heads of MSA, the performance of the model tends to be saturated in the CC-SIM performance space. Therefore, considering the trade-off between computational resource consumption and model performance, we chose 12 heads for the transformer encoder 1 and 2, and 8 heads for the transformer encoder 3 in this study.

Conclusion
In this paper, we have proposed a novel saliency model for predicting saliency maps that are perceptually in close agreement with the ground truth. By integrating transformers into CNNs, saliency models can significantly benefit from capturing long-range spatial information at multiple perceptual levels. An ablation study has demonstrated the contributions of the transformer encoders to a CNN model, especially the added value of transformers in enhancing the perceptual relevance of saliency prediction. Experimental results show that the proposed models have achieved superior performance on the public benchmarks and competitions for saliency models, particularly having yielded notable results on perception-based saliency evaluation metrics. The perceptually more relevant saliency models have the potential to advance many image processing applications.

Acknowledgments
This work is funded in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -Project-ID 251654672 -TRR 161 and the China Scholarship Council -ID 202008220129.