1 Introduction

When developing the most powerful neural network architectures in computer vision, small changes to network architectures have historically resulted in significant gains in accuracy, such as the introduction of skip connections in ResNet [1], batch normalization [2], or the addition of self-attention mechanisms, like Squeeze-and-Excitation (SE) modules [3]. Due to their simplicity, these changes come with relatively small computational overhead, making them excellent additions to networks used in real-world systems. Attention mechanisms have been gaining traction in recent years in the computer vision community, achieving large performance improvements in classification [3,4,5], object detection [6,7,8] and other visual domains, such as tracking [9]. Impressively, reported accuracy gains are significant over many different architectures and tasks [10, 11]. Many of these works focus on designing increasingly complex attention mechanisms to achieve the best possible accuracy, but this comes with additional computational cost. Instead, we investigate how much of the performance improvement of attention mechanisms can be retained, using a module that is as simple and computationally efficient as possible. These research objectives are similar to those of the recently published ECA-Net [11].

In this paper, we focus on attention modules that up- or down-weight entire feature maps based on global feature map information, often called “feature recalibration”, similar to [3, 8, 10, 11]. Our objective is to either find or design the attention module with the best possible inference speed-to-accuracy trade-off. To this end, we propose a novel attention module called the Attention-Bias (AB) module, that recalibrates feature maps using only a single Hadamard product with learned weights and a single tensor addition per module, thereby making it the most low-cost attention module as far as the authors know. The architecture of the AB block is shown in Fig. 1 and can be used as a drop-in replacement of any residual block in a ResNet-like architecture, or added after convolutional blocks in other architecture types. Finally, we investigate under which conditions existing attention modules outperform ours, and when to apply attention modules in general.

The contributions of our research are as follows.

  • Optimized re-implementations (code availableFootnote 1) of state-of-the-art attention modules are compared with the same data and under identical conditions, on both accuracy and computational complexity, based on computation time in PyTorch and TensorRT.

  • Extensive experiments are performed to determine under which conditions attention modules provide the best speed-to-performance trade-off.

  • A new attention module is designed that has a smaller computational and memory overhead but still performs similarly to more expensive attention modules in multiple situations.

  • We investigate the potential of attention modules to improve the performance of pretrained networks, after only brief fine-tuning.

Fig. 1
figure 1

Architecture of our Attention Bias (AB) module, compared with more expensive state-of-the-art attention modules. These include from left to right, Efficient Channel Attention (ECA) [11], Style-based Recalibration Module (SRM) [10], Squeeze-and-Excitation (SE) [3] and Multi-Squeeze-and-Excitation (Multi-SE) [8]. Abbreviations used: CFC (Hadamard product), FC (fully connected), one-dimensional convolution (Conv1D), Batch Norm (BN), weighted sum (W-Sum)

2 Related Work

2.1 Convolutional Neural Networks

Convolutional neural networks have been achieving state-of-the-art performance on computer vision tasks, ever since AlexNet [12] first achieved top performance on the ImageNet [13] classification dataset. The VGG network [14] showed that additional network depth can improve performance further, followed later by ResNet [1] which introduced residual connections to make even deeper networks trainable. Due to their simplicity and despite their age, ResNets are used to this day as a baseline network for comparisons, as a backbone network for tasks such as detection or segmentation, or as a starting point for modifications such as in ResNeXt [15].

2.2 Global Image Information as Means and Variances

Networks trained on ImageNet are known to be highly biased towards style [16], reducing their performance on unseen data, especially in case of style differences. Using the assumption that style is primarily represented in the mean and variance of feature maps [17], it makes sense to optimize the classification accuracy by making the network invariant to changes in these values. Both batch normalization [2] and instance normalization [18] do this in some way. IBN-net [19] intelligently combines these two, by observing that the impact of style can effectively be reduced by applying instance normalization at only early layers of the network, removing the mean and variance. Instead of attempting to compensate for global information using a fixed normalization, Lee et al. design an attention module, called the Style-based Recalibration Module (SRM) that allows the network to learn style invariance, by learning multiplication weights based on feature means and variances [10].

2.3 Attention Mechanisms

Multiple types of attention exist, with channel attention and spatial attention being most common. Channel attention can be broadly defined as a mechanism by which a network can weight feature maps depending on context, to achieve better performance [3, 5, 10, 20]. In contrast, spatial attention focuses the network on specific spatial regions in the input or feature maps [21,22,23]. In this work we limit ourselves to channel attention, because their implementations are generally less computationally expensive. The simplest form of channel attention consists of a network block that performs global pooling of some features followed by processing layers and using the resulting values to weight the original features. Many attention module variants follow this scheme, such as SE [3], SRM [10], ResNeSt [5], GE [20], CBAM [24], AA [25], and ECA [11]. Alternatively, an entire separate network can be used to compute the feature attention weights [26]. In some recent attention block designs, such as GC-net [27] and NA-net [28] the ordering of pooling and processing is inverted, which appears to result in slightly better performance, but at a higher computation cost. ECA-Net [11] has been developed in parallel to our work and also aims at designing a module with efficient speed-to-performance trade-off. Their “SE-Var2” module is especially similar in design to our proposal.

3 Attention Bias Module

Creating a computationally inexpensive attention module means removing the parts that are the most computationally expensive relative to their performance improvement. The components in the attention blocks of [3, 5, 10, 11, 20] vary, and while these studies contain ablation experiments, these ablation studies only look at classification performance and not inference speed. Therefore, our objective is to test various attention blocks and measure their relative improvement compared to their additional computational overhead. The three main components of channel attention modules, “pooling”, “processing” and “integration”, are illustrated in Fig. 1 for our own proposed module and four state-of-the-art modules. Each step is detailed in the sections below. As a starting point of our research, we commence with the design of the squeeze-and-excitation (SE) block [3].

3.1 Global Pooling

Global information can be extracted from a feature map in various ways (shown as the green color in Fig. 1), such as average pooling (“squeeze” in [3]), max pooling, or standard deviation pooling [10]. More complicated methods may be used, but these are more computationally expensive. Although max pooling is potentially the least computationally expensive, it was shown to perform poorly as a global pooling operation by Lee et al. [10], hence we restrict ourselves to modules that use the other two pooling options. The AB module employs only spatial average pooling, which is computed by

$$\begin{aligned} Z = \frac{1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^W X(i,j), \end{aligned}$$
(1)

where \(X \in \mathbb {R} ^{H\times W \times C}\) is the input feature map tensor, that is reduced to a tensor \(Z\in \mathbb {R} ^C\).

3.2 Processing

The weight computation of the feature map of the AB block consists of a single Hadamard product of the tensor Z with learnable weights \(W\in \mathbb {R} ^C\), one for each channel, specified by

$$\begin{aligned} Z^{'} = Z \circ W. \end{aligned}$$
(2)

This is a notable simplification compared to the two ReLU-separated fully connected (FC) layers in the SE block. While these FC layers account for only a few FLOPs due to the prior pooling operation and reduction factor, inefficiencies of GPUs for computing small inner products give rise to a larger than expected performance overhead for the SE block. Furthermore, FC layers reach relatively large numbers of parameters for deeper layers with many channels. Lee et al. [10] perform the same simplification for SRM, by first realizing that a single FC layer performs similarly to two layers, and then simplifying it further by removing cross-channel connections. This simplification also results in a single Hadamard product. Removing these cross-channel connections means that attention is now only based on a single channel of the feature tensor at a time. In contrast, Multi-SE [8] expands the cross-channel power by learning multiple sets of FC weights. This allows the network to train on multiple, very different datasets simultaneously (e.g. ImageNet and sets of medical images), but this is computationally expensive and may not add value for a dataset with less image variety. Hence, we stay with our proposed approach of using a single Hadamard product for tensor weighting.

3.3 Weight Integration

All commonly used attention blocks combine their computed feature-importance weights with the original features through multiplication [3, 5, 10, 20], after applying a sigmoid to the weights to ensure a unity interval for each weight value. This essentially means that features can be only reweighted relative to other features, but never in the absolute sense. Without the sigmoid, training can become unstable. However, applying a sigmoid introduces some computational overhead. Instead, we remove the sigmoid computation and integrate the computed weights with the original features through a summation, which also remains stable during training. This essentially allows the network to learn an input-dependent bias for each feature map instead of a weight. An advantage of summation is that, when the weights of the attention module are initialized to zero, it can be added to any pretrained network without adverse effects on the performance. The weights may then be learned during a very brief fine-tuning phase.

4 Experiments

In this section we perform an extensive evaluation of the chosen attention modules, both in accuracy and computational cost. We start by investigating classification networks and extend our experiments to object detection. Computational cost is evaluated both in the number of floating-point operations (FLOPs) and inference time in two different implementation frameworks. We re-implement all attention modules with more efficient PyTorch operations, the same across all modules for fair comparison. All reported results are from our own training iterations and all timing results are evaluated on a single RTX 2080 Ti, with unity batch size. We limit ourselves to widely used ResNet architectures [1], since attention modules have been shown to improve performance for many different other architectures already [3, 5, 10, 20].

By default, ResNet blocks are implemented differently depending on the network depth. When defining the ResNet, one can use either “basic blocks” or “bottleneck blocks” (see Fig. 5 in [1]). Bottleneck blocks contain four times as many feature channels in between blocks, compared to basic blocks, and compress and decompress around \(3\times 3\) convolutions using \(1\times 1\) convolutions for computational efficiency. Basic blocks use two sets of \(3\times 3\) convolutions without any channel compression. Basic blocks are default for ResNet-18 and ResNet-34 and bottlenecks are default for ResNet-50 and deeper. Note that the number of channels, which differs between basic blocks and bottleneck blocks, dictates the amount of information available inside attention modules after global pooling. In all experiments, we use default settings for the attention modules, i.e. reduction ratio between FC layers \(r=16\) for SE, 1D-convolution kernel size \(k=3\) for ECA and 3 branches for Multi-SE. Attention modules are applied at each residual block and “baseline” refers to the network without attention modules. The total number of residual blocks, and thus attention modules, depends on the ResNet depth.

4.1 Classification Accuracy

We perform a general comparison of the classification accuracy of state-of-the-art attention modules. Three different datasets are evaluated: CIFAR-10, CIFAR-100 and ImageNet. All networks are trained from scratch.

4.1.1 CIFAR

We use a typical architecture and training settings for CIFAR. We train a basic-block variant of ResNet-56 for 164 epochs using batches of 128 images, with learning rate 0.2 and reduce the learning rate by a factor 10 at the 81st and 122nd epochs. This architecture contains 27 basic ResNet blocks, thus also 27 attention modules. The accuracy scores for CIFAR-10 and CIFAR-100 are shown in Table 1. Note that there is no clear “best” attention module, despite reported results of cited work.

Table 1 Average CIFAR-10 and CIFAR-100 accuracy (%) over 3 runs, using a ResNet-56 and various attention modules

There is a difference in the impact of attention modules, depending on whether the ResNet is created with bottleneck or basic blocks, which is larger than the differences between the modules. On CIFAR-100, when using basic blocks, attention modules add only upwards of \(\approx \)1 percent point to the top-1 accuracy, while for a ResNet architecture using bottleneck blocks, it adds upwards of \(\approx \)3 percent points. The fact that bottleneck blocks result in far better performance when combined with attention modules is consistent over both CIFAR-10 and CIFAR-100 and occurs for all tested attention modules, although the gains are roughly twice as large for CIFAR-100 than for CIFAR-10. Having access to four times as many feature channels is thus highly beneficial for these modules.

4.1.2 ImageNet

We limit our experimental validation to ResNet-18, with basic blocks, and ResNet-50, with bottleneck blocks, and train them with multiple different attention modules. Again, each block receives a single attention module, which adds up to 8 attention modules for ResNet-18 and 16 for ResNet-50. The parameters and training code are availableFootnote 2, based on the implementation of [10]. The results are shown in Table 2. The performance improvement of adding attention modules is significantly larger for ResNet-50 than for ResNet-18. This observation reinforces the idea that attention modules require sufficient feature channels to function properly. Due to its simplicity, our AB module does not outperform the other attention modules in accuracy, but has a significantly lower computational overhead when measured in execution time.

Table 2 ImageNet classification performance for ResNet-18 and ResNet-50 with different attention modules (from scratch). Inference time is measured for a single image on an RTX 2080 Ti. Best result in bold, close second-best results are underlined. The AB module has the smallest computational overhead, especially in PyTorch, but also provides only a modest performance improvement

4.2 Fine-Tuning Pretrained Models

In practice, it is common to use a pretrained network and fine-tune it to a specific task, since training from scratch is costly in terms of both time and data. In this experiment, we investigate how close the new minimum of the network is to that of a pretrained network when adding initialized attention modules. First, we fine-tune an ImageNet-pretrained ResNet-18 network with added initialized attention modules using learning rate 0.001 (the final learning rate of the pretrained network’s learning schedule) for up to 30 epochs. Second, under the assumption that the optimization process first needs to escape the current local minimum, we test whether train-from-scratch performance can be reached by fine-tuning for 40 epochs, first 20 at learning rate 0.01, then 20 at learning rate 0.001 to settle in a new local minimum. The results are shown in Table 3.

Table 3 Top-1 accuracy after fine-tuning an ImageNet pretrained ResNet-18, with initialized attention module. Baseline refers to the pretrained model without attention modules, after which attention modules are added. The AB module is the only one that does not deteriorate performance when added to an initialized model and maintains a lead over other modules during finetuning, unless training with large learning rates for many epochs

Regardless of the applied attention module, for low learning rate, the model remains stuck in the local minimum of the pretrained model, even after 30 epochs. However, for the 40 epoch test, the networks significantly outperform the baseline pretrained network. Furthermore, the resulting scores are consistently similar or higher than the trained-from-scratch equivalents (Table 2, ResNet-18 Top-1). Concluding, when designing novel attention modules, it is beneficial to fine-tune from a pretrained network, resulting in lower training time (convergence in 40 vs. 90 epochs) and improved accuracy. We investigate whether this property also holds for the object detection task in Sect. 4.4.

Note that in Table 3, the initialized score of the AB module is much higher than that of other modules as explained in Sect. 3.3. Similar functionality can be added to the other attention modules by adding a learnable scaling layer with weight zero and bias unity right after the sigmoid, but this increases their computational complexity further and can impact training stability negatively.

4.3 Computational Complexity

To analyze the computational overhead of adding attention modules, we now look at all commonly used ResNet depths for Imagenet classification: ResNet-18, 34, 50, 101 and 152. Once again, a single attention module is added for each block, which results in the following number of attention modules for each depth: 8 for ResNet-18, 16 for ResNet-34, again 16 for ResNet-50 but now with the deeper bottleneck blocks, 33 for ResNet-101 and 50 for ResNet-152.

Fig. 2
figure 2

Comparison of FLOPs and parameters for several ResNet depths. Regardless of depth, the computational overhead of attention modules appears to be minimal, while the overhead in parameters is only noticeable for bottleneck-based networks with SE-based attention

Attention modules appear extremely efficient when analyzing the number of FLOPs. As can be seen in Fig. 2a, the impact on the total number of FLOPs is negligible compared to using a deeper network architecture, regardless of the used attention module. Similarly, the impact of attention modules on the total number of parameters in a network is also small compared to using deeper networks, as shown in Fig. 2b. When comparing the number of parameters, only SE and Multi-SE modules result in noticeable overhead, especially for deep networks using bottleneck blocks, but still a smaller overhead than what using a deeper network would result in. This makes attention modules seem to offer a nearly free performance improvement. However, in practice these networks will be running on a GPU, which is not efficient for computing the small inner products inside attention modules, especially for batch size unity, hence the overhead in computation time can be significantly larger than the number of FLOPs suggest.

4.3.1 Attention Modules versus Network Depth

To investigate the real impact of attention modules on inference time, we measure the inference time with unity batch size. We perform this experiment for PyTorch and TensorRT. PyTorch is commonly used for experimentation, while TensorRT is a vendor-supplied optimization engine that typically results in the fastest inference. Time is measured without image loading or pre-processing, using a single image from ImageNet (\(224 \times 224\) pixels) on an RTX 2080 Ti and 32-bit precision. We report the lowest time out of 1000 trials to ensure that outliers due to other processes are unlikely to accidentally interfere with the timing measurement. For fair comparison, each attention module has been re-implemented more efficiently with similar operators, typically reducing inference time by 5–30% from the publicly available implementations. The results are shown in Fig. 3a and b for PyTorch and TensorRT, respectively. Despite the negligible FLOPs increase, the attention modules have a comparatively large impact on execution time. For example, ResNet-18 with the Multi-SE module performs inference slower than baseline ResNet-50 in PyTorch. TensorRT reduces total inference time with a factor 2-4 and also reduces the relative impact of attention modules. Lowering precision to 16 bits results in a similar trend, although the relative overhead of attention modules is higher, as shown in Fig. 3c. Hence, when using 16-bit precision inference, attention modules are less interesting from a speed-to-performance point of view. This effect is expected to remain even when efficient 16-bit implementations are available for all modules, because the bottleneck at this resolution is not in the parallel computing. In all cases, even when using TensorRT, the relative growth in inference time (5–50%) remains significantly larger than the FLOPs suggest (0.2–3%).

Fig. 3
figure 3

Impact of attention modules on inference time of a single \(224 \times 224\) image using an RTX 2080 Ti, using two different frameworks

Fig. 4 shows the relative inference time increase of using attention modules versus the Top-1 accuracy achieved on ImageNet. To prevent clutter, only ResNet-18 and ResNet-50 are shown with attention modules. Considering the relatively large computational overheads of attention modules, one may wonder whether they are worth using at all. The answer is sometimes. For bottleneck-based ResNets (50, 101 and 152), adding an attention module generally results in an improved accuracy at a reduction in computation time, compared to a deeper ResNet. This is not the case for basic block-based ResNets (18, 34), which validates that the number of feature channels is essential for efficient attention module performance. When using TensorRT, this conclusion remains the same, but overhead is more favorable for attention modules. When using TensorRT, ECA-net has an especially good speed-to-accuracy trade-off, due to its computational cost becoming nearly equivalent to our proposed AB module after TensorRT optimization. However, note that this graph only shows the ImageNet results and different attention modules perform best for different datasets.

Fig. 4
figure 4

Trade-off between accuracy and computation time for ResNets with attention modules, on ImageNet, using the PyTorch or TensorRT framework. Our AB module is an efficient improvement in PyTorch, but starts falling behind other modules when using TensorRT

4.3.2 Resolution Dependence

Different applications employ images of variable sizes. Hence, we investigate the impact of different image resolutions on inference time. Generally, attention modules consist of resolution-dependent components (“pooling” and “integration” in Fig. 1) and resolution-independent components (“processing” in Fig. 1). Intuitively, it may be expected that the impact is especially large for small resolutions. Image size versus relative inference time overhead is shown in Fig. 5a and b when using PyTorch and TensorRT, respectively. PyTorch introduces significant overhead for small image resolutions, whereas TensorRT shows more constant overhead. The relative computation overhead in PyTorch drops drastically for images larger than \(512\times 512\) pixels. The increased computation time of ResNet-101 compared to ResNet-50 is constant with respect to resolution, for both PyTorch and TensorRT. This suggests that the PyTorch implementation of attention modules is inefficient for low resolutions.

Fig. 5
figure 5

Relative inference time overhead of various attention modules using a ResNet-50 and single image inference in PyTorch and TensorRT, for square images of varying size. The inference time increase of a ResNet-101 relative to ResNet-50 is also depicted, to show that PyTorch specifically has trouble with efficiently computing attention modules. In subfigure (b), ECA and AB overlap almost completely

As expected, for both PyTorch and TensorRT, the resolution-dependent components dominate the overhead for large resolutions. The resolution-independent component is clearly visible from the difference between Multi-SE and SE in Fig. 5b. Except for Multi-SE with its expensive resolution-independent component, attention modules are, surprisingly, less attractive when using larger image resolutions when using TensorRT. This effect persists when using different network depths, as shown in Fig 5c and d.

Deeper analysis into computational complexity reveals that at high resolutions, for all modules except SRM, the majority of the computation power is used for the “integration” part of the module (see the bottom of Fig. 1). Hu et al. [3] show that a reduction in computation time for this part can be achieved with a more efficient CUDA implementation. Using this optimization is possibly more efficient than the summation that we use in AB. However, without this optimization, summation saves the computation of the sigmoid, hence is slightly faster.

As a final experiment, we scale the batch size at fixed \(224\times 224\) pixel resolution, shown in Fig. 6. Larger batch sizes allow the GPU to execute the small inner products of the resolution-independent components in parallel, allowing more efficient computation, similar to scaling resolution. However, in contrast to resolution scaling, the resolution-dependent components scale linearly (constant relative overhead) with larger batch sizes, which makes the use of attention modules more attractive. Because of this scaling effect, high resolutions (above \(1024^2\) pixels) and at least medium batch sizes (above 16) are most beneficial for complex attention modules such as Multi-SE, while simple attention modules are preferred for low-resolution single-image inference.

Fig. 6
figure 6

Relative inference time overhead of various attention modules using a ResNet-18, ResNet-50 or ResNet-152 using TensorRT, for varying batch size with fixed resolution of \(224\times 224\) pixels. Note that the x-axis ends at different batch sizes for different network depths due to memory constraints

4.4 Object Detection and Domain Generalization

In addition to our classification experiments, we evaluate two different object detection architectures on the BDD100k [29] and COCO [30] datasets.

On the COCO dataset, we have trained a standard Faster-RCNN [31] with a feature pyramid [32] and a ResNet-50 backbone, using the MMDetection framework [33]. The attention modules are only added to the backbone, the same as before at each ResNet block, resulting in a total of 16 attention modules. First, we have compared all attention modules by initializing the network with our pretrained ImageNet networks, containing pretrained attention modules. Second, we have repeated these experiments, but initialized with pretrained baseline ResNet-50 weights and applied default initialization for the attention modules. This comparison allows us to verify the hypothesis from the ImageNet fine-tuning experiment, as in Sect. 4.2. Resulting Average Precision (AP) scores are computed on the COCO minival dataset, using 5k images for validation and adding the remaining 35k validation images to the training set, as is common practice. The results of both experiments are shown in Table 4. In contrast to our ImageNet fine-tuning experiment, using pretrained attention module weights in the backbone appears to be required to achieve a large part of the performance gain, except for ECA. This could potentially be alleviated by increasing the number of training epochs [34], but this negates the train-time advantage, hence this approach is not investigated further.

Table 4 Average precision on the COCO minival dataset using ResNet-50 Faster-RCNN with attention modules. “Pretrained” refers to the pretrained weights from our ImageNet experiments, “Scratch” refers to training the attention modules from scratch, while still using pretrained weights of the baseline network)

Domain generalization. BDD100k is an object detection dataset for autonomous driving and includes domain labels for time of day and weather. This makes it an excellent dataset for determining the impact of network changes on their domain generalization capabilities. Attention modules are tested for domain generalization because some domain generalization methods are also based on altering means and variances of feature maps, similarly to channel attention modules, and may thus perform a similar role in the network. Two such methods are Instance-Batch Normalization (IBN) [19] and style adversarial training [35].

We have trained a ResNet-34 based \(512\times 512\) SSD [36] detector, following the network modifications and parameters used in ScratchDet [37], so that it can be trained from scratch. Again, the attention modules are added only to the backbone, which results in 16 modules for ResNet-34. First, for each attention module type, we have trained the SSD from scratch for 20 epochs on the 100,000 training images and report the AP@IoU=0.5 on the validation set. We have then repeated this experiment, but trained only on all “day” and “dawn/dusk” images of the dataset and validated on the “night” images of the validation set. The results are shown in Table 5. Surprisingly, our AB module is the only module that obtains a significant improvement, outperforming the more expensive modules. To verify whether this is an outlier, we have retrained with AB two more times, resulting in a standard deviation of 0.2, which clearly suggests the obtained result is not an outlier.

Table 5 Average precision of ResNet-34 SSD detector on the BDD100k dataset, trained from scratch (“All” represents evaluation on full dataset, “Day/night” shows training on “day” and “dawn/dusk” , testing on “night”)

Next, we compare the effect of domain generalization methods to that of attention modules for the day/night generalization experiment. For IBN blocks, we apply instance normalization on the first two residual blocks, according to the IBN-b scheme of Pan et al. [19]. For style adversarial learning, we compute the mean and variance of the final layer of the second residual block, followed by a gradient reversal layer and two FC layers, using the parameters of [35]. For adversarial training, we use both time of day and weather labels, to make the network invariant to these domains. The results are shown in Table 5. The performance improvement of these domain generalization methods is similar to or worse than when using attention modules. Furthermore, combining IBN blocks with AB or SE diminishes the improvement of IBN, suggesting that attention modules by themselves already reach most of the gain that can be achieved from modifying the means and variances of the feature maps.

5 Discussion

This paper provides an objective comparison of several attention modules under identical conditions. Generally, attention module papers only compare their numerical scores with previously reported results, instead of experimentally validating the different modules in an identical experimental setup. We have re-implemented each attention module, aligning common operators by replacing them with identical, more computationally efficient alternatives. Although more work, this results in a more objective comparison.

Adding attention modules to a network changes the location of its global minimum, but not very strongly, as shown by the results in Sects. 4.2 and 4.4. This allows adding randomly initialized attention modules to a pretrained network and fine-tuning the extended network with a lower learning rate than training from scratch. This fine-tuning always outperforms a network that is fully trained from scratch, in half as many epochs (40 epochs fine-tuning vs. 90 epochs from scratch).

From an execution-time perspective, adding attention modules should be viewed as an alternative to increasing network depth. In either way the network becomes larger, but the impact on the trade-off of inference time versus accuracy is different. The number of FLOPs suggest that attention modules are always the more efficient option, but measuring the inference time shows that this is highly dependent on other factors, such as architecture, resolution, batch size and framework.

Channel attention modules may be a replacement of certain domain generalization techniques, as they serve a similar role and modify feature maps in a similar way. For example, in one of our experiments it was found that using both simultaneously does not improve performance further. As future work, it is interesting to investigate this relation between domain generalization and channel attention and discover whether related techniques have a complementary nature or not.

6 Conclusion

We have performed an extensive comparison in terms of speed and accuracy with computationally inexpensive attention modules, for both classification and object detection. All attention modules have been re-implemented more efficiently and similarly, under identical conditions, for a fair comparison (code available). With respect to the speed-to-performance trade-off, we propose a novel attention module, the Attention Bias (AB) module, which achieves the lowest overhead when used for inference. This module can be easily integrated into pretrained networks, achieving higher performance after minimal fine-tuning. However, none of the tested modules, including our AB module, are consistently the best over different datasets and network architectures and every tested module performed best in at least one experiment. This means that selecting the best attention module for a practical use case cannot be simplified to choosing one with e.g. the best performance on ImageNet.

We extend the theoretical complexity measurement of FLOPs with inference time measurements in the commonly used PyTorch and TensorRT frameworks. Our measurements show that inference time increases far more (5–50%) than the increase in FLOPs suggests (0.2–3%). This means that attention modules are not always a computationally efficient addition to the architecture. Their effectiveness depends on framework, network architecture, batch size and input resolution. For a ResNet with bottleneck blocks, using attention modules is generally preferred over using a deeper architecture. In contrast, ResNets of basic blocks perform poorly, suggesting sufficient feature channels are essential. Complex attention modules (Multi-SE) are preferred for larger resolutions and batch sizes, because the computation bottleneck is in the resolution-independent components. It can be concluded that simple modules (ECA, AB) are preferred for low-resolution single-image inference.

Finally, we have found that the difference in increased accuracy is often relatively smaller than the difference in inference time. This simplifies the choice of the selection of a suitable module for specific use cases. Merely benchmarking the inference time of several modules without any training can provide a suitable selection of modules for further evaluation in the complete training pipeline. Moreover, then these modules can be integrated by fine-tuning a pretrained network, reducing training time by a factor of two compared to training from scratch.