Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression

Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ~145x lower model complexity and ~210x faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.

The best performing LIC frameworks [3]- [17], [22]- [24] use an autoencoder which applies a non-linear, energy compacting transform [26] to an input image.Such frameworks learn a so-called forward-backward adaptive entropy model in order to create a low-entropy latent space representation of the image.The model estimates the probability distribution by conditioning the distribution to two types of information -one is a signalled prior (a.k.a.forward adaptation [4]), and the other is implicit contextual information extracted by an autoregressive context model (a.k.a.backward adaptation [5]).The performance of the backward adaptation is a major factor for the efficiency of the LIC framework, and methods to improve its performance have been actively investigated.Various context model architectures have been proposed -e.g.2D masked convolutions using local dependencies [5]- [9], [11]; channel-wise autoregressive mechanisms exploiting channel dependencies [10], [16], [24]; non-local simplified attentionbased models capturing long-range spatial dependencies [12], [23]; sophisticated transformer-based models leveraging a great degree of content-adaptive context modeling [17], [24].
Using an autoregressive model requires recursive access to previously processed data, which results in slow decoding time and inefficient utilization of NPU/GPU.To remedy this, some researchers [11], [13], [24] optimized the decoding process by using wavefront parallel processing (WPP) similar to the one used in classical codecs [27].The WPP significantly increases parallelism, but still requires a large number of autoregressive steps for the processing of large images.A more efficient approach is to split the latent elements into groups and code each group separately [14]- [17], [22].The latent elements might be split into e.g.spatial patches [14], channel segments [16], or using a checkered pattern [15].The researchers also investigated a combination of the channelwise and the checkered grouping [22].Such parallelization approaches can reduce decoding time by 2-3 orders of magnitude, at the cost of higher model complexity and up to 3% performance drop.For instance, the patch-wise model [14] uses a recurrent neural network to share information between patches.Other works [16], [22] use channel-wise model and implement additional convolutional layers to combine decoded channel segments.
Our previous work [24] proposed the Contextformer utilizing spatio-channel attention, which outperformed contemporary LIC frameworks [3]- [12], [16], [17], [23].However, the decoding time and model complexity of Contextformer make it unsuitable for real-time operation.In this work, we propose a fast and low-complexity version of the Contextformer, which we call Efficient Contextformer (eContextformer).We extend the window attention of [28] to spatio-channel window attention in order to achieve a high performance and low complexity context model.Additionally, we use the checkered grouping to increase parallelization and reduce autoregressive steps in context modeling.By exploiting the properties of eContextformer, we also propose algorithmic optimizations to reduce complexity and runtime even further.The previously decoded latent elements not joining to context modeling and yet to be coded elements are depicted as ( ) and ( ), respectively.The illustrated context models are (a) the model with 2D masked convolutions [5], [11], (b) the model with 3D masked convolutions [6], [10], (c) channel-wise autoregressive model [16], (d) simple attention-based model [23], and (e-f) Contextformer with sfo and cfo coding mode [24], respectively.
In terms of PSNR, our model outperforms VTM 16.2 [29] on Kodak [30], CLIC2020 [31] (Professional and Mobile) and Tecnick [32] datasets, providing average bitrate savings of 10.9%, 12.4%, 6.9% and 12.6%, respectively.Our optimized implementation requires 145x less multiply-accumulate (MAC) operations and takes 210x less decoding time compared to the earlier Contextformer.It also contains significantly less model parameters compared to other channel-wise autoregressive and transformer-based prior-art context models [16], [17].Furthermore, due to its manageable complexity, eContextformer can be used with an online rate-distortion optimization (oRDO) similar to the one in [33].With oRDO enabled, the performance of our model reaches up to 17.1% average bitrate saving in terms of PSNR over VTM 16.2 [29].

A. Transformers in Computer Vision
The underlying principle of the transformer network [34] can be summarized as a learned projection of sequential vector embeddings x ∈ R S×de into sub-representations of query Q ∈ R S×de , key K ∈ R S×de and value V ∈ R S×de , where S and d e denote the sequence length and the embedding size, respectively.Subsequently, a scaled dot-product calculates the attention, which weighs the interaction between the query and key-value pairs.The Q, K and V are split into h groups, known as heads, which enables parallel computation and greater attention granularity.The separate attention of each head is finally combined by a learned projection W .After the attention module, a point-wise multi-layer perceptron (MLP) is independently applied to all positions in S, in order to provide non-linearity.For autoregressive tasks, masking of the attention is required to ensure the causality of the interactions within the input sequence.The multi-head attention with masking can be described as: where the mask M ∈ R S×S contains weights of −∞ in the positions corresponding to interactions with elements which are not processed yet, and ones for the rest.The operator ⊙ denotes the Hadamard product.
The vanilla implementation of a convolutional neural network (CNN) [35] exploits the local correlations with static weights after training.The attention mechanism inherent to transformers uses content-adaptive weighting, which captures long-distance relations [36].Such property makes transformers potentially more suitable for various computer vision (CV) tasks [37]- [40].As computational complexity of a vanilla transformer grows quadratically with the length of the sequence S, those are hardly applicable for high-resolution images.As one potential solution, Dosovitskiy et al. [38] proposed the Vision Transformer (ViT).The ViT decomposes large images into non-overlapping 2D patches and computes a global attention between the flattened patches.The ViT outperforms its CNN-based counterparts, but it has a trade-off between the patch size and its complexity -the tasks depending on pixel-level decisions, such as image segmentation, require smaller patch sizes, while the complexity remains quadratic in the number of patches.As an alternative to global attention, [24], [40], [41] propose to use sliding window attention to limit the attention span.The attention computation within a window has relatively lower complexity than the global attention.However, the window attention is re-computed for every pixel position, which results in a low latency.Another solution is the Hierarchical Vision Transformer using Shifted Windows, also called Swin transformer, proposed by Liu et al. in [28].The attention is computed in non-overlapping windows, and is shared by cyclically shifting the windows where the attention is computed.The Swin alternatingly employs non-shifted and shifted window attention along the transformer layers to approximate global attention.

B. Learned Image Compression
The recent lossy LIC frameworks apply learned non-linear transform coding based on [26].They use an autoencoder with an entropy-constrained bottleneck, where the autoencoder learns to achieve high fidelity of the transform while reducing the symbol entropy.The encoder side uses an analysis transform g s , which projects the input image x to a latent variable y with lower dimensionality.The latent variable is quantized to ŷ by a quantization function Q and encoded into a bitstream with a lossless codec such as [42].During training, the quantization is approximated by an additive uniform noise  [14], (b) checkered grouping [15], (c) channelwise grouping [16], and (d-e) combination of checkered and channel-wise grouping with sfo and cfo coding, respectively.All latent elements within the same group (depicted with the same color) are coded simultaneously, while the context model aggregates the information from the previously coded groups.For instance, [14], [15] use 2D masked convolutions in the context model, and [16] applies multiple CNNs to channel-wise concatenated groups.The context model of [22] combines the techniques of [15], [16] and can be illustrated as in (d).Our proposed model (eContextformer), as well as the experimental model (pContextformer), use the parallelization techniques depicted in (d-e).However, our models employ spatio-channel attention in context modeling and do not require additional networks for channel-wise concatenation.
U , and during inference, the actual quantization is applied.
The decoder reads the bitstream of ŷ, and applies a synthesis transform g s to output the reconstructed image x.
The coding redundancy of the latent variable ŷ is minimized while learning its probability distribution with the lowest possible entropy.The entropy of ŷ is minimized by learning its probability distribution.Entropy modeling is built using two methods, backward and forward adaptation [26].The forward adaption utilizes a hyperprior estimator, i.e., a second autoencoder with its own analysis transform h a and synthesis transform h s .The hyperprior creates a side, separately encoded channel ẑ with dimensionality lower than the one of ŷ.A factorized density model [3] learns local histograms to estimate the probability mass p ẑ (ẑ|ψ f ) with model parameters ψ f .The backward adaptation utilizes a context model g cm , which conditions the entropy estimation of the current latent element ŷi on the previously coded elements ŷ<i , autoregressively.The entropy parameters network g e uses the outputs of the context model and the hyperprior to parameterize the conditional distribution p ŷ (ŷ|ẑ).The lossy LIC framework can be formulated as: with the loss function L of end-to-end training: where ϕ, θ and ψ are the optimization parameters of their corresponding transforms.The Lagrange multiplier λ regulates the trade-off between distortion D(•) and rate R(•).

C. Efficient Context Modeling
As an autoregressive context model yields a significant increase of the coding efficiency, multiple architectures have been proposed (see Fig. 1).In our previous study [24], we categorized those into three groups; (1) exploiting spatial dependencies; (2) exploiting cross-channel dependencies; (3) increasing content-adaptive behavior in the entropy estimation.
As an example for the first group, Minnen et al. [5] proposed to use a 2D masked convolution to exploit spatially local relations in the latent space.In order to capture a variety of spatial dependencies, [7], [8] implemented a multi-scale context model using multiple masked convolutions with different kernel sizes.The second group encompasses context models such as 3D context model [6], [10] and channelwise autoregressive context model [16].In those proposals, the models are autoregressive in channel dimension, thus they can use information derived from previously coded channels.In [6], [10] the authors use 3D masked convolutions which exploit both spatial and cross-channel dependencies, while the proposal in [16] uses cross-channel dependencies only.
Attention-based models provide better content-adaptivity than the CNN-based ones [12], [17], [23], [24].Early proposals [12], [23] combine 2D masked convolutions and simplified attention to capture local and global correlations of latent elements.Additionally, Guo et al. [23] propose to split the latent variable into two channel segments to exploit crosschannel relations.Qian et al. [12] capture long-ranged spatial dependencies by transformer-based hyperprior and context model providing adaptive global attention.In a previous work, we proposed a transformer-based context model (a.k.a Contextformer [24]), which splits the latent variable into several channel segments N cs and applies spatio-channel attention.We proposed two different coding modes -spatial-first-order (sfo) and channel-first-order (cfo), which differ in prioritizing spatial and cross-channel correlations.
On the encoder side, all ŷi are already available prior to context modeling.Therefore, entropy estimation can be efficiently performed on an NPU/GPU by masking out element permutations which can break coding causality in the sequence.However, in the decoder, entropy modeling requires an autoregresive processing of the latent elements.Such serial processing results in slower decoding time and lower NPU/GPU efficiency.In Fig. 2, we compare techniques which increase parallelization by grouping the elements according to various heuristics [14]- [16].The authors of [14] propose a patch-wise context model processing the latent variable within spatially non-overlapping patches.A multi-scale context model simultaneously exploits the local relations inside each patch.The inter-patch relations are aggregated with an informationsharing block which employs a recurrent neural network.The authors of [15] proposed grouping of the latent variables according to a checkered pattern.They use a 2D masked convolution in context modeling, in which the mask has a checkered pattern.Both [14] and [15] outperform their serial baseline in high bitrates but are inferior to them in low bitrates.In [12], the authors use a transformer-based model with checkered grouping similar to the one used in [15], However, the context model in [12] has high complexity due to the computation of global attention.The model in [16] is another example of parallel context model, which uses channel-wise grouping.However, it uses separate CNNs to aggregate the decoded channel segments which requires a signifficant number of model parameters.He et al. [22] combined checkered grouping with channel-wise context modeling and reached rate-distortion performance similar to the one of the Contextformer, albeit with a drastically reduced complexity.However, their model uses separate networks to aggregate channels similar to [16] and does not utilize an attention mechanism.

D. Online Rate-Distortion Optimization
Training a LIC encoder on large datasets gives a set of globally optimal but locally sub-optimal network parameters (ϕ, θ, ψ).For a given input image x, a more optimal set of parameters might exist at inference time.Such occurrence is referred to as the amortization gap [33], [43], [44].The optimization of network parameters per image is not feasible during encoding due to the additional signaling and computational complexity involved in the process.Therefore, [33], [43] propose an online rate distortion optimization (oRDO), which adapts only y and z during encoding.The oRDO algorithms first initialize the latent variables by passing the input image x through the encoder, without tracking the gradients of the transforms.Then, only the latent variables are iteratively optimized using (3a), while network parameters are kept frozen.Finally, the optimized variables (ŷ opt , ẑopt ) are encoded into the bitstream.The oRDO does not influence the decoder complexity or its runtime.However, the iterative process significantly increases the encoder complexity.Therefore, it is primarily implemented on LIC frameworks with low complexity entropy model [4], [5], [9], [11].

A. Exploring Parallelization Techniques
We analyzed the effect of prior parallelization techniques on the Contextformer.We observed a few areas where the architecture design and the training strategy can be improved: (1) where the spatial kernel size K and the number of channel segments N cs defines the sequence length N inside a spatiochannel window.The window operation is performed N win times with a stride of s=1 over ŷ.
Our model is trained with a large spatial kernel (K=16) on 256×256 image crops.Therefore, the model learns global attention during training.During inference, it uses sliding window attention, which reduces computational complexity.Compared to global attention used in [12], our sliding window attention provides lower latency.However, our model can not learn the sliding behavior, which creates a discrepancy between training and inference results.This problem could be fixed by training the model with sliding window attention either on larger image crops while using the same kernel size or decreasing the kernel size without changing the image crop size.Training on larger image crops does not affect the complexity and yet increases the compression performance.In this case, direct comparison with prior art trained on small crops would be unfair and would make the effects of parallelization less noticeable.Training with a smaller kernel provides biquadratic complexity reduction for each window.Moreover, the Contextformer requires N win autoregressive steps, which cannot be efficiently implemented on a GPU/NPU.Based on those observations, we built a set of parallel Contextformers (pContextformer), which adopt the patch-wise context [14] and checkerboard context model [15] and need eight autoregressive steps for N cs =4.
We trained the set of models with 256×256 image crops for ∼65% of the number of iterations used for training in [24].We varied the coding method (sfo or cfo), the kernel size K, and the stride during training s t and inference s i .The stride defines whether the windows are processed in overlapping or separately.To achieve the overlapping window process, we set the stride to K 2 .For K=16, the model could learn only global attention, which could be replaced with a local one during the inference.Following the methodology of JPEG-AI [45], we measured the performance of the codec in Bjøntegaard Delta rate (BD-Rate) [46] over VTM 16.2 [29], and the complexity of each context model in kilo MAC operations per pixel (kMAC/px).For simplicity, we measured the single pass complexity in the encoder side, where the whole latent variable ŷ is processed at once with the given causality mask.We present the results in Fig. 3 and Tables I and II) and make the following observations: a) When trained with global attention (s t =K=16): the spatial-first coding is better than channel-first coding at high bitrates but worse at low bitrates.At low bitrates, the spatialfirst coding cannot efficiently exploit spatial dependencies due Fig. 3. Experimental study on Kodak dataset [30], comparing the ratedistortion performance of different model configurations to the increased sparsity of the latent space, and due to the application of non-overlapping windows during the inference.Also, spatial-first coding benefits more from the overlapping window attention applied at inference time (s i <K).b) Trained with overlapping attention windows (s t <K): the spatial-first coding outperforms channel-first coding.Moreover, using overlapping windows at inference time helps the small kernel size models (K=8) to reach performance close to the one of the larger kernel models (K=16).
c) In general: simple pContextformer models can provide more than 100x complexity reduction for a 3% performance drop.Theoretically, more efficient models are possible with the use of overlapping window attention at training and inference time.However, the vanilla version of the overlapping window attention is still sub-optimal since it increases the complexity by four-fold.

Based on our investigations on vision transformers and initial experiments (see Sections II-A and III-A), we propose
Efficient Contextformer (eContextformer) as an improvement to our previous architecture [24].Built upon the same compression frameworks of [8], [24], the analysis transform g a comprises four 3×3 convolutional layers with a stride of 2, GDN activation function [47], and a single residual attention module without the non-local mechanism [8], [11], [48], same as in our previous work 1 .The synthesis transform g s closely resembles g a and implements deconvolutional layers with inverse GDN.In order to expand receptive fields and reduce quantization error, g s utilizes residual blocks [6] in its first layer, and a attention module.The entropy model integrates a hyperprior network, universal quantization, and estimates p ŷi (ŷ i |ẑ) with a Gaussian Mixture Model [11] with k=3 mixtures.In contrast to [24], we use 5×5 convolutions lieu the 3×3 ones in the hyperprior in order to align our architecture with the recent state-of-the-art [16], [22].
Our experiments showed that the Contextformer could support the parallelization techniques of [14], [15], but require  an efficient implementation for overlapping window attention.
To achieve this, we replaced the ViT-based transformer of Contextformer with a Swin-based one and extended it with the spatio-channel attention.Fig. 4 illustrates our compression framework.For context modeling, the segment generator rearranges the latent ŷ ∈ R H×W ×C into a spatio-channel segmented representation ŝ ∈ R Ncs×H×W ×pcs with the number of channel segments N cs = C pc .A linear layer converts ŝ to tensor embedding with the last dimension of d e .The embedding representation passes through L Swin transformer layers where window and shifted-window spatio-channel attention (W-SCA and SW-SCA) are applied alternatingly.Window attention is computed on the intermediate outputs with R Nw×(NcsK 2 )×pcs , where the first and second dimension represents the number of windows N w = HW K 2 processed in parallel and the sequential data within a window of K×K, respectively.We masked each window according to the group coding order (see Fig. 2) to ensure coding causality, where each group uses all the previously coded groups for the context modeling.Following [28], we replaced absolute positional encodings with relative ones to introduce more efficient permutation variance.We employ multi-head attention with h heads to yield better parallelization and build independent relations between different parts of the spatio-channel segments.

C. Additional Complexity Optimizations
To generate the windows and shifted windows, the transformations R Ncs×H×W ×de ↔ R Nw×(NcsK 2 )×de are applied to intermediate outputs in pre-post attention layers.Since eContextformer is autoregressive in spatial and channel dimensions, one can iterate over the channel segments with a checkered mask during decoding, i.e., by passing previous segments ŝ<i through the context model to decode ŝi (with the channel segment index i).For instance, in spatial-firstorder (sfo) setting, two autoregressive steps are required per iteration, while the computation of half of the attention map is unnecessary for each first step.To remedy this, we rearranged the latent tensors into a form r ∈ R 2Ncs×H× W 2 ×pcs that is more suitable for efficient processing.We set the window size to K× K 2 to preserve the spatial attention span, and applied the window and shifted attention as illustrated in Fig. 5.We refer to the proposed rearrangement operation as Efficient coding Group Rearrangement (EGR).Furthermore, we code the first group r1 only using the hyperprior instead of using a start token for context modeling of r1 .This reduced the total autoregressive steps for using the context model to 2N cs − 1, i.e., total complexity reduction of 13% for N cs =4.We refer to this optimization as Skipping First coding Group (SFG).
Note that transformers are sequence-to-sequence models and compute attention between every query and key-value pair.
Let Q(1 ≤ n), K(1 ≤ n) and V (1 ≤ n) be all queries, keys and values, and be the attention computed up to coding step n.Since context modeling is an autoregressive task, during decoding, one can omit to compute the attention between the previously coded queries and key-value pairs, and just compute the attention between the current query and the cached key-value pairs.For our efficient implementation, we also adopted the key-value caching according to the white paper [49] and the published work [50] BPG [20] (b) Fig. 6.The rate-distortion performance in terms of (a) PSNR and (b) MS-SSIM on Kodak dataset [30] showing the performance of our model compared to various learning-based and classical codecs.We also include the performance of our model combined with oRDO.
embedding size.Since our initial experiments showed deteriorating performance for the cf o coding method for window attention, we continued with the sf o version.K defines the size of the attention span in the spatial dimension, which results in K× K 2 window for the proposed optimization algorithms (see Section III-C).Following [24], [51], we trained eContextformer for 120 epochs (∼1.2M iterations) on 256×256 random image crops with a batch size of 16 from the Vimeo-90K dataset [52], and used ADAM optimizer [53] with the initial learning rate of 10 −4 .In order to cover a range of bitrates, we trained various models with λ ∈ {0.002, 0.004, 0.007, 0, 014, 0.026, 0.034, 0.058} with mean-squared-error (MSE) as the distortion metric D(•).To evaluate the perceptual quality of our models, we finetuned them with MS-SSIM [54] as the distortion metric for ∼500K iterations.For high bitrate models (λ 5,6,7 ), we increased the bottleneck size to 312 according to the common practice [5], [11], [16].Empirically, we observed better results with a lower number of heads for the highest rate model, so we reduced it for this model to h=6.Additionally, we investigated the effect of the training dataset and crop size by finetuning the models with 256×256 and 384×384 image crops from COCO 2017 dataset [55] for about ∼600K iterations.
b) Evaluation: We analyzed the performance of the eContextformer on the Kodak image dataset [30], CLIC2020 [31] (Professional and Mobile), and Tecnick [32] datasets.We compared its performance to various serial and parallel context models: the 2D context models (Minnen et al. [5] and Cheng et al. [11]), the multi-scale 2D context model (Cui et al. [8]), the 3D context model (Chen et al. [6]), the transformer-based context model with spatio-channel attention (Contextformer [24]), the channel-wise autoregressive context model (Minnen&Singh [16]), the checkerboard context model (He et al. [15]), transformer-based checkerboard context model (Qian et al. [17]), the channel-wise and checkerboard context model (He et al. [22]).If the source was present, we executed the inference algorithms of those methods; otherwise, we obtained the results from relevant publications.Additionally, we also used classical image coding frameworks such as BPG [20] and VTM 16.2 [29] for comparison.We measured the number of parameters of entropy models with the summary functions of PyTorch [56] or Tensorflow [57] (depending on the published implementation).By following the recent standardization activity JPEG-AI [45], we computed the model complexity in kMAC/px with the ptflops package [58].In case of missing hooks for attention calculation, we integrated them with the help of the official code repository of the Swin transformer [28].For the runtime measurements, we used DeepSpeed [59] and excluded arithmetic coding time for a fair comparison, since each framework uses different implementation of the arithmetic codec.All the tests, including the runtime measurements, are done on a machine with a single NVIDIA Titan RTX and Intel Core i9-10980XE.

B. Model Performance
Fig. 6 shows the rate-distortion performance of the eContextformer trained on the Vimeo-90K with 256×256 image crops.In terms of PSNR, our model qualitatively outperforms the VTM 16.2 [29] for all rate points under test and achieves 5.24% bitrate savings compared to it (see Fig. 6a).Our compression framework shares the same analysis, synthesis transforms, and hyperprior with the model with multi-scale 2D context model [8] and Contextformer [24].Compared to [8], our model saves 8.5% more bits, while it provides 1.7% lower performance than Contextformer due to the parallelization of context modeling.The eContextformer achieves competitive performance to parallelized models employing channel-wise autoregression with an increased number of model parameters, such as Minnen&Singh [16] and He et al. [22].When comparing with VTM 16.2 [29], the former model gives 1.6% loss in BD-Rate performance, and the latter gives 6.3% gain.
In Fig. 6b, we also evaluated the perceptually optimized models compared to the prior art.In terms of MS-SSIM [54], the eContextformer saves on average 48.3% bitrate compared to VTM 16.2 [29], which is performance-wise 0.7% better than He et al. [22]  (b) Fig. 7. Experimental study of the effects of different training datasets and crop sizes on the performance, showing the rate-distortion performance on (a) Kodak [30] and (b) Tecnick [32]  Cheng et al. [11] BPG [20] (c) Fig. 8.The rate-distortion performance in terms of PSNR on (a) Kodak [30], (b) CLIC2020 [31] (Professional and Mobile), and (c) Tecnick [32] datasets, showing the performance of our model w/ and w/o finetuning compared to various learning-based and classical codecs.We also include the performance of our models combined with oRDO.

C. Effect of training with larger image crops
Although eContextformer yields a performance close to Contextformer on Kodak dataset [30], it underperforms on larger resolution test datasets such as Tecnick [32] (see Fig. 7.The Vimeo-90K dataset [52] we initially used for our training has a resolution of 448×256.In order to avoid a bias towards horizontal images, we used 256×256 image crops, ergo 16×16 latent variable, for training similar to the state-of-the-art [51].However, the attention window size of K=8 combined with low latent resolution limits learning an efficient context model.In order to achieve high rate-distortion gain, the recent studies [14], [16], [17], [22] experimented with higher resolution datasets, such as ImageNet [60], COCO 2017 [55], and DIV2K [61], and larger crop sizes up to 512×512.Following those studies, we finetuned eContextformer with 256×256 and 384×384 image crops from COCO 2017 [55].As one can see in Fig. 7, the models finetuned with different crop sizes achieve similar performance on the Kodak dataset [30], which has about 0.4M pixels.On the contrary, larger crop sizes help the models to reach more than two times increase of performance on the Tecnick dataset [32], which has ∼1.5M pixels per image. Fig. 8 shows the rate-distortion performance of t finetuned eContextformer on Kodak [30], CLIC2020 [31] (Professional and Mobile), and Tecnick [32] datasets.The finetuned models provide 5-11% bitrate saving over the initial training, achieving average savings of 10.9%, 12.4%, 6.9% and 12.6% over VTM 16.2 [29] on those datasets, respectively.

D. Model and Runtime Complexity
Table III shows the complexity of the eContextformer w.r.t.different optimization proposed in Section III-C.During encoding and decoding, each of the EGR and SFG methods decreases the complexity of context modeling by 10-13%, whereas the combination of them with the caching of keyvalue pairs provides an 84% complexity reduction in total.We also compared the efficiency of caching to the single pass, where the whole latent variable ŷ is processed on the encoder side at once with the given causality mask.The caching is also more efficient than the single pass since only the required parts of the attention map are calculated.
In Table IV, we compared the number of model parameters and the complexity of the entropy model (including g c , h a , h s    and g e ) of our model and some of the prior arts.Compared to the other transformer-based methods, the proposed window attention requires less computation than the sliding window attention of Contextformer and global attention of [17].For instance, with the proposed optimizations enabled, eContextformer has 145x lower complexity than the Contextformer for a similar number of model parameters.The spatio-channel window attention can efficiently aggregate information of channel segments without concatenation.Therefore, our model requires a smaller and shallower entropy parameter network compared to [8], and a significantly lower total number of parameters in the entropy model compared to channel-wise autoregressive models [16].
Table V presents the encoding and decoding runtime complexity of our model, some of the prior learning-based models and VTM 16.2 [29].The proposed optimizations speed up the encoding and decoding up to 3x, proving a 210x improvement over the optimized version of the Contextformer [24].Furthermore, we observed that coding time scales better for the optimized eContextformer compared to the one without the proposed optimizations.The 4K images has 21x more pixels than the Kodak images, while the relative encoding and decoding time of the optimized models for those images increase only 14x w.r.t. the ones on the Kodak dataset.Moreover, our optimized model provide competitive runtime performance to the Minnen&Singh [16] and VTM 16.2 [29].

E. Online Rate-Distortion Optimization
Since our model with the optimized eContextformer has a significantly low model and runtime complexity on the encoder side, we incorporated an oRDO technique to further improve the compression efficiency.As described in [62], we took the algorithm of [33], replaced the noisy quantization with the straight-through estimator [63], and used an exponential decay for scheduling the learning rate, which results in a simplified oRDO with faster convergence.The learning rate at n-th iteration can be defined in closed-form as: where α 0 and γ are the initial learning rate and the decay rate, respectively.Fig. 9a illustrates the immediate learning rate α n w.r.t.oRDO steps for different combinations of (α 0 , γ).
We searched the optimal parameters by using Tree-Structured Parzen Estimator (TPE) [64] with the objective function (3a) and the constraints of α n >10 −7 , α 0 ∈ [0.02, 0.08], and γ ∈ [0.5, 0.7].To omit over-fitting, we used 20 full-resolution images randomly sampled from COCO 2017 dataset [55] for the search.Figs.9b and 9c show the results of TPE [64] for models trained with λ 1 and λ 4 , respectively.We observed that the higher bitrate models (λ >3 ) generally perform better with a higher initial learning rate compared to the ones trained for lower bitrates (λ <4 ).This suggests that the optimization of less sparse latent space requires a larger step size at each iteration.We set α 0 =0.062 and γ=0.72 for λ <4 and α 0 =0.062 and γ=0.72 for λ >3 , which results in 26 iteration steps for all models.Figs. 6 and 8 show the rate-distortion performance of the eContextformer with oRDO using the optimal parameter setting.Over VTM 16.2 [29], the oRDO increases the ratedistortion performance of the finetuned models up to 4%, providing 14.7%, 14.1%, 10.3%, and 17.1% average bitrate savings on Kodak [30], CLIC2020 [31] (Professional and Mobile), and Tecnick [32] datasets, respectively.The encoding runtime complexity is proportional to the number of optimization steps used.For the selected number of steps, the total encoding runtime complexity of our model is still lower than the VTM 16.2 [29].Furthermore, we observed that the oRDO increases the performance of our initial models (without the finetuning) by up to 7%, which indicates those models are trained sub-optimally.

V. CONCLUSION
This work introduces eContextformer -an efficient and fast upgrade to the Contextformer.We conduct extensive experimentation to reach a fast and low-complexity context model while presenting state-of-the-art results.Notably, the algorithmic optimizations we provide further reduce the complexity by 84%.Aiming to close the amortization gap, we also experimented with an encoder-side iterative algorithm.It further improves the rate-distortion performance and still has lower complexity than the state-of-art video compression standard.Undoubtedly, there are more advanced compression algorithms yet to be discovered which employ better non-linear transforms and provide more energy-compacted latent space.This work focuses on providing an efficient context model architecture, and defer such an improved transforms to future work.

Fig. 1 .
Fig.1.Illustration of the context modeling process, where the symbol probability of the current latent variable ( ) estimated by aggregating the information of the latent variables ( ).The previously decoded latent elements not joining to context modeling and yet to be coded elements are depicted as ( ) and ( ), respectively.The illustrated context models are (a) the model with 2D masked convolutions[5],[11], (b) the model with 3D masked convolutions[6],[10], (c) channel-wise autoregressive model[16], (d) simple attention-based model[23], and (e-f) Contextformer with sfo and cfo coding mode[24], respectively.

Fig. 2 .
Fig.2.Illustration of different parallelization techniques for the context modeling in (a) patch-wise grouping[14], (b) checkered grouping[15], (c) channelwise grouping[16], and (d-e) combination of checkered and channel-wise grouping with sfo and cfo coding, respectively.All latent elements within the same group (depicted with the same color) are coded simultaneously, while the context model aggregates the information from the previously coded groups.For instance,[14],[15] use 2D masked convolutions in the context model, and[16] applies multiple CNNs to channel-wise concatenated groups.The context model of[22] combines the techniques of[15],[16] and can be illustrated as in (d).Our proposed model (eContextformer), as well as the experimental model (pContextformer), use the parallelization techniques depicted in (d-e).However, our models employ spatio-channel attention in context modeling and do not require additional networks for channel-wise concatenation.

Fig. 4 .Fig. 5 .
Fig.4.Illustration of our compression framework utilizing with the eContextformer with window and shifted-window spatio-channel attention.The segment generator splits the latent into Ncs channel segments for further processing.Following our previous work[24], the output of hyperdecoder is not segmented but repeated along channel dimension to include more channel-wise local neighbors for the entropy modeling.

Fig. 9 .
Fig. 9. Illustration of (a) the learning rate decay used for the oRDO w.r.t.optimization iteration for different initial learning rate α 0 and decay rate γ.The results of the TPE for our model (b) with λ 1 and (c) with λ 4 for different combinations of (α 0 , γ).The size of each square symbolizes the required number of oRDO iteration steps.
(2)rly large kernel size of the attention mechanism;(2)a discrepancy between train and test time behavior; (3) a large number of autoregressive steps.For an input image x ∈ R H×W ×3 with a latent representation ŷ ∈ R with height H, width W and the number of channels C), the complexity C of the Contextformer's attention mechanism expressed in number of MAC operations is: (

TABLE III ABLATION
STUDY OF THE PROPOSED OPTIMIZATION METHODS APPLIED

TABLE IV NUMBER
OF PARAMETERS AND ENTROPY MODEL COMPLEXITY OF OUR * The complexity is measured on 4K images with 3840x2160 resolution

TABLE V ENCODING
AND DECODING TIME OF OUR MODEL COMPARED TO VARIOUS LEARNING-BASED AND CLASSICAL CODECS