Learned Wavelet Video Coding using Motion Compensated Temporal Filtering

We present an end-to-end trainable wavelet video coder based on motion-compensated temporal filtering (MCTF). Thereby, we introduce a different coding scheme for learned video compression, which is currently dominated by residual and conditional coding approaches. By performing discrete wavelet transforms in temporal, horizontal, and vertical dimension, we obtain an explainable framework with spatial and temporal scalability. We focus on investigating a novel trainable MCTF module that is implemented using the lifting scheme. We show how multiple temporal decomposition levels in MCTF can be considered during training and how larger temporal displacements due to the MCTF coding order can be handled. Further, we present a content adaptive extension to MCTF which adapts to different motion strengths during inference. In our experiments, we compare our MCTF-based approach to learning-based conditional coders and traditional hybrid video coding. Especially at high rates, our approach has promising rate-distortion performance. Our method achieves average Bj{\o}ntegaard Delta savings of up to 21% over HEVC on the UVG data set and thereby outperforms state-of-the-art learned video coders.


INTRODUCTION
Following the progress of learned image compression, there have been significant advances in learned video compression.Built on learned image coders, video coding approaches exploit temporal redundancies by following two main paradigms: residual and conditional coding.Residual coders [1][2][3][4][5][6][7][8][9][10] largely take over the structure of known hybrid video coders such as VVC [11].Using motion-compensated inter prediction, the residual between the predicted and current frame is compressed and then transmitted.Instead of The authors gratefully acknowledge that this work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under project number 461649014.transmitting a difference signal, conditional coders compress the current frame directly under the condition that both the encoder and decoder know the prediction.Since the introduction of conditional coding in learned video compression by a framework called DCVC [12], there have been several improvements of DCVC [13,14] as well as other conditional coding schemes based on a generative model [15] or transformers [16,17].With these developments, conditional coding currently outperforms residual coders and represents the state of the art in learned video coding.This paper investigates a different coding scheme visualized in Fig. 1: learned wavelet video coding.It performs a discrete wavelet transform (DWT) in temporal, horizontal, and vertical dimensions.Specifically, an end-to-end trainable wavelet video coder based on Motion-Compensated Temporal Filtering (MCTF) [18] is introduced.Traditional MCTF as proposed by Ohm [18] and improved by Choi and Woods [19], incorporates motion compensation into the temporal wavelet transform.Until the early 2000s, MCTFbased wavelet video coding was an active research topic as a scalable alternative to predictive transform coders.With the success of the video coding standard H.264/AVC [20], hybrid video coding approaches have dominated the field.Transform coding has emerged as the predominant principle in learned image and video compression.Here, the foundation of most popular coders [21][22][23] is based on nonlinear transform coding [24].
Recently, employing a learned spatial wavelet transform for end-to-end image compression has shown great potential by achieving state-of-the-art performance [25].Motivated by this emerging topic of trainable wavelet transforms for compression, the novel learned MCTF video coding approach is built on top of the wavelet image coder called iWave++ [25].The MCTF video coder provides a flexible framework that supports lossless compression.In addition, MCTF enables a fully scalable video coder.Compared to other learned approaches, which usually do not support input in YUV 4:2:0 format, wavelet video coding allows arbitrary input formats.
The focus of this paper is on the investigating a novel trainable MCTF module and compressing the obtained temporal subbands with the state-of-the-art wavelet image coder iWave++ [25].The contributions of this paper are as follows: • Introduction of the first end-to-end trainable wavelet video coding scheme.To date, there have been no learned video compression approaches based on MCTF.
• Presentation of a training strategy for multiple temporal decomposition levels in MCTF.
• Investigation of large temporal displacements due to the MCTF coding structure and a first solution for handling these cases more efficiently.
• Proposal of a content-adaptive MCTF approach that adapts to different types of motion during inference.

Learned Video Compression
DVC [1,2] was the first learning-based deep video compression framework.It follows the structure of a traditional hybrid P-frame codec but replaces its modules for motion estimation, motion vector and residual compression by neural networks.With the feature-space video coding network FVC [4], the DVC framework was significantly improved by performing these operations in the feature space.The coarse-to-fine framework C2F [5] further advanced residual coding using two-stage motion compensation at different resolutions and mode prediction networks.Conditional coding can offer theoretical benefits over residual coding [26] and learning-based frameworks allow for its straightforward implementation via conditional autoencoders [27,28].The DCVC [12] framework has attracted greater attention to conditional coding for learned video compression.Conditioning on the temporal context in the feature space [13], and an extended entropy model with a latent prior in addition to quantization at different granularities [14] made the DCVC framework reach state-of-the-art performance.Another conditional coding approach [17] follows the structure of DCVC but uses a transformer-based entropy model.There are also frameworks based on augmented normalizing flows [15] or without an explicit motion model, such as the video compression transformer VCT [16].

Wavelets for Learned Image Compression
The traditional discrete wavelet transform has desirable properties for image and video coding.Its compromise between spatial and frequency resolution fits the correlation structure of image data: edges can be coded more efficiently in the spatial domain, whereas smooth shades and regular textures can be better modeled in the frequency domain.Hence, the image compression standard JPEG2000 [29] and the Dirac video coder [30] employ a DWT as an alternative to the discrete cosine transform.The coders rely on the lifting scheme [31] for a fast and efficient implementation of the DWT.With the lifting structure, the DWT can be performed in place by factoring its calculation into multiple lifting steps.At the same time, the lifting structure allows the construction of new wavelet filters, so-called second-generation wavelets.Moreover, the lifting scheme is a reversible structure and is thus well suited for realizing lossless transforms that can be incorporated into learning-based frameworks.Convolutional Neural Networks (CNNs) allow the optimization of wavelet transforms based on a set of training images [32].Such a learned wavelet transform implemented via the lifting scheme has been shown to outperform the wavelet filters used by JPEG2000.The learned wavelet transform forms the basis of the end-to-end trainable wavelet image coder iWave++ [25].An overview of iWave++ is shown in Fig. 2. First, the encoder performs a CNN-based DWT with four decomposition levels.The obtained tree-structured subbands constitute a hierarchical representation at different resolutions.For transmission, the wavelet coefficients are quantized using scalar quantization with a trainable parameter.Subsequently, a CNN-based context model estimates the entropy parameters of a Gaussian mixture model employed for adaptive arithmetic coding.The context model exploits correlations within the current subband to be coded and across subbands from different decomposition levels.After an inverse discrete wavelet transform (IDWT) is performed by the decoder, a post-processing module compensates for quantization artifacts.Learned wavelet image compression provides a flexible framework.A 3D version of iWave++ [33] has been employed for lossless and lossy medical image compression, that is, for coding 3D volume data without temporal information.An extension through an affine wavelet transform module further improved volumetric image compression performance [34].The low-and highpass subbands obtained from the lifting scheme are re-scaled by an affine map computed based on the output of the prediction and update filters.
Dong et al. [35] proposed a partly trainable wavelet video coder that follows a "t+2D" decomposition structure.First, they perform a temporal wavelet transform taken from [36].Afterwards, they code the obtained temporal subbands using a trainable entropy parameter estimation module that largely takes over the structure of iWave++.In addition, Dong et al. enabled quality scalability via bitplane coding.This paper focuses on a trainable temporal wavelet transform instead to obtain a fully CNN-based wavelet video coder.

LEARNED WAVELET VIDEO CODING
In the following section, the end-to-end trainable wavelet video coding scheme is introduced.Fig. 3 provides an overview of the proposed approach.The temporal wavelet transform realized via MCTF provides temporal scalability.
The obtained temporal low-and highpass subbands are coded using dedicated iWave++ [25] image compression models.Its spatial 2D wavelet transform yields spatial scalability.
First, the concept of wavelet video coding for one temporal decomposition level, that is, for coding two frames is explained.Subsequently, multiple temporal decomposition levels are discussed in Section 4.

Lifting Scheme
The lifting structure [31] provides a flexible and efficient implementation of the DWT.The temporal lifting scheme illustrated in Fig. 3 consists of the three steps split, predict, and update.In the first step, the input video sequence f is split into even-and odd-indexed frames f 2t and f 2t+1 .In the next step, the odd frames are predicted from the even frames with the prediction operator P. A temporal highpass subband (HP t ) is obtained as HP t = f 2t+1 − P(f 2t ).Subsequently, an update step is performed according to LP t = f 2t + U(HP t ) resulting in a temporal lowpass subband (LP t ).The inverse lifting scheme is obtained by reversing the order of the operations and inverting the signs.Rounding the output of the prediction and update operators yields an integer-to-integer temporal DWT required for lossless reconstruction [37].

Prediction and Update Filters
Fig. 4 illustrates the detailed structure of the prediction and update filters.For the prediction step, motion estimation between the even and odd frames f 2t and f 2t+1 is performed to obtain the motion vectors at time instance t.The motion vectors are employed for motion compensation, followed by a denoising filtering module (DN).In the update step, the motion vectors are inverted to perform inverse motion compensation (MC −1 ) followed by another denoising module.Due to the update step, the even frame is effectively low-pass filtered along the motion trajectory.The temporal lowpass filtering separates noise from content over time.
Applying a denoising filter after forward and inverse motion compensation has been shown to improve compression efficiency in scalable lossless wavelet coding of dynamic CT data [38].This paper follows the same processing order and structure but uses trainable denoising filters, allowing for flexibility during training.The denoising filters have the same residual filter structure as the prediction and update filters of the CNN-based spatial DWT in iWave++ [25].

Motion Estimation and Motion Vector Compression
The approaches for motion estimation and motion vector coding follow the state-of-the-art learned video coder DCVC-HEM [14].During motion estimation, a dense optical flow field is estimated using a Spatial Pyramid Network (SPyNet) [39].With six pyramid levels, the input of SPyNet is 6× downsampled.At every pyramid level, a network computes the residual flow based on the upsampled flow from the preceding level, and thus deals with relatively small motion.
To code the motion vectors obtained from SPyNet, a motion vector encoder computes a 64-channel latent representation with a 16× downscaled spatial resolution.The latents are discretized using multi-granularity quantization.The entropy model uses a hyper prior and a dual spatial prior.The latter is a two-step coding approach that exploits channel redundancies, which allows parallelization, in contrast to an autoregressive prior.The latent prior employed by DCVC-HEM conditions the entropy model on previously coded motion vector latents and is omitted for the MCTF coder.Because training is performed using only two frames, as detailed in Section 4.2, only one motion vector latent is available.For more details on motion vector compression, please refer to [14].

DYADIC TEMPORAL DECOMPOSITION
A dyadic decomposition [29] recursively applies a wavelet transform in the temporal direction to the lowpass of the previous decomposition stage.Thus, different temporal resolutions are obtained at each decomposition level for temporal scalability.With the dyadic decomposition structure, the number of frames contained in a group of pictures (GOP) is equal to powers of two.This paper investigates GOPs containing up to eight frames.

Coding Order and Temporal Scalability
Coding order for a GOP size of 8.The temporal lowpass and highpass subbands are denoted as l j,t and h j,t .The gray frames are coded from temporal decomposition level 1 to 3. motion estimation is performed on the original frames instead of the decoded frames.In the first temporal decomposition level, the operator P 1 predicts all odd-indexed frames from the respective preceding frame.The resulting four temporal highpass frames h 1,t and their corresponding motion vectors can be directly coded.Next, the temporal lowpass frames are obtained from the update operation U 1 which receives the highpass frames as input.After the first temporal decomposition level, there are four temporal lowpass frames.MCTF repeats this decomposition in the temporal direction until only the single temporal lowpass frame l 3,0 in decomposition level j = 3 is left.Overall, the highpass frames h 1,t from the first decomposition level can be coded first, followed by the highpass frames from the deeper temporal levels.Finally, the lowpass frames l 3,0 from the lowest temporal decomposition level are transmitted.
Note that the distance between frames d in the temporal direction increases with every temporal decomposition level j according to d = 2 j−1 .Hence, the frame distance d is equal to 4 in temporal decomposition level j = 3.This is disadvantageous in terms of rate-distortion performance compared with regular P frame coding with a frame distance of d = 1 for every P frame.However, MCTF has the benefit of providing temporal scalability: the lowpass subbands are similar to the original sequence and therefore correspond to a Base Layer (BL).The highpass subbands contain residual information that serves as an Enhancement Layer (EL).The further the input video sequence is decomposed in the temporal di- rection, the more ELs are available.For a GOP size of 8, there are three ELs as indicated in Fig. 5. Owing to the different temporal decomposition levels, dedicated MCTF filtering, motion estimation, and motion vector compression networks for each temporal decomposition level are beneficial.The benefits of the different MCTF stages are evaluated in Section 5.2.2.On the decoder side, the inverse MCTF is performed by reversing the order of the prediction and update filters.

Training Strategy and Loss
This paper adopts a multi-stage training strategy, of which Table 1 provides an overview.During the entire training procedure, each training sample consists of two frames.In the first part, a single MCTF stage is trained (training stage 1-3), and more stages are added in the second training part (training stage 4-5) depending on the GOP size.Thus, dedicated models for different GOP sizes are trained to consider the varying number of temporal decomposition levels.The two iWave++ models employed for coding the temporal lowpass and highpass subbands are initialized with models pretrained on image data.

First training part: Single MCTF stage
During the first two training stages, only the network components for MCTF are trainable.They consist of motion estimation, motion vector compression, and DN modules.In the first stage, the loss is the distortion D ME between the frame to be predicted f 2t+1 and the prediction P 1 (f 2t ).The second stage additionally considers the rate R MV required for motion vector coding.In the next stage, the entire network is trainable and the loss is the regular rate-distortion loss.
The full rate-distortion loss for two frames reads: where i denotes the frame number and the distortion term corresponds to the Mean Squared Error (MSE) between the original frame f i and the reconstructed frame fi .R all,i consists of the rate required to code the temporal subbands using an iWave++ model.If the corresponding frame i is coded as a temporal highpass subband, R all,i also includes R MV .This paper considers lossy compression, where the only information loss stems from the scalar quantization operation of the iWave++ models.

Second training part: Multiple MCTF stages
To account for multiple temporal decomposition levels, multiple MCTF networks are used, where the additional MCTF stages are initialized with the parameters of the already available MCTF stage.For a GOP size of 4 with two temporal decomposition levels, two MCTF stages and a maximum frame distance d max of two are used.For every batch element, a random frame distance between one and d max is selected.Depending on the frame distance, a different MCTF stage with different networks is chosen.Thus, for a GOP size of 4, it is randomly alternated between optimizing the first MCTF stage with a frame distance of one and the second MCTF stage with a frame distance of two.Thereby, the different MCTF stages share the iWave++ models employed for coding the temporal lowpass and highpass subbands.
In the last two training stages, again only the MCTF components are trained first and then all network modules are jointly optimized as can be seen in Table 1.To consider inverse MCTF for multiple decomposition levels during training, experiments were conducted using four frames per batch element.However, they showed that training becomes unstable, and the final rate-distortion performance is significantly worse than training with two frames and one temporal level.
For a GOP size of 8, the number of MCTF stages is increased from two to three.The maximum frame distance d max in the last two stages (see Table 1) is set to 4 to account for the GOP structure shown in Fig. 5.

Downsampling Strategy for Temporal Displacements in MCTF
The larger the temporal decomposition level, the larger the temporal distance between the frames in the original sequence (see Section 4.1).Therefore, considerably larger temporal displacements are possible.If the motion is too strong for the motion estimation network to predict accurately, prediction errors can lead to ghosting and error propagation across decomposition levels.
The reconstructed motion vectors MV t obtained from motion vector compression are upscaled before being used for forward and inverse motion compensation.
To address larger motion, computing and transmitting motion vectors at a lower spatial resolution for temporal decomposition levels larger than one is proposed.Specifically, the current frame and reference frame before motion estimation are downscaled by a factor of two for every temporal decomposition level j > 1 as illustrated in Fig. 6.Hence, the motion vectors are coded at lower resolution and upscaled after the motion vector decoder.Both bilinear down-and upsampling are performed.The upscaled motion vectors are then used for the forward and inverse MCTF.The proposed downsampling strategy (MCTF-DS) does not require additional training, and its benefits are evaluated in Section 5.2.2.

Content-Adaptive MCTF (MCTF-CA)
The coding efficiency of MCTF is highly dependent on the motion-compensated prediction quality as motion estimation errors propagate to higher temporal decomposition levels.Even with the downsampling strategy, the motion present in a scene can be too strong for the motion estimation network or occluded regions can limit the prediction quality.Therefore, adaptive temporal scaling for each video sequence can lead to improved coding efficiency compared with uniform dyadic temporal decomposition by mitigating ghosting and thus error propagation.Lanz et al. [40] investigated content-adaptive wavelet lifting for scalable lossless coding of medical data by choosing the number of temporal decomposition levels based on the sequence content.This paper proposes the adoption of a content-adaptive wavelet lifting approach for our lossy wavelet video coder, which is referred to as MCTF-CA.This approach does not require additional training.
In the following section, the concept of content-adaptive MCTF for a GOP size of 8 is explained.During inference, the coding costs for a GOP consisting of 8 frames are optimized.As a cost criterion, the rate-distortion cost for N = 8 frames is evaluated as: where the tradeoff parameter λ is chosen according to the value employed for training the MCTF model.With one MCTF model trained for a GOP size of 8, evaluate different options for coding the current GOP.Subsequently, the variant with the minimum coding cost is chosen: where the notation C DS 8,GOP4 denotes the cost of coding 8 frames in smaller GOPs of size 4 with the downsampling strategy, for example.Either a GOP size of 8 is coded or split into several smaller GOPs.Here, two GOPs of size 4 or four GOPs of size 2 are possible.In addition, it is decided whether to use the downsampling strategy or not.
In total, five options are considered for coding a GOP with 8 frames.The choice for each GOP needs to be transmitted to the decoder side.However, the overhead of transmitting three bits per eight frames is negligible.Hence, binary encoding is used to signal the content-adaptive choice for a coding unit with eight frames.
For a GOP size of 4, there are three options: Two GOPs of size 2 and one GOP of size 4, with or without downsampling.

Training details
The networks described above are implemented using the PyTorch framework.The Vimeo90K data set [41] is used for training and the batch size is set to 8.During training, patches of size 128 × 128 are cropped from the luma channel of the respective training sample, whereas no cropping is performed during inference.By choosing the rate-distortion trade-off parameter according to λ = {0.007,0.01, 0.03, 0.05, 0.08}, five models are obtained for each GOP size.AdamW [42] is used as optimizer.Furthermore, the iWave++ models pretrained on luma data from [43] are used for temporal subband coding.SPyNet [39] is initialized with the "sintel-final" model1 trained on a synthetic data set.
As described in Section 4.2, separate models with multiple MCTF stages are trained for GOP sizes of 4 and 8, because the seven frames available in sequences from the Vimeo90K data set allow considering up to three temporal decomposition levels, that is, a maximum GOP size of 8.In line with the MCTF evaluation setup from Dong et al. [35] with a GOP size of 8, it is shown that in this setting, MCTF performs competitive to state-of-the-art coders.

Test conditions
The UVG [44] and MCL-JCV [45] data sets are used for testing.The sequences in both data sets have a resolution of 1920×1080 and are in YUV 4:2:0 format.UVG consists of 7 sequences and MCL-JCV of 30.To consider a different resolution of 1280 × 720, the JCT-VC class E data set (HEVC E) containing three YUV 4:2:0 sequences are used.The test conditions in [14] are followed by evaluating on the first 96 frames of each sequence.In addition, the evaluation includes three sequences from the UVG 4K [44] data set (CityAlley, FlowerFocus, FlowerKids) with a resolution of 3840×2160 and testing is performed on the first 24 frames.DCVC-HEM 2 [14] and DCVC 3 [12] are evaluated with GOP sizes of 4 and 8 for a fair comparison with the MCTF approach.Thereby, publicly available models from the authors are used, which were trained on Vimeo90K.As a traditional hybrid video coder, HM 16.25 4 is included.HM is used in the Lowdelay P (LD-P) configuration because the learned video coders only support unidirectional motion estimation.HM is evaluated in its default main profile with an intra period and GOP sizes of 4 and 8 as well.
The evaluation is performed in terms of RGB-PSNR, as this is common in learned video compression, and the aim of this paper is to provide comparable measurements.The MCTF approach and HM receive the input video sequence in YUV 4:2:0 format, whereas the input is converted to RGB 4:4:4, as required by DCVC-HEM and its predecessor DCVC.The wavelet video coder supports input data in YUV 4:2:0 format as well as in 4:4:4 format, because the color channels are coded independently by iWave++.The motion vectors are computed based on the luma channel.They are re-used for the chroma channels, and bilinear downsampling is performed if necessary.

Rate-distortion curves
The novel approach is compared to HM, the state-of-the-art learned video coder DCVC-HEM [14], and its predecessor DCVC [12].Figs.7-10 show the rate-distortion curves for the UVG, MCL-JCV, HEVC E, and UVG 4K data sets, respectively.The dashed lines correspond to a GOP size of 4, whereas solid lines indicate a GOP size of 8.
Clearly, the conditional coder DCVC (gray) is not competitive with the remaining video coders.The approach performs better for a smaller GOP size of 4 compared to a GOP size of 8 on two data sets, which implies an error propagation issue.Its successor, DCVC-HEM (green), on the other hand, can effectively exploit temporal redundancies for the 2 https://github.com/microsoft/DCVC/tree/main/DCVC-HEM 3https://github.com/microsoft/DCVC/tree/main/DCVC 4 larger GOP size of 8. DCVC-HEM outperforms HM in lower bitrate ranges on UVG and MCL-JCV, whereas HM always performs better at higher rates.The rate-distortion performance of the best-performing model, MCTF-CA (red), behaves in the opposite way: the higher the rate, the better the approach performs relative to HM.At higher rates, the model clearly outperforms HM for all data sets and GOP sizes.The performance degrades only for the MCTF-CA model at the lowest rate point (λ = 0.007) compared with the other rate points.The MCTF-CA model performs particularly well at high rates, owing to its invertible wavelet transforms.The perfect reconstruction property allows lossless compression without quantization, and therefore provides the capacity for high coding efficiency at high quality.

Bjøntegaard Delta rate
For a quantitative evaluation of the rate-distortion performance, the Bjøntegaard Delta (BD) rate savings of the learned video coders are measured using HM LD-P as an anchor.Note that the BD values need to be handled with caution because the available supporting points of DCVC-HEM and DCVC cover a limited bitrate and quality range.Thus, comparisons in terms of the BD metric can be less reliable [46], and rate-distortion curves should be considered to obtain a complete picture.Therefore, using HM as an anchor avoids comparing the two conditional coders with the proposed method directly, but still perform comparisons over different bitrate and quality ranges.To cover the entire bitrate-distortion range of the learned video coders, HM is evaluated with Quantization Parameters (QP) values QP = {32, 27, 22, 19, 17, 15, 13}.The integration area for BD rate calculation is determined by the respective learned video coder, that is, by the minimum and maximum RGB-PSNR values obtained with the learned coder.Compared with the entire rate-distortion curve of HM, the overlap of the rate-distortion curve of DCVC-HEM with respect to the bitrate lies in the range of 14-29%.The overlap in terms of RGB-PSNR is between 42 and 46% depending on the data set and GOP size.Comparing the overlap of the rate-distortion curves of HM and MCTF-CA, the rate overlap is between 36 and 66%, whereas the distortion overlap ranges from 70 to 94 %.Hence, the MCTF models cover a larger rate-distortion range, as shown in Figs.7-10.
Table 2 contains the BD measurements for all four data sets and for both GOP sizes.Over the entire bitrate range, DCVC-HEM performs best on the MCL-JCV data set for a GOP size of 8, achieving a BD rate reduction of approximately -4% compared to HM.In the remaining cases, MCTF-CA performs the best.It achieves BD rate savings of up to -21% and -9% on the UVG data set for GOP sizes of 4 and 8, respectively.On MCL-JCV, BD rate savings of -12% are obtained for a GOP size of 4. Furthermore, MCTF-CA achieves coding gains of -26% and -11% for GOP sizes of 4 and 8, respectively, on HEVC E. Overall, the high-resolution sequences from the UVG 4K data set are the most challenging for all learned video coders.MCTF-CA only achieves coding gains over HM for a GOP size of 4, but nevertheless performs favorably in comparison to the remaining learned coders.

Per-sequence evaluation on the UVG data set
The coding performance of DCVC-HEM and MCTF-CA is assessed for each of the seven sequences in the UVG data set.Table 3 provides BD rate savings relative to HM as an anchor.Independent of the GOP size, DCVC-HEM performs better than the proposed approach compared to HM for sequences with stronger motion, namely, Jockey, ReadySteady, and YachtRide.These sequences mostly contain relatively large translational motion.In contrast, the MCTF-CA approach performs best for sequences with high spatial detail and more irregular motion.For example, the approach achieves BD rate savings of over -63% for the Beauty scene, which is challenging because of moving hair.Here, DCVC-HEM struggles and is the least efficient compared to HM.Overall, the per-sequence evaluation shows that MCTF leads to superior coding performance compared to an "IPPP. .." coding order for specific scene contents.The following example of the ShakeNDry sequence illustrates the benefits of the temporal update operation.The scene has a static background, but contains challenging motion with flying water drops.With the MCTF-CA model, the first GOP of the sequence is coded with a GOP size of 8, that is, three temporal decomposition levels.The temporal updates help improve the coding efficiency of the temporal highpass frames at higher temporal decomposition levels: the highpass frames in the first, second, and third level require 0.38 bpp, 0.27 bpp, and 0.20 bpp at approximately 42.2 dB.As shown in Fig. 11(d)-(f), the highpass h 3,4 from temporal decomposition level three contains fewer prediction errors compared to the other levels, which leads to better coding efficiency.The application of two temporal update operations (see Fig. 11 (c)) creates a better representation for the prediction compared to the original frame in Fig. 11 (b) through lowpass filtering along the motion trajectory.
When comparing the rate-distortion curves of MCTF-CA for every sequence of the UVG data set (cf. Fig. 12), the ShakeNDry sequence is one of the most challenging sequences next to the Beauty sequence.RGB-PSNR in dB HM LD-P HM LD-P DCVC [12] DCVC [12] DCVC-HEM [14] DCVC-HEM [14] MCTF-CA (proposed) MCTF-CA (proposed) Fig. 10.Rate-distortion evaluation on 3 sequences (CityAlley, FlowerFocus, FlowerKids) from the UVG 4K data set.Solid lines correspond to a GOP size of 8 and dashed lines to a GOP size of 4. maximum motion vector length in pixels averaged over all 96 evaluated frames for each sequence.These values are computed using a SPyNet model trained on Vimeo90K without considering motion vector compression.These measurements show that a high prediction quality of over 48 dB and relatively small motion (HoneyBee, Bosphorus) are associated with the best rate-distortion performance of MCTF-CA.However, a lower prediction quality and larger motion do not necessarily lead to poor rate-distortion performance; for example, MCTF-CA performs better on the Jockey sequence than on the Beauty sequence because factors such as high spatial detail contained in a sequence influence the coding efficiency as well.

Complexity
The computational complexity of the MCTF-based approach is assessed in terms of model size and kilo multiply-accumulate operations per pixel (kMAC/px).As shown in Table 4, the MCTF-CA approach is more complex with respect to both model size and kMACs/px.Note that most of the model complexity of MCTF-CA is attributed to the temporal subband coder iWave++.For a GOP size of 8, the MCTF modules only account for 29 % of the model size and 12 % of the required kMACs/px.Because of the dedicated MCTF stages for every temporal decomposition level, the MCTF modules have a larger influence on the model size relative to MACs.

Ablation Study: MCTF configuration
In the following section, several MCTF coder configurations are examined.In doing so, the benefits of the proposed downsampling strategy and content-adaptive MCTF approach are evaluated.1).It is included in the evaluation, because it corresponds to the standard approach commonly used in traditional MCTF.On both data sets, MCTF-Single results in a BD rate degradation of over +16% and +29% for GOP sizes of 4 and 8, respectively.Therefore, multiple MCTF stages are necessary to achieve improved rate-distortion performance for higher temporal decomposition levels with larger frame distances.
The impact of multiple MCTF stages on the rate-distortion from the highest temporal decomposition level has less prediction errors compared to h 2,2 and h 1,1 (black corresponds to zeros).This is because h 3,4 is predicted from l 2,0 shown in (c).Here, the application of temporal updates to l 2,0 improves the prediction and thus coding efficiency.
curves for the UVG data set is illustrated in Fig. 13.The models with multiple MCTF stages (blue) clearly outperform a single stage (orange), independent of the GOP size.

Downsampling Strategy (MCTF-DS)
Next, the MCTF-DS approach introduced in Section 4.3 is evaluated.On average, the MCTF-DS models (gray) lead to a reduced bitrate at approximately the same quality as the baseline models (blue), as shown in Fig. 13.The bitrate savings are due to the smaller spatial resolution of the motion vectors, which requires a lower rate.At the same time, there is no significant quality degradation, and for some rate points, the quality is even slightly improved.On average, MCTF-DS leads to coding gains between 2 and 3%, measured in terms of BD rate, compared to the MCTF model with multiple stages as an anchor (cf.Table 5).However, MCTF-DS degrades the performance on the MCL-JCV data set for a GOP size of 8.For each sequence, the motion strength in pixels (px) and motion-compensated prediction quality in dB averaged over all frames are provided.For these measurements, the motion vectors between successive frames required for motion compensation are estimated using a SPyNet model.Thereby, the motion strength for a single frame is measured as the maximum motion vector length in horizontal or vertical direction.For the HoneyBee sequence with a small moving object and high spatial detail, the downsampling strategy leads to BD rate increases of 0.5% and 10% for GOP sizes of 4 and 8, respectively.This shows that although the downsampling strategy leads to improved performance for most sequences, a content-adaptive mechanism is required.

Content-Adaptive MCTF (MCTF-CA)
The MTCF-CA approach explained in Section 4.4 overcomes the disadvantages of MCTF-DS for some motion types and scene contents.As can be seen in Table 5, MCTF-CA performs best on all data sets and GOP sizes.In particular, for a GOP size of 8, MCTF-CA provides average BD rate savings  of at least 10% compared to the MCTF model with multiple MCTF stages as an anchor.
A detailed evaluation on every sequence of the UVG data set provided in Table 6 shows that for a GOP size of 4, MCTF-CA improves over MCTF-DS for 5 out of 7 sequences.For the remaining two sequences, MCTF-DS is already optimal.However, for a GOP size of 8 more options for MCTF-CA are available and MCTF-DS is only optimal for the Bosphorus sequence, which contains relatively easy translational motion.For the remaining sequences, a contentadaptive approach leads to considerable improvements in terms of BD rate; for example, MCTF-CA achieves BD rate savings of -12% and -25% on the YachtRide and Jockey sequences, respectively.Furthermore, MCTF-CA prevents the use of the downsampling strategy for sequences where it degrades rate-distortion performance, for example, for the HoneyBee and Beauty sequences containing high spatial detail.Therefore, content-adaptive temporal scaling is clearly advantageous in terms of rate-distortion performance, because the motion types are highly dependent on the scene content.
Fig. 14 provides an example of the benefit of MCTF-CA: the Jockey sequence from the UVG data set contains strong motion, which leads to ghosting for some GOPs (cf.Fig. 14(c)) when processing the sequence with a uniform temporal decomposition, that is, a constant GOP size of 8 with the MCTF model.MCTF-CA adaptively chooses a smaller GOP size if ghosting harms the coding costs.As can be seen in Fig. 14(d), MCTF-CA prevents ghosting by determining a GOP size of 2, which can be coded most efficiently.

CONCLUSION
This paper introduced the first end-to-end trainable wavelet video coder based on MCTF.It presented a training strategy that considers multiple temporal decomposition levels during training.Moreover, a downsampling strategy was proposed as a first solution for handling larger temporal displacements in MCTF.The novel content-adaptive MCTF enables the proposed method to adapt to different motion types in each sequence.The experimental results show that the learned MCTF video coder exhibits promising rate-distortion performance, especially for higher bitrates.On the UVG data set, the MCTF-CA method achieves average BD rate savings of -21% and -9% for GOP sizes of 4 and 8, respectively, compared to HM. Thereby, it clearly outperforms the stateof-the-art video coder DCVC-HEM [14].
There are various possibilities for improvement as an initial version of a learned wavelet video coder.First, one could examine a different temporal subband coder required for practical usage because the autoregressive context model of iWave++ prohibits parallelization.Second, the MCTF structure requires extensions to handle more diverse motion types and GOP sizes of 16 and higher.Because the maxi- mum frame distance doubles with every additional temporal decomposition level, motion estimation is considerably more challenging for, for example, a GOP size of 16 with a frame distance of 8. Therefore, bidirectional motion estimation and methods for overcoming the limitations of short-sequence training sets for larger GOP-size compression could be investigated.To mitigate ghosting for larger GOP sizes, an adaptive choice of a truncated DWT without temporal update [47] could be beneficial.Furthermore, the complexity of content-adaptive MCTF can be limited by using a predictor for choosing the adaptive MCTF option.
The MCTF-based approach provides an explainable and scalable alternative to common autoencoder-based video coders.This paper made the first steps to enable further development of this important direction of research.

Fig. 1 .
Fig. 1.Schematic overview of the wavelet video coding scheme.This paper introduces a novel trainable version of the coding scheme.A temporal wavelet transform is followed by a 2D wavelet transform in horizontal and vertical dimensions.By incorporating motion compensation into the temporal wavelet transform, Motion-Compensated Temporal Filtering (MCTF) is performed.

Fig. 2 .
Fig.2.Overview of the end-to-end image compression method iWave++[25].x is a single luma or chroma channel of an image in the YCbCr color space.The red arrows indicate the coding order of the subbands y.Trainable modules are colored in blue.For visualization, the subbands of two decomposition levels are shown.

−Fig. 3 .
Fig.3.Overview of the proposed wavelet video coding scheme for one temporal decomposition level with two frames.f denotes the input video sequence.

Fig. 4 .
Fig.4.Details on prediction and update filters.f denotes the input video sequence.The "ME" module contains motion estimation and motion vector coding.Its output MV t corresponds to the decoded motion vectors at time instance t.MC stands for motion compensation and MC −1 for inverse motion compensation.The "DN" modules represent residual CNN-based filter operations.

Fig. 5
Fig.5illustrates the coding order of MCTF for a GOP consisting of 8 frames.Because MCTF is an open-loop structure,

Fig. 8 .
Fig. 8. Rate-distortion evaluation on the MCL-JCV data set.Solid lines correspond to a GOP size of 8 and dashed lines to a GOP size of 4.

Fig. 9 .
Fig. 9. Rate-distortion evaluation on the HEVC E data set.Solid lines correspond to a GOP size of 8 and dashed lines to a GOP size of 4.

1 Fig. 11 .
Fig. 11.Impact of the temporal update operation.Subfig.(a) shows the first frame of the ShakeNDry sequence from the UVG data set.(d)-(f) depict temporal highpass frames coded in different temporal decomposition levels by a MCTF-CA model (λ = 0.08, GOP size 8).The highpass frame h 3,4from the highest temporal decomposition level has less prediction errors compared to h 2,2 and h 1,1 (black corresponds to zeros).This is because h 3,4 is predicted from l 2,0 shown in (c).Here, the application of temporal updates to l 2,0 improves the prediction and thus coding efficiency.

Fig. 12 .
Fig. 12.Comparison of the rate-distortion curves of MCTF-CA (GOP size of 8) for every sequence of the UVG data set.For each sequence, the motion strength in pixels (px) and motion-compensated prediction quality in dB averaged over all frames are provided.For these measurements, the motion vectors between successive frames required for motion compensation are estimated using a SPyNet model.Thereby, the motion strength for a single frame is measured as the maximum motion vector length in horizontal or vertical direction.

Fig. 13 .
Fig. 13.Rate-distortion evaluation on the UVG data set.Solid lines correspond to a GOP size of 8 and dashed lines to a GOP size of 4. MCTF-Single: Same MCTF stage for all temporal decomposition levels.MCTF: Different MCTF stages for each level.MCTF-DS: Different MCTF stages with downsampling strategy during inference.MCTF-CA: Content adaptive MCTF.Best to be viewed enlarged on a screen.

Fig. 14 .
Fig. 14.Content adaptive MCTF prevents ghosting.Subfig.(a) shows the first frame of a GOP of size 8 from the Jockey sequence.The MCTF model in (c) codes the first frame as l 3,0 in third temporal decomposition level, which leads to ghosting due to the large motion in the scene (C 8,GOP8 = 1.74).MCTF-CA in (d) mitigates ghosting by choosing a GOP size of 2 and transmitting the first frame as l 1,0 in the first temporal decomposition level (C 8,GOP2 = 1.57).

Table 1 .
Training schedule for a GOP size of 4/8.A training sample consists of two frames in each training stage.LR denotes the learning rate, d max the maximum frame distance between two frames in a training sample, and "parts" refers to the trainable components of the network."All" parts include the MCTF stages and the iWave++ models. 25tps://vcgit.hhi.fraunhofer.de/jvet/HM/-/releases/HM-16.25

Table 2 .
Rate-distortion evaluation on the UVG, MCL-JCV, HEVC E, and UVG 4K data sets for different GOP sizes.Average BD rate savings are provided relative to HM in LD-P configuration as an anchor.(a)GOP 4

Table 3 .
BD rate savings for each of the 7 UVG sequences over HM in LD-P configuration.

Table 4 .
Complexity comparison of learned video coders for an input size of 1920×1080 in terms of model size and kilo multiply-accumulate operations per pixel (kMAC/px).

Table 5 .
Rate-distortion evaluation on the UVG and MCL-JCV data sets for different GOP sizes.Average BD rate savings are provided relative to the baseline MCTF model as an anchor.

Table 6 .
BD rate savings for each of the 7 UVG sequences over the baseline MCTF model as an anchor.