Spatio-Temporal Convolutional Neural Network for Enhanced Inter Prediction in Video Coding

This paper presents a convolutional neural network (CNN)-based enhancement to inter prediction in Versatile Video Coding (VVC). Our approach aims at improving the prediction signal of inter blocks with a residual CNN that incorporates spatial and temporal reference samples. It is motivated by the theoretical consideration that neural network-based methods have a higher degree of signal adaptivity than conventional signal processing methods and that spatially neighboring reference samples have the potential to improve the prediction signal by adapting it to the reconstructed signal in its immediate vicinity. We show that adding a polyphase decomposition stage to the CNN results in a significantly better trade-off between computational complexity and coding performance. Incorporating spatial reference samples in the inter prediction process is challenging: The fact that the input of the CNN for one block may depend on the output of the CNN for preceding blocks prohibits parallel processing. We solve this by introducing a novel signal plane that contains specifically constrained reference samples, enabling parallel decoding while maintaining a high compression efficiency. Overall, experimental results show average bit rate savings of 4.07% and 3.47% for the random access (RA) and low-delay B (LB) configurations of the JVET common test conditions, respectively.

redundancy is exploited by motion-compensated (or inter) prediction, which is a core component of all video coding standards.In the evolution of those standards, ranging from H.261 [1], first ratified in November 1988, to Versatile Video Coding (VVC) [2], [3], inter prediction has been enhanced in many ways.Typically, these enhancements were aimed at improving the motion-compensated prediction signal and thus increasing the overall coding performance.For example, it is a well-established finding that by superimposing two individual prediction signals, the resulting prediction error variance can be reduced [4].Thus, a simple averaging of the two predictors has been used since the introduction of the MPEG-1 standard [5] in 1991.The H.264/AVC video coding standard [6] introduced so-called weighted prediction, where a weighting factor can be transmitted at slice level for each reference picture.A significant improvement of inter prediction is achieved by variable block size motion compensation, which was first standardized in the Advanced Prediction mode of H.263 [7] and later refined in subsequent video coding standards.
In the current state-of-the-art standard VVC, several further enhancements to inter prediction have been integrated.The simple averaging in bi-prediction can be replaced by Bi-prediction with CU Weights (BCW) [8].For the blockbased bi-prediction, there is a sample-wise refinement called Bi-Directional Optical Flow (BDOF) [9], [10].Furthermore, there is subblock-based inter prediction, where for each subblock an individual motion vector is derived.This includes Subblock-based Temporal Motion Vector Prediction (SbTMVP) [10], Decoder-side Motion Vector Refinement (DMVR) [10], and Affine Motion Compensation (AMC) [10].For the latter, there also is a sample-wise refinement called Prediction Refinement with Optical Flow (PROF) [10].Moreover, the Geometric Partitioning Mode (GPM) [8] adds support for non-rectangular partitions.In order to jointly exploit temporal and spatial redundancies, VVC introduces Combined Inter/Intra Prediction (CIIP) [8], which additionally uses adjacent samples from neighboring blocks.
During the development of VVC, another method that incorporates spatially neighboring samples into a temporally predicted block has been studied in detail.With this method, which is known as Local Illumination Compensation (LIC) [11], [12], a scale and an offset value are derived at the decoder to adjust the luminance of an inter prediction block to that of the top and left neighboring reconstructed samples.However, LIC has not become part of VVC.
In this paper, based on our previous work [13], a spatiotemporal residual network (STRN) is proposed.It is integrated into the VVC test model (VTM), the reference software of the VVC standard.The main idea of STRN is to refine the inter prediction signal without any additional signaling, by using a convolutional neural network (CNN) that incorporates information from spatially neighboring blocks.This is motivated by the following considerations.Firstly, incorporating spatially neighboring reference samples has the potential to improve the prediction signal by adapting it to the reconstructed signal in its immediate vicinity.It is a well-established finding that for motion-compensated prediction, the prediction error tends to be higher at the border than within the center region of the prediction block [14].By exploiting the statistical dependencies between neighboring reconstructed samples and the current samples, the prediction error can be reduced for the top and left border areas of the block.Secondly, neural network-based methods offer a higher degree of signal adaptivity than conventional signal processing methods.This adaptivity can be attributed to the fact that multilayer feedforward networks are universal approximators [15], which recently has been shown to also hold for deep CNNs [16].Given this background, the main contributions of our work are summarized as follows: • We add a polyphase decomposition stage to the CNN.It is shown that this decomposition of the input video representation results in a significantly better trade-off between computational complexity and coding performance.
• We introduce an additional signal plane for decoupling the CNN refinement from the intra decoding loop.This novel signal plane contains constrained spatial reference samples, allowing parallel application of the CNN for all blocks within one picture at the decoder, independent of intra predicted blocks.Our solution has the advantage that it enables parallel instead of sequential processing without significantly impairing the coding performance.
• We propose a constraint that largely prevents gradual signal degradation for low-delay prediction structures.It is found that for long prediction chains, repeated application of the CNN can in some cases have a negative impact on the compression efficiency.Our solution mitigates this problem without degrading the random access (RA) coding performance.Other than that, STRN stands out from most related work by being based on VVC and by using a single CNN model for all block shapes and quantization parameter (QP) values.
The remainder of this paper is organized as follows.Related work is reviewed in Section II.The proposed network is presented and detailed in Section III.The integration of STRN into VVC is described in Section IV.Experimental results are shown and discussed in Section V. Finally, Section VI concludes the paper.

II. RELATED WORK
For many image processing tasks, in particular those which are commonly subsumed under the term computer vision, approaches based on deep learning have been successfully applied in recent years.A particularly important class of such approaches are convolutional neural networks (CNNs).One of the earliest CNNs was the so-called LeNet, initially proposed by Y. LeCun in 1989, for the automated recognition of ZIP code numbers [17].In the following decades, CNNs have been applied to various image processing tasks, such as object recognition, picture classification and segmentation, image restoration and denoising, and many others.
In recent years, CNNs have also been proposed for video coding.Here, two different categories have to be distinguished.The first category are so-called end-to-end optimized compression methods like [18], [19], and [20], where the classical architecture of a hybrid video codec is replaced by a combination of encoder and decoder networks that are jointly optimized according to a common rate-distortion loss function.In the second category, the basic framework of a conventional hybrid video codec is kept, but a neural network is used for specific coding tools, like interpolation filtering [21], [22], intra prediction [23], [24], quantization [25], or loop filtering [26], [27], [28].Our proposed method belongs to the second category and related work is discussed in more detail below, with a focus on inter prediction.An overview over various approaches of neural network based video compression can be found in [29] and [30].
In [31], Huo et al. propose a CNN-based motion compensation refinement (CNNMCR) scheme.There are two variants of CNNMCR: In the simple variant, the inter prediction signal is fed into a CNN, and the output of the CNN is the refined prediction signal.In the extended variant, an enlarged block, also consisting of already reconstructed neighboring samples, is used as the input of the CNN.For each QP, a distinct model is trained.
In [32], Wang et al. describe a neural network based inter prediction (NNIP) algorithm, employing a combination of a fully connected network (FCN) and a CNN.Similar to [31], the output of the networks is the refined inter prediction signal, and reconstructed neighboring samples are incorporated into the input of the networks.However, [32] additionally uses neighboring samples of the temporal reference block for the input.An improved version of NNIP is presented in [33].Here, the network architecture is changed, such that three instead of two neural networks are used in combination.In [32] and [33], a distinct model is trained for each combination of QP and block shape.
In [34], Zhao et al. propose a CNN-based fusion scheme.It is applied only for bi-prediction and replaces the averaging of the two predictors.Input to the network are the two motion-compensated prediction signals and its output is the combined inter prediction signal.For each QP, a distinct model is trained.
In [35], Mao et al. present a CNN-based bi-prediction utilizing spatial information, called SICNN.Conceptually, SICNN can be viewed as a combination of ideas originating from [32], [33], and [34]: Like [34], the two constituent prediction signals of bi-prediction are used for the input of the CNN.Like [32], [33], the corresponding blocks are enlarged to also include top/left spatially neighboring samples.The output of the CNN is the refined bi-prediction signal.Again, for each QP, a distinct model is trained for SICNN.In [36], Mao and Yu extend their work of [35] to also include temporal distance information in the input of the CNN.
In [37], Zhang et al. describe a CNN-based inter prediction refinement method for the AVS3 standard [38].This work is based on the work [32], but uses a CNN instead of the FCN, in order to allow application of the network to all block shapes.Furthermore, no spatially neighboring samples are used in [37].Still, for each QP, a distinct model is trained.
In [39], Jin et al. propose a deep affine motion compensation network (DAMC-Net) which is based on the AMC method of VVC.Input to the network are the AMC prediction, the initial motion vector field, and the reference block.Output of the network is the refined AMC prediction signal.Like in [31], [32], [33], [35], and [36], the input block is enlarged to also include top/left neighboring samples.For each combination of block shape and QP, a distinct network model is trained.
Neural network-based methods for video coding are also studied by the Joint Video Experts Team (JVET).For this purpose, a series of exploration experiments has been conducted over the course of several JVET meeting cycles, starting with [40] in 2020.As an outcome of this exploration activity, a new software called neural network-based video coding (NNVC) is being developed based on VTM.At the time of writing, the current version 7.1 of the NNVC software includes the following neural network-based tools: in-loop filter in the two variants low-complexity operating point (LOP) [41] and high-complexity operating point (HOP) [42], intra prediction [43], super resolution [44], and content-adaptive post filter [45].
CNN-based in-loop filters are an interesting alternative to STRN for improving the inter coding performance.While STRN enhances the quality of the prediction signal, in-loop filters enhance the quality of the reconstructed signal.Given our solution for decoupling STRN from the intra decoding loop, both approaches allow the same level of parallelism at the decoder.However, applying the CNN to the prediction signal has advantages: On the one hand, it has the potential to directly reduce the energy of the transmitted residual signal.On the other hand, it is less prone to visual artifacts, as possible distortions in the prediction signal can either be compensated by the residual signal or averted by selecting a different coding mode or block partitioning during the rate-distortion (RD) optimization at the encoder.
In our previous work [13], an intra-inter prediction residual convolutional neural network (IPRN) is presented.The architecture of IPRN is based on [35] and [36].Accordingly, the input to the network includes the inter prediction signal together with the two constituent prediction signals of bi-prediction and is likewise extended by top/left neighboring samples.Other than most related work, IPRN is based on VVC and uses a single network model for all block shapes and QP values.In addition, different training loss functions are studied in [13].It is found that the sum of absolute transformed differences (SATD), i.e., the ℓ 1 -norm in the DCT domain, results in a better coding performance than the commonly used sum of squared differences (SSD) and sum of absolute differences (SAD), which operate in the spatial domain.
Most of the methods discussed above, namely [31], [32], [33], [35], [36], and [39], as well as our previous work [13], use reconstructed top/left neighboring samples for the input of the neural network.This has big implications for practical implementation of the decoder.Firstly, and most significantly, the network cannot be applied in parallel to the affected blocks of one picture.Instead, the blocks have to be fed sequentially through the network.This is caused by the fact that the input of the network depends on the reconstructed neighboring samples, and therefore on the output of the network for these blocks.Secondly, by referring to reconstructed neighboring samples, a CNN refined inter block may now depend on the output of intra prediction, if at least one of its top/left neighboring blocks happens to be intra predicted.Both aforementioned aspects have the effect that the CNN-based inter prediction becomes part of the so-called intra decoding loop.This is a complete break with existing video codec design principles.In all video coding standards, including VVC, the inter prediction can be performed in parallel at the decoder for all inter blocks of one picture, after the corresponding motion vectors have been determined.This becomes impossible with such a change.
In this paper, we propose STRN, a spatio-temporal residual CNN for enhanced inter prediction.As a distinct feature, we introduce an additional signal plane with constrained spatial reference samples which enables decoupling the network from the intra decoding loop.Therefore, our solution allows parallel processing of the CNN for all affected blocks of one picture at the decoder.To the best of our knowledge, this aspect, which has a significant impact on practical implementation, has not been addressed before in the literature related to CNN-based inter prediction.Moreover, most of the previously discussed methods employ a separate CNN model for each block shape and/or QP value.Separate models for each QP value are only suitable for showing results with a small predefined set of QPs, while real applications are usually expected to support a large range of QP values.STRN uses a single CNN model for all block sizes and QP values.

III. DESCRIPTION OF THE NEURAL NETWORK FOR
ENHANCED INTER PREDICTION This section introduces the proposed neural network, including the CNN architecture with polyphase decomposition in Section III-A and the specifics of the training process in Section III-B.

A. Architecture
Fig. 1 shows an overview of the proposed STRN network architecture, which is based on the architecture in [13].Our approach aims at improving the prediction signal of inter blocks in VVC, so that the input and output of the network (as depicted in the top-left and top-right of Fig. 1, respectively) represent the interface between the video codec and the STRN domain.
Given an inter block of size W × H , bi-prediction in VVC uses the motion-compensated reference blocks of size W × H of the two reference pictures, i.e., L0 and L1 prediction signals, to compute the prediction of the current Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.block.Accordingly, the STRN input is composed of the sample arrays of the current picture (highlighted in blue in Fig. 1) and the L0 and L1 prediction signals (highlighted in green and yellow in Fig. 1).For STRN, however, the block size is extended by an L-shaped B samples wide area along the top/left border, resulting in input arrays of size Regarding the L0 and L1 prediction signals, the extended motion-compensated reference blocks are derived using the same motion vectors as for regular biprediction, so that input arrays C 1 and C 3 contain additional (interpolated) prediction samples along the top/left border, i.e., spatio-temporal reference samples.For the current picture, the input array C 2 contains the regular bi-prediction P of the current block in the corresponding W × H area (light blue in Fig. 1) and additional reconstructed samples in the L-shaped B wide area along the top/left border (dark blue in Fig. 1), i.e., spatial reference samples.Unlike in [13], these reconstructed samples are subject to certain constraints that allow STRN and intra blocks to be decoded independently and in parallel (see Section IV-B for details).
Together, the three input arrays form a tensor which is used to derive the actual input tensor of the network via polyphase decomposition [46], [47].Given C with size 3×W B × H B and elements (c, x, y), the polyphase components are obtained by splitting it into even and odd samples along the x and y directions as which requires W B and H B being a multiple of 2. The input tensor of size 12 × 1 2 W B × 1 2 H B is then formed by joining the four polyphase components as . Please note that this decomposition only rearranges the tensor elements, but does not change the number of elements or their values.In the context of deep learning, this operation is sometimes also referred to as pixel (un)shuffling [48].
Regarding the neural network structure, the proposed STRN basically consists of N convolution layers.As illustrated in Fig. 1, all layers perform convolutions with a kernel size of 3 × 3, followed by a rectified linear unit (ReLU) activation function, except for the last layer.The operation of these convolution layers can be defined as with weight matrices W k and bias vectors b k , k ∈ [1.
. N ].The input to the first layer L 0 corresponds to 12 × 3 × 3 subtensors of C * at positions (x, y).For each layer, the operations of ( 2) are applied to all positions (x, y), using zero padding to preserve the block shape, such that the output of the last layer Each convolution layer has c in × 3 × 3 × c out weights (size of W k ) and c out biases (size of b k ), with c in and c out being the number of input and output channels of the respective layer.The first layer has 12 input channels and the last layer has 4 output channel, while all intermediate layers have F feature channels.This results in a total of n w = 608256 weights and n b = 644 biases for STRN with N = 6 layers and F = 128 feature channels.
As illustrated in Fig. 1, the network has a skip connection, where the input array C 2 is added to the output.However, L N is a polyphase representation of the output, where the four output channels correspond to the four polyphase components like in (1).Therefore, the elements of L N are rearranged to a W B × H B output array C before the skip connection.This means that the four polyphase components are merged into one array by inverse polyphase decomposition.The final output is then obtained by adding the input array C 2 to the output C and cropping the L-shaped B samples wide area along the top/left border as with P * being the refined prediction of the current W × H block.The skip connection makes STRN a residual network and has the effect that the convolution layers learn to result Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.with the help of spatial and temporal reference samples.The reason for including the polyphase decomposition in the STRN architecture is that it allows for a considerable complexity reduction.Table I shows a comparison between the IPRN architecture [13] without and the STRN architecture with polyphase decomposition.The computational complexity of deep learning approaches for video coding is often evaluated in terms of multiply-accumulate (MAC) operations per output luma sample, which is commonly referred to as MAC per pixel (MAC/pxl).For both IPRN and STRN, this value depends on the number of weights and the block shape as with input tensor shape for STRN, respectively.Thus, polyphase decomposition reduces the complexity by reducing the tensor shape of the input and all subsequent layers.The resulting minimum and maximum values in Table I highlight that STRN has about a quarter of the complexity of IPRN for the same number of feature channels or, alternatively, about the same complexity for twice the number of feature channels.

B. Training
The training dataset for STRN consists of a collection of so-called training samples.Like in [13], these samples are derived from decoded VVC bitstreams by generating the three input arrays C i for inter blocks and storing them together with the corresponding original signal array O of the block.The values of these arrays are of integer type in the range [0..2 b −1], with bit depth b, and are converted to floating-point values in the range [−0.5, 0.5[ in the training process.IPRN in [13] and STRN have in common that the architecture is fully independent of the block shape W × H , which means that the same model can be trained and applied for all VVC inter block shapes.Consequently, the dataset contains training samples for various block shapes.During training, each forward and backward propagation cycle processes a batch of training samples at once.While all training samples within a batch are required to have the same shape, each batch can have a different shape, so that one single model can be trained with all the block shapes contained in the dataset.
The core of the training process is a gradient descent algorithm with a loss function and a backward propagation of the loss, based on a learning rate and an optimizer.Regarding the loss function, the differences between the commonly used SSD, SAD, and SATD have been studied in [13], with the conclusion that SATD performs better than the other loss functions that are computed in the spatial domain.Therefore, the SATD loss function is also used for STRN: Given output P * and the corresponding original signal O of a W × H block, the loss l equates to the ℓ 1 -norm of the two-dimensional DCT-II of the residual as For backward propagation of the loss, the widely-used Adam optimizer [49] is employed together with a learning rate that is decayed exponentially by a factor of 0.8 every two epochs, using an initial value of of 10 −4 .

IV. INTEGRATION WITH VVC INTER CODING
This section describes how STRN is integrated into VVC inter coding, including the interaction with other inter coding tools in the prediction process in Section IV-A, the integration in the decoding process with special attention to the intra loop in Section IV-B, the compilation of the input arrays for application and training sample collection in Section IV-C, and an efficient integration in the encoding process in Section IV-D.

A. Prediction Process
Both IPRN in [13] and STRN are designed as a residual network for improving the prediction signal of inter blocks and therefore integrated as a refinement module after the VVC inter prediction process.Given an inter block with prediction − 1] after output, which are then used as the final prediction signal of the block in VVC.
In the proposed solution, STRN is only applied to the luma component of a block and to all uni-and bi-predicted inter blocks that • are not coded with CIIP, BCW, GPM, or SbTMVP, and • do not have a motion vector equal to zero, unless they are coded with AMC (this will be referred to as zero-MV constraint in the following).For all these cases STRN is mandatory, i.e., it cannot be switched off for individual blocks.Accordingly, no tool flag or other mode data is signaled in the bitstream.Note that by selecting one of the coding modes for which which STRN is not applicable, the encoder can implicitly disable STRN for a particular block.Moreover, a single model is used for all applicable inter blocks, including all block shapes and all QP values.The zero-MV constraint is motivated by the observation that repeated application of the CNN can lead to a gradual signal degradation.This is particularly relevant for low-delay prediction structures and will be discussed in more detail in Section V-E.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Decoding Process
The general process of generating and compiling the input tensor C is the same for both the collection of training data and the application of STRN as a coding tool.For the latter, however, the design of the VVC decoding process for inter pictures (or slices) needs to be considered: Achieving realtime decoding for applications with high frame rates and/or high resolutions is very challenging.For intra blocks, on the one hand, the prediction signal is a function of the top/left spatial reference samples, which means that intra blocks can only be decoded after all the respective neighboring blocks in the current picture are decoded.For inter blocks, on the other hand, the prediction signal is a function of the L0 and L1 temporal reference samples, which means that inter blocks can be decoded independently of other blocks in the current picture.This design enables parallel decoding of inter blocks and, thus, a significant reduction in implementation complexity for decoding inter slices.For applications with higher frame rates and/or higher resolutions, all inter blocks can be processed in parallel first and then the remaining intra blocks successively.In this context, CIIP blocks are considered as part of the intra decoding loop, since the final prediction is a weighted combination of planar intra prediction (spatial reference samples) and inter prediction (temporal reference samples).
Fig. 2 illustrates the decoding process of an inter slice with inter, intra, and STRN blocks.For STRN, the prediction signal P * is a function of both spatial and temporal reference samples.Consequently, without appropriate modifications, STRN would be part of the intra decoding loop, as shown in Fig. 2 (b): Both STRN and intra blocks depend on reconstructed samples of neighboring STRN and intra blocks.This would be a problem for decoder implementations that rely on parallel processing of inter blocks, since the STRN prediction refinement is mandatory for most of the inter coding modes and the computational complexity of forward propagating input C through the network is quite high.A straightforward solution would be to remove the spatial reference samples in C by setting B = 0, but the corresponding results in Fig. 4 and Table IV show that the potential for improving the prediction signal is limited, if it cannot be adapted to the reconstructed signal of adjacent blocks.
Our solution for B > 0 is illustrated in Fig. 2 (c).The STRN refinement is decoupled from the intra decoding loop by imposing the following constraints: • For spatial reference samples of STRN and intra blocks that lie inside adjacent STRN blocks, an intermediate reconstructed signal without STRN refinement (P + R) is used instead of the actual reconstructed signal (P * + R). • For spatial reference samples of STRN blocks that lie inside adjacent intra blocks, the extended inter prediction signal of the STRN block is used instead of the reconstructed signal of the intra block.
For this purpose, an additional signal plane is introduced that contains regular reconstructed samples for inter blocks and constrained spatial reference samples for STRN and intra blocks: Due to the first constraint the signal plane does not (a) inter slice with inter, intra, and STRN blocks, (b) decoding process without additional signal plane, i.e., STRN is part of intra loop, and (c) proposed decoding process using additional signal plane with constrained spatial reference samples, i.e., STRN is not part of intra loop.
contain the reconstructed signal (P * + R) of STRN blocks, but an intermediate signal (P + R), where P is the inter prediction signal before application of the STRN refinement and R the residual transmitted in the bitstream.Due to the second constraint the signal plane does not contain the reconstructed samples of intra blocks; for the STRN refinement of inter blocks, the required samples are replaced with simple bi-prediction of the extended L0 and L1 signals (see Section IV-C).Regarding the coding performance, the corresponding results in Table IV show that using constrained spatial reference samples is significantly better than the straightforward solution with B = 0. Now, with the additional signal plane, the inter decoding process of Fig. 2 (c) can be implemented as follows: (1) reconstruct all inter blocks without STRN refinement in parallel, (2) reconstruct the remaining intra blocks successively, and (3) apply the STRN refinement for applicable inter blocks in parallel.Steps (1) and ( 2) are the same as the regular VVC decoding process without STRN, and steps (2) and (3) are independent of each other and may be executed in reverse order or even simultaneously.

C. Input Compilation
Given a W × H inter block with prediction P for which STRN is applicable according to Section IV-A, the process of generating input arrays C i depends on the coding mode.Resulting input arrays C i are required to be identical to the respective regular inter prediction signals in the corresponding W × H area.While the prediction process of BDOF, DMVR, and AMC operates on subblocks, STRN is applied to the whole W × H block, which means that input arrays C i are derived for the whole W B × H B area, including the reference and prediction signals of all subblocks.
For input arrays C 1 and C 3 , the L0 and L1 motion vectors available from regular inter prediction are used to obtain the extended W B × H B area from the respective reference picture, including additional spatio-temporal reference samples in the L-shaped B wide area along the top/left border.Except for AMC and DMVR, this step is straightforward, since for each reference picture only one motion vector is used for the whole block.For uni-predicted blocks only either L0 or L1 reference data is available, which means that either C 1 or C 3 would be empty.In this case, the available reference data is copied to the otherwise empty input array, i.e., C 1 and C 3 are identical.AMC and DMVR use individually refined motion vectors for each subblock and, thus, deriving the additional spatio-temporal reference samples requires extending the process accordingly.For AMC, the input arrays C 1 and C 3 are generated without the PROF refinement by applying the motion vectors of 4 × 4 subblocks along the top/left border to extended subblocks that include the adjacent B wide areas, using the same interpolation filters as for the regular AMC subblocks.For DMVR, the W × H area is divided into subblocks of up to 16 × 16 samples and the L-shaped B wide area along the top/left border of C 1 and C 3 is derived by introducing additional subblocks that inherit the refined motion vectors and the horizontal and/or vertical dimensions of subblocks along the top/left border, using the same sample padding process as for regular DMVR subblocks.
For input array C 2 , the regular prediction P is copied to the corresponding W × H area and the remaining L-shaped top/left area is filled with spatial reference samples.According to Section IV-B, constrained spatial reference samples are used here.For this purpose, the additional signal plane is continuously filled during the encoding and decoding process, collecting the required data of already processed blocks.In some cases, spatial reference samples are (partially) unavailable: For blocks located along the top or left border of the picture and for intra reference blocks, due to the second constraint in Section IV-B.These areas of C 2 are then filled with simple bi-prediction without BDOF refinement, i.e., the average of the corresponding sample values in C 1 and C 3 .

D. Encoder Optimization
The VVC encoding process tries to minimize the RD cost by testing different combinations of block partitioning and coding modes against the original, uncompressed picture.For a given W × H block in an inter picture, a number of coding mode candidates are tested, including both inter and intra prediction modes.Eventually the coding mode with the lowest RD cost is selected and later used for deciding the block partitioning.
Since STRN is integrated as a refinement module and the computational complexity of forward propagating the input tensor C through the network is quite high, it is not used for all coding mode candidates, but only for the most promising ones.For this purpose, the best coding mode is first determined using RD costs without STRN refinement.During this step, a list of length is with coding modes for which STRN is applicable (see Section IV-A).In case STRN is applicable for less than K coding modes, the list is not entirely filled and in case it is applicable for more than K coding modes, the list contains the ones with the lowest RD costs without STRN refinement.In a second step, the RD costs of the up to K coding modes of the list are updated with STRN refinement and the final coding mode of the block is selected between the best coding mode in the list and the best coding mode for which STRN is not applicable.

V. EXPERIMENTAL RESULTS AND EVALUATION
For evaluating the impact of STRN on the VVC coding efficiency, we have used the VVC test model 15 reference software (VTM-15.0)[52] under JVET common test conditions (CTC) [54].Unless stated otherwise, the STRN model has been trained with the configuration and the dataset specified in Table II.This results in a model file that contains the layer structure together with weight and bias values.For application in VTM, the STRN refinement has been integrated into the software according to Section IV, using the LibTorch 1.10 API, which provides the functionality to load the model file and forward propagate input tensor C through the network.All VTM coding experiments have been performed without GPU support, which means that the runtimes presented in this section have been obtained by running both VTM and STRN single threaded on the CPU.Apart from the fast encoder search described in Section IV-D, our implementation has not been optimized in terms of runtimes.In particular, the decoder operates without the parallel processing described in Section IV-B, which is intended for hardware implementations and real-time applications.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Overall Performance
Table III shows the coding gains as the Bjøntegaard delta (BD) [55], [56] rate of the CTC sequences and the overall coding performance for the RA, low-delay B (LB), and low-delay P (LP) configurations.While the training dataset only contains samples for certain block shapes and QPs under RA configuration, the results in Table III demonstrate that STRN achieves substantial coding gains for different coding structures and for being applied to all block shapes (using the default QP range 22..37).The additional results in Table V show that STRN performs equally well for the high QP range 27..42 with an overall luma BD-rate of more than 4%.
Table IV shows the coding performance of IPRN and STRN together with important intermediate steps during the development of the proposed solution.For assessing the effect of polyphase decomposition and decoupling from the intra decoding loop in more detail, the table additionally contains an analysis of MAC operations and sample usage.According to Section III-A, the number of MAC operations depends on the network configuration (number of weights) and on the block shape (if B > 0), so that theoretical minimum and maximum values are achieved for the largest and smallest block shapes, respectively.Both the average MAC per pixel and the average sample usage are measured for all blocks and all pictures of decoded CTC bitstreams, with sample usage indicating the portion of luma samples covered by blocks for which   feature channels), or to a significant increase in coding gain with about the same complexity (for twice the number of feature channels).Note that configuration (b) uses the same trained model as STRN, but is still part of the intra decoding loop in VVC, since input array C 2 contains unconstrained spatial reference samples.According to Section IV-B, configuration (c) is a straightforward solution for decoupling STRN from the intra decoding loop by completely omitting spatial reference samples, i.e., B = 0. Compared to configuration (b), however, this results in a substantial decrease in coding gain.In our proposed solution, STRN uses constrained spatial reference samples instead, which leads to slightly less coding gain than configuration (b), but has the advantage that the STRN refinement is decoupled from the intra decoding loop.
All the remaining results in Sections V-C to V-F are based on STRN and consequently use constrained spatial reference samples.

B. Analysis of the Prediction Error
It is well-known from literature that for motioncompensated prediction, the spatial distribution of the prediction error is nonuniform within the prediction block.In particular, the prediction error tends to be higher in the border areas than at the center of the block, as shown in [14,Fig. 9] and confirmed by our own investigation illustrated in Fig. 3 (a).With this in mind, we now address the effects of IPRN and STRN on the spatial distribution of the prediction error.3 (a), they are below average after application of STRN in Fig. 3 (b).The prediction error is also reduced in the remaining area of the block, albeit to a smaller extent.
One consequence of the architecture being independent of the block shape is that the influence of the spatial reference samples in input C on output P * is limited: Given a simple CNN like IPRN with N layers and a kernel size of 3 × 3, the value of an output element only depends on the values of input elements in an (2N + 1) × (2N + 1) area centered around the position of the output element.Consequently, the spatial reference samples in the L-shaped B wide area along the top/left border of the input only affect the output values in the L-shaped N wide top/left area of P * .For STRN the area of P * affected by the spatial reference samples in C is actually 2N wide due to the polyphase decomposition.Fig. 4 shows the results of an experimental evaluation of the described effect, comparing IPRN and STRN with and without spatial reference samples.For each of the three trained models, the position-wise MSE reduction has been evaluated during inference as r (x, y) = [P(x, y) − O(x, y)] 2 − [P * (x, y) − O(x, y)] 2 for each position (x, y) of the output P * .The value of r corresponds to the amount of improvement at the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.respective position and the diagrams in Fig. 4 show the average value over the validation dataset.Comparing the results in Fig. 4 (a) and (b) with Fig. 4 (c) reveals that the improvement of the prediction signal is considerably higher when spatial reference samples are included in the input.Moreover, the results in Fig. 4 (a) and (b) confirm that the influence of spatial reference samples is limited to the L-shaped top/left area of the block and that this area is twice as wide for STRN as for IPRN.

TABLE V OVERALL CODING PERFORMANCE OF STRN VARIANTS (BD-RATES AND RELATIVE RUNTIMES
Regarding Fig. 3 and 4, the cross-shaped structure in the center of the block should not go unmentioned.Further investigation showed that this effect stems from the DMVR coding tool of VVC, namely the 16 × 16 subblocks that use sample padding instead of reconstructed samples along the border of the L0 and L1 prediction signals.As illustrated by Fig. 3 (a) and (b), these areas tend to have a higher MSE in prediction P, which is attenuated by the CNN-based refinement.Consequently, a higher MSE reduction (improvement) can be observed for these areas in Fig. 4.

C. Variations of STRN
Table V and Fig. 5 compare the coding performance of the default STRN configuration with variants that have a different number of layers N , number of feature channels F, spatial reference size B, or encoder list length K .Table V presents BD-rates and relative runtimes for both the default and the high QP range, while the diagram in Fig. 5 shows luma BD-rates versus average MAC per pixel and contains additional variants for N and F.Even though decoder runtimes and average MAC per pixel are strongly correlated, only the latter can be exactly reproduced for a given set of bitstreams, as it is independent of the simulation environment.Hence, MAC per pixel is more suitable for assessing the complexity overhead of STRN.The results for varying N , F, and B confirm that our default STRN configuration is a good trade-off between coding gain and complexity.When targeting a configuration with lower complexity, reducing the number of feature channels F shows a better trade-off than reducing the number of layers N or the spatial reference size B. When targeting a configuration with higher coding gain, however, increasing the encoder list length K shows an interesting trade-off: Additional coding gains are achieved without an appreciable increase in decoding complexity.Increasing K corresponds to testing additional coding modes with STRN refinement at the encoder, which results in noticeably higher encoder runtimes.For example, the variant with K = 2 has about the same coding gain and encoder runtime as the variant with F = 192, but only half the decoding complexity.

D. Polyphase Decomposition
For IPRN, the results in Table IV indicate that polyphase decomposition basically reduces the complexity at the expense of accuracy by performing a kind of downsampling.In addition, the analysis in Section V-B and Fig. 4 shows that the area of the block that benefits from the spatial reference samples in the input is increased by polyphase decomposition.
Table VI and Fig. 6 study the impact of polyphase decomposition in more detail, comparing the coding performance and computational complexity of STRN with and without polyphase decomposition while varying the number of feature channels F. The results in Fig. 6 show that with polyphase decomposition an additional coding gain of more than 0.5% is achieved for the same complexity over the entire range of tested configurations.For example, the default STRN configuration with polyphase decomposition and F = 128 (labeled STRN) and the one without and F = 64 (labeled F64 on Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the blue line) have about the same complexity, but a BD-rate difference of 0.8%.Conversely, this means that with polyphase decomposition the same coding gain can be achieved with significantly less complexity.
Table VI evaluates the effect of adding the polyphase decomposition stage to models with otherwise identical configuration.Overall, this consistently results in almost four times less MAC operations per pixel and some coding loss for most configurations, confirming that polyphase decomposition basically reduces the complexity at the expense of accuracy.However, for individual configurations even some coding gain is achieved.Upon closer inspection, the per-class results reveal that the difference in coding performance seems to depend on the resolution of the sequences (from highest resolution in Class A to lowest resolution in Class D).The two opposing effects of polyphase decomposition play a role here: For higher resolutions, the advantage of increasing the area that benefits from the spatial reference samples seems to outweigh the disadvantage of decreasing the accuracy, while for lower resolutions the opposite seems to be the case.

E. Zero-MV Constraint
Table VII and Fig. 7 study the effect of the zero-MV constraint introduced in Section IV-A, focusing on the low-delay configurations LB and LP.Note that the zero-MV constraint is also considered in the training process by using a dataset that either includes or excludes training samples of blocks that meet the zero-MV condition.Before adding the zero-MV constraint to the conditions for applicable blocks, we observed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.that STRN leads to a considerable coding loss for the class E sequences when using the LP configuration.Further investigation revealed that the STRN refinement can result in a gradual signal degradation.This effect is illustrated by the diagram in Fig. 7, which shows how the average luma BD-rate changes over the length of the sequence: Both curves without the zero-MV constraint (dashed lines) feature a gradual decrease in coding efficiency that accumulates to considerable losses.The fact that class E sequences are very static and have large areas of constant background and that the LP configuration is limited to uni-prediction, results in situations, where the STRN refinement is applied to the exact same signal over and over.The examples in Fig. 7 show that this effect is almost completely eliminated by adding the zero-MV constraint, i.e., omitting the STRN refinement for blocks that have a motion vector equal to zero.The results in Table VII further confirm that the zero-MV constraint improves the coding performance of the challenging sequences for the low-delay configurations without affecting the coding performance of the other sequences or the RA configuration.Note that the zero-MV constraint is applied to RA in exactly the same way as to the low-delay configurations.

F. Additional NNVC Results
Tables VIII and IX show additional results for STRN relative to NNVC, the experimental software being developed by JVET for studying neural network-based video coding technologies.For this evaluation, STRN has been integrated into the current version 7.1 of the NNVC software.Note that the same model as for the VTM simulations has been used, i.e., STRN has not been retrained for NNVC.All the results shown in Table VIII are relative to NNVC configured as VTM, using the RA configuration and the default QP range 22. . .37. In its default configuration, NNVC comes with two additional, neural network-based coding tools: Intra prediction and in-loop filter in the low-complexity operating point variant, which are referred to as NN-intra and LOP, respectively.Table VIII shows the coding performance of all possible combinations of the three coding tools NN-intra, LOP, and STRN, namely individual results for each tool in the first three columns, combinations of two tools in the next three columns, and finally the combination of all three tools in the last column.Note that the combination + NN-intra + LOP in the fourth column corresponds to the coding performance of NNVC.Thus, the last column shows the overall coding gain for the combination of NNVC with STRN.Table IX shows the coding performance of STRN relative to different NNVC configurations in terms of luma BD-rate.Here, each column corresponds to a different encoder setting for STRN, with list length K determining the number of coding modes to be tested, and each row shows the results for a different tool configuration of the NNVC software.
The following observations can be made: Firstly, the coding performance of STRN for NNVC configured as VTM is very similar to the results for actual VTM, as reported in Tables III and V.The small deviations can be explained by differences in the code bases (VTM-15.0versus VTM-11.0 for NNVC) as well as minor differences in the default encoder settings.Regarding the combination of STRN with NNVC tools, the results show that the coding gains of NN-intra and STRN are almost additive.Obviously, these tools target different aspects of video coding and are therefore complementary.For LOP and STRN, however, there apparently is an overlap, as the coding gain of their combination is significantly smaller than the sum of their individual coding gains.This can be explained by the fact that both LOP and STRN have an impact on the inter prediction signal, STRN directly and LOP indirectly through the filtering of the reference pictures.Some additional notes for a better assessment: As an alternative to LOP, the NNVC software can also be operated using HOP, i.e. a high-complexity variant of the in-loop filter.By using HOP instead of LOP, the coding gain of NNVC can be further increased by -7.26% luma BD-rate (resulting in an overall luma coding gain of -13.44% relative to NNVC configured as VTM) [57].However, the computational complexity of HOP is also considerably higher, with about 28 times more MAC/pxl and a 24 times higher decoder runtime than LOP.As previously mentioned, the NNVC coding tools and STRN have been trained independently.In the case of NN-intra, the separate training seems to be appropriate.For the in-loop filters, though, a joint training might be preferable, but this is beyond the scope of this paper.
In summary, it can be stated that even without retraining the model, STRN is able to achieve average coding gains of -1.26% to -1.76% luma BD-rate over NNVC, depending on the number of RD checks at the encoder.

VI. CONCLUSION
In this paper, we presented an approach for refining the prediction signal of inter blocks in state-of-the-art video coding via a spatio-temporal residual CNN (STRN).
With our previous work in [13] as a starting point, the architecture has been improved by adding polyphase decomposition of the input tensor prior to the first convolution layer.It has been shown theoretically and experimentally that polyphase decomposition increases the area of a block that can benefit from the spatial reference samples while reducing the computational complexity (worst case and effective MAC per pixel).Compared to IPRN without polyphase decomposition, this results either in almost four times less complexity and slightly lower coding gain or, by doubling the number of feature channels, in about the same complexity and significantly higher coding gain.
STRN has been integrated into the inter coding process of the VVC standard, using the inter prediction signal of a block together with spatial and temporal reference samples to compile the input tensor, which is then forward propagated through the trained network, resulting in the refined prediction signal.The same model is used for all coding modes, block shapes, and QPs.Moreover, STRN is supported for most of the inter prediction modes and mandatory for all applicable blocks, with an average sample usage of about 68%.
Including spatial reference samples in the inter prediction process is challenging: The additional dependency on reconstructed blocks in the same picture would make parallel decoding of STRN blocks impossible.They would become part of the intra decoding loop, which contradicts the fundamental design of the VVC decoding process and is not feasible for real-time decoder implementations.In our solution, we introduce an additional signal plane consisting of constrained spatial reference samples.Using this new signal plane instead of the actual reconstructed neighboring samples decouples STRN from the intra decoding loop.This comes at the cost of a slightly lower coding gain, but enables to decode STRN blocks independently of the intra blocks and in parallel.
STRN has been implemented in the VTM reference software and under CTC, an average coding gain of -4.07%luma BD-rate is achieved for the RA configuration with about 3 times the encoder and 70 times the decoder runtime.However, our implementation has not been optimized in terms of runtimes and the coding experiments have been performed single threaded on the CPU.An experimental evaluation confirmed that the default STRN configuration (N = 6, F = 128, and B = 4) is a good trade-off between coding gain and complexity, and that additional coding gains can be achieved for K > 1 without an appreciable increase in decoding complexity.
For low-delay prediction structures, a gradual signal degradation effect has been observed with STRN.We have shown that this effect can be mitigated successfully by adding the zero-MV constraint to the conditions for blocks to which STRN is applicable.As a result, STRN achieves consistent and substantial coding gains for all configurations.

Fig. 1 .
Fig.1.Residual CNN architecture with polyphase decomposition using spatial and temporal information for improving the inter prediction signal.
P and a trained model with fixed weights W k and biases b k , the input tensor C is compiled and forward propagated through the network according to Section III-A, resulting in the refined prediction P * .Like for the training in Section III-B, the values of C are converted from integer values in the range [0..2 b − 1] to floating-point values in the range [−0.5, 0.5[ before input.Accordingly, the values of P * are converted back to integers in the range [0..2 b

Fig. 2 .
Fig.2.Inter decoding process (prediction and reconstruction with residual R): (a) inter slice with inter, intra, and STRN blocks, (b) decoding process without additional signal plane, i.e., STRN is part of intra loop, and (c) proposed decoding process using additional signal plane with constrained spatial reference samples, i.e., STRN is not part of intra loop.

Fig. 3 .
Fig. 3. Prediction error before and after application of the STRN refinement: average position-wise mean squared error (MSE) for all 32 × 32 blocks of the validation dataset.(a) regular inter prediction P, (b) STRN refined inter prediction P * , and (c) colormap (using a linear scale).

Fig. 3 (
Fig. 3 (b) shows the resulting prediction error after application of the STRN refinement.Compared to Fig. 3 (a) the spatial distribution has been notably altered: By incorporating neighboring reconstructed samples in the STRN refinement, the prediction error is particularly reduced in the top/left border area of the block.While the MSE values in this area are above average in Fig. 3 (a), they are below average after application of STRN in Fig. 3 (b).The prediction error is also reduced in the remaining area of the block, albeit to a smaller extent.One consequence of the architecture being independent of the block shape is that the influence of the spatial reference samples in input C on output P * is limited: Given a simple CNN like IPRN with N layers and a kernel size of 3 × 3, the value of an output element only depends on the values of input elements in an (2N + 1) × (2N + 1) area centered around the position of the output element.Consequently, the spatial reference samples in the L-shaped B wide area along the top/left border of the input only affect the output values in the L-shaped N wide top/left area of P * .For STRN the area of P * affected by the spatial reference samples in C is actually 2N wide due to the polyphase decomposition.Fig. 4 shows the results of an experimental evaluation of the described effect, comparing IPRN and STRN with and without spatial reference samples.For each of the three trained models, the position-wise MSE reduction has been evaluated during inference as r (x, y) = [P(x, y) − O(x, y)] 2 − [P * (x, y) − O(x, y)] 2 for each position (x, y) of the output P * .The value of r corresponds to the amount of improvement at the

Fig. 5 .
Fig. 5. Coding performance of as average MAC per pixel and luma BD-rate for VTM-15.0CTC RA (variations of N , F, B, and K are each connected by a line, with labels for variants included in TableV).

Fig. 6 .
Fig. 6.Coding performance for STRN with and without polyphase decomposition (PD) as average MAC per pixel and luma BD-rate for VTM-15.0CTC RA.

TABLE I COMPUTATIONAL
COMPLEXITY OF IPRN WITHOUT AND STRN WITH POLYPHASE DECOMPOSITION FOR B = 4 AND N = 6 in residual or offset values C for improving the regular prediction P of the current block (included in input array C 2 )

TABLE III CODING
PERFORMANCE OF STRN USING CONSTRAINED SPATIAL REF.SAMPLES AND K = 1 (BD-RATES AND RELATIVE RUNTIMES FOR VTM-15.0CTC)

TABLE IV CODING
PERFORMANCE AND COMPLEXITY ANALYSIS OF DIFFERENT IPRN AND STRN CONFIGURATIONS (LUMA BD-RATES AND RELATIVE RUNTIMES FOR VTM-15.0CTC RA) IPRN or STRN is applied.Comparing the overall results of IPRN with configurations (a) and (b) highlights that polyphase decomposition either leads to a dramatic complexity reduction with a slightly lower coding gain (for the same number of

TABLE VI CODING
PERFORMANCE DIFFERENCE AND RELATIVE COMPLEXITY OF STRN WITHOUT AND WITH POLYPHASE DECOMPOSITION USING N = 6, B = 4, AND K = 1 (LUMA BD-RATES AND RELATIVE RUNTIMES FOR VTM-15.0CTC RA)

TABLE VII CODING
PERFORMANCE DIFFERENCE FOR STRN WITHOUT AND WITH THE ZERO-MV CONSTRAINT (LUMA BD-RATES AND RELATIVE RUNTIMES FOR VTM-15.0CTC)

TABLE VIII CODING
PERFORMANCE OF STRN AND NNVC TOOL COMBINATIONS (REFERENCE IS NNVC CONFIGURED AS VTM, BD-RATES AND RELATIVE RUNTIMES FOR NNVC-7.1 CTC RA)

TABLE IX OVERALL
CODING PERFORMANCE OF STRN WITH DIFFERENT ENCODER SETTINGS RELATIVE TO DIFFERENT NNVC CONFIGURATIONS (LUMA BD-RATES FOR NNVC-7.1 CTC RA)