Quantization and Entropy Coding in the Versatile Video Coding (VVC) Standard

—The paper provides an overview of the quantization and entropy coding methods in the Versatile Video Coding (VVC) standard. Special focus is laid on techniques that improve coding efﬁciency relative to the methods included in the High Efﬁciency Video Coding (HEVC) standard: The inclusion of trellis-coded quantization, the advanced context modeling for entropy coding transform coefﬁcient levels, the arithmetic coding engine with multi-hypothesis probability estimation, and the joint coding of chroma residuals. Beside a description of the design concepts, the paper also discusses motivations and implementation aspects. The effectiveness of the quantization and entropy coding methods speciﬁed in VVC is validated by experimental results.

Abstract-The paper provides an overview of the quantization and entropy coding methods in the Versatile Video Coding (VVC) standard. Special focus is laid on techniques that improve coding efficiency relative to the methods included in the High Efficiency Video Coding (HEVC) standard: The inclusion of trellis-coded quantization, the advanced context modeling for entropy coding transform coefficient levels, the arithmetic coding engine with multi-hypothesis probability estimation, and the joint coding of chroma residuals. Beside a description of the design concepts, the paper also discusses motivations and implementation aspects. The effectiveness of the quantization and entropy coding methods specified in VVC is validated by experimental results.

I. INTRODUCTION
T HE Versatile Video Coding (VVC) standard [1], [2] is the most recent joint video coding standard of the ITU-T and ISO/IEC standardization organizations. It was developed by the Joint Video Experts Team (JVET), a partnership between the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). VVC was technically finalized in July 2020 and will be published as ITU-T Rec. H.266 and ISO/IEC 23090-3 (MPEG-I Part 3).
The primary objective of the new VVC standard is to provide a significant increase in compression capability compared to its predecessor, the High Efficiency Video Coding (HEVC) standard [3]. At the same time, VVC includes design features that make it suitable for a broad range of video applications. In addition to conventional video applications, it particularly addresses the coding of video with high dynamic range and wide color gamut, computer-generated video (e. g., for remote screen sharing or gaming), and omnidirectional video and it supports adaptive streaming with resolution switching, scalable Manuscript  coding, and tile-based streaming for immersive applications. Despite the rich set of coding tools and functionalities, particular care was taken to enable decoder implementations with reasonable complexity in both hardware and software.
Similar to all previous video coding standards of the ITU-T and ISO/IEC since H.261 [4], the VVC design follows the general concept of block-based hybrid video coding. The video pictures are partitioned into rectangular blocks and each block is predicted by intra-or inter-picture prediction. The resulting prediction error blocks are coded using transform coding, which consists of an orthogonal transform, quantization of the transform coefficients, and entropy coding of the resulting quantization indexes. Quantization artifacts are attenuated by applying so-called in-loop filters to reconstructed pictures before they are output or used as references for inter-picture prediction of following pictures.
Although VVC uses the same coding framework as its predecessors, it includes various improvements that eventually result in a substantially improved compression performance. One of the most prominent changes in comparison to HEVC is the very flexible block partitioning concept [5] that supports non-square blocks for coding mode selection, intra-picture prediction, inter-picture prediction, and transform coding and, thus, impacts the design of many other aspects. In the present paper, we describe modifications to quantization and entropy coding. The coding efficiency improvements in this area can be mainly attributed to the following four features: • the support of trellis-coded quantization (TCQ); • the advanced entropy coding of quantization indexes suitable for both TCQ and scalar quantization; • the binary arithmetic coding engine with multi-hypothesis probability estimation; • the support of joint chroma residual coding. Theses changes in quantization and entropy coding together with a block-adaptive transform selection [6] eventually led to a substantially increased efficiency of the transform coding design in VVC compared to that of HEVC.
The paper is organized as follows. Section II describes the quantization in VVC with special focus on the TCQ design. The entropy coding of quantization indexes including context modeling is presented in Section III. Section IV discusses the improvements of the core binary arithmetic coding engine. The joint coding of chroma prediction errors is described in Section V. Experimental results validating the effectiveness of the quantization and entropy coding tools are provided in Section VI, and Section VII concludes the paper.

II. QUANTIZATION
Quantization is an irreversible mapping of input values to output values. For the specification in image and video coding standards, it is split into a non-normative encoder mapping of input samples to integer quantization indexes, which are also referred to as levels and are transmitted using entropy coding, and a normative decoder mapping of the quantization indexes to reconstructed samples. The aim of quantization is to decrease the bit rate required for transmitting the quantization indexes while maintaining a low reconstruction error.
In hybrid video coding, quantization is generally applied to transform coefficients that are obtained by transforming prediction error blocks (also referred to as residual blocks) using an approximately orthogonal transform. The transforms used have the property that, for typical residual blocks, the signal energy is concentrated into a small number of transform coefficients. This has eventually the effect that simple scalar quantizers are more effective in the transform domain than in the original sample space [7]. In particular for improving the coding efficiency for screen content [8], where residual blocks often have different properties, VVC also provides a transform skip (TS) mode, in which no transform is applied, but the residual samples are quantized directly.
Similarly as in AVC (Advanced Video Coding) [9] and HEVC, the quantizer design in VVC is based on scalar quantization with uniform reconstruction quantizers. But VVC also includes two extensions that can improve coding efficiency at the cost of an increased encoder complexity.

A. Basic Design: Uniform Reconstruction Quantizers
In scalar quantization, the reconstructed value t k of each input coefficient (or sample) t k depends only on the associated quantization index q k . Uniform reconstruction quantizers (URQs) are a simple variant, in which the set of admissible reconstruction values is specified by a single parameter, called quantization step size ∆ k . The decoder operation is given by a simple scaling, t k = ∆ k q k . Similar as previous ITU-T and ISO/IEC video coding standards, VVC supports quantization weighting matrices by which the quantization step size can be varied across the transform coefficients of a block. Conceptually, the step size for a coefficient t k is given by ∆ k = α k ∆, where α k is a weighting factor that depends on the location of the coefficient t k inside the transform block and ∆ is a quantization step size, which can be selected on a block basis among a pre-defined set of candidates. The chosen ∆ is indicated by a non-negative integer value referred to as quantization parameter (QP). VVC uses an exponential relationship between ∆ and QP, which was originally introduced in AVC. When neglecting rounding operations, the reconstruction of transform coefficients can be written as An increase of the quantization parameter by one corresponds to an increase of about 12% for the quantization step size. For avoiding reconstruction mismatches, the entire VVC decoding process is specified using exact integer operations (similar to AVC and HEVC). In comparison to the idealized case with orthogonal transforms, the inverse transform for a W×H block includes an additional scaling by √ W H · 2 B−15 , where B represents the bit depth of the color component in bits per sample. Consequently, the scaling in the decoder has to approximately generate reconstructed coefficients which are then used as input values to the inverse transform. With β = 1 2 log 2 WH , γ = 2β − log 2 WH, p = QP/6 , and m = QP % 6, where · and · denote the ceiling and floor functions, respectively, and % denotes the modulus operator, the mapping q k → t k can be rewritten according to Since both the width W and the height H of a transform block are integer powers of two, γ ∈ {0, 1} is a binary parameter.
For obtaining a realization with integer operations, the two terms in parenthesis are rounded to integer values and the multiplication with 2 5−β−B is approximated by a bit shift. The VVC standard specifies the reconstruction according to where and denote bit shifts to the left and right (in two's complement arithmetic), respectively, and b = B + β − 5.  1; 255], are called scaling list. As further detailed in Section II-F, scaling lists for different block types can be specified in a corresponding high-level data structure. If scaling lists are not used, the values w k are inferred to be equal to 16, which corresponds to ∆ k = ∆.
In transform skip mode, no inverse transform is applied and, hence, no additional scaling factor has to be included in the reconstruction process of residual samples r k . Furthermore, the concept of scaling lists is not applicable. An integer realization of the reconstruction r k = ∆ · q k is obtained by using (4) with w k = 16, γ = 0, and b = 10, which yields (p + 4)) · q k + 512 10.

B. Quantization Improvements
If one only considers scalar quantization, the restriction to URQs has no negative impact on coding efficiency. When combined with a suitable entropy coding and encoder decision, URQs can achieve virtually the same rate-distortion efficiency as optimal scalar quantizers for typical distributions of transform coefficients [10], [11]. However, even for statistically independent transform coefficients, the usage of scalar quantization results in an unavoidable loss in coding efficiency relative to the fundamental rate-distortion bound. This gap can only be reduced by using vector quantizers (VQs) [12].
VVC includes two advanced techniques for quantizing transform coefficients that are referred to as sign data hiding and trellis-coded quantization. Both have properties of VQs, but also represent simple extensions of URQs and require only minimal changes of the decoding process. Since these Fig. 1. Scalar quantizers Q 0 and Q 1 . The circles indicate the reconstruction levels and the labels represent the associated quantization indexes.
approaches cannot utilize statistical dependencies of the input data, the basic concept of transform coding is not modified. The dependencies between residual samples are exploited by applying the quantization in the transform domain and by using an appropriate entropy coding method.

C. Sign Data Hiding
Sign data hiding (SDH) [13]- [15] is a technique that is already included in HEVC and hasn't been modified in the context of VVC. Consider a block of reconstructed transform coefficients {t k } that is represented by a corresponding block of quantization indexes {q k }, with t k = ∆ k q k . The basic idea of SDH is to omit the coding of the sign for one nonzero index in {q k } and instead derive it from the parity of the sum of absolute values |q k |. In comparison to scalar quantization with the same step sizes ∆ k , SDH saves about 1 bit per block, which for suitably large blocks outweighs the average increase in distortion. But note that an encoder has to carefully select quantization indexes {q k } that obey the sign hiding condition in order to achieve coding efficiency improvements.
In HEVC and VVC, SDH is applied on the basis of so-called coefficient groups (CGs), which represent groups of successive levels q k in coding order (see Section III); in most cases, they include 16 levels. If the difference between the scan indexes of the last and first nonzero level (in coding order) inside a CG is greater than 3, the sign for the last nonzero level of the CG is not coded but derived based on the sum of absolute values, k∈CG |q k |, where odd sums indicate negative values. At the decoder side, SDH does not require any change of the scaling in (4), only the entropy coding of sign flags is modified.

D. Trellis-Coded Quantization
The second improvement [16], [17] employs the concept of trellis-coded quantization (TCQ), first described in [18]. Since the reconstruction process specified in the standard does not use trellis structures, the TCQ design included in VVC is also referred to as dependent quantization. TCQ was well studied in the 1990s and it was demonstrated that it can significantly outperform the best scalar quantizers [18]- [21]. Due to its simple structure, it can be applied for quantizing vectors of arbitrary dimensions.
From a decoder perspective, TCQ specifies two scalar quantizers and a procedure for switching between these quantizers. The two scalar quantizers Q 0 and Q 1 used in VVC are illustrated in Fig. 1. Similar to URQs, the reconstruction levels of both quantizers represent integer multiples of a quantization step size ∆ k . The quantizer Q 0 includes the even multiples of ∆ k and the quantizer Q 1 includes the odd multiples of ∆ k and, in addition, the value of zero. Note that both quantizers are symmetric and include the reconstruction level equal to zero. This deviation from conventional TCQ designs improves the coding efficiency at low and medium rates without requiring significant changes of the entropy coding (in comparison to using URQs). For both quantizers Q 0 and Q 1 , the selected reconstruction levels t k are indicated by integer quantization indexes q k , as illustrated by the labels in Fig. 1.
In contrast to scalar quantization, the transform coefficients of a block have to be reconstructed in a pre-defined order, which shall be indicated by the index k. The reconstruction order is chosen to be equal to the coding order of quantization indexes q k , which additionally enables the exploitation of certain TCQ properties in the entropy coding (see Section III). Given the reconstruction order, the procedure for switching between the two quantizers Q 0 and Q 1 can be specified by a state machine with 2 K states (K ≥ 2), where the state s k for a current coefficient t k uniquely determines the quantizer used. The state s k+1 for the next coefficient t k+1 is determined by the current state s k and the parity p k = (q k & 1) of the current quantization index q k (the operator & represents a bitwise and in two's complement arithmetic). Even though the achievable coding efficiency increases with number of states [18], [22], the TCQ design in VVC uses the minimal number of 4 states for limiting the required encoder complexity. The state transition and quantizer selection are given in Table I. The initial state s 0 is always set equal to zero.
The reconstruction of transform coefficients t k is specified as follows: First, the quantization indexes q k for a block are mapped to integer multiplication factors q * k for the corresponding quantization step sizes ∆ k . And then, the multiplication t k = q * k ∆ k is approximated as in the conventional URQ case. For a block with N transform coefficients, the calculation of the factors q * k can be specified by the following algorithm, where stateTransTable represents the 4×2 state transition table given in Table I and sgn(·) denotes the signum function The distance between two neighboring reconstruction levels in the quantizers Q 0 and Q 1 is in most cases 2∆ k , and not ∆ k as for URQs. For obtaining approximately the same reconstruction quality for a given QP, regardless of whether TCQ is enabled, the quantization step sizes ∆ k have to be scaled for TCQ. As verified experimentally, a scaling factor of 2 −5/6 , corresponding to a QP decrease of 5, represents a suitable choice. Hence, when TCQ is enabled, the scaling t k = q * k ∆ k is specified by re-using (4), but with q k being replaced with q * k and modified parameters b = B + β − 4, p = (QP + 1)/6 , and m = (QP + 1)%6.
For supporting TCQ at the decoder side, only three changes are required: (1) An additional mapping from levels q k to multiplication factors q * k ; (2) a modification of the scaling parameters b, p, and m; and (3) a state-dependent context selection, which will be described in Section III.

E. Encoder Operation
Even though the quantization process at the encoder, i. e., the algorithm for selecting levels q k , is outside the scope of the standard, it has a significant impact on coding efficiency. Stateof-the-art video encoders often use algorithms that select the levels q = {q k } for a block by minimizing a Lagrangian function J(q) = D(q) + λR(q) of the MSE distortion D(q) and the number of bits R(q) required for transmitting the levels [23]- [25]. The Lagrange multiplier λ determines the operating point and is typically set depending on a base QP. These approaches take into account dependencies between levels q k that are introduced in the entropy coding and are referred to as rate-distortion optimized quantization (RDOQ). An RDOQ algorithm suitable for URQs and the entropy coding design in HEVC and VVC is described in [26]. This algorithm is also implemented in the reference encoders HM [27] and VTM [28] for HEVC and VVC, respectively.
1) Sign Data Hiding: When SDH is enabled, an encoder has to ensure that the sign hiding condition (the parity of the sum of absolute levels correctly indicates the sign of the last nonzero index) is met for all CGs. This is typically achieved as follows [15]. First the RDOQ algorithm for URQs is applied. Then, in a second step, for all CGs for which the sign condition is violated, one of the levels q k is increased or decreased by one. The corresponding level as well as the direction of the change are selected by minimizing the Lagrange cost J(q).
2) Trellis-Coded Quantization: The quantizer switching in TCQ introduces dependencies, which have to be taken into account for achieving a good coding efficiency. The potential transitions between the quantizers Q 0 and Q 1 can be elegantly represented by a trellis with 4 states per coefficient [18]. The selection of indexes q for a block is then equivalent to finding the path with minimum J(q) through the trellis.
For a better consideration of certain entropy coding aspects (coding of last position), the algorithm [17] implemented in the VTM software uses a trellis with 5 states, as shown in Fig. 2. In addition to the states 0-3, it includes an "uncoded" state, which represents levels equal to 0 that precede the first nonzero level in coding order. Note that the start and end states s k−1 and s k , respectively, of a connection between two nodes uniquely determine the quantizer used and the parity of the associated level q k . For each connection, the candidate level q k that minimizes the difference |t k − t k (q k )| between the original and reconstructed coefficients is determined first. Then, the final levels q = {q k } for a block are selected among these candidates by applying the Viterbi algorithm 1 [29]. The cost assigned to a connection represents the contribution D k (q k ) + λR k (q k | · ) of the associated candidate q k to the overall cost J(q). Preceding levels in the trellis paths are taken into account for calculating the rate terms R(q k | · ).
There are several possibilities for speeding up the encoding process, for example, by approximating the Lagrange costs or pruning unlikely connections. The VTM reference encoder uses a simple method that significantly reduces the encoder run time for typical video bit rates, at which most high-frequency coefficients are small compared to the quantization step size. In an initial step, the first original coefficient t i in coding order with |t i | > ∆ i /2 is determined. The levels for all coefficients that precede this coefficient in coding order are set equal to zero and the Viterbi algorithm starts at k = i.

F. Quantization Control
As described above, VVC supports three quantizer designs (URQs, SDH, and TCQ) with different trade-offs between achievable coding efficiency and encoder complexity. An encoder can select the one that best suits the application requirements. The choice is indicated in the slice header.
For enabling both block-based rate control algorithms and perceptually optimized encoding approaches (e. g., [30]), the QPs can be selected on a block basis. The corresponding blocks are called quantization groups (QGs); their sizes are indicated in the picture header. The QPs for the luma component are coded differentially. For each QG that contains nonzero levels, the difference between the QP used and a prediction derived from spatially neighboring QGs is transmitted. For the chroma components, the QPs are derived from the luma QP of the co-located block via look-up tables. There are three different tables, one for the Cb component, one for the Cr component, and another one that is explicitly used for the JCCR modes with |m| = 2 (see Section V). For supporting a wide range of transfer functions and color formats, an encoder has the freedom to choose suitable look-up tables. They are defined by piece-wise linear mapping functions that are coded in the sequence parameter set. VVC supports QP values in the range from 0 to 63 + 6(B − 8), inclusive, where B denotes the bit depth of the corresponding color component.
As noted above, the quantization of transform coefficients can be additionally controlled by weighting matrices, which are specified using scaling lists. The main motivation is that the usage of frequency-dependent quantization step sizes can help an encoder to better take the contrast sensitivity behavior of human vision into account. In total, VVC includes 28 scaling lists, each defining weighing factors for a 2×2, 4×4, or 8×8 array of coefficients. The scaling lists can be transmitted in a high-level syntax structure referred to as adaptation parameter set; similarities between the different lists are exploited using predictive coding. The list that is used for a transform block is determined by the color component, the prediction mode, and the maximum of the width and height of the block. For block sizes not equal to 2×2, 4×4, or 8×8, the weighting matrices are resampled using nearest neighbor interpolation.
III. TRANSFORM COEFFICIENT CODING Similarly as HEVC, VVC employs context-based adaptive binary arithmetic coding (CABAC) for entropy coding of all low-level syntax elements. Non-binary syntax elements are mapped to binary codewords. The bijective mapping between symbols and codewords, for which typically simple structured codes are used, is called binarization. The binary symbols, also called bins, of both binary syntax elements and codewords for non-binary data are coded using binary arithmetic coding. The core coding engine, which is further discussed in Section IV, supports two operating modes: A regular mode, in which the bins are coded with adaptive probability models, and a less complex bypass mode that uses fixed probabilities of 1 /2. The adaptive probability models are also called contexts and the assignment of probability models to individual bins is referred to as context modeling. Note that both the binarization and the context modeling used have a significant impact on coding efficiency. The required encoder and decoder complexities primarily increase with the number of context-coded bins (i. e., bins coded in regular mode). But they are also affected by other aspects such as the degree of dependencies between successive bins, the complexity of the context modeling used, or the frequency with which a switching between the regular and bypass modes of the arithmetic coding engine occurs.
The entropy coding of quantization indexes for transform blocks is commonly referred to as transform coefficient coding. Since, at typical video bit rates, transform coefficient levels consume the major part of the total bit rate, it is important to find a reasonable trade-off between coding efficiency and implementation complexity. The basic concept of the transform coefficient coding in VVC is similar to the coefficient coding specified in HEVC [15]: 1) A coded block flag (CBF) indicates whether a transform block includes any nonzero levels; 2) For blocks with CBF equal to 1, the x and y coordinate of the last nonzero level in forward scan order is transmitted; 3) Starting from the indicated last position, the levels are transmitted in reverse scan order, organized into so-called coefficient groups (CGs). The bins for a CG are coded in multiple passes, where all bypass-coded bins are grouped together in order to enable efficient implementations. Since VVC supports a larger range of transform sizes than HEVC, some aspects of the transform coefficient coding were generalized. In contrast to HEVC, the scan order does not depend on the intra prediction mode as such a mode-dependent scan was found to provide only negligible improvements and would unnecessarily complicate the design. Moreover, the context modeling for the bins representing levels is independent of the block size; there are no exceptions for certain block shapes. But instead, the context dependency restrictions found in HEVC are relaxed and local statistical dependencies between levels are utilized for increasing coding efficiency.
For enabling the exploitation of certain TCQ properties, the binarization for levels includes a parity bin and all contextcoded bins of a CG are coded in a single pass. VVC uses a transform block based restriction on the number of contextcoded bins to keep a similar worst-case complexity as HEVC.

A. Coded Block Flag
The coded block flag (CBF) is coded in the regular mode of the coding engine. In total, 9 contexts are used (4 for luma, 2 for Cb, and 3 for Cr). One context per component is reserved for blocks coded in BDPCM mode (a special variant of the transform skip mode, see [8]). For luma, two contexts are used only for transform blocks coded in the intra sub-partitioning mode (see [31]); here, the chosen context depends on the CBF of the preceding luma transform block inside the same coding unit. In order to exploit statistical dependencies between the CBFs of the chroma components, the context for Cr blocks not coded in BDPCM mode is selected depending on the CBF of the co-located Cb block.

B. Coefficient Groups and Scan Order
The transform coefficient levels {q} of a W ×H transform block are arranged in a W ×H matrix. For enabling a harmonized processing across all block sizes (see also [15]), but also for increasing coding efficiency for transform blocks, in which the signal energy is concentrated into transform coefficients that correspond to low horizontal or low vertical frequencies, transform blocks are partitioned into coefficient groups (CGs). As further detailed in Section III-D, the levels for each CG are coded in a unified manner using multiple scan passes. Since VVC also supports block sizes with widths and heights less than 4, the shape of CGs depends on the transform block size as shown in Table II. For transform blocks with at least 16 coefficients, the CGs always include 16 levels; for smaller blocks, CGs of 2×2 levels are used.
The coding order of CGs is given by the reverse diagonal scan illustrated in Fig. 3. Independent of the CG size, the CG diagonals are processed from the bottom right to the top left of a transform block, where each diagonal is scanned in down left direction. For limiting the worst-case decoder complexity, high-frequency coefficients of large transforms are forced to be equal to zero [6]. Nonzero quantization indexes can only be present in a max(W, 32)×max(H, 32) region at the topleft of a transform block 2 . Hence, CGs outside this region are not coded and thus excluded from the scan as is illustrated in Fig. 3d. The coding order of levels inside CGs is specified by the same reverse diagonal scan.

C. Last Significant Coefficient Position
Similar as in HEVC, the explicit coding of zero quantization indexes for coefficients related to high-frequency components is eliminated by transmitting the position of the last nonzero level in forward scan order (which is the first nonzero level in coding order). This does not only increase coding efficiency, but also reduces the number of context-coded bins.
The x and y coordinates corresponding to the column and row number, respectively, in the matrix of coefficient levels are transmitted independently of each other. As shown in Table III, each component is represented by a combination of a prefix codeword and a (possibly empty) suffix codeword. The prefix part specifies an interval of values. It is binarized using truncated unary (TU) binarization and the bins are coded in regular mode. The prefix part indicating the last interval of the non-zero-out region of a transform block is truncated. That means, the zero bins in parenthesis shown in Table III are not coded if max(W, 32), for the x coordinate, or max(H, 32), for the y coordinate, is equal to the number in the last table column. In particular, the coding of a coordinate is completely skipped if the corresponding block width or height is equal to 1. The suffix part represents the offset inside the interval  IV  CONTEXT INDICES FOR PREFIX BINS OF LAST COEFFICIENT COORDINATES   transform  bin index   dimension  0  1  2  3  4  5  6  7  8   2  0  1  4  0  1  2  8  3  3  4  4  5  luma  16  6  6  7  7  8  8  9  32 10 10 11 11 12 12 13 13 14  64 15 15 16 16 17 17 18 18  Let v pre be the number of bins equal to 1 in the prefix codeword. Then, the number n suf of suffix bins to be decoded is derived by With v suf being the value specified by the suffix codeword (in binary representation), the decoded coordinate value last is calculated according to The prefix part for the x coordinate is signaled first followed by that for the y coordinate. For grouping bypass-coded bins, the suffix parts are coded after the prefix codewords. The prefix bins of the x and y coordinates are coded using separate sets of context models. Table IV lists the context offsets that indicate the probability model used inside a set. The model chosen depends on whether a luma or chroma block is coded, the width or height of the transform block, and the bin number inside the prefix codeword. Note that for large transform blocks, where zero-out is present, the transform dimension (and not the dimension of the non-zero-out region) is used to derive the context offset. In total, 46 contexts (40 for luma and 6 for chroma) are used for coding the last coefficient position.

D. Binarization and Coding Order
Starting with the CG containing the last nonzero level (as indicated by the x and y coordinates), the CGs are transmitted in coding order (given by the reverse diagonal scan). The first syntax element coded for a CG is the sb coded flag. If this flag is equal to 0, it indicates that the CG contains only zero levels. For the first CG (which contains the last nonzero level) and the last CG (which contains the DC level), this flag is not coded, but inferred to be equal to 1. The sb coded flag is coded in regular mode. The chosen context depends on whether the CG to the right or the CG below contain any nonzero levels, where separate context sets are specified for |q| 0 1 2 3 4 5 6 7 8 9 10 11 · · · sig 0 1 1 1 1 1 1 1 1 1 1 1 · · · gt1 -0 1 1 1 1 1 1 1 1 1 1 · · · par --0 1 0 1 0 1 0 1 0 1 · · · gt3 --0 0 1 1 1 1 1 1 1 1 · · · rem ----0 0 1 1 2 2 3 3 · · · luma and chroma. In total, 4 contexts (2 for luma and 2 for chroma) are used. For CGs with sb coded flag equal to 1, the level values are coded as described in the following. The binarization of coefficient levels and the coding order of bins were chosen to support an efficient entropy coding for both TCQ and conventional quantization. Due to the different structures of the two scalar quantizers Q 0 and Q 1 used in TCQ (see Fig. 1), the probability that a level is equal to 0 highly depends on the quantizer used. For exploiting this effect in context modeling (Section III-E) and, at the same time, grouping the context-and bypass-coded bins, the binarization includes a dedicated parity flag that is used for determining the TCQ state during entropy coding [32]. By additionally taking into account the number of context-coded bins required for achieving a good coding efficiency [33], [34] as well as the dependencies between successive bins [35], the binarization shown in Table V was chosen. The absolute values |q| of the quantization indexes are mapped to the bins sig (significance), gt1 (greater than 1), par (parity), gt3 (greater than 3), and the non-binary remainder rem.
The syntax elements for a CG are coded in multiple passes over the scan positions. Unlike HEVC, where a single syntax element per coefficient is coded per scan pass, VVC codes up to 4 syntax elements per coefficient in a single pass. In the first pass, the context-coded bins sig, gt1, par, and gt3 are coded in an interleaved manner (i.e., all bins for a scan position are coded before proceeding to the next scan position). Note that the parity bin driving the TCQ state machine is included in the first pass for enabling an efficient coding of the sig bin for the TCQ case. For scan positions for which the sig bin can be inferred to be equal to 1 (e. g., for the last significant position), it is not signaled. The presence of the gt1, par, and gt3 bins is controlled as specified in Table V. The non-binary remainders rem are coded in a second scan pass. They are binarized using similar parametric codes as in HEVC and the resulting bins are coded in the bypass mode of the coding engine.
In order to increase the worst-case throughput, the number of context-coded bins that can be coded in the first pass is restricted [33], [35]. For allowing a suitable distribution of context-coded bins across CGs, the limit is specified on a transform block basis. With N being the number of transform coefficients in the non-zero-out region of a transform block, the maximum allowed number of context-coded bins is set to 1.75×N . This would correspond to 28 bins per CG if the bin budget was distributed equally among CGs, which is only slightly higher than the limit specified in HEVC (25 bins). The limit on context-coded bins is enforced as follows. If, at the start of a scan position, the total number of already coded sig, gt1, par, and gt3 bins for the transform block exceeds 1.75×N − 4, i. e., less than 4 bins are remaining in the budget, the first coding pass is terminated. In that case, the absolute values |q| for the remaining scan positions are coded in a third scan pass. They are represented by syntax elements decAbsLevel, which are completely coded in bypass mode. Finally, in the fourth and last pass, the signs for all nonzero levels of a CG are coded in bypass mode. If SDH is enabled and the difference between the scan indexes of the last and first nonzero level inside the CG is greater than 3, the sign for the last nonzero level is not signaled. Fig. 4 illustrated the organization of level data into the different scan passes.

E. Context Modeling
In order to efficiently utilize conditional statistics for arithmetic coding, VVC uses a rather large set of context models for coding the bins sig, gt1, par, and gt3. Beside the TCQ state 3 , the context modeling also exploits statistical dependencies between spatially neighboring quantization indexes, similar to the approaches described in [36]- [38].
The context for the sig bin depends on the associated TCQ state s k , the diagonal position d = x + y of the coefficient in transform block, and the sum of partially reconstructed absolute levels q * inside the local template T illustrated in Fig. 5a. The partially reconstructed absolute levels are given by already coded bins for neighboring scan positions and can be calculated according to For luma blocks, the context index c sig lum indicating the adaptive probability model used is derived according to with f sig (T ) = min 3, 1 + being a function of the partially reconstructed levels q * inside the local template T . For chroma blocks, only two classes of When TCQ is not enabled, the value of the TCQ state s k is set equal to 0. In total, 60 context model are used for coding the sig bin (36 for luma and 24 for chroma). The probability models chosen for gt1, par, and gt3 do not depend on the TCQ state, as it was found to provide only a very minor benefit. A single shared context offset is computed to select the probability model for these syntax elements. They are chosen based on the diagonal position d of the coefficient (4 classes for luma and 2 for chroma) and the sum of the values max(0, q * −1) inside the local template T . With being another function of the partially reconstructed levels q * inside the local template T , the context indexes c lum and c chr for luma and chroma blocks, respectively, are given by In addition, for the last coefficient position a separate context (given by c lum = 0 and c chr = 0) is used. For each of the gt1, par, and gt3 bins, 32 probability models (21 for luma and 11 for chroma) are used in total.

F. Binarization of Bypass-Coded Level Data
The syntax elements rem coded in the second pass represent remainders for absolute levels. They are only transmitted for a scan position if the associated gt3 bin is equal to 1. With q * being a partially reconstructed level according to (8), the absolute value |q| of the level is given by The remainders rem and the syntax elements decAbsLevel, which represent absolute levels coded in the third pass, are binarized using a combination of truncated Rice (TR) and Exp-Golomb (EG) codes, similar to remainder values in HEVC. The resulting bins are coded in the bypass mode of the coding engine. Unlike HEVC, the Rice parameter for the TR codes is derived based on the sum of absolute level values |q| in a , with s T = T |q| − 5 z 0 , (16) where z 0 is set equal to 4 for coding the remainders rem, and it is set equal to 0 for the coding decAbsLevel. The reason for this difference is that the values of decAbsLevel specify complete absolute levels, while the remainders rem represent differences rem = (|q| − q * )/2, which have smaller values.
For each Rice parameter m, values less than v max = 2 m · 6 are coded using only TR codes of order m (TR m ); this corresponds to codes with a unary prefix of length 6. For values greater than or equal to v max , the TR m codes are concatenated with Exp-Golomb codes of order m + 1 (EG m+1 ). Table VI shows the binarization for Rice parameters m = 0 and m = 3 with a concatenation of TR m and EG m+1 codes. Bold bins in the table correspond to the TR m portion of the binarization. When the combined code length would exceed 32 bins, the binarization is slightly modified [39]. In this case, the length of the Exp-Golomb prefix is limited to 11 bins (see underlined entry for m = 0 in Table VI) and the remaining 15 bins of the 32 bit budget are used to represent the suffix part.
For increasing the coding efficiency for completely bypasscoded levels [33], the values of decAbsLevel do not represent the absolute level values |q| directly, but are derived as  These values are coded using the same binarization as for the remainders rem. Note that the parameter pos 0 basically specifies the position of the codeword for |q| = 0 in a reordered codeword table. It is derived based on the Rice parameter m and the TCQ state s k according to

G. Transform Skip Residual Coding
In addition to the regular residual coding (RRC) for transform coefficients described above, VVC also includes a dedicated entropy coding for quantization indexes in transform skip mode, which is referred to as transform skip residual coding (TSRC). It was mainly designed for improving coding efficiency for screen content and can be enabled on a slice level. When enabled, the TSRC scheme is used for coding quantization indexes of transform skip blocks; when not enabled, the quantization indexes of transform skip blocks are coded with the regular residual coding.
In contrast to the regular residual coding, the position of the last nonzero level is not transmitted and the levels are coded in forward scan order, i. e., starting from the topleft coefficient and proceeding to the bottom-right coefficient. Similar to RRC, the syntax elements for a CG are coded in multiple passes over the scan positions, and the same limit for the number of context-coded bins is applied. As long as this limit is not reached, the levels are coded using three passes, as shown in Fig. 6. In the first pass, the bins of sig, sign, gt1, and par are interleaved and context-coded using adaptive probability models. A local template, as shown in Fig. 5b, is also applied in TSRC for deriving the context indexes, but it only includes two neighboring coefficient positions. Since, in transform skip blocks, successive signs have often similar values, the sign flags are included into the first pass and are coded in the regular mode of the coding engine. If the limit of regular-coded bins is still not reached after the first pass for a CG, up to four greater-than-x flags (gt3, gt5, gt7, and gt9) per coefficient are coded in a second pass. These bins are also context-coded. Finally, in a third pass, the remainders for absolute levels (rem) are coded in bypass mode. Note that the remainders can have different meanings, depending on whether the bin limit was reached for a scan position during the second pass (and, thus, no gt3 bin could be coded). For all scan positions for which no data were transmitted in the first pass, the complete absolute values (decAbsLevel) as well as the associated sign flags are coded in bypass mode in a fourth pass. The Rice parameter m for both rem and decAbsLevel is always set equal to 1. For more details on the design of TSRC, the reader is referred to [8].

IV. BINARY ARITHMETIC CODING
Context-based adaptive binary arithmetic coding (CABAC) [40] was originally introduced in AVC, as one of two supported entropy coding methods. Due its superior coding efficiency compared to conventional variable-length coding, it is the only entropy coding supported in both HEVC and VVC. But while AVC and HEVC share the same core coding engine, VVC introduces a new engine for the regular coding mode that is designed to be more flexible and efficient.
In binary arithmetic coding, the coding engine consists of two elements: Probability estimation and codeword mapping. The purpose of probability estimation is to determine the likelihood of the next binary symbol having the value 1. This estimation is based on the history of symbol values coded using the same context and typically uses an exponential decay window [41]. Given a sequence of binary symbols x(t), with t ∈ {1, · · · , N }, the estimated probability p(t + 1) of x(t + 1) being equal to 1 is given by (19) where p(1) is an initial probability estimate and α is a base determining the rate of adaptation. Alternatively, this can be expressed in a recursive manner as The engine of AVC and HEVC implements such an exponential smoothing estimator using a single finite state machine with 128 states. VVC also uses such an estimator, but with some key differences: • VVC maintains two estimates for each context, where each estimate uses it own base α. The probability that is actually used for coding is the average of the two estimates. The reason for using multiple estimates is to improve compression performance; • VVC defines a different pair of bases for each context to improve compression performance; • VVC does not use a state machine but arithmetically derives the probability estimates using the recursive function described above. More details on the rationale for using two estimates and percontext customized bases are provided in Section IV-A.
In VVC, the initial estimate p(1) is derived for each context using a linear function of the quantization parameter QP, as is also done in AVC/HEVC. The main difference lies in the fact that, in VVC, the so derived value represents an actual probability (linear space), whereas in AVC/HEVC, it represents a state of the state machine (logarithmic space). For codeword mapping, a current interval is split into two subintervals, each corresponding to one of the possible values of a binary symbol. The range of each subinterval is obtained by multiplying the range r of the current interval with the corresponding probability estimate. In AVC/HEVC, the multiplication is approximated using a lookup table, which determines the range r LPS of the subinterval associated with the least probable symbol (LPS). In VVC, a direct multiplication is used instead while using the same LPS convention. Once r LPS is determined, the AVC/HEVC and VVC engines operate in identical manners.

A. Multi-Hypothesis Probability Estimation
Consider a binary source x(t) and let p 0 be the marginal probability of a symbol being equal to 1. When using the initial estimate p(1) = p 0 , the expected value of the exponential smoothing estimator is given by which means that the estimator is unbiased. Assuming that the source is uncorrelated, i. e., the variance of the prediction error is This implies that the optimal value of α should be 0. However, this is not observed in practice, where larger values of α are found to be optimal. The assumption that the source is uncorrelated is therefore incorrect. Fig. 7 shows the distribution of first-order auto-correlation coefficients for data collected from a set of VVC bitstreams, where the correlation coefficients were estimated on chunks of 4096 symbols for each context. The range [−0.05, 0.45] and distribution are clearly biased towards positive auto-correlation coefficients.
For this two-parameter estimator, the relationship between the auto-correlation coefficient of the source and the estimation error was investigated in [43]. In particular, considering an first-order auto-regressive source model, with 0 < < 1, optimal values for α 0 and α 1 were derived as a function of the correlation coefficient . As shown in Fig. 8, the first parameter α 0 should be equal to 0, and the second parameter α 1 should be chosen as a function of . Using these optimal parameters, it was further shown in [43] that the twoparameter estimator outperforms the traditional one-parameter estimator for a wide range of correlation coeffcients . The above assumes that the initial probability estimate is set equal to p 0 . In practice, this is not achievable as p 0 may depend on the actual content of a slice. α 0 should therefore be set to a value larger than 0 such as to gradually disregard the initial estimate. The larger the value of α 0 , the less impact has the initial estimate. In VVC, the parameters α 0 and α 1 were selected for each context using a training algorithm that jointly optimizes these parameters and the initial probability estimates [44].

B. Implementation Considerations
Arithmetic coding is an inherently serial process: Each symbol must be processed in sequence. Throughput, measured in the number of symbols processed per second, is a key complexity metric to be considered in the design of a coding engine. Another key complexity metric is the memory requirement. A combination of hardware and software considerations have been used to design the VVC coding engine.
For probability estimation, to simplify implementation and avoid multiplications, the bases α are limited to negative integer powers of 2, i. e., α = 2 −β with β ∈ N + . This enables implementations with bit shifting operations [45], where q(t) is an integer representation of p(t) with b bits. The relationship between p(t) and q(t) is given by The value of b for each estimator is selected based on coding efficiency and memory considerations. Memory requirements are driven by the product of two numbers: The number of contexts n and the number of bits m required to capture the state for each context. n depends on context modeling, as discussed in Section III, while m is equal to the sum b 1 + b 2 of the number of bits used for each estimator. As VVC uses two estimators with different adaptation rates, a smaller number of bits is typically required for the estimator with the faster adaptation rate. Hence, b 2 = 10 and b 1 = 14 bits are used for r Cb r Cr r C1 r C2 α Fig. 9. Rotational transform of vectors (r Cb , r Cr ) to (r C1 , r C2 ) by an angle α.
the faster and slower estimator, respectively, yielding a total of m = 24 bits per context. This amount is significantly higher than for HEVC (7 bits) but nevertheless remains reasonable.
Multiplications are thus avoided for probability estimation but not for subinterval range computation. While the bit width of the multiplier has typically no impact in software (latency and throughput are the same for 8-, 16-, and 32bit multiplications), it does matter in hardware, where smaller multipliers are preferred. The size of the multiplier in VVC is thus limited to 5 by 4 bits, where 5 is the number of bits representing the probability estimate and 4 the number of bits representing the range of the current interval. Thus, r LPS is computed as follows, where ⊕ specifies the bit-wise "exclusive or" operator. During the development of VVC, throughput of the coding engine was measured for optimized software implementations (see experiment 2 in [46]). In that experiment, the throughput of the VVC engine was determined to be about 7% lower than that of the AVC/HEVC engine (128.5 million symbols per second versus 137.8 million symbols per second).

V. JOINT CODING OF CHROMA RESIDUALS
The previous sections focused on the actual quantization and entropy coding of individual blocks of transform coefficients in the VVC standard. An efficient joint representation of multi-component residual block signals, however, was also addressed during the development of VVC. In addition to the cross-component linear model (CCLM) prediction of chroma samples from collocated luma samples [47], VVC provides a means for the joint coding of chroma residuals (JCCR), which is described in the following.
Digital images and video pictures are generally composed of multiple color components (for example, red, green, blue in RGB color formats and Y, Cb, Cr in the YCbCr color format). In natural pictures acquired via image sensors, a signal correlation can be observed between these color components, causing some redundancy to remain in the quantized residuals. JCCR [48], [49] exploits the correlation between chroma components (particularly, the Cb and Cr components in YCbCr coding) by allowing an encoder to transmit, on a transform unit (TU) basis, only one instead of two quantized residual signals, along with compact correlation angle and sign information. In the decoder, the transmitted downmixed joint residual block signal is then upmixed to the original color components, scaled according to the angle and sign information.
The following two subsections describe the JCCR mode in VVC in more detail. For further information on the fundamental concept behind the JCCR tool, the inter-component transformation (ICT), the reader is referred to [50].

A. Forward and Inverse Rotational Transform
The JCCR processing can be regarded as a switchable intercomponent rotational transform applied in addition to conventional intra-component spatial transforms like the DCT-II [50], with the purpose of achieving increased compaction of residual energy into a single component on a block basis. As illustrated in Fig. 9, this rotational transform on two residual blocks r Cb and r Cr is controlled by an angle α. Conceptually, the forward and inverse transforms are given by with the forward transform matrix In the JCCR modes supported in VVC, the samples of the second component r C2 are enforced to be equal to zero. Hence, at the decoder side, both color components r Cb and r Cr are reconstructed from the transmitted downmix block signal r C1 . At the encoder side, the rotation angle α can be selected from a predefined set of values. Typically, an encoder would select the angle α opt that yields the lowest rate-distortion cost (when considering the reconstruction of the Cb and Cr components). Naturally, an encoder can also disable JCCR on a TU basis (for example, if it would decrease coding efficiency); then, the residual blocks for Cb and Cr are coded separately.
VVC supports, in total, 6 different rotation angles α, which can be indicated by an angular mode m. The rotation angles α and the corresponding weights of T −1 α for all supported modes m are shown in Table VII. Note that multiplications by 1 /2 can be realized efficiently using bit shifts to the right. Thus, JCCR upmixing, which follows the inverse spatial transforms in the decoder, is given by r Cb = r C1 , r Cr = (c sign · r C1 ) 1, for |m| = 1, (34) r Cb = r C1 , r Cr = c sign · r C1 , for |m| = 2, (35) r Cr = r C1 , r Cb = (c sign · r C1 ) 1, for |m| = 3, (36) where c sign ∈ {-1, 1} represents the sign of the mode m or, equivalently, the sign of the rotation angle α.

B. Signaling of JCCR Usage and Rotation Parameters
The general usage of JCCR can be enabled on a picture level. When enabled, a flag tu joint cbcr residual flag is transmitted for every TU for which either or both chroma coded block flags, CBF Cb and CBF Cr , are equal to 1. If the tu joint cbcr residual flag is equal to 0, JCCR is not used for the TU and the chroma residual blocks are reconstructed in a conventional manner. Otherwise, if the flag is equal to 1, the absolute value of the JCCR mode m is derived by if CBF Cb = 1 and CBF Cr = 0, 2, if CBF Cb = 1 and CBF Cr = 1, 3, if CBF Cb = 0 and CBF Cr = 1.
The value of c sign , allowing the decoder to distinguish between m < 0 and m > 0, is conveyed via the picture header syntax element ph joint cbcr sign flag. This flag is transmitted on a picture level, since it was observed that its optimal value usually varies very little within a video frame. Note that, when JCCR is enabled for a TU and both chroma CBFs are equal to 1, no quantization indexes are sent for the second chroma component. Hence, in all cases, either the Cb residual r Cb or the Cr residual r Cr is replaced by the downmix component r C1 .
The support of JCCR does not require any modifications of the transform coding for residual blocks. But as noted in Section II-F, a separate look-up table can be specified for deriving the QP for JCCR modes with |m| = 2. For |m| = 1 and |m| = 3, the QPs for the Cb and Cr components, respectively, are used.

VI. EXPERIMENTAL RESULTS
In the following, we provide experimental results evaluating the coding efficiency impact of the quantization and entropy coding modifications in VVC relative to HEVC. All results were obtained by running coding experiments according to the JVET Common Test Conditions (CTC) [51]. In addition to the test sequences specified in JVET's CTC, we also generated results for two other well-known test sets: The UHD-1 test set of the EBU [52] with 12 sequences in 2160p resolution and the 5 publicly available 1080p sequences of the SVT [53] test set. All sequences are given in the YCbCr 4:2:0 format with 8 or 10 bits per sample and frame rates of 24 Hz to 60 Hz.
Since a video coding standard like VVC specifies a combination of multiple coding tools and design concepts, it is difficult to assess the benefit of individual aspects. For example, the design of many tools is affected by the block partitioning, improvements in intra-and inter-picture prediction influence the effectiveness of all transform coding tools, and the nonnormative encoding algorithm has a significant impact on all coding efficiency comparisons. In our coding experiments, we ran simulations with the VTM-9.1 reference software [28] and compared the following five versions: 1) VTM-9.1 configured according to CTC (enabling all tools that contribute to coding efficiency); 2) Version 1 with disabling JCCR; 3) Version 2 with additionally disabling TCQ but enabling SDH (already supported in HEVC) instead; 4) Version 3 with additionally replacing the arithmetic coder of VVC with that of AVC/HEVC (the same initialization tables are used, but with a mapping to initial states); 5) Version 4 with additionally replacing the VVC with the HEVC coefficient coding (for supporting all block shapes, the definition of CGs and the scan is not modified).
By comparing bitstreams generated with versions 1 and 5, we can estimate the coding efficiency benefit of the newly added features for quantization and entropy coding. The contribution of individual tools is assessed by comparing two successive versions in the list above. As measure for coding efficiency differences, we use the Bjøntegaard delta (BD) rate [54] with base QP values of 37, 32, 27, and 22, as specified in the JVET CTC. Note that negative numbers indicate corresponding average savings in bit rate for the same quality, measured as peak signal-to-noise ratio (PSNR). The BD rates are measured for the three test scenarios all intra (AI), random access (RA), and low delay (LD) specified in CTC. The average results for the CTC sequence classes (A1 to F) and the two additional test sets (EBU and SVT) are summarized in Table VIII. For each scenario, the table lists two BD rate averages: An average over the sequences of classes A1, A2, B, C, and E as defined in JVET's CTC and an average over all tested HD and UHD sequences. It also reports increases in encoder and decoder run times (measured via geometric averages, see [51]), which give an indication of the impact on encoder and decoder complexity, respectively.
The simulation results indicate that the improvements of quantization and entropy coding in VVC relative to HEVC yield bit-rate savings of roughly 4% at reasonably small increases of encoding and decoding times, where somewhat higher gains (but also higher encoding times) are observed for intra-only coding. The contributions of the individual tools lie in a range of about 0.5-2%. The larger improvements for class F, which comprises sequences with screen content, can be attributed to the newly included transform skip residual coding (see also [8]). The decreased decoding times for enabling TCQ are caused by a slight shift of the quantizer's operating point towards lower bit rates. For the results shown in Table VIII, TCQ was compared to SDH, since the latter is already included in HEVC. When one compares TCQ to conventional quantization with URQs, the bit-rate savings increase by about 0.5-0.7%, which represents the gain of SDH. Due to their VQ properties, both quantization tools show larger gains for higher video qualities. In contrast to that, JCCR is more effective for lower bit rates. JCCR also yields larger benefits for non-4:2:0 color sampling formats, as these include more samples   Table IX, which were obtained by running simulations for eight sequences in YCbCr 4:4:4 and RGB 4:4:4 formats according to the JVET Common Test Conditions for non-4:2:0 color formats [55].

VII. CONCLUSION
Transform coding of prediction error blocks is one of the key components in hybrid video coding. This paper described the fundamental principles and implementation considerations behind the quantization and entropy coding design in the recently finalized Versatile Video Coding (VVC) standard. It introduced the trellis-coded quantization feature of VVC and highlighted the improvements, relative to the High Efficiency Video Coding (HEVC) standard, made in both the entropy coding scheme for quantized transform coefficients and the binary arithmetic coding engine. In addition, a newly integrated method for a block-wise joint coding of chroma residuals in color images and videos was discussed. A comprehensive performance evaluation, conducted by means of a large set of video sequences of varying resolution, confirmed the increased coding efficiency (measured in bit-rate reduction at the same peak signal-to-noise ratio) achieved by each of the aforementioned improvements, as well as by the combination of these coding tools. Beside the technology described in this paper, VVC includes a variety of other improvements such as the flexible block partitioning [5], block-adaptive transforms [6], various improvements of the intra-and inter-picture prediction, and new adaptive in-loop filters. By combining all these coding tools, VVC is able to outperform its predecessors, HEVC and AVC, in compression efficiency by a considerable margin and, thus, represents the new state-of-the-art in video coding.