A Decoding-Complexity and Rate-Controlled Video-Coding Algorithm for HEVC

: Video playback on mobile consumer electronic (CE) devices is plagued by ﬂuctuations in the network bandwidth and by limitations in processing and energy availability at the individual devices. Seen as a potential solution, the state-of-the-art adaptive streaming mechanisms address the ﬁrst aspect, yet the efﬁcient control of the decoding-complexity and the energy use when decoding the video remain unaddressed. The quality of experience (QoE) of the end-users’ experiences, however, depends on the capability to adapt the bit streams to both these constraints (i


Introduction
Increasingly mobile consumption of video content, advancements in consumer electronics, and the popularity of video on-demand services, have immensely contributed towards the dramatic increase in video data traffic (expected to exceed 75% of the overall data traffic [1]) in the Internet. However, fluctuations in the network bandwidth, together with the limited processing and energy resources in mobile hand-held devices affect the quality of the video streams and the viewing experiences of the end users. As such, jointly adapting the video content to meet the network bandwidth and the device's energy supply becomes a crucial element to enhance the quality of experience (QoE) of video streaming services and applications.
focuses on a content-adaptive video-encoding solution for HEVC; thus the following section briefly elaborates on the state-of-the-art with respect to the application layer class of solutions.
Certain application-layer approaches propose alterations to the decoder implementations (both software and hardware decoders) such as data and task-level parallelization techniques [30,31]. The more advanced of variants of this concept, such as Green-MPEG, use metadata to specify the decoding-complexity requirements to the decoder [32,33], which can then skip certain decoding operations in order to reduce the decoding energy consumption. The approach by Nogues et al. [34] is one such technique where two of the most complex decoding operations in the HEVC decoder (the in-loop filtering and the interpolation filters) are altered during the decoding process. However, such alterations in the interpolation filter at the decoder during motion compensation severely compromise the video quality. This is due to the use of a modified filter at the decoder on prediction unit (PU) residuals that are computed at an encoder which is unaware of the changes in the decoding process [3,24].
The use of DVFS algorithms is seen as another common approach to reduce the energy consumption of a video decoder. In this case, the video quality and energy usage are balanced [11,[35][36][37], by controlling the idle-time of the video decoder in real-time by adjusting the CPU frequency and operational voltage [12]. However, this has been shown to have drawbacks such as frame drops and impact on the overall system performance which adversely affect the user's QoE, especially in the case of high frame rate, high quality video content [36]. The poor estimation of the complexity of subsequent frame/video segment is largely to blame in many cases, which leads to the sub-optimal selection of CPU frequencies and voltages. A recent Green-MPEG specification suggests the inclusion of codec-dynamic voltage frequency scaling (C-DVFS) metadata into the bit stream [13] to aid the frequency selection process in DVFS. However, estimating frame complexity in order to predict the operational CPU frequency still remains a challenging task. Such operations can greatly benefit from using an encoder (such as that proposed in this manuscript) which is capable of generating HEVC bit streams for given decoding-complexity and bit rate constraints.
The encoder-side content adaptation offers another alternative to reduce the decoding energy consumption during video playback. For example, scalable video coding (SVC) using proxy servers, media transcoding solutions [27], and dynamic adaptive streaming technologies such as MPEG-DASH [20] facilitate dynamic video content adaptation in order to meet the constraints of video playback devices. However, in general, these and other similar solutions, such as device-oriented [23], battery-aware [19], adaptive multimedia delivery and rate adaptation [38] schemes, are limited to manipulating basic video coding parameters such as the quantization parameter (QP), spatial resolution, and frame rate to adapt the video content and achieve energy savings.
Following a similar concept, decoder-friendly bit stream generation at the encoder has been attempted for H.264/AVC in [39][40][41]. For instance, the algorithm proposed in [39] constrains the encoder to select sub-pel motion vectors to control the decoding-complexity. A power consumption model for the deblocking filter is proposed in [40] to support the encoder in preparing decoder friendly H.264/AVC bit streams. A complexity analysis of H.264/AVC entropy decoder is presented in [41] to facilitate the encoder to effectively trade off bit rate, distortion, and decoding-complexity during the encoding process. However, mechanisms to dynamically allocate and control the decoding-complexity levels across video frames and macroblock units have been overlooked in these state-of-the-art algorithms.
The MPEG-DASH-based energy-aware HEVC streaming solutions [20] that exist in the literature only consider the decoding energy in PU mode decision and motion vector selection (i.e., integer-pel vs. fractional-pel). Hence, the reduction in energy consumption is marginal with respect to the similar approaches. In addition, the encoding techniques targeting energy-efficient HEVC decoding proposed in [42] consider the inverse transform and inverse quantization operations to reduce the decoding energy consumption. In this regard, our previous work in [24] takes a step ahead by introducing a decoding-complexity-rate-distortion model which is capable of determining the optimal combination of Langrangian multipliers that minimizes a cost function that constitutes all three parameters (i.e., rate, distortion, and decoding-complexity). The solution presented in [24] is capable of determining the coding modes for given content that minimize the decoding-complexity, and by extension the decoding energy, have minimal impact to the coding efficiency in a fixed QP encoding scenario. However, this solution lacks the capability to arbitrarily allocate rate and decoding-complexity levels to frames or CTUs, and control them in order to generate bit streams with multiple bit rate decoding-complexity levels, which is crucial for video streaming applications that target resource-constrained video decoding devices with high quality HD/UHD video contents.
Furthermore, the network-aware and receiver-aware adaptation algorithms and the complexity-rate-distortion models [14][15][16] which have been introduced based on the previous coding standards typically focus on creating spatial, temporal, and quality scalable video bit streams. These approaches lack a comprehensive analysis of the decoding-complexity, rate, and distortion trade-offs with respect to the features available to the modern coding standards such as HEVC.
Finally, it has been shown that the increasingly popular HTTP-based video streaming solutions are sufficiently flexible to incorporate decoding-energy in their content prefetch logic [20]. This is typically achieved by utilizing an algorithm that monitors the device's remaining energy level to determine the next most appropriate video segment to meet that energy level. However, as discussed above, the content creation algorithms used in these solutions consider bit rate, QP, and spatial resolution changes as means for generating bit streams at different energy levels [19,[21][22][23]. This is primarily due to the lack of a direct approach to generate rate-controlled bit streams at specified decoding-complexities-a crucial missing element in the state-of-the-art. Therefore, it is clear that a need exists for a mechanism to generate decoding-complexity and rate-controlled bit streams at the encoder to fully realize the energy efficiency goals of standards such as Green-MPEG [32,33] that can also co-exist with current streaming solutions and decoder-side energy efficiency initiatives such as DVFS.

The Decoding-Complexity, Rate, and Distortion Relationship
In order to develop an algorithm to control the bit rate and decoding-complexity of a video sequence, it is necessary to first determine the relationship between these two parameters and the distortion produced when a particular coding parameter combination is selected by the encoder for a given content. As such, a content-adaptive decoding-complexity-rate-distortion model is necessary, where the decoding-complexity can be determined for various decoding operations based on the coding modes and features selected by the encoder. To achieve this, the decoding-complexity estimation models developed in [24,[43][44][45] for both inter-predicted and intra-predicted coding units (CU) are used as a basis for this work, which will equip the encoder to compute the relative complexity of each decoding operation. This section describes the approach used to analyze the behavior of these three parameters and a content-dependent model that can be generated for the decoding-complexity-rate-distortion space.

The Decoding-Complexity, Rate, and Distortion Space
In HEVC, the optimum coding parameter combination of a CU is derived by using a Lagrangian optimization approach with the rate-distortion (RD) cost function where λ ≥ 0 is the empirically determined Lagrangian multiplier, p is a coding structure in the set of combinations P, and D(p) and R(p) represent the distortion (squared error per pixel) and bit rate (bits per pixel), respectively. Each p in (1) results in different decoding-complexities at the decoder [24,[44][45][46] that remain unknown to the encoder. In order to assess the impact of each p on the decoding-complexity, we first redefine the optimization function in (1) as where C(p) is the relative decoding-complexity (cycles per pixel) of p obtained from [43][44][45]. Here, λ r and λ c are bit rate and decoding-complexity trade-off parameters analogous to λ in (1), respectively. The ranges of λ r and λ c define the decoding-complexity-rate-distortion space spanned by the coding parameter combinations in P. Next, we analyze this relationship and derive a model of these parameters for use in a joint decoding-complexity and rate-controlled encoding algorithm.

The Decoding-Complexity, Rate and Distortion Behaviour
In order to determine the behavior of decoding-complexity, rate, and distortion, the parameter space created by (2) must first be determined. To achieve this, an experimental sweep of the space created by λ r ∈ [0, ∞) and λ c ∈ [0, ∞) was performed on six different test sequences (with representative and varying spatial and temporal characteristics). Empirical data were collected from both inter-and intra-predicted frames of these test sequences for QPs ranging from 0-51 [24]. The resulting decoding-complexity, rate, and distortion can thereafter be expressed for further analysis in terms of cycles per pixel (cpp), bits per pixel (bpp), and the mean squared error (MSE), respectively, as follows.
where s i and s i correspond to the ith original and reconstructed pixel, respectively, q is the QP, and N is the number of pixels in the frame. Similarly, the bpp and cpp are defined as, and cpp(p, λ r , λ c , q) = C(p, λ r , λ c , q) where W and H correspond to the frame width and frame height, respectively. Further, R and C represent the total number of bits required to encode the frame and the estimated decoding-complexity of that frame once encoded. In the model used in this work, the relative decoding-complexity is expressed in terms of the number of CPU cycles used by the HM 16.0 reference decoder for each operation when executed on an Intel x86 CPU architecture platform. Figure 1 graphically illustrates the behaviors of decoding-complexity, rate, and distortion in the parameter space spanned by p, λ r and λ c for the "kimono 1080p" sequence, including the discrete operating points that can be achieved by the encoder (i.e., each data point in Figure 1 represents the resultant bit rate, decoding-complexity and distortion for a particular combination of λ r and λ c ). (The behaviors of decoding-complexity, rate, and distortion remain similar across QPs and sequences, albeit with different model parameters). It was observed that this general behavior can be modeled in a content-dependent manner using a 2-dimensional nth-power model given by where a(q), b(q), c(q), and d(q) are QP and content-dependent model parameters. Naturally, this implies that a content-adaptive approach is necessary to compute the appropriate model parameters dynamically, in order to determine the optimum coding structure p for particular content. The next section describes an approach to build upon the model in (6) and to adaptively compute its model parameters (as described in Section 4.3), which leads to the novel joint decoding-complexity and rate-controlled encoding algorithm proposed in this manuscript.

Joint Decoding-Complexity and Rate Control
As with any rate control algorithm, the joint control of both decoding-complexity and rate also requires target decoding-complexities and bit rates to be defined. In this context, this section first describes how these can be allocated at the CTU-level. This is followed by the derivation of an appropriate QP and content-adaptive decoding-complexity and rate trade-off factors in (6), and finally by an update algorithm to dynamically adapt the model parameters in (6).

CTU-Level Rate and Decoding-Complexity Allocation
Adopting a similar approach to the rate controller in the HM reference encoder [47,48], the group of picture (GOP)-level and frame-level bit rate and decoding-complexity targets can be used to derive their CTU-level allocations. In this case, the target number of bits for the GOP can be expressed as where ψ, B l , B a , M l , and W are the GOP size, bits remaining, average bits per frame, frames remaining, and window size, respectively [48]. (In this case, it is assumed that the bit rate and the number of frames to be encoded in the sequence is known ahead of the encoding process. E.g., an on-demand video streaming scenario.) The average bits per frame is given by B a = B T /M, where B T is the total number of bits assigned to the sequence and M is the total number of frames in the sequence. Similarly, the target decoding-complexity for the GOP becomes where C l and C a are the total decoding-complexity budget left over and the average complexity per frame, which is given by C a = C T /M. Here, C T represents the total decoding-complexity assigned to the sequence. (The available decoding-complexity budget in this case is defined as the total number of CPU cycles that can be spent for the purpose of decoding a particular sequence with a known number of frames. This can be determined based on the remaining energy capacity of the battery powered decoding device, and is considered outside the scope of this work.) Throughout the remainder of the manuscript, W = 40 as in the default configuration used in the HM 16.0 [49] encoder implementation. Next, the bits and decoding-complexities allocated for the GOP in (7) and (8), respectively, must be distributed across the individual frames in the GOP. In this case, a similar approach is adopted for both quantities (i.e., bits and decoding-complexity) leading to a weighted distribution of the GOP-level allocation to each frame as shown in (9). For the jth frame in the GOP, the bit and decoding-complexity allocations are given by where X ∈ {R, C} and the weights are defined as Here, (j) is the default weighting factor defined in HM16.0 reference encoder implementation [48,50] for the jth frame in the GOP. The weighting factor η m X is experimentally determined based on the ratio of bits and decoding-complexities consumed by intra-predicted and inter-predicted frames in a typical video sequence. In this case, the numbers of bits and the decoding-complexities utilized within intra-predicted and inter-predicted frames were averaged across 50 frames for the six test sequences (Section 3.2) to determine η m X , which is defined as Finally, bit rates and decoding-complexity targets are allocated to the individual CTUs based on the MSE of the previous co-located CTU, as is often done in traditional rate control [51,52]. The decoding-complexity-rate-distortion model in (6) is used to these ends, and as the model parameters therein are functions of QP and content, the MSE of the kth CTU in the jth frame of the GOP is first predicted for each QP q which can be expressed as where a k,j (q), b k,j (q), c k,j (q), and d k,j (q) are the appropriate model parameters for that CTU. Here, bpp avg and cpp avg are the bpp and cpp calculated as the averages of the minimum and maximum of each respective parameter observed so far within the encoded sequence for the kth QP for a given frame type. Next, the actual MSE of the co-located CTU, MSE CTU k,j,co , is then compared with MSE CTU k,j for all QPs to obtain a QP q 0 such that The bpp avg and cpp avg for QP q 0 are then used as the weights for both the bit rate and decoding-complexity to the CTU, respectively. Thus, the final target bit rate and decoding-complexity of the kth CTU in the jth frame in the GOP can be expressed as where is the remaining bits or decoding-complexity available to the remaining CTUs in the frame, Φ is the total number of CTUs in the frame, and ω CTU X (k, j) is the bit or decoding-complexity weight for the CTU. Note that these bit rates and decoding-complexities can be expressed in terms of bpp or cpp by simply dividing X CTU T by the number of pixels in the CTU.

Determining the Model Parameters and Trade-Off Factors
Having established the target bit rates and decoding-complexities at the CTU-level, the remaining modeling parameters in (6), QP and the trade-off factors for the bit rate and decoding-complexity must be determined at the CTU-level in order to apply the optimization function in (2) to determine the most appropriate coding structure for that CTU.

Determining QP
Once the CTU-level decoding-complexity and bit allocations are made, the QP selection is first performed using a similar MSE based approach. In this case, MSE of the co-located CTU, MSE CTU k,j,co , is now compared with MSE CTU k,j for all QPs to obtain a QPq 0 such that In this case, MSE of the kth CTU in the jth frame of the GOP is estimated for each QP q using where, N is the total number of pixels in the CTU and R CTU T , and C CTU T are the bit and decoding-complexity levels allocated for the CTU, respectively.

Determining λ r and λ c
Next, from (6), for the kth CTU in the jth frame and Equations (16) and (17) imply that the CTU-level model parameters, together with the CTU-level bit rate and decoding-complexity allocations described in the previous subsection, completely define the optimization function in (2) needed to determine the optimum coding structure. Thus, from (14)- (17) for the kth CTU in the jth frame the bit rate and decoding-complexity trade-off parameters can be expressed as and respectively, where N is the number of pixels in the CTU. It now becomes apparent that the two trade-off parameters are both content and QP-dependent via the four modeling parameters in (6), (16), and (17). However, a content-independent generic set of parameters can also be obtained (to be used as initial values in the adaptive model parameter computation process described in the following subsection (Section 4.3) from the data collected in Section 3 and [24]. In this case, MSE, bpp, and cpp values from 50 inter-coded and intra-coded frames of six different test sequences (three HD, and three CIF) [24] have been considered to derive these generic model parameters, which can be expressed as for the two frame types. Naturally, the bit rate and decoding-complexity achieved using (20) and (21) will be inaccurate and not adaptive to the content. Hence, a mechanism to dynamically update the model parameters is necessary for joint decoding-complexity and rate-controlled encoding.

Dynamic Model Parameter Adaptation
In order to derive content-dependent model parameters, the generic parameter set in (20) and (21) can be adapted using a least mean square (LMS)-based approach [3,53]. To that end, the error between the assigned and achieved bit rate and decoding-complexity must be minimized. To do so in this case, a joint error function for the two quantities is first defined. Note that the following derivations will omit the k, j, and q 0 subscripts for notational simplicity, but the adaptation process must be applied independently to each CTU to compute their unique model parameters. Now, let the difference between the assigned and achieved bit rate and decoding-complexity per pixel be ∆R and ∆C, respectively. Similarly, let the difference between the predicted distortion and actual distortion in terms of MSE be ∆D. The total derivative of distortion in terms of MSE can be expressed as the sum of partial derivatives of the dependent variables in the model in (6) and the definitions in (16) and (17) as Obtaining the squared term of (23) and rearranging the terms The right hand side of (25) can be simplified further as The objective of minimizing ∆C and ∆R simultaneously is now made possible by multiplying (26) and therefore (25) by ∆D 2 . Hence, by combining (25) and (26), the joint error function to be minimized can be defined as Thus, using (27) and a LMS adaptive filter, the updated model parameters can be expressed as where α n , β n , ρ n , τ n , α n−1 , β n−1 , ρ n−1 , and τ n−1 are the newly computed and previous model parameters for the QP q 0 being considered. Further, ϑ α , ϑ β , ϑ ρ , and ϑ τ are the LMS filter's step size controlling the adaption speed and are empirically determined as 10 −4 , 10 −5 , 10 −5 , and 10 −6 respectively. Finally, the partial derivatives of F in (28) with respect to the model parameters are and where R CTU T , C CTU T are the target bit rate and decoding-complexities, respectively. Once the model parameters are updated as per (28) in this manner, the new parameters can be used to determine the λ r and λ c trade-off factors for the mode selection in the cost function in (2).

Experimental Results and Discussion
This section presents the performance of the proposed CTU-level decoding-complexity and rate control algorithm. In this case, the rate and complexity-controlling capabilities of the proposed algorithm are first compared with two state-of-the-art decoding-complexity-aware encoding algorithms in the literature. Thereafter, experimental results for the power consumption characteristics of the decoder during a video streaming session are compared for two different CPU frequency governing methods. Finally, the experimental results and observations are discussed in detail for different use cases.

Simulation Environment
The proposed encoding algorithm is implemented in the HM 16.0 reference encoder. The decoding-complexity estimation models presented in [24,[43][44][45], the Lagrangian cost function that determines the coding modes and the proposed decoding-complexity, and rate-controlling algorithm, are integrated into the HEVC encoding tool chain. The resultant bit streams are decoded using the openHEVC [54] software decoder. The decoding was performed on an Intel x86 Core i7-6500U system running Ubuntu 16.04 to measure the decoding-complexity performance of the bit streams. The proposed algorithm's performance is compared with three state-of-the-art approaches: the power-aware encoding algorithm proposed by He et al. [20]; the rate, distortion, and decoder energy optimized encoding algorithm proposed by Herglotz et al. [46]; and the tunable HEVC decoder proposed by Nogues et al. [34]. The video sequences used in the experiments reported in this section are of HD (1920 × 1080) resolution. In this case, "Kimono" and "Parkscene" sequences are defined in the HEVC common test configurations [55] and the rest are a collection of proprietary sequences. Their sequence categories (i.e., motion and texture complexity levels) are defined in the Table 1. Each video sequence is encoded at 900 kbps, 1 Mbps, 2 Mbps, and 4 Mbps video bit rates using the random access configuration with rate control enabled. Moreover, the bit streams corresponding to the proposed algorithm were generated with two decoding-complexity levels, which are referred to as L1 and L2. The bit streams corresponding to these two levels are encoded such that decoding-complexities are 30% and 40% less (in terms of CPU cycles) as compared to HM encoded bit streams, respectively. The complexity of the decoding process was measured using the instruction level analysis tools callgrind/valgrind [56]. The numbers of CPU cycles identified were assigned as the available decoding-complexity budget for the sequence, when performing the decoding-complexity allocation calculations described in Section 4.1.
Finally, the decoder's energy consumption was determined by measuring the energy dissipated by the system during the video playback. In this context, a test bed that implements an online video streaming scenario where the openHEVC decoder is used as the playback client was used for this assessment. The encoded bit streams were streamed for a duration of 15 min and the energy capacity reduction of the playback device's battery was measured using the Linux power measurement tools [57]. It should be noted that the measured battery capacity reduction corresponds to the overall energy consumption by the device that includes the energy consumed for the wireless transmission, video decoding, and video presentation. Furthermore, the relative energy consumption performances when using an application-specific DVFS algorithm [10] and the Linux ondemand frequency governor were also analyzed and are compared in the experimental results.

Evaluation Metrics
The performance of the proposed algorithm is evaluated in multiple stages. The evaluation metrics are described below.

Decoding-Complexity and Rate Control Performance
First, the decoding-complexity and rate-controlling capabilities of the proposed algorithm are evaluated by measuring the percentage error in achieving the target decoding-complexity and rate. In this case, the percentage error in bit rate is calculated using where R r and R T are the achieved and target bit rate, respectively. Similarly, the overall decoding-complexity controlling performance of the proposed algorithm is measured using where C T and C r are the target and achieved decoding-complexity levels in terms of the CPU cycles, for a particular number of frames. Moreover, the frame-wise rate and decoding-complexity control performances are measured using the percentage error between the allocated and actual number of bits and decoding-complexity per frame, respectively. (The decoding-complexity controlling performance is presented only for the proposed algorithm as the state-of-the-art algorithms do not support a mechanism to achieve a specified decoding-complexity).

Decoding-Complexity, Energy Reduction Performance, and Video Quality Impact
Next, the impact on video quality, decoding-complexity, and the respective energy reduction achieved at a particular decoding-complexity by the proposed algorithm (e.g., complexity level L2 is considered in this case), are compared against the state-of-the-art algorithms while keeping HM 16.0 as the reference. The impact on video quality for a given bit rate is assessed using the impact on PSNR given by where PSNR HM and PSNR κ are the resultant average PSNRs for the reconstructed video sequences when using HM 16.0 and proposed and other state-of-the-art algorithms, respectively. Similarly, the reduction in decoding-complexity and the corresponding energy reduction are assessed using and respectively. Finally, for normalized comparison purposes, the proposed and state-of-the-art algorithms are assessed on the decoding-complexity and energy reduction achieved for a 1 dB PSNR loss in the video quality. In this case, ∆Γ(%) per ∆PSNR(dB) is given by Similarly, ∆E(%) per PSNR(dB) is defined as where ∆Γ and ∆E are calculated as per (36) and (37), respectively.

Performance Evaluation and Analysis
This section presents and analyzes the experimental results. In this context, the decoding-complexity and rate-controlling performances are analyzed first. Thereafter, the decoding-complexity reductions achieved by the proposed as well as state-of-the-art algorithms and their quality impacts are discussed with respect to the experimental setup discussed in Section 5.1. Here, the proposed algorithm considers generating bit streams with a 40% decoding-complexity reduction target over HM 16.0 (i.e., complexity level L2).

Rate Controlling Performance
The percentage deviations of the final bit rate achieved after encoding using the proposed (with joint rate and decoding-complexity controlling) and state-of-the-art algorithms (with rate controlling enabled) are presented in the Table 1. Here, the video sequences are encoded at four different bit rates (described in Section 5.1) and the averaged percentage error is presented for comparison.
It can be observed that the rate-controlling algorithm implemented in the HM 16.0 reference encoder shows 1.75% average error, which is less than the 3.08% deviation from the target bit rate experienced by He et al. [20]. This is mainly due to the content and QP-agnostic nature of the algorithm, despite its use of PU level prediction modes, integer-pel vs. fractional-pel motion vectors, and in-loop filtering decisions. However, ref. [46] uses QP-dependent trade-off factors for both rate and decoding-complexity; thus the impact on the rate controller is significantly improved compared to He et al. [20].
In contrast, the proposed algorithm uses a content-adaptive, decoding-complexity, rate, and distortion model to derive the QP as well as rate and decoding-complexity trade-off factors to determine the set of coding modes and structures that minimize the distortion while achieving a given bit and decoding-complexity budget. Therefore, as illustrated in the Table 1, the proposed algorithm achieves the allocated bit rate targets with <1% error indicating that both CTU-level bit allocation as well as coding parameter selection are more accurate and content-adaptive compared to the state-of-the-art approaches.
In addition, the frame-wise rate-controlling performances of the encoding algorithms were analyzed using the percentages of error between the allocated bits and actual bits per frame. A graphical illustration of this frame-wise percentage error is presented in the Figure 2. It can be observed that the rate-controlling algorithms implemented in HM 16.0 and other state-of-the-art encoding algorithms suffer from large percentage errors throughout the video sequence. The incorporation of a third parameter within the mode selection cost function in He et al. [20] and Herglotz et al. [46] crucially affect the rate controller in achieving the allocated number of bits for a given block. For example, both these algorithms use an RD-optimization-based bit allocation, QP, and Lagrangian parameter determination approach [48] for the rate control while utilizing three parameters in the cost function (rate, distortion, and decoding-complexity) for the coding mode selection. The correlation that exists between the three parameters is, however, ignored when performing the rate control, which results in large average rate-controlling errors, as illustrated in the Table 1. The rate-controlling algorithm in HM 16.0 which follows a R-λ-based bit allocation and coding parameter selection approach also shows some deficiency in achieving the allocated bit budget for each frame. However, as illustrated in the Table 1, the HM 16.0 encoder still demonstrated a 1.75% error in its rate-controlling function.  In contrast, the proposed algorithm enables the encoder to effectively utilize the correlation between the three parameters to perform rate and decoding-complexity allocation, and appropriate coding mode selection, resulting in a smaller percentage bit error (illustrated in the bottom row of Figure 2 and Table 1). Moreover, the parameter update process in Section 4.3 keeps the algorithm content-adaptive, further minimizing the rate control error.

Decoding-Complexity Controlling Performance
The experimental results summarized in Table 2 show the percentage error in the decoding-complexity controlling function of the proposed encoding algorithm (achieving a specified decoding-complexity is not possible for any of the state-of-the-art algorithms). The proposed algorithm shows on average an ≈1.78% decoding-complexity controlling error for both complexity levels considered (30% and 40% reductions over HM 16.0). The results suggest that the proposed algorithm is capable of generating a bit stream that adheres to a given bit rate and a decoding-complexity level. Furthermore, the frame-wise decoding-complexity error illustrated in the Figure 3 also reveals that the proposed encoding algorithm is capable of maintaining a very low error despite the dynamic nature of the video content. In summary, numerical and graphical results for the simultaneous rate and decoding-complexity control capability of the proposed method indicate that the proposed method is content-adaptive and capable of achieving specified bit and decoding-complexity targets.

Decoding-Complexity Reduction and the Impact on Video Quality
The Table 3 demonstrates the average decoding-complexity reductions and the corresponding quality impact, in PSNR, for the proposed and state-of-the-art algorithms.
The algorithms proposed by He et al. [20] and Herglotz et al. [46] both achieved decoding-complexity reductions in the range of 10% and 20%, respectively. However, it was observed that those were achieved at the expense of a significant reduction in PSNR for a given bit rate. For example, although Herglotz et al. [46] uses a decoding-complexity estimation model [58][59][60], the bit rate, decoding-complexity trade-off factors are selected independently; thus, the impact on each other is overlooked during the coding mode selection. Furthermore, only the bit rate trade-off factor [47,49] is content-adaptive, and the decoding-complexity trade-off factor remains agnostic to the dynamics of the video sequence, which ultimately results in a higher quality loss. Similarly, the method proposed in [20] uses predefined trade-off factors and decoding-complexity-aware coding mode selection only at the PU level. Thus, the sacrifice made in video quality to maintain the bit rate requirement is greater, and performance is inferior to Herglotz et al. [46]. These results are illustrated graphically in the ∆PSNR vs. decoding-complexity graphs presented in the Figure 4. Here, it can be observed that both these algorithms would experience a higher quality impact in a rate-controlled scenario if they were to achieve a particular decoding-complexity. However, it should be noted that the encoding algorithm proposed by Herglotz et al. has shown improvements in very low-in-complexity video sequences, such as "band," "cafe," "poznan st.," etc. ∆ refers to the reduction in quality measured using ∆ PSNR (dB). Υ (dB) is the BD-PSNR that represents the drop in video quality by the proposed algorithm when encoded using a similar bit rate to that of HM16.0 encoder. ‡ ∆Γ% achieved using the openHEVC decoder. * Here, the bit streams for complexity level 2 (L2) are subjected to the LF algorithm.  The approach by Nogues et al. [11] modifies the decoding operations to reduce the decoding-complexity. For example, the skipping of in-loop filtering and simplifying the motion compensation operations within the decoder results in a significant complexity reduction. (It should be noted that the presented results correspond to the highest complexity reduction that can be achieved by applying the decoder modifications proposed in [11] to all frames in the bit stream). However, changing the motion compensation filters and thereby applying the decoded residuals on a predicted PU which is different from that of the encoder's, causes more distortions in the reconstructed block. Although the intra-frames that appear within the given intervals avoid the propagation of these errors, the algorithm results in a much larger PSNR reduction (cf. Figure 4).
In contrast, the proposed algorithm uses a more comprehensive and dynamic approach to simultaneously control both decoding-complexity and bit rate. First, the use of more accurate and detailed decoding-complexity estimation models enables the encoder to estimate the decoding-complexity requirements for a given coding mode. Next, the proposed decoding-complexity-rate-distortion model allows the encoder to determine the impact of a coding mode on all three parameters. Finally, the continuous update of the decoding-complexity-rate-distortion model allows the encoder to pick the most content-relevant trade-off factors when selecting the coding modes that minimize the distortion while achieving the given rate and decoding-complexity constraints. As observed from Figure 4, the proposed algorithm allows the encoder to generate bit streams that provide the least quality impacts on a given decoding-complexity. Moreover, the proposed algorithm is highly scalable and provides the capability to generate bit streams with multiple bit rate and decoding-complexity levels-a crucial benefit for adaptive video streaming services that target streaming videos to mobile devices. For instance, in this case, the decoding-complexity level L2 (i.e., 40% decoding-complexity reduction with respect to HM16.0) results in on average −12.71% decoding-complexity reduction when using the openHEVC decoder. Finally, if the bit streams generated by the proposed algorithms are decoded with a decoder that skips the in-loop filter operations (e.g., openHEVC), it can be observed that decoding-complexity can be further reduced by ≈ 7%, with only a minor impact on the video quality. Thus, it is evident that the bit streams generated by the proposed algorithm can be subjected to decoder modifications such as [34] to attain further complexity reductions. Figure 4 also demonstrates the decoding-complexity reduction that can be achieved for a 1 dB quality loss in PSNR. It can be observed that the proposed algorithm on average achieves a greater ∆Γ(%/dB) across all bit rates. This is much larger for the proposed algorithm at lower bit rates, due to the reduced quality impact with respect to the HM encoded bit stream. Thus, it is apparent that the proposed algorithm can produce more decoding-complexity reduction than state-of-the-art algorithms for each 1 dB quality loss. Finally, Figure 5 illustrates the visual quality impact of the video sequences reconstructed from the bit streams encoded by HM16.0, the proposed algorithm, and other state-of-the-art methods. It can be observed that despite the PSNR drops listed in the Table 3, the bit streams generated by the proposed algorithm retain a visual quality level similar to that of the bit streams prepared by the HM 16.0 encoder.   [20], (d) Herglotz et al. [46], and (e) Nogues et al. [34]. The figures correspond to the frame number 36 of the "Kimono HD" sequence encoded at 2 Mbps.

Decoding Energy Reduction Performance
Next, the actual energy consumption performance for the bit streams generated by the proposed and state-of-the-art algorithms is compared for a video streaming use case. First, the generated bit streams are decoded using the openHEVC video decoder with Linux ondemand as the frequency scaling governor [61]. It can be observed in the Table 4 that the proposed and state-of-the-art algorithms demonstrate an energy-consumption reduction in the range of ≈4% compared to HM 16.0 encoded video bit streams. Moreover, forcing the decoder to skip in-loop filters enables the proposed algorithm to increase the energy-consumption reduction up to 5.65%.
Changing the Linux ondemand governor to a more application-specific DVFS algorithm [10] that alters the CPU's operational frequency based on the estimated complexity of the next video frame improves the energy-consumption reduction of all the algorithm. In this case, the proposed algorithm has achieved 7.77% and 9.10% decoding energy-consumption reductions compared to the HM encoded bit streams-a non-trivial performance with only −1.44 dB and −2.03 dB quality impacts for with and without in-loop filter operations, respectively. The decoding energy reduction achieved per 1 dB PSNR video quality loss for the proposed and state-of-the-art algorithms is presented in the Table 5, and the ∆E(%/dB) achieved for a 1 dB quality loss is graphically demonstrated for three different test sequences in Figure 6. These results further corroborate that the energy reductions achieved by the bit streams generated with the proposed algorithm result in smaller impacts on quality compared to the state-of-the-art approaches. Thus, the decoding energy consumption reduction achieved for each 1 dB PSNR loss is also relatively large for the proposed encoding algorithm.   The metrics ∆Γ (%/dB) and ∆E (%/dB) are both measured in terms of the ∆Γ(%) and ∆E(%) achieved per 1 dB PSNR quality loss for the proposed and state-of-the-art algorithms. † ∆E (%/dB) achieved when using Linux ondemand frequency governor. ‡ ∆E (%/dB) achieved when using an application-specific DVFS algorithm as the frequency governor.

Impact of the Proposed Encoding Framework on Different Decoders and CPU Architectures
The proposed encoding algorithm presented in this manuscript is based on the HM 16.0 reference encoder and decoder implementations on an Intel x86 CPU architecture. For instance, the bpp and MSE parameters defined in Section 3 are based on the corresponding values generated by the HM 16.0 encoder. Furthermore, cpp values utilized throughout the modeling phase in Section 3 correspond to the decoding-complexity levels profiled for HM 16.0 decoder implementation.
The decoding-complexity level is tightly coupled with the implementation details, CPU architecture, and hardware level optimization. Therefore, it is important that complete decoder profiling is carried out for each decoder implementation on each CPU architecture to achieve an optimal decoding-complexity/energy reduction. However, the focus of this work is to present a framework which can be used to achieve decoding-complexity/energy reduction by generating joint decoding-complexity and rate-controlled bit streams. Therefore, decoder profiling for individual implementation and architecture is considered outside the scope of this work.
However, the experimental results presented in Tables 3-5 correspond to the decoding-complexity and associated energy reductions when decoding bit streams use openHEVC decoder implementation on an Intel x86 CPU. These results correspond to the decoding-complexity/energy reductions achieved when bit streams are encoded with 40% less decoding-complexity to that of HM 16.0 decoder. Similarly, the experimental results in Table 6 present the decoding energy reduction achieved when decoding the bit streams using MXplayer [62] running on a Samsung Galaxy Tab A device that consists of an Exynos 8890 processor with ARMv8 Instruction Set Architecture [63]. It can be observed that the percentage energy reduction level is different to that in the Intel x86 results. However, overall, the bit streams generated by the proposed algorithm outperform the energy reduction per 1 dB PSNR quality loss compared to the state-of-the-art methods. Thus, the resultant energy reductions with the proposed encoding framework with different CPU architectures and decoder implementations (even though they are sub-optimal) are still significant, despite being optimized for a different encoding and decoding architecture.

Conclusions
Fluctuations in network bandwidth and the limited availability of processing and energy resources of consumer electronic devices demand video streaming solutions that adapt to changing network and device constraints. In this context, although solutions such as HTTP adaptive streaming consider the network bandwidth problem, adapting the transmitted video contents by considering both the network bandwidth and an individual device's energy constraints remains a compelling challenge.
To this end, this paper presents an encoding algorithm that can generate HEVC-compliant bit streams with multiple arbitrary bit rate and decoding-complexity levels. The experimental results with respect to the simultaneous bit rate and decoding-complexity control suggest that the proposed algorithm achieves a target bit rate and a decoding-complexity level with 0.47% and 1.78% average errors, respectively. Furthermore, the proposed algorithm demonstrates an average 10.11 (%/dB) decoding-complexity reduction and up to 10.52 (%/dB) decoding energy reduction for 1 dB PSNR quality loss compared to HM 16.0 encoded bit streams in an Intel x86 CPU architecture-a significant improvement compared to the state-of-the-art techniques. In addition, the bit streams generated by the proposed algorithm demonstrate 20.36 (%/dB) average energy reduction per 1 dB quality loss for a RISC-based ARM CPU architecture. Finally, the future work will focus on developing the proposed model into an adaptive video streaming solution that considers the end-to-end network and device resource availability to determine coding parameters used to encode the streaming video.