Adaptive Video Encoding for Different Video Codecs

By 2022, we expect video traffic to reach 82% of the total internet traffic. Undoubtedly, the abundance of video-driven applications will likely lead internet video traffic percentage to a further increase in the near future, enabled by associate advances in video devices’ capabilities. In response to this ever-growing demand, the Alliance for Open Media (AOM) and the Joint Video Experts Team (JVET) have demonstrated strong and renewed interest in developing new video codecs. In the fast-changing video codecs’ landscape, there is thus, a genuine need to develop adaptive methods that can be universally applied to different codecs. In this study, we formulate video encoding as a multi-objective optimization process where video quality (as a function of VMAF and PSNR), bitrate demands, and encoding rate (in encoded frames per second) are jointly optimized, going beyond the standard video encoding approaches that focus on rate control targeting specific bandwidths. More specifically, we create a dense video encoding space (offline) and then employ regression to generate forward prediction models for each one of the afore-described optimization objectives, using only Pareto-optimal points. We demonstrate our adaptive video encoding approach that leverages the generated forward prediction models that qualify for real-time adaptation using different codecs (e.g., SVT-AV1 and $\times$ 265) for a variety of video datasets and resolutions. To motivate our approach and establish the promise for future fast VVC encoders, we also perform a comparative performance evaluation using both subjective and objective metrics and report on bitrate savings among all possible pairs between VVC, SVT-AV1, $\times$ 265, and VP9 codecs.


I. INTRODUCTION
Video streaming applications are steadily growing, driven by associate advances in video compression technologies. Video codecs support the widespread availability of digital video content that is dominating internet traffic compared to any other communicated data [1]. Traditional video sources such as video-on-demand (i.e., movies), teleconferencing, and live streaming events, including sports, concerts, and news, claim a significant share amongst the most popular applications, leveraging open-source protocols such as Dynamic Adaptive Streaming over HTTP (MPEG-DASH) [2]. Intriguingly, The associate editor coordinating the review of this manuscript and approving it for publication was Dian Tjondronegoro .
user-generated content facilitated by social media platforms has driven video communications to unprecedented levels [3]. Beyond that, synthesized video content augmented and virtual reality applications including 360 • video content and point-cloud technologies [4]- [6], as well as internet gaming and emerging medical applications [7], [8] necessitate efficient solutions that will alleviate known bottlenecks, especially over resource constraint wireless networks.
Toward this direction, Versatile Video Coding (VVC)/ H.266 [9], the successor of the High Efficiency Video Coding (HEVC)/ H.265 standard [10], was officially released in July 2020 to reclaim the best compression efficiency available to the AV1 codec that was released in 2018 [11]. Both codecs target ultra-high-definition video coding, with a clear direction towards accommodating AR and VR applications, 360 • and multi-view video coding. VVC is the new compression standard of the Joint Video Experts Team (JVET), materialized by the long-lasting collaboration between the ISO/ IEC Moving Picture Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG). On the other hand, AV1 is the product of the Alliance for Open Media (AOM) [11], a new industry-driven initiative that has generated real competition with respect to the potential market penetration that a codec can have, not originating from the former two entities that have dominated the market over the past two decades.
Naturally, the afore-described developments generated a large body of literature in an attempt to document the coding capabilities of each codec and to demonstrate their advantages over legacy coders. The latter resulted in mixed results, mainly attributed to the fact that new and refined coding tools were gradually made available in their respective reference software implementations before their official releases. Recent studies appear to converge to the fact that VVC is the best codec available today, followed by AV1, which in turn outperforms HEVC [12], [13]. Here, it is worth highlighting that reference implementations of both HEVC/H.265 and H.264/AVC, namely HM and JM, respectively, perform better in terms of compression efficiency compared to popular open-source implementations, i.e., x265 and x264. On the other hand, the latter facilitates encoding optimizations that provide for real-time encoding capabilities, being orders of magnitude faster than the former and thus closer to industry adaptation.
A key contribution of the paper is to provide for a fair comparison of the recently released VVC standard [14] and AOM's debut AV1 encoder (using Intel's Scalable Video Technology SVT-AV1 implementation) [15], [16], along with both its predecessors, namely HEVC (x265 implementation [17]) and VP9 [18], respectively. More specifically, the paper provides bitrate comparisons in terms of subjective video quality assessment (VQA) [19], [20], the weighted Peak Signal to Noise Ratio (PSNR 611 ), and the Video Multi-Method Assessment Fusion (VMAF) metrics.
To support adaptive video encoding, we refer to the use of dynamic adaptive HTTP streaming (i.e., MPEG-DASH), also known as Hypertext Transfer Protocol (HTTP) adaptive streaming (HAS) that has proven instrumental in the success of popular industry video streaming platforms such as YouTube, Netflix, Apple, and Amazon. For this paper, we adopt the current practice of allowing the client to select video segment encodings that adapt to network conditions. To select a particular video segment encoding, the streaming server must have video segments pre-encoded at different quality levels and video resolutions. Per-title optimization (i.e., per video sequence) is a popular approach introduced by Netflix for adapting to different bitrate demands [21]. The challenge is then to perform an informed decision that will maximize the overall quality of the perceived video and thus user experience (QoE) [22]. Given the paradigm shift from RTP/RTCP pull-based approaches to HTTP/ Transmission Control Protocol (TCP) push-based ones in MPEG-DASH/ HAS, erroneously received video packet effects are mitigated. Hence, quality of service (QoS) parameters that are now in need to be optimized mostly concern minimizing (i) the initial time required to commence video rendering (i.e., time is taken to initialize video buffer) and (ii) buffering events (buffer becomes empty due to limited available bandwidth prohibiting the download of the next video segment) that result in a video freeze [22]. A significant body of literature exists studying efficient adaptation algorithms that can be broadly categorized with respect to the dominant adaptation parameter considered. Rate-based techniques optimize the utilization of the available bandwidth, while buffer-based ones are concerned with maintaining the best possible buffer fullness ratio. Recently, collaborative rate and buffer-based and quality-based (provided video chunk's quality is available) are widely used and studied methods [23]- [25].
The challenge when it comes to implementing such algorithms lies in the time-varying nature of (wireless) transmission mediums resulting in significant variations of the available bandwidth. The latter is especially true in view of the established user demands that have also been cultivated by the industry for anywhere, anytime, and any device, robust access to video services. Moreover, optimization objectives are competing in the sense that increased video quality translates to increased bitrate and complexity (i.e., for higher video resolution decoding). In addition, video applications are recourse hungry and can cause battery drainage in mobile devices.
In this study, we propose the use of a multi-objective optimization approach that jointly maximizes video quality while minimizing bitrate and encoding time demands. The approach is codec-agnostic and can thus be trained and used over different codecs in a multi-codec adaptive video encoding setting. The proposed methodology, subject to minor modifications, can be used for both server-side and/ or client-side adaptive video streaming applications. In the particular setup depicted in the present manuscript, server-side adaptive encoding for real-time video streaming scenarios is examined, utilizing client-side feedback concerning the instantaneous throughput to guide the adaptation process. In the typical MPEG-DASH setting, the proposed approach can be fine-tuned using the decoding time instead of the encoding time. The proposed approach significantly extends prior work by our group involving single-resolution medical videos encoded using the x265 video coding standard alone [26].
Key contributions of the present study include: • Comprehensive video codecs performance evaluation using objective metrics (e.g., PSNR, VMAF), and subjective evaluations, • Adaptive video encoding leveraging multi-objective optimization of key video parameters including quality, bitrate demands, and encoding rate, VOLUME 9, 2021 • Cross-codec applicability across different datasets and video characteristics, and • VMAF-driven adaptive video encoding. The rest of the manuscript is organized as follows. Section II details the methodology for the video-codecs performance comparative evaluation as well as the multi-objective optimization setup for adaptive video encoding. Then, Section III discusses the results obtained for both scenarios. Finally, Section IV provides the concluding remarks and highlights ongoing and future work.

II. METHODOLOGY
In the present section, we detail the methodology used to perform the comparison between the video codecs' compression efficiency. The latter involves the video datasets considered in this study, followed by the experimental setup and the video quality assessment (VQA) methods. Then, we describe the multi-objective optimization approach used to achieve adaptive video encoding to match time-varying available bandwidth using different compression standards based on the same video encoding and assessment methodology.

A. MATERIAL-VIDEO DATASETS
For the purposes of this study, we use two established datasets: the HEVC test sequences [27] and the UT LIVE video dataset [28]. The former consists of video resolutions ranging from 240p to 1600p with corresponding frame rates ranging between 24 and 60 frames per second (fps), summing 21 videos of different content. The latter consists of ten videos of 768 × 432 video resolution at 25 fps (7 videos) and 50 fps (3 videos). Overall, 31 video sequences were used in the present study to deduct the compression capabilities of the examined video compression standards that are summarized in Table 1. A representative sample of the incorporated video content appears in Fig. 1. A subset of these videos was further used to train and validate the multi-objective video adaptation framework.

B. VIDEO CODECS PERFORMANCE EVALUATION 1) VIDEO ENCODING EXPERIMENTAL SETUP
The selected encoding parameters per video codec appear in Table 2. Comparable encoding setups were selected based on state-of-the-art literature to allow for a fair comparison, as further detailed in previous studies by our group [12], [13]. As such, the comparative performance evaluation relied on constant quality encoding. Selected quantization parameters comprised of {27, 35  was inserted every 32 frames for 25fps and 40fps videos and every 48 frames for 50fps and 60 fps videos. For the remaining codecs, random access intervals were set at the video's frame rate. Bi-prediction (B/b frames in VVC and x265 and alternate reference frames in AV1 and VP9 codecs), enabling both preceding and subsequent pictures to be used for spatial and temporal prediction and compensation of the currently encoded frame, was employed. Additionally, codec-specific filters were enabled towards enhancing video fidelity. The remaining parameters were tuned according to predefined presets available in every codec as highlighted in Table 2, namely -random access for VVC, default preset for SVT-AV1, -ultrafast for x265, and -rt for VP9.

2) VIDEO QUALITY ASSEMMENT
Video quality assessment of encoded sequences relied on popular objective video quality metrics as well as perceptual (subjective) assessment of a representative subset of the compressed video instances as described next. Here, we use popular Peak-Signal-to-Noise-Ratio (PSNR) as the benchmark metric that will allow direct comparison to similar studies in the literature. However, due to the rather weak correlation of PSNR to perceptual quality, we further employ the state-of-the-art Video Multi-Method Assessment Fusion (VMAF) metric [29], [30]. VMAF has shown consistently high correlations to subjective (user) ratings across video resolutions and datasets, and hence VMAF scores provide for a more realistic representation of a video's quality [13], [29]. We also consider perceptual (subjective) video quality assessment to (i) to verify the afore-described hypothesis that VMAF achieves better correlation to perceptual scores than PSNR, and (ii) to further validate the performance of the video codecs.

3) Objective Video Quality Assessment
Peak-Signal-to-Noise-Ratio (PSNR) is considered the defacto benchmark metric being used for decades in image and video quality assessment. To compensate for the limited correlation to the actual perceived quality, a new weighted alternative is employed nowadays: where PSNR Y refers to the luma component whereas PSNR U and PSNR V correspond to the chrominance ones. PSNR 611 considers the brightness intensity (luma) as the most decisive contributing factor of the raw YUV space driven by the observation that a human eye is more susceptible to brightness variations than colors. PSNR 611 was introduced during HEVC development [31] and was found to correlate better to subjective quality (with respect to traditional PSNR) while providing a combined score for luma and chrominance assessment, ideally suited for computing Bjontegaard Delta (BD-rate) differences [32]. Video Multi-Method Assessment Fusion (VMAF) [29], [30], as its name suggests, relies on the fusion of known image quality assessment metrics applied for VQA (Visual Information Fidelity (VIF) [33]; Detail Loss Metric (DLM) [34]) in conjunction with the computation of the motion offset between subsequent frames. Importantly, VMAF uses machine learning to train on a specific dataset and thus increase its correlation to subjective ratings. Since its introduction in 2016, it is one of the most widely used VQA methods in both industry and academia. Here, we use VMAF as the primary VQA algorithm (using the default, pre-trained machine learning coefficients, i.e., no training on the examined datasets has been performed) while PSNR 611 is employed for benchmark purposes.
Finally, to compute the bitrate gains and/ or reductions of one video codec over the other, the Bjontegaard Delta (BD) metric [32] was employed (see [12], [13], for more details).

4) Subjective Video quality assessment C. Selected Video Instances
To validate the perceptual quality of the compressed video instances, a balanced subset in terms of video quality, video content, video resolution, and the frame rate was abstracted from the total number of the investigated videos in this study. More specifically, 12 videos originating from both the HEVC and the UT LIVE datasets, consisting of 416 × 240p, 768 × 432p, 832 × 480p, 1280 × 720p, 1920 × 1080p and 2500 × 1600p video resolutions were selected. For each of the aforementioned videos, three different rate points (quantization values) were used, categorized with respect to their objective VMAF scores as follows: • High Quality: VMAF score ≥ 85 • Medium Quality: 60 ≤ VMAF score ≤ 70 • Low Quality: VMAF score ≤ 50 VOLUME 9, 2021 FIGURE 2. Adaptive video encoding abstract system architecture. Raw video files are segmented into chucks of 3 seconds in duration to capture differences in content and characteristics within the same video sequence. A dense encoding space is then used to produce a sufficient sample of video instances per examined video codec. The approach relies on the Pareto front to apply curve fitting using regression, in order to generate forward prediction models per optimization objectives, namely video quality, bitrate, and encoding rate. During real-time video streaming, given a set of constrains and mode of operation, the Newton method is employed to predict an encoding configuration that satisfies these constraints. In case the encoding parameters prediction set is empty, the constraints are relaxed until a matching configuration is provided.
The latter categorization translated and allowed the collection of three different video quality categories, namely low, medium, and high quality, across the investigated video resolutions. The objective was to capture a balanced and adequate sample of the compressed video instances that would allow a safe extrapolation of the perceptual scores over the whole dataset. In other words, to use a representative video sample that would highly approximate the outcomes of the perceptual evaluation if this were to be performed over the entire set of investigated videos. Here, it is important to highlight that subjectively evaluating the entire set of compressed video sequences was not feasible due to time constraints. Overall, a total of 108 video instances were assessed (12 videos x 3 QPs x 3 codecs) vs. a total of 496 encoded video instances used in this series of experiments (31 videos x 4 QPs x 4 codecs).

D. Training and Testing Sessions
The encoded videos were evaluated over three distinct sessions, each one corresponding to one of the examined encoders, namely SVT-AV1, x265, and VP9. Reliable VVCencoded video rendering was not possible at the time of the experiments. The study involved 32 volunteers. Before initiating the actual scoring session, participants were given a scoring sheet with the following categories: Each scoring entry was then explained using a video example of matching quality. The videos used for this purpose were not part of the perceptual VQA dataset. Then, the assessment procedure was described, and participants were given as much time as it was required to familiarize themselves with the involved processes before the examined test sequences were displayed in a randomized fashion.
All evaluations were performed using a SAMSUNG U28E690D LCD TV with a spatial resolution of 3840 x 2160mm and maximum screen brightness (peak luminance of 370 lux.) in a brightly lit room. Optimal 3H viewing distance was secured for all participants [35], [22].

E. ADAPTIVE VIDEO ENCODING
An abstract system diagram of the proposed adaptive video encoding framework appears in Fig. 2. The method relies on a two-step process, of which the first step is performed offline and generates forward prediction models, while the second one capitalizes on these models to perform real-time adaptive video encoding. We start by producing a dense encoding space of the investigated video segments to capture different content characteristics found within the same video sequence. Then, Pareto front sampling is performed to identify non-dominated encoding setups, followed by curve fitting using regression to produce the forward prediction models. Forward prediction models are generated per an optimization objective, namely video quality, bitrate, and encoding rate (in frames per second -FPS). These models allow us to predict encoding configurations that match the time-varying network characteristics during real-time video streaming. In other words, a significant variation in available bandwidth (being an increase or decrease) will trigger the proposed adaptive video control mechanism (described below) to generate a new encoding configuration. For our proposed maximum video quality mode, the new encoding configuration will have to meet constraints on the currently available bitrate while securing the best possible video quality and an encoding rate for real-time streaming services.
Our proposed adaptive video coding is more responsive than typical encoding approaches used with MPEG-DASH, where pre-encoded video segments cover a sparser sampling of available bandwidths and associated video resolutions [21], [36]. As a result, we would expect the proposed methods to translate into fewer buffering and video stalling incidents. To this end, we propose the minimum bitrate mode that minimizes bitrate requirements while meeting or exceeding constraints on video quality and the encoding rate. For the proposed maximum performance mode, we emphasize real-time performance while meeting constraints on available bandwidth and required video quality. Overall, our proposed models and optimization methods provide strong adaptation to different scenarios.

1) VIDEO MODELING AND OPTIMIZATION
In this section, we will provide a detailed description of how we model the different objectives, compute the relevant regression models, and compute the optimal encoding parameters. We then demonstrate how to apply the models to optimize for maximum video quality, minimum bitrate, and maximum performance.
In order to model strong variations of video content, we break the video into short video segments of three seconds. Then, over each segment, we model the objectives as functions of the encoding configuration and the quantization parameter. We thus use a large number of models as functions of the video segment i and the encoding configuration c.
In what follows, we summarize the mathematical notation in Table 3.
For each video segment i and encoding configuration c, we model the objectives as functions of the quantization parameter given by (see Table 3 for sub-indices for β i,c,j,k ): The regression models are estimated using least squares. To demonstrate the approach, let us consider the model for VMAF. In this example, we will be estimating the values for β i,c,1,0 , β i,c,1,1 , and β i,c,1,2 . First, we will need to encode the video by varying the quantization parameter and measuring the corresponding values for VMAF for the encoded videos. We can also measure PSNR, B, and FPS using the same video encodings. The procedure for estimating the regression models for PSNR, B, and FPS is similar.
For each video segment i and encoding configuration c, let the video be encoded using n values of the quantization parameter is given by: Suppose that the corresponding VMAF values for each QP are given by: We then formulate the regression model using: where: and ε denotes an n × 1 independent, identically distributed random noise vector. From equation (3), we obtain leastsquares estimates using: where we note that X T X −1 X T only depends on QP. VOLUME 9, 2021 To assess the accuracy of any given model, we compute the adjustedR 2 for each model. To define the adjusted R 2 , we first need to define the residual sum of squares error (RSS) using: We then compute the total sum of squares (TSS) using: where the average output is given by: We are now ready to define the adjusted R 2 using where n denotes the number of QP values and d = 3, which represents the number of non-zero β values used in our model. For a perfect model fit, RSS=0, and we have that the adjusted R 2 is 1. It is also important to note that the definition of the adjusted R 2 takes into account the degree of the polynomial fit. For higher-order polynomials, the same value of RSS will produce a lower value for the adjusted R 2 . Hence, by maximizing the adjusted R 2 , we can also determine if higher-order polynomials are justified. In general, stepwise logistic regression refers to the process of selecting the polynomial order that maximizes the adjusted R 2 that balances the use of higher degree polynomials versus the degree of fit. Inverting the regression model is easy to do. For example, for any given VMAF value, we use to determine the QP value that would deliver this video quality. In general, the determined QP value would not be an integer. Hence, we will need to round up the real-valued QP value to determine an integer solution.

Multi-objective Optimization
An exhaustive evaluation of all QP values for all possible configurations is not needed. Instead, we only need to consider values that are optimal in the multi-objective sense. In this section, we explain how this is accomplished.
Formally, we write that we are interested in solving the multi-objective optimization problem expressed as: To solve equation (4), let the collection of points generated by all possible encodings be expressed as: Then, we eliminate an encoding m = p if there is at least one other encoding for m = k (k = p) that is better in all objectives as given by: To generate the Pareto front, we compare all points against each other using a double for-loop through and eliminate points based on equation (6). We use the term Pareto front to describe the remaining encodings that do not get eliminated through the process.

Formulating and Solving Constrained Optimization Problems
We summarize the constraint optimization modes that can be solved using our derived models. We note that all of the solutions lie on the Pareto front. To see this, note that if they did not lie on the Pareto front, we would then be able to select another encoding that is better in every respect. We begin by establishing notation (also see Table 3 ). Let VQ denote the video quality mode metric (e.g., VMAF). We would like to impose a minimum video quality bound as given by VQ ≥ VQ min . Similarly, for maximizing performance, we require that FPS, the achieved frames per second, must be above a minimum rate of performance as given by FPS ≥ FPS min . Also, we want to bound the encoded bitrate based on the available bitrate using: B ≤ B max . In what follows, we can define the constraint optimization problems based on this notation.
We next formulate and solve the constrained optimization problems that will provide optimal video encoding delivery. We formulate the minimum bitrate mode using: min B subjectto (VQ ≥ VQ min ) and (FPS ≥ FPS min ) . (7) To solve equation (7), we need to consider encoding configurations that satisfy the constraints (VQ ≥ VQ min ) and (FPS ≥ FPS min ) . Among these encodings, we then select the one that gives the minimum bitrate.
Similarly, we define the maximum video quality mode using: max VQ subjectto (B ≤ B max ) and (FPS ≥ FPS min ) .
Then, the maximum encoding performance mode is defined using: We solve (8) and (9) by maximizing VQ or FPS over the encodings that meet our constraints.

2) ENCODING CONFIGURATIONS FOR FORWARDING PREDICTION MODELS GENERATION
The encoding configurations for the forward prediction models are summarized in Tables 4.A-4.C. More specifically, for each of the examined codecs, namely x265, VP9, and SVT-AV1, a total of 200, 200, and 252 video instances are generated, respectively, per investigated video segment. A similar configuration setup is used for all three encoders, which is tuned towards real-time performance. Different encoding structures and the use of deblocking filters to enhance quality are further considered. The objective here is to demonstrate the universality of the proposed codec agnostic adaptation framework.

3) FORWARD PREDICTION MODELS USING LINEAR REGRESSION
As we described earlier, we compute forward prediction models to determine the mapping from the encoding parameters to the objectives of video quality, bitrate, and encoding time, in a live video streaming session, without having to actually encode the video. To succeed in this task, the available knowledge emanating from the above-described offline process used to generate a dense encoding space is required. Linear and logistic regression up to third order polynomials were used to determine the most suitable models as functions of the encoding configurations given in Table 4 per examined video codec. Furthermore, stepwise regression was employed to optimize the trade-off between the cross-validated residual error and model complexity, hence limiting overfitting with more complex models.

4) REAL-TIME ENCODING USING MULTI-OBJECTIVE OPTIMIZATION
Adaptive video encoding for real-time video delivery applications leverages the codec-agnostic algorithm described in Fig. 3 that implements the abstract system architecture highlighted in Fig. 2. The basic algorithm is broken into two parts. First, the forward prediction models are computed offline for each video segment. Second, the forward prediction models are used to adaptively encode each video segment for a pre-processed video.
For computing the forward prediction models, we apply stepwise regression for modeling the Pareto front of the VMAF, PSNR, bitrate, and encoding rate of each video segment of each video. Here, we note that the goal is to summarize all of the encoding options with simple regression models that are fast to process during real-time video delivery. Thus, instead of storing the encoded video segments, we store the parameters of the forward regression models.
After storing the forward regression models for the first video segment, we also consider the possibility of reusing the regression models from the previous video segment. Here, we reuse the forward regression models from the previous segment if they can accurately predict the encoding for the current video segment (see second if statement in Fig. 3). To reuse the models, we require that the fitted regression model error does not exceed the maximum values of 5% for video quality, 10% for bitrate, and 10% for encoding time for any one of the encodings. Hence, for real-time applications, VOLUME 9, 2021  our approach can reduce the overall computational complexity while sacrificing model accuracy.
Then, for real-time video delivery, each video is broken into the same video segments like the ones used for computing the forward models. We then retrieve the forward models for each segment and use them to determine points that satisfy the optimization modes. We use the procedure outlined in section II.C 1) to compute the QP_opt parameter for each chosen encoding configuration C_opt. Here, we note that the selected QP_opt parameter is continuously valued. Thus, we quantize QP_opt. In the event that the pair QP_opt and C_opt are not valid, we relax the constraints to get a valid solution. In this case, we allow QP to vary from −4 to +4 from QP_opt and consider alternative encodings until we identify valid encoding parameters.

III. RESULTS
In what follows, we compare the compression effectiveness of different video codecs using both objective and subjective video quality assessment. We present and discuss results using PSNR and VMAF for a variety of video resolutions.
Then, we describe adaptive video encoding results to demonstrate the advantages of our adaptive approach over nonadaptive approaches.
We provide a summary of our software and hardware platforms to support the reproducibility of our results. We implemented our methods on a Windows 10 Dell Precision Tower 7910 Server 64-bit platform with Intel(R) Xeon(R) Processor E5-2630 v3 (8 cores, 2.4GHz). In terms of software, we used SVT-AV1 version 0.7, VP9 version 1.8, and x265 version 2.0.

A. COMPARATIVE PERFORMANCE EVALUATION OF VIDEO COMPRESSION STANDARDS 1) OBJECTIVE VIDEO QUALITY ASSESSMENT
We provide comprehensive results comparing all video codecs against each other using BD-PSNR in Table 5 and using BD-VMAF in Table 6. For better visualization, the results are also plotted in Fig. 4 for both BD-PSNR and BD-VMAF.
Clearly, VVC achieves the best compression efficiency compared to all rival standards, followed by SVT-AV1. X265 enhances compression efficiency compared to VP9 in  all but the lowest examined video resolution, namely 240p, when VMAF is used to compute the BD-rate gains.
More specifically, VVC reduces bitrate demands compared to SVT-AV1 by ∼51.7% on average based on PSNR scores, climbing to ∼61.6% when VMAF ratings are used. There is approximately a 10% difference in favor of VVC when VMAF is employed, which calls for further validation in conjunction with large-scale subjective evaluation studies (see also subjective VQA, below). Significant bitrate gains are observed when VVC is compared against x265 and VP9. With respect to x265, average bitrate gains are comparable and extend to ∼66.6% and ∼69.9% for PSNR and VMAF, respectively. When compared against VP9, results from both algorithms are also in agreement, documenting reduced bitrate requirements between ∼73% and ∼74%.
SVT-AV1 achieves significant compression performance gains over x265, as depicted in Tables 5 and 6. Bitrate gains using BD-PSNR are in the order of ∼32% and slightly reduced to 25.6% when BD-VMAF is used. Likewise, increased performance over its predecessor, VP9, is documented, reaching ∼44% and 34%, for PSNR and VMAF computed BD-rate gains, respectively. Again, there is a noticeable difference between the two objective scores. As discussed in the subjective evaluation sub-section below, VMAF's high correlation to perceptual quality ratings for both AV1 and VP9 suggests that the bitrate gains using VMAF are, in fact, more realistic.
In the last set of codec comparison pairs, x265 supersedes VP9 in compression efficiency, recording average gains between 19%-20% for both VQA algorithms used. However, it is important to highlight that results show a great variation across the video resolution ladder. Especially when PSNR is used, the standard deviation measured is ∼13%. Moreover, at the lower 240p end, VP9 is found to be more efficient in terms of the VMAF scores.

2) SUBJECTIVE VIDEO QUALITY ASSESSMENT
We validate our results using subjective video quality assessment. In previously published studies, the HEVC/H.265 standard matched the targeted bitrate gains of ∼50% over its predecessor, H.264/AVC, based on subjective ratings, as opposed to objective assessment, which actually documented lesser gains [31]. For our purposes, we note that it is impractical to conduct subjective video quality studies on very large datasets. Instead, we conducted our study on a representative sample of videos that capture different video content, compression levels, and video characteristics (i.e., resolution and framerate). This sample is then evaluated and used to compute the correlation between subjective and objective ratings.
The subjective VQA results' correlation to the objective scores is summarized in Table 7. The subset of the investigated datasets totaled 108 video instances that were assessed by 32 human subjects in three different sessions. Each session corresponded to videos encoded with SVT-AV1, x265, and VP9 compression standards. Unfortunately, assessing VVC videos was not possible at the time that the evaluation sessions took place due to the unavailability of a reliable VVC player. As a result, we only report the correlation between subjective and objective scores of codecs belonging to the aforementioned groups. Table 7 demonstrates the correlation of the subjective evaluation scores to both PSNR and VMAF objective VQA metrics per investigated video codec. Two widely used correlation indices were employed for this purpose, namely the Spearman rank-order correlation coefficient (SROCC) and the Pearson linear correlation coefficient (PLCC). The VMAF algorithm achieved a significantly higher correlation to the subjective ratings than PSNR for all three examined codecs and both correlation coefficients. The latter observation strongly suggests that documented bitrate gains using BD-VMAF highlighted in Table 6, are in fact, more trustworthy and better reflect the performance comparison of the examined video codecs. Moreover, VMAF's correlation results justify its wide adoption over the past few years amongst the research community and the industry. On the other hand, PSNR failed to adequately capture the user-perceived quality across all codecs and correlation indices. In particular, PLCC recorded values between 0.58 and 0.623, while SROCC slightly better correlation ratings from 0.61 to 0.64, for the three video compression standards, investigated.
We obtained the highest correlations for the SVT-AV1 codec. More specifically, AV1 achieved a 0.78 and a 0.75 PLCC and SROCC correlation, respectively, to the VMAF scores. Interestingly, VP9 performed significantly better than the x265 codec, reaching a 0.74 correlation compared to 0.635 for the SROCC and 0.7 against 0.633 for the PLCC, both for VMAF scores. Given the gap from 0.6-0.7 to the ideal correlation of 1.0, despite the great progress achieved by VMAF, it is clear that there is a need for continued research on developing reliable VQA that better correlate to human interpretation. In that sequence, such metrics should achieve correlation to user ratings over 0.95 before they can be used as the sole criterion to measure user's perceived QoE. Clearly, larger studies are also needed to support such efforts, securing the objectivity of such algorithms across video codecs, video content, and video characteristics.

B. ADAPTIVE VIDEO ENCODING VALIDATION
A representative subset of the proposed adaptive video encoding framework validation space is presented in the current section. We start by providing the computed forward prediction model equations. Then, we demonstrate specific use-case scenarios involving the x265 and SVT-AV1 video codecs and different video sequences. For each codec, the precise forward models' equations per examined video and selected mode of operation are presented, followed by the advantages of the proposed methods.

1) FORWARD PREDICTION MODELS EQUATIONS
In this section, we provide a summary of our regression models. We consider a total of 5 forward prediction models per objective for x265, 5 for VP9, and 6 for SVT-AV1.

a: FORWARD PREDICTION MODELS VALIDATION
To demonstrate the efficiency of the generated forward prediction models, we provide results documenting the adjusted R 2 of the fitted models. As detailed in the methodology, a perfect fit results in an adjusted R 2 value of 1. Hence, the closer the adjusted R 2 is to 1, the higher is the model's accuracy. Table 8 tabulates the results for linear and quadratic logistic models by video, video codec, and optimization objective. The minimum, maximum, and median adjusted R 2 values are given, abstracted from the entire set of considered models per segment, and employed video coding structure (3 segments x 5 models for x265, VP9, and 6 for AV1). Only encoding structures with the best-performing filters option were considered, as described above. In this fashion, we can demonstrate how well the models fit or not for a certain video before used for adaptive, real-time video streaming purposes.
Due to space constraints, four selected videos per video codec, for a total of 12 videos, are summarized in Table 8. The key message here is that the proposed methodology achieves robust models' fitting that can hence be confidently used for adaptive video encoding using the approach described by the pseudocode algorithm of Fig. 3. In particular, the majority of median adjusted R 2 values of linear models are significantly higher than 0.9, indicating strong fits for all objectives except for VMAF. For VMAF, quadratic models provide the best fits, which are used instead. The same holds for certain videos with respect to the FPS objective, such as the Kristin and Sara video AV1 models and the lower resolution VP9 models of BlowingBubbles and SunFlower videos. In all cases, however, adjusted R 2 median values, even for linear models, are higher than 0.7. Clearly, based on the depicted results, the proposed methods can be used to derive robust models over video datasets with diverse characteristics and content.  Table 8. In other words, the median values are grouped only based on the encoding structure, contrary to Table 8, where grouping per both segment and encoding structure was performed. Results reiterate the robust fit of the generated models that are able to capture a video's unique characteristics.
As evident, for PSNR and Bitrate objectives, adjusted R 2 median values are virtually indistinguishable, hence favoring the use of linear models during adaptive video encoding. A strong fit is also depicted for the encoding rate (i.e., FPS) objective, despite the slightly lower adjusted R 2 median values and higher variation, especially for SVT-AV1 models. On the other hand, quadratic models are needed for the VMAF objective.

2) ADAPTIVE VIDEO ENCODING
We present two examples to demonstrate adaptive video coding. We selected the Cactus 1080p video of the HEVC dataset for demonstrating adaptive encoding using x265. For SVT-AV1, we selected the Pedestrian 432p video. For both videos, we present results for the minimum bitrate optimization mode, extending our results for bitrate savings across different codecs.
For both examples, we break each video into three-second segments. Then, we compute forward prediction models per segment as depicted in Tables 9 and 13. In addition to the model coefficients, we also report the adjusted R 2 value for each video segment. Based on the high values for the adjusted R 2 values, we can see that the forward regression models provide excellent fits for all video segments. As described in Figs. 2 and 3, we consider reusing the forward models from the first video segment provided the model errors remain below some maximum values. In the present examples, the high proximity of the forward regression models coefficients between segments linked with the high adjusted R 2 values justifies the selection of the first segment's models for use throughout the video's remaining segments. Hence, for both scenarios described next, the forward regression models from the videos' first segment are used for subsequent segments. Here, it also worth noticing the very low values of the β i,c,j,2 coefficient, which essentially translates to the use of linear models over quadratic models, for the specific examples (see also Table 8 and Fig. 5).

a: MINIMUM BITRATE DEMANDS USING x265
To compare against static approaches, we consider encoding the Cactus 1080p video of the HEVC dataset based on YouTube recommendations described in [36]. We encode each video segment using constant quality encoding and x265 default encoding parameters. We only vary the selected QP parameter to approximate the bitrate recommendations as tabulated in Table 10. We then average quality (PSNR and VMAF), bitrate, and FPS over the first three segments composing the Cactus video sequence and use these values as constraints to our multi-objective optimization approach described next. We choose to only use the first three segments to facilitate simpler results reporting, given that the Cactus video sequence duration is 10 seconds, and each segment corresponds to 3 seconds. Note that the proposed models can be trained over any segment duration. Table 11 demonstrates the benefits of using the proposed adaptive video encoding approach. For the minimum bitrate optimization mode depicted, the objective is to minimize bitrate subject to the quality and encoding rate constraints as given in equation (3). We use PSNR as the quality constraint and require 37.15 dB that matches the average value achieved by the YouTube recommendations are given in Table 10. We note that quality constraint can be either PSNR or VMAF, and the appropriate model can be selected and invoked accordingly. However, we report output values for both quality metrics to facilitate consistency between experiments. In terms of encoding rate, the constraint is set to match the real-time encoding requirement (i.e., the video's framerate) and hence is 50 frames per second. An encoding mode switch is considered at every segment while the results of every segment, as well as the average ones over the entire video sequence, are displayed.
The results of Table 11 show that the minimum bitrate mode produced savings of 11% over static encoding, while the average PNSR also increased. From the results, we can see that the encoding rate was reduced to 49.33 frames per second, suggesting that our optimization took slightly longer than the static encoding. Still, the achieved encoding rate in FPS was very close to the imposed requirement. Bitrate savings in the particular scenario are attributed to the dense encoding space used to generate the forward prediction models that allow considering different encoding setups during the adaptation decision.
The next optimization scenario targeted two interlinked objectives while leveraging VMAF to monitor quality. First, to investigate whether substantial bitrate savings were possible while maintaining a VMAF level that is virtually indistinguishable from the static example of Table 10. Second, to demonstrate the ability of our proposed approach to delivering high-quality video streams under extreme bandwidth fluctuations that can cause a dramatic bandwidth decrease and/ or scenarios where the available bandwidth is shared amongst users with equivalent quality of service plans. In such events, adaptive mechanisms such as the one proposed in this study need to be in place to secure the continuity and quality of the video streaming service.
In that context, for the example in Table 12, we reduced the VMAF constraint by three so that 92.58 − 3 = 89.58. More generally, the 3-point VMAF reduction is also supported by the work on establishing just noticeable differences (JND) [37], [38]. The JND study described in [39]   arrived at the conclusion that a 6-point VMAF difference is the threshold value after which the perceived difference in quality becomes noticeable between two compressed videos.
The results depicted in Table 12 materialized both goals. In particular, a 32.8% reduction in original bitrate was achieved at only a slight reduction of the VMAF constraint (or the original PSNR). As expected, the final videos were indistinguishable from the original. These substantial savings were achieved at a frame rate of 58.63 frames per second, a rate that is well above the requirement of 50 frames per second.

b: MINIMUM BITRATE DEMANDS USING SVT-AV1
To investigate the universality of the proposed methods over different video codecs, we also considered SVT-AV1 for the minimum bitrate mode on a different video. Table 13 presents the average quality, bitrate, and encoding rate values for the Pedestrian video segments that serve as the baseline, static encoding constraints. We then followed the same approach as VOLUME 9, 2021    Tables 15 and 16. For PSNR, the minimum bitrate optimization mode reduced bitrate requirements by 7% compared to the static approach of Table 14. The gains primarily come from increasing the QP and involving a different encoding structure in the encoding setup of Seg. 2. In the opposite direction,  the desired quality constraints are matched neither by PSNR nor by VMAF values besides the 1 st segment. As already described, the latter is attributed to the tolerable percentage error that is introduced in order to avoid the ping-pong effect of switching between prediction models and thus compromising real-time performance by adding complexity. Nonetheless, the documented drop is within acceptable limits. Moreover, the encoding rate in terms of FPS is significantly higher than the minimum values required to achieve real-time performance.
In terms of the 3-point VMAF scenario leveraging the JND approach, results appear in Table 16. The minimum bitrate optimization mode reduced bitrate requirements by approximately 50% while matching real-time encoding requirements. The VMAF score was reduced by 3.27; however, it remained well within the JND requirements and did not compromise perceptual quality at the displayed resolution.

IV. DISCUSSION AND CONCLUDING REMARKS
An adaptive video encoding methodology that is applicable to different video codecs for optimizing the utilization of available recourses is proposed. The approach uses multi-objective optimization to meet dominant, time-varying constraints in a video streaming session, namely video quality, bandwidth availability, and encoding rate.
Results demonstrate that the proposed methods can effectively mitigate time-varying bandwidth fluctuations that are likely to result in buffering incidents and thus degraded end user's QoE. The applicability across the range of examined video codecs holds great promise, especially in view of the fast-changing landscape of the video compression industry highlighted in this study. In that context, following VVC/H.266 standardization, MPEG has subsequently released two new video codecs, termed Essential Video Coding (EVC) [40] and Low-complexity Enhancement Video Coding (LCEVC) [40]. The goal is to alleviate patent/ licensing schemes on the one hand and complexity concerns on the other. As a result, approaches such as the ones described in this study are now, more than ever, timely and central to the efficient utilization of individual codecs to benefit the user experience.
Our ongoing research is focused on interfacing with a wireless network simulator to validate the proposed methods under realistic unicast and multicast video transmission scenarios for both live and on-demand sessions.
Furthermore, we are also working on extending the multiobjective optimization framework to include decoding time and decoding power consumption, and adopting blind VQA metrics at the receiver, to drive the adaptive video encoding process.