Fast Screen Content Coding in HEVC using Machine Learning

Screen Content (SC) videos require proper tools to handle their special characteristics since they include repeated regions, sharp edges, and limited number of colors. Therefore, SC extension to High Efficiency Video Coding (HEVC) standard has been released for this purpose. SC extension has new tools such as Intra Block Copy (IBC) and Palette (PLT) mode. These tools improve the coding efficiency, but they come with a huge computation complexity. In this paper, we propose a scheme to reduce the encoding time of SC encoder. It has two algorithms based on the Decision Rule (DR) machine learning technique. The first algorithm, which is called Mode Skipping (MS), is used to skip the unnecessary SC mode checking. Early Pruning Termination (EPT), which is the second algorithm, is used to stop the partitioning process early. A small number of features has to be calculated for the trained models. The proposed scheme was implemented then simulated by using the standard software test model SCM-6. The experimental results show that the time complexity is reduced by 37.84% on average while the Bjøntegaard delta bit-rate (BD-R) increases by 1.34% only.


I. INTRODUCTION
Due to the popularity of screen sharing and computer graphics applications, Screen Content Coding (SCC) has become important. Therefore, the SCC extension to the High Efficiency Video Coding (HEVC) [1]- [3] has been released by the Joint Collaborative Team on Video Coding (JCT-VC), and that to enhance the coding of SC sequences over the conventional HEVC [4], which is suitable for Natural Content (NC) sequences.
In SCC, new tools have been incorporated over conventional Intra mode [5], such as Intra Block Copy (IBC) [6]- [8] and Palette (PLT) mode [9]- [11]. These tools utilize the special characteristics of the SC videos to improve the coding efficiency, since SC sequences contain repeated patterns, complex structure, and a limited number of colors. In IBC, a searching process is carried out to find the matched block of the current one in a similar manner to the Inter motion estimation [12] in conventional HEVC, but the search range is restricted to current frame. Although these additional tools improve the SC coding, the enhancement comes at the expense of a higher complexity.
In order to reduce the encoding complexity brought by using the additional SC tools, numerous approaches have been proposed. These efforts can be categorized into heuristic and machine learning-based works. In the case of heuristic approaches, the authors in [13] proposed to speed up the IBC mode according to the Rate-Distortion (RD) cost of Intra mode and the Coding Unit (CU) activity value. The RD cost was used in [14] to early terminate the mode testing process and the CU partitioning process. In [15], a hashbased scheme was suggested along the full frame. In which, the current CU finds the matched block among the CUs that have the same hash key. The authors in [16] speed up the local search of IBC by using suggested hash keys for the different Prediction Units (PUs). In [17], zero activity and low gradient CUs are not checked in IBC mode. In [18], the temporal correlation between the current CU and its corresponding CU at previous frame is utilized to reduce the encoding time.
Furthermore, an adaptive step size method was suggested to increase the searching speed of the IBC mode. In [19], the entropy and coding bits were used to achieve a fast CU decision. In [20], the CUs were classified into NC or SC in order to apply fast mode decision methods. Furthermore, the bit per pixel value is utilized to stop the splitting process early. A fast SC encoding scheme was proposed in [21] by using the pixel exactness value, the RD cost of the PLT mode, and suggested hash-based search method. When it comes to the machine learning-based approaches, we proposed in [22] to use the Decision Tree (DT) technique to achieve fast mode decision. In addition, the luminance contrast inside the CU was used to perform fast CU size decision. Also, DT was used in [23] to boost up the SC encoding by inserting a decision block before the run of each mode. It introduced dynamic features and the intermediate RD costs for the training. In [24], the Random Forest (RF) was utilized to reduce the encoder complexity. A new hyperparameter tuning approach was introduced considering the encoding time and bitrate. The approaches in [25] and [26] used the bayesian rule classification method. In [25], the authors applied a method for the corner point detection in order to categorize the frame into textual or pictorial regions. After that, the color number was employed for mode classification. In [26], an online-learning approach was proposed for the bayesian rule to accelerate the mode decision and CU size decision. The modes of neighboring CUs were exploited to avoid the unnecessary mode checking. The authors in [27] proposed a scheme based on neural networks. The CUs were categorized as NC or SC type, and that to make fast mode decision. To achieve fast CU size determination, the correlation between the current CU and adjacent CUs was utilized. In [28], the complexity of intra prediction was reduced by using a Convolutional Neural Network (CNN). The global features were analyzed to give a decision if the mode would be checked or not. Then, local features guide the encoder to decide which modes should be tested. In this paper, we present an approach for fast SC encoding. The reduction in encoding time is performed by utilizing SC modes skipping and early pruning termination for the CUs. We propose to use a small number of features for the training in order to alleviate the burden on the SC encoder. We select the Decision Rule (DR) machine learning technique because it can be easily implemented in SC encoders instead of using complex techniques such as in [28]. The proposed encoder does not have to extract features from neighboring CUs as in [24] or previous frame as in [18] which preserves the memory. In order to enhance the performance, we decide to apply the mode skipping for the IBC and PLT modes without skipping the conventional Intra mode. The rest of the paper is organized as follows. The Intra mode decision of SC encoders is described in Section II. Section III explains the proposed decision rule based scheme. Section IV discusses the simulation results. Section V concludes the paper.

II. INTRA MODE DECISION OF SC ENCODER
SC encoder utilizes the same partitioning method as the conventional HEVC encoder. The input frame is partitioned into square blocks. The main partitioning unit is called a Coding Tree Unit (CTU), which is 64x64 pixels in size. The CTU is recursively divided into equally smaller four parts called CUs. Each CU can be 2Nx2N, where N can be 32, 16, 8, or 4. To achieve a better quality, the CU can be divided into smaller parts called Prediction Units (PUs), in addition, the PUs can be symmetric or asymmetric [4]. In order to determine the best structure of current CTU, the encoder compares the RD cost of current CU with the sum of the costs of its sub-CUs. Then, it terminates the partitioning process if the RD cost of current CU is the smaller. The optimum RD cost for each CU is determined after testing all prediction modes. To find optimum RD cost in intra coding, the encoder starts to encode a CU with a mode called fast IBC, this mode is executed for the 2Nx2N CUs which have sizes less than 64x64. The block matching method is conducted in fast IBC through a number of Block Vectors (BVs). These BVs come from adjacent CUs and last checked CUs. After that, conventional Intra mode is examined for all CU sizes. This mode is skipped if zero distortion condition is satisfied for CUs that are tested by fast IBC. After finishing the test of Intra mode, the encoder runs IBC Skip/Merge mode, which is similar as the Skip/Merge mode in conventional HEVC. If Skip mode is the best one so far, the run of other SC tools is bypassed. Normal IBC is conducted for 16x16 and 8x8 CUs. In which, the encoder finds the best matched block for current 2Nx2N 16x16 PU along the whole frame in 1-D. When it comes to 8x8 CUs, the best matched block is located by searching in a limited search region or by using a hash-based search. The current and left CTUs are used as a search area since the directions are 2-D for 2Nx2N and 2NxN PUs and 1-D for Nx2N PUs. In the case of hashbased search, each 2Nx2N PU has a 16-bit hash key, which is estimated by the SC encoder. The current PU searches the matched block through the PUs that have the same key, and that by examining the hash table that contains the hash keys and the BVs for each hash key. The hash key is formulated from the DC values of the four 4x4 blocks of each 8x8 PU and the gradient of the whole PU. PLT mode is activated for the CUs which are less than 64x64. After completing the determination of optimum mode and RD cost, the encoder checks the CU size and goes to the next depth if the CU size is larger than 8x8. The encoder repeat this process again for higher depths. For 8x8 CUs, the encoder stops the splitting process since 8x8 size is the smallest size.
According to the analysis we have done in [22] by using the software test model HM16.7+SCM-6, hereafter SCM-6 for simplicity, the introduction of SC tools over conventional Intra mode increases the encoding time, since by disabling SC tools, the encoding time decreased by 58.27% on average while the performance is degraded clearly. Therefore, it is important to ignore the testing of unnecessary modes to save time and preserving the quality. Next section will discuss the proposed scheme.

III. PROPOSED DECISION RULE BASED FAST INTRA CODING
In this research, we aim to accelerate the SC encoding process by utilizing a fast mode decision algorithm along with an early partitioning termination algorithm, and that to stop the execution of unnecessary mode testing. The skipping of SC tools is achieved by placing decision blocks before the run of the normal IBC and PLT modes to give decisions whether these modes would be bypassed or not. To decide if current CU could be partitioned or not, we suggest to insert a decision block after the mode testing which decides if the encoder goes to the higher depth or not for 16x16 CUs or larger. The algorithms that represent the decision blocks for fast mode decision and fast CU decision are called Mode Skipping (MS) and Early Pruning Termination (EPT), respectively. The details about the features selection, the training methodology, and the mode decision of our proposed scheme will be discussed in next subsections.

A. FEATURES SELECTION
For efficient fast mode decision in MS algorithm, the candidate features should be suitable to differentiate between the different modes in Intra profile. In our scheme, the deactivation of SC tools is required, so, the Classification Parameter CP , as appeared in [29], was chosen as a feature for the training, and that because it has a high selectivity as reported in our previous work [30]. The Classification Parameter (CP) is estimated as in [29]: where the Color Number CN represents the number of distinct colors inside the CU. L r characterizes the range value between the maximum L max and minimum L min luma values and it is determined as Figure 1 illustrates the percentages of NC and SC CUs versus CP . The data were extracted from the first 10 frames of the sequences that are recommended in [31]. The details of the recommended sequences are shown in Table 1 including the resolution, the number of encoded frames, and number of frames per second (fps). From figure 1 , we can conclude that at the lower CP values, most of the CUs are NC type while at higher CP values, almost all CUs are encoded by using SC tools, especially for CP more than or equal 220. PLT mode is more sensitive to the number of colors inside the CU, since the CU that has a high number of color requires more indices, which increases the coding cost [32]. Consequently, CN is considered as a training feature. Furthermore, the best RD costs before the run of the IBC and PLT modes are taken as features, and that because a small RD cost indicates that the previous modes are sufficient for the mode decision process.
In the case of the EPT, we should select proper features to discriminate between the CUs that should be split and the CUs that may be not partitioned without affecting the performance to save time. CU smoothness can characterize this behavior, since smooth blocks tend to be unpartitioned without affecting performance. CU variance var can be used to represent the CU structure complexity, which is estimated as: where M and N are the height and the width of the CU. L mean is the mean luminance value of the CU. L x,y is the luminance value at location (x,y). To study the relation between the CU variance and the normal splitting process, the first 10 frames of eight selected sequences were encoded, then the data were extracted for analysis. The chosen sequences are "sc_SlideShow", "sc_programming", "sc_map", "Chi-neeseEditing", "sc_desktop", "MissionControlClip3", "Bas-ketball_Screen", and "Kimono1". The extracted data were partitioned into two groups. The first group is called "Not split". In which, the RD cost of each CU is smaller than or equal to the sum of RD costs of its sub-CUs. While the second group, which is denoted as "Split", is the opposite case. Figure 2 shows the histograms that represent the number of coded CUs versus the V ar value for 16x16, 32x32, and 64x64 CU sizes, and that at each group, since "Not split" group is shown in figures 2a to 2c and "split" group is shown in figures 2d to 2f. From figure 2, we can show that the majority of the CUs under "Not split" span over a small range of V ar value in contrary to "Split" category. Thus, V ar is a good candidate feature for EPT algorithm. Figure  3 depicts similar histograms but for the best RD cost rather than variance value. From figure 3, we can show that the CUs belong to "Split" group extend more along the best RD cost range compared with the "Not split" CUs. So, the best RD cost was chosen as a candidate feature to train the decision models. In conclusion, in the case of MS algorithm, CP and the best RD cost before IBC were selected for the decision VOLUME 4, 2016  block that is located before IBC. CN and the best RD cost before PLT mode were chosen for the decision block that is inserted before the PLT mode. For EPT algorithm, V ar and the best RD cost were chosen.

B. TRAINING METHODOLOGY
To train the predictive models of the decision blocks, the fast and popular machine learning classifier (JRip) [33] was used, which is a type of Decision Rules (DR) classifiers. It finds a set of rules that totally cover the members of classes. The generated rules are expressed as decision models. Comparing with other machine learning techniques such as Neural Networks (NN) and Supported Vector Machine (SVM) that require intensive mathematical computations, the decicion models of DR are simple and can be converted to logic statements easily (IF-AND-OR). To extract data samples for the training, eight sequences from the recommended sequences were chosen. Table 2 tabulates information about the sequences that are used for training including the number of trained frames. The Waikato Environment for Knowledge Analysis (WEKA) [34], version 3.9.4, was used to build the decision rules. WEKA is an open-source software includes several machine learning algorithms. In the beginning of the training phase, the attributes and the classes are prepared into Attribute-Relation File Format (ARFF) file format, which is the input file to the WEKA tool. ARFF file contains the header, attributes deceleration, and raw data, which represent attributes and classes. Figure 4 shows a part of ARFF file. The training was done by using 10-fold cross-validation process. Finally, DR were generated to be incorporated into the SCM-6 reference software. Table 3 shows the prediction accuracy of the training in each algorithm, and that at Quantization Parameter (QP) equals 22. In the case of MS algorithm, the models that are used to skip SC tools, including IBC mode and the PLT mode, are called Screen Content Skipping (SCS). While the models that are located before the PLT mode are called PLT Skipping (PLTS). From Table 3, we can note that the prediction accuracy is high, especially for SCS models. SCS models are not trained for 64x64, since normal IBC and PLT are not carried out for this size. Similarly, in the case of PLTS for 64x64 and 32x32 sizes. Because 8x8 CU size is the smallest size, the EPT algorithm is not conducted for 8x8 CUs.

C. MODE DECISION OF THE PROPOSED SCHEME
The mode decision flow of the SC encoder containing the proposed modifications is shown in figure 5. At the beginning, three flags are initialized to zero value. The flags are skip_SC, skip_P LT , and skip_split. The first two flags are used in the MS decision blocks, while the last one is utilized in the EPT decision block. Then, V ar attribute is calculated for the CUs more than 8x8 pixels. CP attribute is estimated for the CUs less than 64x64 to be used in the mode skipping as mentioned before. Before testing IBC mode, DRbased decision block gives a decision to skip SC modes or not. If it decides to skip IBC and PLT modes, skip_SC is set to one. This model gives decisions for 16x16 and 8x8 CU sizes. Similarly, a decision is taken by a decision block resided before the execution of PLT mode to activate or bypass PLT mode. skip_P LT is set to one if the block decides to skip PLT mode. It is used at CU sizes of 8x8, 16x16, and 32x32. After finishing the modes execution and determining the best mode so far, decision block gives a decision to stop the pruning process or not, skip_split is equal to one if the decision of early splitting termination is taken. If skip_split equals one or if the CU size is 8x8, the encoder terminates the CU partitioning. Otherwise, it goes to the higher depth level to repeat the mode testing again.

IV. RESULTS AND DISCUSSION
In order to evaluate the proposed scheme, we have implemented the scheme in the test model software SCM-6. The implemented version was simulated under the AI configuration profile. The QPs of 22, 27, 32, and 37 were used. In the evaluation steps, each proposed algorithm was evaluated in addition to evaluating the integration of them all together, then, the proposed scheme has been compared with existing machine learning-based approaches. The evaluation metrics are the Bjontegaard Delta bit-rate (BD-R) [35] and the Time Savings TS, and that with respect to the anchor SCM-6. TS is described as where T anc is the encoding time of the anchor version of SCM-6, T mod is the encoding time of the SCM-6 containing the modifications. The positive value of TS means higher time saving. Table 4 tabulates the evaluation results of the proposed algorithms EST and MS plus the EPT+MS. In addition, it includes comparisons with other machine learning-based approaches. From this table, EPT achieves 16.41% encoding time reduction with BD-R increment of 0.52% on average. The sequences that have the highest encoding time reduction in EPT algorithm are "sc_SlideShow" and "Mission-ControlClip2" where TS values are recorded as 43.6% and   [31].
In the case of MS, this algorithm reduces the encoding time by 29.27% on average while the BD-R increases to 0.89%. "sc_robot" and "Kimono1" sequences show the highest encoding time reduction by 54.96% and 49.71%, respectively, with 1.01% and 0% BD-R increment, respectively. Most of the CUs in "sc_robot" and "Kimono1" have small CP value, which can be efficiently encoded by Intra mode without needing to be checked by the SC modes. In addition, "sc_SlideShow" records a high TS value, which is 43.85%, while the BD-R increases to 0.46%, and that because it is rich in flat regions, which means zero CP value. These flat regions could be handled by Intra mode well. By combining MS and EPT algorithms, the scheme can decrease the encoding time by 37.84% on average and up to 63.63%. While the average BD-R increment is only 1.34%. Proposed scheme is compared with other existing ML-based approaches. Schemes in [24] and [23]   comparisons, which are based on the RF and DT techniques, respectively. To achieve fair comparisons, approaches [24] and [23] were re-implemented in SCM-6 as our scheme, then they have been simulated under the same conditions. In [24] and [23], trained decision blocks have been inserted before the execution of the different coding modes. TS and BD-R   Table 4. From Table 4, it can be seen that our scheme has the lowest BD-R increment among existing schemes, since TS value is very close to that in [23]. [24] is the slowest approach between them, since the TS value is 29.88% with 1.37% BD-R increment. "sc_robot" sequence achieves the best TS value in all approaches since most of its CUs can be characterized as NC type, which can encoded by Intra mode with small RD cost as mentioned before, which can be handled by the trained blocks located before the SC modes.
When it comes to the number of estimated features, EPT+MS has to estimate 3 features, which is the smallest. While 9 features have to be estimated in [24] and [23], and that after ignoring the estimation of the RD costs and flags because these features are found in SCM-6 by default. Furthermore, [24] has to access neighboring CUs to get coding information. This advantage makes our scheme more applicable to be used in the real video coding systems.

V. CONCLUSION
In this paper, the DR machine learning technique has been used to boost up the speed of SC encoders. Two algorithms have been proposed. MS, which is the first algorithm, is used to skip running of IBC or IBC and PLT modes, and that with accordance to decisions that are taken by trained models located before the running of IBC and PLT modes.
The second algorithm, which is called EPT, is used to make fast CU size determination. The encoder has to estimate 3 features only to train the models. The simulation results show that the presented scheme gives 37.84% encoding time reduction while the BD-Rate increases by 1.34%, which outperforms other machine learning-based approaches.