Non-Intrusive Parametric Audio Quality Estimation Models for Broadcasting Systems and Web-Casting Applications Based on Random Forest

Objective quality assessment models have been used more and more in recent years to assess or monitor speech and audio quality in many multimedia and audio processing systems. These methods offer a clear and repeatable way to evaluate a customer experience by measuring perceived quality on a subjective scale, which is easily understood, such as a quality rating scale, ranging from excellent quality to a low quality. Subsequently, the aim of service providers is to offer reliable services providing the end-user/customer with the best possible quality in the context of the current network conditions to avoid a customer churn. This paper presents a design and performance evaluation of parametric models estimating the audio quality experienced by the end user of broadcasting systems and web-casting applications. The Random Forest (RF) algorithm is used to design non-intrusive parametric models, establishing the relationship between the feature description and the perceived quality scores. For this, the broadcast and web-cast sub-databases were created, where the web-cast sub-database includes 17,280 degraded samples and the broadcast sub-database contains 1,080 degraded samples obtained from the Slovak Radio. The results reported for the proposed parametric audio quality models have validated Random Forest as a powerful technique that provides a good efficiency in terms of Pearson Correlation Coefficient (PCC) and Root Mean Squared Error (RMSE).


Introduction
The substantial development in broadcasting and networking technologies in recent decades reflects the range of digital audio transmission systems, such as Digital Audio Broadcasting (DAB) [1] and [2], Digital Radio Mondiale (DRM) [3] and [4], etc., or webcasting applications using TCP as a transport protocol [5]. On-demand streaming media such as Spotify, Deezer, iTunes, Amazon Music or YouTube Music are among the most common audio streaming services [6]. The streaming service, as well as the broadcasting systems, use audio codecs to minimize bandwidth to be used for the corresponding data transmission. Due to the robustness and reliability of the channel coding of today's digital audio transmission systems, the audio codec is still the key source of quality degradation in this case. The same is true for the web-casting applications as they dominantly deploy the TCP on a transport layer [7]. Given that most of today's web-casting applications are based on HTTP streaming [8], additional quality influencing factors, in this case, are an initial delay and a stalling [5]. The creation of effective quality monitoring tools operating in real-time that can measure the audio quality experienced by the end user is therefore important for the success of any audio service or application.
In order to get a clear picture about the deployment of codecs in the context of digital audio broadcasting systems and web-casting applications, we compiled a list of the currently most widely used codecs and their bit rates [9], which is presented in Tab. 1. As it can be clearly seen from this table, MP2 and HE-AACv2 codecs are mostly used when it comes to broadcasting systems. On the other hand, the most widely used codecs for web-casting applications are MP3, Ogg Vor-The evaluation of audio quality can be performed from either a subjective or an objective perspective [10]. A subjective listening test is a common way of determining the quality of the audio. Subjective methods that are based on the empirical listening measures defined under international guidelines, e.g. [11] and [12], are more effective but also more arduous and timeconsuming. On the other hand, methods for objective assessment of the audio quality are more convenient.
To be more precise, objective perceptual audio models are used to reliably and rapidly predict the MOS (Mean Opinion Score) values, i.e., scores representing a quality perceived by the end user.
Calculating the perceptually weighted difference between the reference and degraded signals is the basic concept of intrusive quality metrics. Intrusive methods are considered more accurate as they provide a higher correlation with subjective evaluations. Intrusive designs in the sense of audio quality, e.g., PEAQ [13], PEMO-Q [14], ViSQOLAudio [15] or POLQA Music [7], expect a qualitative evaluation of the degraded signal distortion by contrasting the degraded signal with the initial/reference signal. So, intrusive measures require the presence of the original signal that is typically not available in continuous quality monitoring.
Moreover, despite the fact that intrusive methods [16] are based on the very hard to realise time synchronization of the two signals, they are still a more reliable counterpart with non-intrusive methods for objective quality evaluation of audio [16]. Non-intrusive approaches depend on the receiving side processing of the audible signal, without exposure to the original audio signal on the broadcast/web-cast side, which leads to some difficulties in detecting distortion caused by communications networks, i.e., loss of packets. Non-intrusive approaches may be categorized as either signal-based or parametric algorithms [16].
Nevertheless, there is, to the best of our knowledge, currently no non-intrusive parametric model for the audio quality estimation that focuses on the broadcasting systems and web-casting applications despite the fact that these models are already available for speech transmission, i.e., E-model [17], and a broadcast audio contribution over IP [18] and [19] or an audio-visual media streaming [20]. Hence in this paper, we propose non-intrusive parametric audio quality estimation models based on machine learning for the broadcasting systems and web-casting applications. The proposed models formulate an estimate of audio quality as a regression problem and use the RF approach to find a mapping between the audio features and quality score.
The paper is organized as follows. In Sec. 2. , we explain an experimental methodology and models input parameters for broadcasting systems and webcasting applications. Section 3.
specifies how the database was built to be used to train and validate the proposed models. We then present our proposed non-intrusive parametric quality estimation models, training, and testing phase details in Sec.
4. Section 5. describes the performance evaluation results of the proposed models. Finally, in Sec. 6. , we conclude the paper and discuss possible future research.

Methodology
In this work, we used Random Forest (RF) as a machine learning technique. Random Forest is a part of the supervised learning methods family [21]. The ecosystem it creates is an ensemble of decision trees [22] and is usually fitted with the bagging cycle. The fundamental theory of bagging methodology is that a combination of learning types increases the combined performance. In that, the size of the forest and the variation between the trees' outputs are minimized. The forest's prediction is the average prediction from individual trees [23]. By running the algorithm on a broad training sample and then ranking these scores against new results, the classification is accomplished. The main aim, of course, is to consider the numerical or logical relationship between the input parameters and performance during the training process.

Models Input Parameters
We concentrated here on input parameters that could reduce the audio quality of broadcasting systems. As mentioned above, the principal degradation parameter, in this case, is an audio codec. Based on [7], [9] and [24], we also considered the need to take into account a type of signal and bit rate. The corresponding parameters have been used along with the MOS values for training by the Random Forest approach, as illustrated in Fig. 1. The same degradation parameters are used for web-casting applications, see Fig. 2 for more detail, but certain parameters of the application layer have also to be taken into consideration [25], [26], [27], [28] and [29]. An initial delay in audio reproduction is the first of them. The initial delay often occurs since a certain amount of data must be transferred to a receiving side before decoding and playing are going to be initiated. The minimum value for the initial delay depends on the bit rate and encoder settings [5]. Another input degradation parameter that has a significant effect on the perceived audio quality in web-casting applications is stalling [5]. It occurs when actual network throughput is lower than an audio bit rate required by the corresponding streaming service, and a buffer is therefore drained. The effect of the initial delay and stalling on the perceived output generally depends only on its duration.  To sum up, the following input parameters have been considered in a design phase of non-intrusive parametric audio quality estimation models for broadcasting systems and web-casting applications: • Broadcast -the type of audio codec, type of signal, bit rate.
• Web-cast -the type of signal, type of audio codec, bit rate, stalling, initial delay.

Database
Before creating the database, it was necessary to specify all the types of sounds that we had to focus on. We sought to reach as large audio signal spectrum as possible. The goal was to create the largest available archive of specific recordings that would represent the audio variety transmitted to the listener over broadcasting systems and web-casting applications. To do so, we obtained 3 hours long uncompressed studio recording from Slovak Radio to create a dataset that will be used for training and testing of the designed estimation models. However, it did not contain all the pre-selected audio signals like classical music, a spoken word in a foreign language, a track featuring only one musical instrument, etc. For that reason, we have added a part of the respectable EBU SQUAM database that includes lossless recordings for subjective audio quality testing. A version published in EBU Tech 3253 [30] was used. In terms of complexity, the resulting set includes 27 types of audio signals typically deployed in the context of audio broadcasting systems and web-casting applications. It can be divided into two principal categories, music and speech, at a ratio of 12 (music) and 15 (speech). Each recording is roughly 10-15 seconds long and sampled at 48 kHz.
For further work with the database, it was necessary to evaluate the audio quality of the selected signals, characterizing the typical content of the current audio broadcasting systems and web-casting applications, degraded by the codecs currently deployed by the corresponding systems and applications. In this case, the default codec settings were deployed besides the bit rate, which was manipulated according to the values presented in Tab. 1. We have chosen POLQA Music [7] model to assess a quality perceived by the end user when it comes to the degradations caused by the audio codecs, as this model, according to [7], was fairly reliable in this regard. It should be noted here that the POLQA Music V2 model was used in this case, see [7] for more information. The total number of combinations resulting from this process was 1080. These 1080 samples, together with the corresponding MOS values predicted by the POLQA Music model, represented a broadcast sub-database and were later used to build a web-cast sub-database covering the additional degradations, i.e., initial delay and stalling. As these two types of degradations are different in nature, i.e., frequency and time domain, in this case, we can apply the additivity concept that comes from the E-model [17]. Since POLQA Music was not trained for degradations induced by the initial delay and stalling, we used the following equations published in [27], focusing on the effect of initial delay and stalling in the context of the audio transmission, to cover the impact of these two parameters on the quality experienced by the end user in the context of the model design: where T Init represents the duration of the initial delay and T Stall represents the duration of the stalling. In order to reflect real-world conditions, we have used the real-world measurements presented in [25] and [26] to select the values of initial delay and stalling to be used in the design process of the proposed models. The values, together with the corresponding quality degradations incurred by the initial delay and stalling, are listed in Tab To create the web-cast sub-database, we have extended the broadcast sub-dataset with the MOS values that cover the impact of initial delay and stalling on the quality perceived by the end user. It is necessary to note here that a combined impact of the initial delay and stalling was also considered while creating the web-cast sub-database. So, in total, we applied 15 cases, i.e., three initial delay cases (see Tab. 2 for more detail), three stalling cases (see Tab. 2 for more detail) and their nine combinations, to each of the previous 1,080 MOS samples representing the broadcast sub-database. Together, we obtained 17,280 MOS values defining the web-casting dataset. It is worth noting here that the database is available upon request to allow reproducibility of this research as well as support new research/development activities in this context.

Proposed Technique
In this work, we used a machine learning method, namely Random Forest (RF) [21] for regression problem to predict audio quality for broadcast systems and web-cast applications. The random forest can be applied to all kinds of regression applications. Random regression forests are based on regression trees that lead to a minimization of variance within the node in the splitting process [21]. A decision tree is the building block of a random forest and is an intuitive model [21]. Instead of learning a simple problem, we will use a real-world dataset split into a training and testing set. In a random forest, a large number of trees is built from a random selection of a small number of variables and a random selection from the observations. The forest prediction is the average prediction from individual trees. Each individual tree predicts a step function. Even the average number of trees can approach almost any functional form and can automatically account for interactions between regressors. The most widely used variable importance metric for regression forests is permutation-based MSE (Mean Squared Error) reduction [23].

Training and Prediction Details
The sub-databases were divided into two parts at a ratio of 80:20, i.e., the training and testing data. It is worth noting here that when it comes to machine learning approaches, this ratio is often implemented. We pseudo-randomly selected the test cases/samples, attempting to cover all the types of signals, all the codecs and their bit rates at the same ratio. The testing samples were naturally not involved in the training database. The 240 MOS values for broadcast and 3,456 MOS values for web-cast were used for the testing phase to verify the performance of the designed parametric estimation models. The MOS values provided by the designed models were compared to the ground true MOS values. The efficiency of the constructed parametric estimation models was quantified in terms of the Pearson Correlation Coefficient (PCC) and the respective Root Mean Square Error (RMSE) widely used in this context.

Results
To solve the regression problem, we used the existing random forest algorithm included in the Scikit-Learn Python library, namely "sklearn.ensemble.RandomForestRegressor" with the following parameters: the number of trees was set to 100, the random state was set to 50, maximum depth of the tree was set to default, i.e., "None", a minimum number of samples to split was set to 5, and a minimum number of samples required to be at a leaf node was set to 4. We ran 15 simulations for each sub-database with different initial seeds. The best results were noted and are going to be presented in this section. We compared the MOS values provided by the parametric prediction models with the true MOS values of the test sets. The scatter plot of the true MOS values versus the predicted MOS values of the test samples obtained from the proposed models is shown in Fig. 3 for the broadcast sub-database and in Fig. 4 for the web-cast sub-database. Table 3 shows the best outcomes obtained from all the simulations in terms of the above-mentioned performance measures. The results show that the Random Forest method is relatively successful in both cases, i.e., the broadcast and web-cast conditions. To be more precise, for the web-cast conditions, we reached the PCC of 0.9854 and the RMSE of 0.2192; for the broadcast conditions, it is a bit worse with the PCC of 0.9411 and the RMSE of 0.1951, respectively.  We also measured a computational load for both model types. The estimation computational load for the broadcast and web-cast sub-database is 0.0655 s and 0.2855 s respectively. All the simulations were carried out on a 64-bit quad-core processor based on the Kaby Lake H Architecture, Intel i7-7700HQ 2.8 GHz.
As we can clearly see from Fig. 3 and Fig. 4, there are some outliers for both types of environments. In the broadcast scenario, these outliers represent mostly speech signals and are covered by MP2, Opus, AAC-LC and HE-AAC (v.2) codecs at the lower bit rates. On the other hand, in the web-cast scenario, there is an even representation of individual types of signals. Codecs that represent the largest share of outliers are MP2, AAC-LC, Ogg Vorbis, Opus and HE-AACv2, again for the lower bit rates.

Conclusion
In this work, we designed the non-intrusive parametric audio quality estimation models for broadcasting and web-casting scenario respectively, which are based on the Random Forest approach. The proposed models use the set of the audio features to estimate the overall audio quality for broadcasting systems and webcasting applications. When comparing the estimates provided by the proposed parametric models with the actual MOS values, we can conclude that the proposed approach estimates the perceived audio quality with a rather good accuracy. The developed parametric models can be implemented in monitoring systems, so that the quality of sound transmission can be calculated simultaneously on a large amount of connections. The future work will focus on the design of nonintrusive parametric audio quality estimation models based on other types of machine learning approaches, e.g. shallow and deep neural networks.