Multiple Time-Instances Features of Degraded Speech for Single Ended Quality Measurement

. The use of single time-instance features, where entire speech utterance is used for feature computation, is not accurate and adequate in capturing the time localized information of short-time transient distortions and their distinction from plosive sounds of speech, particularly degraded by impulsive noise. Hence, the importance of estimating features at multiple time-instances is sought. In this, only active speech segments of degraded speech are used for features computation at multiple time-instances on per frame basis. Here, active speech means both voiced and unvoiced frames except silence. The features of diﬀerent combinations of multiple contiguous active speech segments are computed and called multiple time-instances features. The joint GMM training has been done using these features along with the subjective MOS of the corresponding speech utterance to obtain the parameters of GMM. These parameters of GMM and multiple time-instances features of test speech are used to compute the objective MOS values of diﬀerent combinations of multiple contiguous active speech segments. The overall objective MOS of the test speech utterance is obtained by assigning equal weight to the objective MOS values of the diﬀerent combinations of multiple contiguous active speech segments. This algorithm outperforms the Recommendation ITU-T P.563 and recently published algorithms.


Introduction
The speech processing algorithms and codecs are used in modern telecommunication systems and thus the monitoring and maintaining the quality of speech is important from customer satisfaction point of view to maintain and improve the quality of service.One aspect of this requirement for the automated system is to evaluate the speech quality objectively and continuously.If the quality of speech is not up to the mark, the proper bandwidth allocation or other speech enhancement techniques can be utilized to improve the quality of speech and thus the quality of service.There are two methods for signal based speech quality measurement: Double ended (Intrusive technique) and single ended (Non-intrusive technique).Double ended (Intrusive technique) requires original clean speech signal along with the received degraded speech signal to compute the quality rating called objective MOS, while single ended (Non-intrusive technique) uses only received degraded speech signal to compute the quality rating [1].The non-intrusive method of speech quality measurement is suitable for system automation and real-time applications where the original clean speech signal is practically impossible to obtain such as mobile communications, telephonic communication, Direct-to-Home (DTH) signal of television (TV), Voice over Internet Protocol (VoIP) signal, etc.The Recommendation ITU-T P.563 (May 2004) is the standard for single ended (non-intrusive) speech quality measurement [2].The subjective measurement is the ideal way to obtain the speech quality rating of degraded speech signal where the speech signal is played and average value of opinions of about 16-20 listeners is treated as quality rating for a particular speech utterance and called the subjective MOS as per the Recommendation ITU-T P.800-Aug.1996[3].The measurement of speech quality has been done using different types of features obtained from speech encoder and GMM mapping in [4], without considering any degradation model.The human auditory system modelled explicitly or implicitly as Lyon's cochlear model is used in this work.Reference [5], which takes into account for the critical band and different auditory phenomenon such as masking the effect of human auditory system.The functional role of the human auditory system and the articulator system characteristics in the form of temporal envelope representation of speech have been utilized in the Auditory Non-Intrusive Quality Estimation (ANIQUE) model [6].The Lyon's auditory features computed for entire speech let us call as "single timeinstance features" and their mapping to the speech quality score by GMM has been given in [7].The combination of different single time-instance speech features including auditory features and features related to vocal-tract resonances are used for GMM mapping and speech quality evaluation in [8].The method given in [9] is assessing dimensions of perceptual quality space using linear regression and the dimension used is the loudness of speech which describes a non-optimal sound level.Estimating the quality and intelligibility of speech degraded by additive noise and distortions associated with telecommunication networks, based on a data driven framework of feature extraction and tree based regression, is given in [10].
The limitations of current research in the literature are that the features used for speech quality measurement are single time-instance, where the entire speech utterance is used for the computation of features, and these features are mapped to the objective quality rating score.In this work, the features are computed at multiple time-instances which capture the presence of noise at different locations of the speech utterance instead of averaging the effect over the entire speech utterance.Thus, the use of single time-instance features is not accurate and adequate in capturing the time localized information of short-time transient distortions and their distinction from plosive sounds of speech, particularly degraded by impulsive noise.The Voice Activity Detection (VAD) algorithm is employed to get the active speech segments of different speech utterances [11].Here, active speech means both voiced and unvoiced frames except silence.Now, the combinations of multiple contiguous active speech segments of speech utterance are made in increasing order till all the active speech segments are accounted for.These combinations of active segments are divided into frames and features are computed on per frame basis using Lyon's auditory model.These per frame features are combined over the frames to give features of the different combinations of multiple contiguous active speech segments.In similar manner, Mel-Frequency Cepstral Coefficients (MFCC) [12] and [13] and Line Spectral Frequencies (LSF) features [14] are computed at multiple time-instances and concatenated to obtain the feature vector.The subjective MOS of the speech utterance is taken as the subjective MOS for each of the different multiple time scale estimates (the combination of multiple contiguous active speech segments) during GMM training.The objective MOS values for each of the multiple time scale estimates are computed using the GMM parameters and different multiple time-scale features of test speech utterance.The overall objective MOS of the test speech utterance is computed by assigning equal weights to the objective MOS values of different multiple time scale estimates.The results are compared with Recommendation ITU-T P.563, the standard for non-intrusive technique of speech quality measurement, and different state-of-art recently published works [13], [15], [16], [17] and [18], which are using single time-instance features approach in terms of Pearson's correlation coefficient and RMSE between the subjective MOS and the overall objective MOS of speech utterances.The proposed algorithm using the combination of Lyon's auditory features, MFCC and LSF features, all computed at multiple time-instances, outperforms the state-of-art recent works.

Multiple Time-Instances Auditory Features
The more detailed statistical information of local features, particularly for contiguous speech segments, can be captured in multiple time-instances estimates, if non-stationary noise is present in the speech utterance.Thus, it is expected that the correlation between the subjective and the objective MOS in speech quality measurement problem will improve in multiple timeinstances features approach.The degraded speech is input to the multiple time-instance auditory feature computation modules.At the very first stage, it will have to pass through VAD algorithm to remove silence region and find out the different active speech regions present in the speech utterances.For a speech utterance having three active speech segments, the output of VAD algorithm is schematically shown in Fig. 1.
The active speech segments at the output of the VAD algorithm are used in increasing order to make the different combinations of multiple time duration active speech segments till all the active speech segments are accounted for.The method of making concatenation to obtain different multiple time-instances estimates as the combinations of active speech segments for a speech utterance having three active speech segments is shown in Fig. 2. It will be continued till all the active segments are accounted for.
The first active segment is, say SEG1.Next, the combinations of the first and second active speech seg- ments is, say, SEG2.The combinations of the first, second and third active speech segments is, say, SEG3 and so on.In a similar manner, for K number of active speech segments in a speech utterance, there will be K different combinations of multiple contiguous active speech segments, on the lines of SEG1, SEG2, . . . up to SEGK.These combinations of multiple contiguous active speech segments such as SEG2, SEG3, . . . up to SEGK are divided into frames of fixed duration of 16 ms and passed through 64-channel Lyon's auditory model to compute 64 auditory features on frameby-frame basis after windowing with a Hamming window of 16 ms duration with 50 % overlap.The mean, variance, skewness and kurtosis over the frames of 64 auditory features are computed and concatenated to obtain 256-dimensional Lyon's feature vector.The dimensionality of the feature vector is reduced from 256 to 30 by using Principal Component Analysis (PCA) to preserve more than 98 % of the energy.In the multiple time-instances features approach, the duration of active speech segments is varying over time.
In a similar manner, 13-dimensional multiple timeinstances MFCC and 10-dimensional multiple timeinstances LSF feature vectors are also computed on per frame basis.All these feature vectors are now concatenated to obtain a 53-dimensional feature vector.In a similar manner, 53-dimensional feature vectors are computed for all multiple time-instances estimates such as SEG2, SEG3 and so on up to SEGK.For the training of joint GMM according to Expectation Maximization (EM) algorithm [19], the 53-dimensional feature vectors are appended with the subjective MOS values of the corresponding speech utterance.The subjective MOS for each of the multiple time-instances estimates is taken as the subjective MOS of the speech utterance, as shown in Fig. 3, because no separate subjective MOS will be available for the multiple time-instances estimates in any database.The objective MOS of each of the multiple time-instances estimates is computed us-ing GMM parameters namely mean, mixture weight, and covariance matrix and 53-dimensional feature vectors of the corresponding multiple time-instances estimates.The objective MOS value of i th multiple-time scale estimate θi as a function of 53-dimensional multiple time-instances feature vector ψ is obtained using the Minimum Mean Square Error (MMSE) criterion [4]: where θ is the subjective MOS of corresponding speech utterance.The three databases are randomized to use leave-one-out10-fold cross validation process for training and testing.That is, 90% data are used for training and 10 % data are used for testing.The process is repeated 10-times to obtain the objective MOS values for all the multiple time-instances estimates.In this work, GMM with 12 mixture components are used and all the GMM training parameters are computed offline and stored in a library.In real-time monitoring, only test speech will be used but there will be an algorithmic buffering delay corresponding to one sentence speech utterance before the multiple timeinstances speech quality evaluation algorithms are applied.
The averaging of the objective MOS values of the multiple time-instances estimates is done i.e. equal weights are assigned to the objective MOS values of the different multiple time-instances estimates to compute the overall objective MOS of the corresponding speech utterance.If θ is the objective MOS of speech utterance, then it is computed by taking the average of the objective MOS values of K SEGs, θi is given by:  where K is the number of active speech segments which will be equal to the number of combinations of multiple contiguous active speech segments.

Results and Analysis
The Pearson's correlation coefficient and RMSE between the subjective MOS score θ and estimated overall objective MOS score θ, both computed as condition averaged value, are used as figure of merit in most of the literatures of single ended speech quality measurement algorithms.In this work, unconditioned values of the subjective and objective MOS are also used for the computation of Pearson's correlation coefficient and RMSE [8], where MOS values of speech sentenceby-sentence are used, because it will be more realistic.
Results are given and compared in Tab. 1 for condition averaged MOS values and Tab. 3 for unconditioned MOS values using three databases.The comparison of results between single time-instance [8] and multiple time-instances approaches is presented along with Recommendation ITU-T P.563.The overall weighted average of the correlation using multiple time-instances estimates is 0.980 as against single time-instance features approach which is 0.960 [8], whereas the correlation is 0.934 using the ITU-T Rec.P.563 algorithm over the three databases for condition averaged MOS case as given in Tab. 1.
In [8], on same databases 37-dimensional feature vectors formed by combining 14-dimensional reduced size Tab.1: Correlation coefficients and RMSE between the subjective and the estimated overall objective MOS for the condition averaged MOS case.

Database
No Lyon's auditory model features, 13-dimensional MFCC features, and 10-dimensional LSF features, all computed at single time-instance for entire speech utterances are used.In this work, 53-dimensional feature vectors formed by combining 30-dimensional reduced size Lyon's auditory features, 13-dimensional MFCC features, and 10-dimensional LSF features, all computed at multiple time-instances are used.The basis for dimensionality reduction of Lyon's auditory features using PCA from 256 to 14 in the case of single time-instance is preservation of 98 % energy.According to this criterion, the dimensionality of multiple time-instances Lyon's auditory features is reduced from 256 to 30 using PCA.Moreover, the MFCC features are 13-dimensional and LSF features are 10dimensional.Thus, single time-instance feature vectors are 37-dimensional and multiple time-instances feature vectors are 53-dimensional.The results in terms of Pearson's correlation coefficient and RMSE for condition averaged MOS are also compared in Tab. 5 with the published results of recent works in [13], [15] and [16] which were using a database of 1792 speech utterances that was a subset of NOIZEUS-2240 database of 2240 speech utterances used in this work.The comparison is also shown by bar graph in Fig. 4. Here, we have conducted subjective listening tests to obtain the subjective MOS for 2240 speech utterances, while in [13], [15] and [16] they have used their own respective subjective scores.The value of correlation reported in [13] for the condition averaged case is 0.9002 and the RMSE to be 0.33, whereas in this proposed work the correlation obtained is 0.986 and the RMSE to be 0.068 respectively for the NOIZEUS-2240 database.In [15], the maximum value of Pearson's correlation coefficients obtained is 0.910 in test-1 which uses 8-fold cross validation process, whereas 10-fold cross-validation process has been used Tab. 5: Comparison of results in terms of Pearson's correlation coefficient and RMSE with recently published works [11], [13] and [14] on NOIZEUS-2240 database.The comparison of results in terms of Pearson's correlation coefficient for NOIZEUS-960 database has also been done with [17] for condition averaged MOS, which is the same speech database used in this work.Here, we have conducted subjective listening tests to obtain the subjective MOS for 960 speech utterances, while in [17] they have used their own respective subjective scores.The Pearson's correlation coefficients obtained in [17] was 0.933 as against 0.995 in this proposed work.In [17], 70 % of data has been used for training while 30 % for testing.The comparison of results in terms of Pearson's correlation coefficient for ITU-T P. Supplement-23 database has also been done with recent work [18] in Tab.6 for seven sub-databases for condition averaged MOS values.In these comparisons, it is observed that the proposed work performs better than these recently published works.

Inferences Drawn from Results
From the overall results expressed in tabular form and different comparisons, the following inferences are made: • The multiple time-instances estimates to compute the objective MOS score of the overall speech utterance gives higher correlation as compared to the single time-instances features approach.
• For both, the condition averaged MOS case or unconditioned MOS case, correlation coefficients and RMSE are significantly better for multiple timeinstances estimates as compare to single timeinstances estimates over the different databases.
• In this algorithm, the combination of reduced size Lyon's auditory features with MFCC and LSF features are used as feature vectors in the study.In this, even there will be some duplicity of information in the features, but the combination of features gives better result in terms of correlation and RMSE between the subjective MOS and the estimated overall objective MOS for speech on sentence-by-sentence basis.By combining these feature vectors, the correlation coefficient, in both the cases of unconditioned and condition averaged MOS increases significantly.[11], [13] and [14].

c 1 SEG2SEG3Figure 2 :SEG1Figure 3 :
Figure 2: Combinations of three active speech segments for different time-instances estimates for illustration.

Fig. 3 :
Fig. 3: Computation of 53-dimensional feature vector, and appending with the subjective MOS for GMM training.
Tab. 2: Comparison of single time-instance and multiple timeinstances features approach using equal weights with ITU-T Rec.P.563 taking condition averaged subjective MOS and estimated objective MOS.
Tab. 3: Correlation coefficients and RMSE between the unconditioned subjective MOS and the unconditioned estimated overall objective MOS.