Multi-Level Audio Classification Architecture

A multi-level classification architecture for solving binary discrimination problem is proposed in this paper. The main idea of proposed solution is derived from the fact that solving one binary discrimination problem multiple times can reduce the overall miss-classification error. We aimed our effort towards building the classification architecture employing the combination of multiple binary SVM (Support Vector Machine) classifiers for solving two-class discrimination problem. Therefore, we developed a binary discrimination architecture employing the SVM classifier (BDASVM) with intention to use it for classification of broadcast news (BN) audio data. The fundamental element of BDASVM is the binary decision (BD) algorithm that performs discrimination between each pair of acoustic classes utilizing decision function modeled by separating hyperplane. The overall classification accuracy is conditioned by finding the optimal parameters for discrimination function resulting in higher computational complexity. The final form of proposed BDASVM is created by combining four BDSVM discriminators supplemented by decision table. Experimental results show that the proposed classification architecture can decrease the overall classification error in comparison with binary decision trees SVM (BDTSVM) architecture.


Introduction
Currently, the enormous amount of audio-visual data is available on the internet and various audio-visual databases.There is a need to manage the audio and video content.Various content-based techniques have been implemented in order to process these data automatically.The efficient processing of audio data is inevitable for applications like automatic speech recognition, classification and retrieval.A big effort is directed towards content-based analysis of audio data containing various, hard to discriminate, acoustic classes.The aim of this paper is therefore focused on content-based classification of BN audio data utilizing an efficient classification architecture.Moreover, we built the classification system also with intention to use it for refinement the acoustic models for each particular audio class and lower the word error rate of the automatic speech recognition (ASR) system.
In general, there are six acoustic classes with frequent occurrence in BN audio stream, namely pure speech, speech with environment sound, environment sound, speech with music, music and silence [1].Each individual class is characterized by unique acoustic properties and random occurrence in audio stream.So far, various different approaches to discrimination of multiple classes have been investigated and compared.Some of them focus on the automatic selection of the most efficient features and the optimal thresholds, utilizing rule-based classification algorithms [2], [3].Typically some long-term statistics, such as the mean or the variance, and not the features themselves, are used for the discrimination.Other works point to the importance of using appropriate classification architecture employing robust machine learning classifiers [4], [5].Other authors also reported superior position of SVM in many classification tasks, especially in case of classifying audio stream with various acoustic events [6], [7], [8].That was the main reason why we decided to use the SVM classifier as the core element in our binary discrimination architecture supplemented by sufficient features.The aim was to minimize missclassification error and increase the overall classification accuracy for broadcast news audio data.The overall system for classification of BN audio data is illustrated in Fig. 1.It utilizes the fundamental principles of audio processing techniques applied in BDT classification strategy.The input audio stream is firstly segmented into segments with duration 200 ms and 100 ms overlapping by using a simple rectangular window.Each segment is further divided into the overlapped frames, using Hamming window with length 50 ms and 25 ms frame shift, in order to avoid spectral distortions.The segmented audio signal is pre-emphasized by a FIR filter in order to emphasize higher frequencies in speech signals.All the features are calculated within each individual frame in time, frequency or cepstral domain.The variance of feature values is then calculated within each individual segments.Such long-term statistics reduce computational complexity and the influence of the signal's variability.Each extracted parameter is then smoothed by averaging the values using successive floating window with length 1 s.The process of smoothing can help to alleviate the influence of the abrupt changes between several adjacent coefficients within the feature vector that represents only one audio class and, as a consequence of that, reduces the miss-classification error.
Each block of classification is represented by one binary discriminator that performs discrimination of input feature vectors using corresponding decision function.Easy to separate and most general classes like speech and non-speech are classified on the first level of topology.The other classes, namely music/environment sound and pure speech/non-pure speech, are classified in the next step, processing the audio data from previous level.The last level performs classification of the two most difficult to discriminate classes: speech with environment sound/speech with music.The output value of decision function assigns the final class label for the actual vector.
provides a description about binary decision algorithm and main discrimination principles applied in BDASVM.Section 3. discusses experimental setup and finally Section 4. gives our conclusions and shows future directions.

Binary Discrimination Architecture
The core element of proposed BDA is the SVM classifier [9].We decided to use it in our classification topology also for its generalization ability and superior performance in various pattern classification tasks.
A decision function of the SVM is modeled by separating hyperplane: where the input audio data are represented in the form of feature matrix using weights w and bias b, obtained from the process of learning (training).

Algorithm 1 BDSVM algorithm
Require: Input feature matrices X train , X test and label vector Y train .Ensure: Output label vector Y predict .
1: for all x ∈ {X train , X test } do svm_predict( x, model); 13: end for Proposed solution for binary decision (BD) algorithm utilizing SVM is stated in Alg. 1.It follows the basic principles applied in the training process.Thus, the main task is to find optimal parameters for binary decision function on each level of classification, using training data.The optimal setting is understood as the process of finding parameters for kernel function and the penalty parameter in the process of training.The input of the BDSVM algorithm is represented by a feature matrix corresponding to training data X train , with dimension M × N .Scaling values in the range of 0 -1 ensures function scale.It helps to eliminate big differences between coefficients and can be considered as some kind of smoothing.Each feature vector then takes the values y m = +1 and y m+1 = −1 keeping the same number of vectors for both classes.Reordered feature vectors and labels are assigned as X train , Y train .This step helped us to optimize process of cross-validation and alleviate overfitting of classifier.Cross-validation technique, also known as leaveone-out cross-validation) [10], is then applied in order to find optimal parameters of kernel function (g) and penalty parameter (C).After several initial experiments, we decided to use 5-fold cross-validation and RBF kernel function as the best choice.The parameters C and g were adjusted exponentially, taken the values 2 0 , 2 2 , 2 4 and 2 6 .AUC (Area Under the Curve) [11] parameter was used as the main evaluation criterion during the crossvalidation.The highest value of AUC signifies the most optimal (best) parameters C and γ.The optimal parameters are then used to generate model by using svm_train function.Acoustic model is consequently applied in the process of prediction the class labels for testing data (Y predict ).The proposed BDASVM topology is depicted in Fig. 2.After several initial experiments, we found out that the number of input vectors on the fourth level was insufficient (very small) for training the SVM classifier.This unwanted effect was caused by a high discrimination power on the first level of classification.
There was a need to decrease discrimination ability on the first level and increase on the fourth level in order to suppress this effect.Partial solution was to divide training set at the input of BDASVM into two parts X train_train and X train_test with equal size.Feature vector matrix X train_train was used for training the SVM and X train_test for testing on the first level of discrimination.The second level of discrimination enter the feature vectors classified to the class +1: X train_test+1 .On the contrary, the third level enter the feature vectors classified to the class −1 on the first level of classification: X train_test−1 .The whole training set of feature vectors X train was used for training the SVM on the fourth level, regardless the testing vectors which enter the classifier from level two and three.The whole set of testing data X test enters the BDASVM in testing phase and the values of decision functions are written to the decision table.
The maximum classification accuracy was achieved by adding weighted factor w C to each discrimination level.We defined it as the value of penalty parameter C divided by number of training vectors belonging to particular class.Thus, for class +1: w C+ = C/num +1 and for class −1: w C− = C/num −1 .In a certain way, weighted factor helps to decrease the influence of overfitting by adding higher weight to the vectors belonging to the minor class and lower weight to the vectors that belong to the major class.More detailed description about the implementation into the SVM training algorithm can be found in [12].

Experiments
The classification performance of basic BDSVM (BDTSVM) and proposed BDASVM topology was evaluated on KEMT-BN1 database with total duration about 65 hours of the Slovak TV broadcast audio stream [13].Only part of the database was considered in our experiments, namely 49 min for training (PS: 10.19 min, MS: 9.26 min, SES: 9.41 min, M: 11.7 min, B: 9.06 min) and 46.2 min for testing (PS: 9.16 min, MS: 9.44 min, SES: 9.25 min, M: 9.04 min, B: 9.31 min).Our aim was to extract maybe smaller amount of audio data, but more accurate, with equal size for each audio class in order to avoid the overfitting of classifier.Silent parts and all other audio events were extracted manually using only available word level transcription.
The last step in our experimental work was to implement proposed BDASVM architecture into the overall classification system (Fig. 1).The architecture is depicted in Fig. 3.Each block of classification is represented by one BDA module for discrimination S-NS on the first level, PS-NPS on the second level, M-B on the third level and finally MS-SES on the fourth level.Input audio data are represented by feature vectors x with dimension 1 × N , as a part of the feature input matrix X.The output gives the information about the particular audio class in audio stream.During the testing phase, depending on the output values of decision functions D i ( x), i ∈ [1 : 4], label +1 or −1 is assigned to each input feature vector on each level of discrimination and saved into the decision table.Final class label is assigned according to the following criteria: • PS: sum(D1( x), D2( • SES: sum(D1( x), D2( x), D4( x)) = 1, if D1( x) = 1 and also D2( x) = −1.

Results and Discussion
The classification performance for both types of evaluated architectures is given in Tab. 1.We used the classification accuracy Acc as the main evaluation criterion.It defines the number of correctly predicted frames to all tested frames within the particular audio class.Each individual value of Acc, except for Avg Acc, represents the average for parameterization on frame level, segmentation level with smoothing and without smoothing.Value of Avg Acc was obtained by averaging the overall classification accuracy for each class.Such interpretation of results helps to minimize redundant information about the parameterization technique and keeps the substantial information about system's performance.PT corresponds with processing time needed for classifying each testing feature vector into the particular class.The implementation of proposed BDASVM topology resulted in higher classification performance.The 0.48 % increase of Avg Acc and 4.24 s growth of PT was achieved in comparison with BDSVM.Relatively small increase in processing time was caused by loading all the testing data at once in the first step of classification and processing them on each level of discrimination.We assume that the main cause for relatively low enhancement in classification performance was a high influence of level-dependent miss-classification error in case of MS-BS discrimination problem.Leveldependent error propagates only within one BDASVM component.A possible cure for these drawbacks is the implementation of feature selection algorithm for each level of BDASVM.The aim of that algorithm is to reduce the number of training data by selecting the optimal data set on each level of discrimination.The comparison with other classification architectures, like oneagainst-one and one-against-all, willbe investigated in the future work as well.
The process of training and testing of the SVM was performed by LIBSVM software (http://www.csie.ntu.edu.tw/∼cjlin/libsvm/).Features were extracted using our own sw implementation.The classification algorithms were running on HPC system with 24 nodes.Each one contains computing server IBM Blade System x HS22 with two six-core processor units Intel Xeon L5640 (2.27 GHz) and 48 GB RAM.
are input vectors with dimension N and class label y m = ±1.N defines number of coefficients per frame (or segment) and M gives the overall number of frames (segments).After successful training stage the learning machine produces the output D( x m ), given as:
The fundamental principles are based on a successive discrimination amendment where each block tries to correct what previous one miss-classified.The total number of binary classifiers needed to discriminate N classes is n × N , where n refers to the number of discriminators for one classification level (in our proposal n = 4).It follows a basic stepwise classification condition, where combination of n classifiers is aligned in such configuration which outputs only two decision values +1 and −1.It is considered as empirical approach.
Tab. 1: The classification performance of basic BDSVM and proposed BDASVM architectures.