A Novel Element Detection Method in Audio Sensor Networks

Audio element detection in wireless sensor networks (WSNs) has great significance in our lives (e.g., in detecting traffic jam and accident, gun shots and explosion, and hurricane). It is particularly useful when video cameras cannot be used effectively (e.g., in darkness, with a wide range to cover); audio sensors are also much cheaper. However, most previous works on audio element detection require a large number of training examples to obtain satisfactory results. This becomes even more infeasible for audio sensors in WSNs where small energy consumption is required. In this paper, we propose a novel approach to solve this difficult problem. We first break down audio clips into a collection of simple “audio elements,” and train these audio elements offline using statistical learning. Then, we train a weighted association graph using the trained audio element models online. This greatly reduces the amount of online training without sacrificing accuracy. We deploy our approach in an audio sensor network for traffic monitoring and venue monitoring to evaluate its performance. The experiments demonstrate that our proposed method achieves better results compared to the state-of-the-art methods while using smaller online training sets.


Introduction
Wireless sensor networks (WSNs) which combine the technology of sensor, embedded system, and wireless communications have a bright application prospect.At present, WSNs have been widely used in some specific area sensing and acquisition of scalar information, such as temperature and humidity.However, with the complexity and variation of monitoring demand, such scalar information cannot meet full demand of users.It is urgent to introduce rich media (i.e., audio, image, and video) into environment monitoring activities on the basis of sensor networks, thus performing a complete and accurate environment monitoring.Recently, multimedia sensor networks (MSNs) have been paid wide attention, and audio event detection in WSNs has great signification in our lives (e.g., in detecting traffic flow, jam, and accident) [1,2].
Most smart home or smart traffic systems usually utilize video cameras to detect abnormal situations [3].However, in some special situations, video cameras cannot work well.For example, the cameras are susceptible to poor weather conditions if outdoors, and they cannot work when black out.
In addition to the video cameras, the use of audio sensors in surveillance and monitoring applications is becoming increasingly important [2,4].The audio sensors are particularly useful when video cameras cannot be used effectively (e.g., in darkness, with a wide range to cover).In some abnormal situations, audio conveys more significant information than video (e.g., human shouting/crying, car-crashing, etc.).Audio sensors are also much cheaper.
Audio element detection is important and helpful in many applications, such as audio scene analysis and audio summarization [5][6][7].However, most relevant research focuses on detecting the audio elements in TV programs.In these works, most researchers collected enough training samples to train each audio element model.In practice, multiple audio elements may occur simultaneously in one audio clip.In certain controlled environment (such as movies), it is relatively easy to separate multiple audios by ICA (e.g., [3]).However, in real-world situations (such as on a noisy street), it is hard to separate multiple audios.Furthermore, building a model using training data is also difficult, due to a large number of training data required.Little previous work has solved this problem satisfactorily.

International Journal of Distributed Sensor Networks
In this paper, we propose a novel approach to solve this difficult problem.We first break down audio clips into a collection of simple "audio elements." In this paper, audio elements are defined as short audio clips, and each represents a basic audio type (such as speech and music).We train these audio elements offline using statistical learning.Then, we train a weighted association graph using the trained audio elements online.This greatly reduces the amount of online training without sacrificing accuracy.
To further reduce the energy consumption of our detection method, we introduce a feature extraction method based on discriminating principle component analysis, DPCA.By using DPCA, we can reduce the dimension of the feature vector and distinguish different audio elements accurately.
The rest of this paper is organized as follows.In Section 2, we introduce some related works.In Section 3 we describe the system architecture briefly.Section 4 presents the feature extraction method.In Section 5, we present the weighted association graph-based element detection method.In Section 6, we show the experimental results.We conclude the paper and discuss the future works in Section 7.

Feature Extraction.
Feature extraction is one of the most fundamental and important issues in audio element detection.The general methodology of audio classification involves extracting discriminatory features from the audio data and feeding them to a pattern classifier [8].
Pfeiffer [9] presented a theoretical framework and application of automatic audio content analysis using some audio features in frequency domain.Nitanda et al. [10] presented a speech/music classifier based on simple features, such as zero-crossing rate and short-time energy for radio broadcast.Tzanetakis and Cook [11] proposed a novel method to classified music files into different musical genres by using lowlevel audio features.More specifically, the authors proposed three feature sets for representing timbral texture, rhythmic content, and pitch content.However, these simple featuresbased method focused more on music and speech signals and also cannot offer satisfactory results particularly when more audio classes are taken into consideration.
In recent years, a lot of audio features are introduced to detect more types of audio elements which include zero crossing rate [3], mel-frequency cepstral coefficients (MFCCs) [12], spectral similarity [8], and linear prediction coefficientderived cepstral coefficients (LPCCs) [9].However, we cannot use all the features to detect the audio events.On the one hand, the increase of the feature dimension may lead the increase of the computational complexity.On the other hand, the redundancy of the features may affect the detecting accuracy.
Some researchers reduce the feature dimension through the subspace analysis approaches.In pattern recognition, principal component analysis (PCA) [13] and linear discriminant analysis (LDA) [14] are the most popular subspace analysis approaches to learn the low-dimensional structure of high dimensional data, and they are widely used in the field of audio feature selection.
Principal component analysis, which is also known as Karhunen-Loeve (KL) transform, is a classical statistic technique that has been applied to many fields, such as knowledge representation, pattern recognition, and image compression.The objectives of PCA are to reduce the dimensionality of the dataset and identify new meaningful underlying variables.The key idea is to project the objects to an orthogonal subspace for their compact representations.However, the PCA does not pay any particular attention to the underlying class structure.
Linear discriminant analysis method searches for those vectors in the underlying space that best discriminate among classes (rather than those that best describe the data).More formally, given a number of independent features relative to which the data is described, LDA creates a linear combination of these which yields the largest mean differences between the classes.LDA is usually used in the fields of face recognition and auditory scene analysis for TV programs.However, in the audio sensor networks, the audio elements often happen in more complex situations and the samples of some special elements are difficultly acquired (e.g., car crashing in the streets).Moreover, the characteristic distributions of the testing data may have some differences with the training data.
Taking all of the data into account, PCA will compute a vector that has the largest variance associated with it.On the other hand, LDA will compute a vector which best discriminates between the classes.The PCA method is obviously of advantage to feature extraction but does not consider the separability of various classes.LDA can just make up for the deficiency of PCA.In this paper, we proposed a discriminating principle component analysis method which combines PCA and LDA together.The proposed method possesses the advantages of PCA and LDA while compensating the drawbacks of each individual; as a result, it can make good effect for feature extraction.

Audio Element Modeling and Detection.
The audio element detection method can be divided into threshold-based method and statistical learning-based method.
Lu and Hankinson [15] used a threshold-based heuristic classification method to classify an audio signal into speech, music, and noise.For each feature, a threshold is set to determine the segment type and the feature set includes silence ratio, centroid, harmonicity, and pitch.However, since the feature threshold must change for different types of audio, the threshold-based classifier is tedious and not ideal.
Some statistical learning methods such as neural networks (NN) and support vector machines (SVMs) have been used to detect the audio elements.Mitra and Wang [7] extracted 26-D features including LPC and MFCCs to separate the audio clips into different audio types such as speech, music, and background noise.Artificial neural networks (ANNs), specifically multilayered perceptrons (MLPs), are implemented to perform the classification task.Lu et al. [16] proposed a content-based audio classification and segmentation method by using support vector machines.However, these classifiers are static classifiers, and they cannot deal with the continuous signal (e.g., audio signal) satisfactorily.
In [14,15], several audio highlight events in the movies, such as applause, laughter, and cheer, are modeled by Hidden Markov Models (HMMs).In the detecting stage, audio features of each audio segment are extracted to be the inputs of these three highlight event models, and the highlight events in an audio clip are detected via a decision algorithm.However, these works were designed for TV programs and cannot be deployed on the audio sensor networks directly.
These methods had some common characteristics.
(1) In the above works, the audio stream is presegmented into some short audio clips with a given length.Then each audio clip will be recognized as a basic audio type.For example, in [16], the input audio stream is first segmented into units of 0.5 s with 0.125 s overlapping, then each unit is classified into applause, cheer, music, speech, and speech with music.However, in practice, multiple audio events may occur simultaneously, and a unit may contain several types of audio elements.(2) The work using HMMs to detected audio elements usually classified the unit into the class which has the maximum log-likelihood score.However, in the monitoring systems, the target audio elements are usually sparsely distributed, and there are many nontarget sounds (e.g., the sound of speech in the traffic monitoring systems) which should be rejected in detection.Furthermore, the detection errors for different audio elements have different risk, for example, the missed detection risk for the sound of explosion is much higher than that of speech.We should find a method to minimize the detection risk as much as possible.
To solve the first problem, we propose a weighted association graph-based element detection method to detect the audio element in audio sensor networks.By using this method, one audio clip is considered as a series of audio elements, rather than as only one audio element.
To solve the second problem, we introduce the minimum risk Bayesian decision to solve the second problem.By this way, we can exclude the nontarget sounds from the target key element sequence and minimum the decision risk.Moreover, since it is easy for a single audio senor to make detection errors, we combine the sensory information of the same area by using a data fusion algorithm to increase the detection accuracy.

System Description
As shown in Figure 1, the audio stream can be divided into several audio clips according to the time duration.In practice, multiple audio elements may occur simultaneously in one audio clip.In certain controlled environment (such as movies), it is relatively easy to separate multiple audios by ICA (e.g., [3]).However, in real-world situations (such as on a noisy street), it is hard to separate multiple audios.Furthermore, building a model using training data is also difficult, due to a large number of training data required.Little previous work has solved this problem satisfactorily.Thus, we propose a weighted association graph-based element detection method to detect the audio element in the audio sensor networks.In this paper, audio elements are defined as short audio clips, and each represents a basic audio type (such as speech and music).In this method, we introduce the weighted association graphs to detect the audio element in audio sensor networks.First, we train the commonly occurred audio elements by HMM, respectively.Then, the transition probabilities between the basic audio elements were set by some specific rules.It actually constructs a higher-level probabilistic model, weighted association graph model, for audio event detection.The advantage of the weighted association graph is that we can effectively train each component separately.Moreover, this scheme keeps the advantage of framework flexibility in various applications.That is, when new basic audio effects are added in or removed from the association graphs, only the graphs need to be redefined, without any extra system retraining.
Feature extraction is one of the basic problems in audio element detection.The features should provide sufficient representations of the audio elements, as well as adequate discrimination among various basic audio elements.In the weighted association graph-based audio element detection method, a discriminating principal component analysis feature extraction method, DPCA, is proposed to improve the description of each basic audio element and to distinguish different audio elements.
As show in Figure 2, to detect the audio element, the audio sensors capture the audio signal in the environment, and then the feature vector of each frame, extracted by using the DPCA method, is passed through the above hierarchical structure.The log-likelihood scores of the audio clip with respect to the weighted association graphs and the optimal sequence of the basic audio elements are then calculated by the Viterbi algorithm.Then, the cluster head fuses the information collected by the sensor nodes in the cluster and make the final decision by using the minimum risk Bayesian decision algorithm.

Discriminating Principal Component Analysis Feature Extraction Method
Feature selection is one of the most fundamental and important issues in audio element detection.In the audio sensor networks, we should limit the dimension of the feature vector and reduce the redundancy of the features to save the posterior computing energy.In this section, we introduce a discriminating principal component analysis feature extraction method to further reduce the energy consumption of our audio element detection method.[1,3,4] and have been proved effective in detecting various key audio elements.Based on these works, in this paper, both temporal features and spectral features are extracted for each audio frame.The temporal features consist of short-time energy (STE) and zero-crossing rate (ZCR), and the spectral features consist of band energy ratios (BERs), frequency centroid, bandwidth, and Mel-frequency cepstral coefficients (MFCCs).In this paper, the spectral domain is equally divided into four frequency intervals, and the energy in each sub-band is then normalized by the whole spectrum energy.After Fourier transformation, frequency centroid and bandwidth are related to the first and second-order statistics of the spectrogram, respectively.MFCCs are subband energy features in Mel-scale, which give a more accurate simulation of the human auditory system.In this paper, 16-order MFCCs and its first-order differential coefficient (ΔMFCC) are extracted for the audio element detection.

Discriminating Principal Component Analysis.
Given a dataset that consists of  audio samples,  = { 1 ,  2 , . . .,   }, the samples can be divided into  different audio classes.
The total scatter matrix of the projected samples,   , is defined as the following: where  represents the mean feature vector of all samples in the training set and   is the th sample's feature vector.
The between-class scatter matrix,   , is defined as the following: where   is the mean feature vector of the samples in class , and   is the number of samples in class .
The within-class scatter matrix,   , is defined as the following: where,   is the th sample in class .
It follows from the definition that   =   +   .
(1) Objective function The objective function of PCA is defined as follows: The objective function of LDA is defined as follows: In this paper, we introduce the discriminating principal component analysis to extract the pivotal and independent audio features.The objective function of DPCA is defined as follows: where  is an adaptive parameter.
When  = 0, then and it is the objective function of LDA.
When  is large enough, has no infection with the solution of  opt , and tr[(  ) −1     ] can be obtained by substituting   =   into the objective function of PCA, and the recognition result by using   is similar to using   .This method is the combination of PCA and LDA, which extracts the best representative audio features and enhance the determining ability through analyzing the categories of training samples.By using this method, we can extract the pivotal and independent features.
(2) Solution of  opt The crux of DPCA is finding the  opt to maximize the value of   ().

That is,
From ( 6) and ( 10), we have That is, Then, we diagonalize   (  + ) and      through the following linear transformations: We have From ( 12) and ( 14), we have Let  = , and substitute equation  =  −1 into (15): It is a typical eigenvector-eigenvalue problem.Since, Let  = (  + ) −1   , then we can obtain  opt as follows: where International Journal of Distributed Sensor Networks Then the original -dimensional feature space is mapped to a -dimensional feature space ( < ) through the linear transformation  opt : (3) Determining the value of  The value of the adaptive parameter  is a crucial issue in DPCA.According to extensive experiments, we obtain that the effects on experimental results caused by variation of  can be neglected when the value of  is in a proper range.

Audio Event Modeling and Detection
Model establishing is extremely critical to the detection performance.In practice, multiple audio elements may occur simultaneously.In certain controlled environment (such as movies), it is relatively easy to separate multiple audios by ICA (e.g., [3]).However, in complex real-world situations (such as on a noisy street), it is hard to separate multiple audios.Furthermore, building a huge model using training data is also difficult, due to a large number of training data required.Little previous work has solved this problem satisfactorily.Thus, we propose a weighted association graph-based method to detect the element in audio sensor networks, in which we train the commonly occurred audio elements, respectively.Then, the transition probabilities between the basic audio elements were set by some specific rules.The advantage of this method is that we can effectively train each component separately.Moreover, this scheme keeps the advantage of framework flexibility in various applications.That is, when new basic audio effects are added in or removed from the event groups, only the graphs need to be redefined, without any extra system retraining.

Basic Audio Element Modeling. Since Hidden Markov
Model has proved effective in many previous works for audio classification [4,11], in this paper, we choose HMM for audio elements modeling and select the left-to-right structure as the topology of each HMM.
The HMM model for the th basic audio element (  ) is characterized as follows: is the number of states in the model of   ,   is the number of distinct observation symbols per state,   is the state transition probability distribution matrix,   is the probability distribution matrix of the observation symbol, and Π  is the initial state distribution matrix.Model size is a crucial issue in modeling HMMs.The state number should be large enough to characterize the variations of features.On the other hand, we need to simplify the computational complexity of model training and detection processes.We collect enough samples to estimate a reasonable model size of each audio element.
In this paper, we build the HMM for 12 basic audio elements, they are: car-engine, bus-engine, klaxon, car-braking, crashing, siren, speech, music, applause, laughter,s and cheer.The state number is set as shown in Table 1.These results make sense because we elaborately collect various kinds of samples for each audio element, and the experiments show the state number setting can achieve satisfied detection accuracy.
For each basic audio element, some short audio clips with the length of 3-10 seconds are selected as the training data.
During training, the parameters for each state of an audio model are estimated by parsing the feature vectors of the training set.The Baum-Welch algorithm is then applied to estimate the transition probabilities between states and the observation probabilities in each state.

Weighted Association Graph.
We notice that some basic audio elements usually occur together, while others seldom happen simultaneously.For instance, the sound of carcrashing often happens with car-braking, but rarely takes place with the sound of laugher.In some previous studies, the researchers have use the audio elements to indicate some special audio scene.For example, in [4,5], car-engine, busengine, and klaxon are associated with traffic scene, and crashing and siren are used to indicate the abnormal events.Based on the above analysis, we introduce some audioweighted association graphs to store the basic elements that always happen together.
According to [4,5,8], we divide the audio elements into two types: key audio element and background audio element.The key audio elements and the background audio element are set according to the monitor environment.We label the audio element which we focus on as the key audio element, and label the element we do not pay much attention to as the background elements.
A weighted association graph is composed of several key audio elements and background audio elements.
For a given graph , let  be the set of the basic elements which always happen together: where   is the th basic audio element in the set.Then, we define the weighted association graph model as follows: where  = {⟨  ,   ⟩ |   ,   ∈  and   }, where   is the transition probability from   to   .Next, we will device the value   in detail.
In this paper, we define 9 weighted association graphs based on 11 basic elements.The weighted association graphs and the related basic audio elements are listed in Table 2.
We assume that the following.
(1) A key audio element can only transfer to the basic audio elements in the same graph.
A background element can transfer to the key audio elements in the same graph and other background elements.
(3) One basic audio element can belong to several weighted association graphs.
Given a basic audio element   , we define its subsequent set, Φ(  ), as a set of all the basic audio elements which   can transit to, that is, where   is the th weighted association graph that   belongs to.
To avoid the training problem and enhance the detection flexibility, in this paper, the transition probabilities from   to the audio elements in its subsequent set are set to be the same.
Thus, for a given basic audio element   , for all   ∈ Φ(  ), the transition probability from   to   can be set as follows: To this end, we have built the model for each audio weighted association graph by connecting the audio effect models through some specific rules.Figure 3 gives an example of the model for the weighted association graph T2.
In the testing process, the audio frames are estimated by each weighted association graph model, the Viterbi algorithm is used to compute the most likely state sequence for each  model and the log-likelihood value with respect to each weighted association graph will be calculated for each frame.
The vector composed of the log-likelihood values is used as the input of the data fusion stage.

Data Fusion and Decision
Making.After collecting information from the sensor nodes, the cluster head should fuse the information collected in the same monitoring area.Consider a cluster with  ( > 1) audio sensors.Let   be the vector of  log-likelihood collected by the th audio sensor node: where  is the number of the weighted association graphs we paid attention to,   is the log-likelihood score under the model of the th weighted association graph.After collecting information from all the sensor nodes of the monitoring area, the cluster head should fuse the sensory information as follows: where   is the weight of the th sensor node and the weights satisfy In this paper, the weight values are measured by experiment data.
After data fusion, the fused log-likelihood value vector will be calculated for each audio frame.Then, the most important issue is how to make decision based on the loglikelihood vector.In the monitoring systems, the target audio elements are usually sparsely distributed, and there are many nontarget sounds (e.g., the sound of speech in the traffic monitoring systems) which should be rejected in detection.Furthermore, the missed detection for different audio elements have different risk.For example, the missed detection for the sound of explosion has the higher risk than that of speech.Based on the above analysis, we cannot simply classify the audio clip into the weighted association graph which has the maximum log-likelihood score.In this paper, an optimal decision is made based on the minimum risk Bayesian decision theory.
Let  be observed audio clip and  fused be its log-likelihood value vector after data fusion.
We define the following.
2 :  does not belong to   .
Let (, ) be the risk factor for making the decision of   when the fact is   .
Then, the risk for the decision  1 can be described as the following: The risk for the decision  2 can be described as the following: We define   as the decision factor of the audio clip  under the weighted association graph   , It is obvious that, if   < 1, the risk of  1 is larger than the risk of  2 , and the larger   is the more probability of  1 .Then, we will introduce how to calculate   in detail.
Since (, ) = 0, (30) can be rewritten as the following: So (31) can be rewrite as the followings: graph model in audio element detection, 9 different kinds of weighted association graphs are selected for analysis.The weighted association graphs and the related basic audio element are listed in Table 2.The traffic data set (95.4 minutes) is collected on a busy road outside.The celebration data set (73.6 minutes) was collected in university hall.The experimental results are shown in Figure 6.
From Figure 6, we can see that because the real-world situations are more complex, and multiple audio events always happen together, so both the HMM-based method and SVM-based method need a large number of training data to build the audio event models.By using the weighted association graph based algorithm, we combine the statistical learning and human knowledge together and reduce the training complexity for complex real-world situations.When the number is the same, the weighted association graphbased algorithm can achieve higher detection accuracy than the other two methods.Moreover, this scheme keeps the advantage of framework flexibility in various applications.That is, when new basic audio elements are added in or removed from the association graphs, only the graphs need to be redefined, without any extra system retraining.

Evaluation of the Bayesian Decision Algorithm.
In this experiment, we compare the Bayesian decision basedalgorithm with the max-log-likelihood-based algorithm.
As shown in Table 6, for traffic environment monitor, the detection precision of using Bayesian decision is much better than that of max log-likelihood.For celebration, the detection precision of various audio elements is better than of max loglikelihood.Because there exists some non-predefined audio elements in real-world audio monitor, Bayesian decision can remove these non-predefined audio elements, which

Figure 1 :
Figure 1: The segmenting method for an audio stream.

Figure 2 :
Figure 2: The process of the weighted association graph-based audio element detection method.It is mainly composed of three parts: (I) the working process of the audio sensor node; (II) the working process of the cluster head; (III) the off-line training process.

Figure 3 :
Figure 3: The model of the weighted association graph T2.

InternationalFigure 4 :
Figure 4: The structure of the selected cluster.

Figure 6 :
Figure 6: The detection results for the three audio element detection methods.

Table 1 :
The number of HMM states of the basic audio elements.

Table 2 :
The weighted association graphs and the related basic audio elements.

Table 4 :
The average recall by using three different feature extraction methods (PCA, LDA, and DPCA).

Table 5 :
The average precision by using three different feature extraction methods (PCA, LDA, and DPCA).

Table 6 :
The audio element detection results by using Bayesian decision based algorithm and the max-log-likelihood-based algorithm.