Nonnegative Matrix Factorization Based Adaptive Noise Sensing over Wireless Sensor Networks

An adaptive noise sensing method is proposed to improve the speech sensing performance of speech-based applications operated over wireless sensor networks. The proposed method is based on nonnegative matrix factorization (NMF), which consists of adaptive noise sensing and noise reduction. In other words, adaptive noise sensing is performed by adapting a priori noise basis matrix of the NMF, which is estimated from the noise signal, resulting in an adapted noise basis matrix. Subsequently, the adapted noise basis matrix is used for the NMF decomposition of noisy speech into clean speech and background noise. The estimated clean speech signal is then applied to a front-end of the speech-based applications. The performance of the proposed NMF-based noise sensing and reduction method is first evaluated by measuring the source to distortion ratio (SDR), the source to interferences ratio (SIR), and the source to artifacts ratio (SAR). In addition, the proposed method is applied to an automatic speech recognition (ASR) system, which is a typical speech-based application, and then the average word error rate (WER) of the ASR is compared with that employing either a Wiener filter, or a conventional NMF-based noise reduction method using only a priori noise basis matrix.


Introduction
Speech-based user applications operated over wireless sensor networks are increasingly being utilized in various environments, for example, smart home, smart TV, and cars, since they have become a key feature of smart user interfaces [1][2][3][4].However, as the number of the speech-based application fields has increased, the various types of background noise could negatively affect the speech sensing performance of the applications deployed over wireless sensor networks.These background noises can be classified into two typesstationary and nonstationary-depending on the variability of their characteristics over time.Many conventional methods, including spectral subtraction [5], minimum mean square error log-spectral amplitude (MMSE-LSA) [6,7], and Wiener filtering [8], have been reported to effectively reduce stationary noise that was recorded with speech signals.Consequently, they were successfully applied to a front-end of an automatic speech recognition (ASR) system, that is, a typical speech-based application over wireless sensor networks [9].However, since these conventional methods were developed based on the stationary noise assumption, their performance could degrade under nonstationary noise conditions [10,11].Thus, the reduction of nonstationary noise is important for reliable noise-robust speech-based applications over wireless sensor networks under various kinds of environmental noise conditions.
As an alternative, nonnegative matrix factorization (NMF)-based noise reduction methods have been proposed to estimate the noise spectrum effectively under nonstationary noise conditions [12][13][14][15][16].In particular, recent research works have reported that NMF-based noise reduction methods have been successfully applied to a front-end of an ASR system under various nonstationary noise environments [14][15][16].However, the performance of NMF-based noise reduction methods degraded substantially when there was a mismatch in noise type for noise basis training and estimation using NMF [17,18].There have been several approaches proposed to improve the noise reduction performance of NMF when there was a mismatch between the training and estimation of speech and/or the noise basis [19][20][21].In particular, real-time semisupervised source separation methods in [19,20] assumed that the noise basis was prepared at the training stage, while the speech basis was learned online under nonstationary noisy conditions.The methods dealt with the mismatch in speech basis training and estimation, but there was no consideration on the mismatch between the noise basis training and estimation.In [21], a universal model was introduced to overcome the mismatch between the noise basis training and estimation, where each noise basis was trained to represent a certain type of noise source.Consequently, the performance of the universal model would be limited because noise could be represented by multiple overlapping sources in a real world environment [18].In this paper, an NMF-based adaptive noise sensing and reduction method is proposed to improve the performance of an ASR system under the mismatch in a noise type between the noise basis training and estimation.The proposed method adaptively updates a priori noise basis matrix of the NMF on the fly by estimating the noise signal prior to the actual speech signal.Next, NMF decomposition is carried out with the adapted noise basis matrix in order to estimate clean speech and background noise from noisy speech.Finally, the estimated clean speech signal is applied to a front-end of an ASR system in order to improve the performance of ASR under various noise types.
The rest of this paper is organized as follows.Following this introduction, Section 2 briefly reviews a conventional NMF-based noise reduction method.Section 3 proposes an NMF-based noise sensing and reduction method.Section 4 evaluates the performance of the proposed method and compares it with those of conventional methods.Finally, the paper is concluded in Section 5.

Conventional NMF-Based Noise Reduction Method
Figure 1 shows the procedure of a conventional NMF-based noise reduction method applied for an ASR.As shown in the figure, noisy speech is captured by a microphone; then a block of speech signal, called a speech frame, is transformed into a frequency domain by applying a short-time Fourier transform (STFT).Next, an NMF technique is applied to estimate the clean speech spectrum by reducing the noise spectrum.Consequently, the estimated clean spectrum is transformed back into the time domain by applying an inverse STFT.Finally, an ASR system, which is typically based on hidden Markov models [22], is constructed from the feature parameters extracted from this estimated clean speech.As mentioned in Figure 1, an NMF-based noise reduction method attempts to decompose a noisy speech signal into separate speech and noise signals by exploring the sparseness of the noisy speech [18].To explain how to estimate speech and noise with NMF, noisy speech at the th speech frame,   (), is represented as where   () and   () are clean speech and additive noise at the th speech frame, respectively, and   () is assumed to be uncorrelated with   ().By applying an STFT to (1),   () can be represented as the spectral components, as where where  [25] as where  is an iteration index and both multiplication, ⊗, and division are applied on an element-by-element basis.In addition, 1 is a  ×  matrix in which all elements are equal to unity.Note that all elements for B 0 S, A 0 S, B 0 D, and A 0 D can be set initially at random values between 0 and 1.
In the NMF training, B S is obtained by repeating (4) and ( 5) until the relative reduction of the NMF objective function according to the iteration arriving at a value below a predefined threshold.In this paper, the Kullback-Leibler (KL) divergence is employed as an NMF objective function [25].Similarly, B D is also obtained by repeating ( 6) and ( 7), and the estimation process is terminated based on the KL divergence.
As described above, the conventional noise reduction methods are performed using B S = B Then, the activation matrices, A S and A D , are calculated by the multiplicative NMF update rule as where all elements for A 0 S and A 0 D are also set as random values between 0 and 1.Similar to the NMF training, ( 8) is repeated until an NMF objective function converges.Subsequently, S  is set to B S A   S , where   is the last iteration of (8).Finally, an estimate of clean speech,    (), is obtained by applying an inverse STFT to S  , and it is fed to the frontend of an ASR system.
However, the main drawback of the conventional NMFbased noise reduction methods is that the noise reduction performance is not reliable when there is a mismatch in noise types between the noise basis training and estimation using NMF.In other words, the basis matrices, B   D , are inappropriate for the ASR test.To overcome this problem, B   D should be updated during the ASR test.

Proposed NMF-Based Adaptive Noise Sensing and Reduction
In this section, an NMF-based adaptive noise sensing and reduction method is proposed to mitigate the degradation of noise reduction when there is a mismatch in noise types between noise basis training and estimation using NMF.
Figure 2 shows the procedure of the proposed NMF-based adaptive noise sensing and reduction method.As shown in the figure, the procedure is divided into three different processing stages: a priori NMF basis modeling, NMF-based adaptive noise sensing, and noise reduction.The first processing stage of the proposed method is the same as that of the conventional method described in Section 2. In other words, clean speech signals and noise signals are separately applied to the NMF training in order to obtain the a priori basis matrices.In the second processing stage, the adaptive noise sensing is performed to decompose the noisy input spectrum into speech and noise spectrum using a priori speech basis matrix estimated by the first processing stage.That is, the noise basis and activation matrices are obtained by adapting a priori noise basis from the instantaneous noise frames of the noisy input signal.Finally, the third processing stage of the proposed method estimates the noise-reduced speech signal by constructing a Wiener filter [8] using the adaptively estimated noise spectrum.The following subsections describe a priori NMF basis acquisition, NMF-based adaptive noise sensing, and noise reduction in detail.and noise database, S and D, the basis and activation matrices for speech and noise, B S, A S, B D, and A D, are obtained by iterating equations ( 4)- (7).Notice that the dimensions of B S, A S, B D, and A D are  ×   ,   × ,  ×   , and   × , respectively, where   and   are the number of bases for S and D. In this paper, , ,   , and   are set to 257, 2000, 80, and 40, respectively.In particular, as a termination condition for the NMF iteration, the divergence cost function [25] for (4) and ( 5) is defined as

Modeling of A Priori
, where the summation notation means adding all the elements of a matrix together.Accordingly, if |Div( S; ) < , where  is also set to 0.001.Finally, B S and B D are applied to the NMF-based adaptive noise sensing stage, which will be explained in the next subsection.

NMF-Based Adaptive Noise
Sensing.This subsection describes how the noise spectrum is adapted into the NMF framework.First, noise frames are detected from noisy input speech.Then, the detected noise frames are concatenated to construct a noise matrix, D  , which is used to adapt B D, which is estimated a priori as described in Section 3.1.
Specifically, the ratio of speech and noise magnitudes for the th frequency bin at the th frame,   (), are first calculated as Using (11), a set of noise frames, I D , is selected which satisfies the equation of where  is a threshold for detecting noise frames.In this paper,  is determined by considering the mean and variance of   () such that where For the mean and variance calculation, the first  frames of noisy speech are assumed to be noise frames and  = 20 in this paper.In addition,  in ( 13) is set so that approximately 80% of the initial  frames are included in I D .As a result, a noise binary mask for the th frequency bin at the th frame,   (), can be defined as and a noise matrix is estimated as D  = M ⊗ Y, where M is a ( × ) noise mask matrix constructed by (14).
Next, B D is adapted using D  by the following iterative equations of where B  D and A  D are the adapted basis and activation matrices of noise at the th iteration.As an initial condition for (15), all elements of A 0 D can be set as random values between 0 and 1, whereas B 0 D = B D. Similar to the NMF training described in ( 4)-( 7), the procedure of ( 15 As a final processing step for the adaptation, NMF decomposition is performed in order to calculate A S and A D .To this end, a multiplicative update rule with B S and B D is applied, where B S is the basis matrix obtained in Section 3.1.That is, the NMF decomposition also iterates the following equation where all elements in A 0 S and A 0 D , are set as random values between 0 and 1.The termination condition is also defined as Div(Y; ).Thus, A S and A D are set to A  S and A  D , respectively, when the procedure of ( 16) is terminated at the th iteration.Finally, Ŝ and D are obtained by Ŝ = B SA S and D = B DA D , respectively, and they are used for noise reduction, which will be explained in the next subsection.

Noise
Reduction.This subsection describes how to reduce noise from noisy input speech using the adapted noise basis of NMF, which is the third processing stage of Figure 2. First, a ( × ) noise attenuation gain, G, is calculated as where Δ is a ( × ) noise reduction control matrix with all elements equal to a constant, , in order to control the degree of noise reduction by scaling D. In this paper,  is set to 3 since this value of  provides the best noise reduction performance.Next, each column of G is applied as a transfer function of the Wiener filter to each th frame of the noisy input speech,   (), resulting in an estimation of clean speech    () [8].

Performance Evaluation
The performance of the proposed method was first evaluated by measuring the source to distortion ratio (SDR), source to interferences ratio (SIR), and source to artifacts ratio (SAR) [26].Next, the average word error rate (WER) of an ASR system employing the proposed method was measured.Finally, the performance of the proposed method was compared with those of the two-stage mel-warped Wiener filter method (Mel-WF) [8] and the NMF-based noise reduction method without noise basis adaptation (NMF-Conv) [14].
For the evaluation, 10 males and 10 females spoke 20 sentences each, resulting in 400 sentences.This recording was performed in a quiet room without any reverberation.Next, each sentence was mixed with four different kinds of background noise recorded at bus stops, restaurants, subways, and a living room with a TV on, where signal-to-noise ratio (SNR) was changed from 0 to 20 dB with a step of 5 dB.The bus stop, restaurant, and subway noises were used to simulate high stationary noise environments, while the living room noise was used in order to simulate a high nonstationary noisy environment in which a person was speaking while watching different genres of TV programs such as drama, news, sports, and movies.It should be noted that the restaurant and living room noise signals were recorded in a nonreverberant room.The speech and noise signals used in the evaluation were sampled at 16 kHz with a 16-bit resolution.A priori basis matrices for the evaluation were prepared as follows.First, a priori basis matrix for speech, B S, was trained for each individual speaker with a 20-second long clean sentence.Next, a priori basis matrix for noise, B D, was trained with 60 seconds of cafeteria noise, where the cafeteria noise was different from other four types of background noise used for the performance evaluation.

Noise Reduction Performance.
In this subsection, the noise reduction performance of the proposed method was evaluated under both nonstationary and stationary noise conditions by measuring the SDR, SIR, and SAR.As shown in (1), a noisy speech signal was composed of clean speech and noise as () = () + (), and the estimates of () and   () were obtained by using the proposed method described in Section 3.Then, the true clean signal and its estimate were related by   () = () +  interf () +  noise () +  artif (), where  interf (),  noise (), and  artif () were the errors associated with the interference, noise, and artifacts, respectively, and they were obtained through least-square projection [26].By using those errors, SDR, SIR, and SAR were defined as [26] where ‖ ⋅ ‖ is the norm operator.First, Table 1 compares the SDRs, SIRs, and SARs of the proposed method and those of the conventional methods under a nonstationary noise condition such as the living room condition.As shown in the table, the proposed method significantly increased the average SDR, SIR, and SAR values, compared to both the Mel-WF and the NMF-Conv.In particular, the proposed method achieved a dramatically  condition.This implies that the proposed method could provide a speech signal with significantly lower interference than the conventional methods under the nonstationary noise condition.
The performance evaluation was then repeated under three different stationary noise conditions such as bus stop, restaurant, and subway noises.Table 2 shows the SDRs, SIRs, and SARs of the noise-reduced signals processed by the proposed and conventional methods under the stationary noise conditions.Similar to the results under the living room noise condition, the proposed method achieved a substantially higher average of SDR, SIR, and SAR than either the Mel-WF or the NMF-Conv under all stationary noise conditions.It could be concluded that the NMF method employing the proposed noise basis adaptation method performed noise reduction more effectively than the conventional methods under both the stationary and nonstationary noise conditions.
Next, the spectrograms obtained by the proposed method were compared with those by the conventional methods.
p) that the noise reduction performance of the proposed method was comparable to that of the Mel-WF under stationary noise conditions including bus stop and restaurant noise.On the other hand, the proposed method successfully reduced nonstationary noise under the living room noise condition, whereas the Mel-WF failed to handle the nonstationary noise.Furthermore, it was demonstrated by comparing Figures 3(i)-3(l) and Figures 3(m)-3(p) that the proposed method provided more distinctive speech signals than the NMF-Conv under all the noise conditions.

ASR Performance.
To evaluate the recognition performance of the proposed noise reduction method in an ASR system, a hidden Markov model (HMM)-based speech recognition system was constructed.To this end, acoustic models based on three-state left-to-right HMMs were first built from 170,000 phonetically balanced words, which were recorded in quiet rooms by 1,800 speakers.Every recorded speech signal was also sampled at 16 kHz at a 16-bit resolution.As a speech recognition feature, 12 mel-frequency cepstral coefficients (MFCCs) with logarithmic energy were extracted and their delta and acceleration coefficients were concatenated, resulting in a 39-dimensional feature vector [27].
Table 3 compares average WERs of an ASR system employing the proposed method as a front-end with those of ASR systems employing the conventional methods under the nonstationary noise condition.As shown in the table, the proposed method significantly reduced average WER than the conventional methods.Specifically, the proposed method relatively reduced average WER by 65.22% and 24.21% compared to the Mel-WF and the NMF-Conv, respectively.
Second, Table 4 compares average WERs of an ASR system employing the proposed method as a front-end with those of ASR systems employing the conventional methods under stationary noise conditions.As shown in the table, the proposed method relatively reduced average WER under bus stop, restaurant, and subway noise conditions by 0.93%, 11.34%, and 6.56% compared to the Mel-WF and 12.13%, 13.10%, and 11.50% compared to the NMF-Conv, respectively.Consequently, it was concluded that the proposed method provided a better ASR performance than the conventional methods under the stationary and nonstationary noise conditions.

Conclusion
In this paper, an NMF-based noise sensing method has been proposed to reduce stationary and nonstationary noises for speech-based applications over wireless sensor networks.The proposed method adapted the initially estimated noise basis matrix on the fly when the noisy input spectrum was applied to a front-end of a speech-based application.After constructing a Wiener filter using the estimated clean speech and noise spectra in the NMF frame, a clean speech signal was estimated and used for speech recognition.The performance of the proposed method was evaluated by measuring the SDR, SIR, and SAR.In addition, the proposed method was applied to an ASR system and then average WER of the ASR system was evaluated.The performance of the proposed method was also compared with those of conventional methods such as the two-stage mel-warped Wiener filter method and the NMF-based noise reduction method without noise basis adaptation.As a result, it was shown that the proposed method provided better performance in terms of the SDR, SIR, SAR, and WER than the conventional methods under both nonstationary and stationary noise conditions.
S and B D = B   D , where   and   are the final iterations of the NMF training.

Figure 2 :
Figure 2: Procedure of the proposed NMF-based adaptive noise sensing and reduction method.

S
and A S = A  S, where, in this paper,  is set to 0.001 from the preliminary experiments at the training stage.Similarly, B D = B  D and A D = A  D are obtained from (6) and (7), if the change in divergence falls within a predefined threshold, such as |Div( D; A  S, B  S) − Div( D; A −1 S , B −1 S )|/Div( D; A −1 S , B −1 S

𝑅
() =      S ()           D ()      , (10) where | S ()| and | D ()| are the clean and noise spectral components estimated from S = B SA S and D = B DA D , respectively.Then, a speech mask at the th frame,   (), is calculated as ) is terminated if the condition |Div( D; A  D, B  D) − Div( D; A −1 D , B −1 D )|/Div( D; A −1 D , B −1 D ) <  Adapt satisfied.It should be noted that the number of iterations for the noise basis adaptation should be smaller than that of the NMF training to prevent B D from representing only the basis of D  .For this reason,  Adapt is set to 0.01, which is 10 times greater than  used in Section 3.1.Consequently, B D = B  D and A D = A  D are obtained when the procedure terminates at the th iteration.

Figure 3
shows the spectrograms of the noise signals, noisy speech signals at 5 dB SNR, and the estimated noise signals obtained by different noise reduction methods under four different background noise conditions.It was shown by pairwise comparison between Figures 3(e)-3(h) and Figures 3(m)-3

, S, and D are represented as Y = B Y A Y , S = B S A S , and D = B D A D , respectively, while B Y , B S , and B D are the basis matrices of Y, S, and D, respectively, and A Y , A S , and A D are the activation matrices correspond- ing to B Y , B S , and B D ,
[23]] (), and   () denote the th spectral components of   (),   (), and   (), respectively.In order to estimate   () from   (), the spectral magnitudes of several speech frames are concatenated together so that Y = S + D is obtained.Note that it is assumed here that |  ()| ≈ |  ()| + |  ()|, because this assumption has provided satisfactory results for NMF-based noise reduction[23,24].Thus, the matrices Y, S, and D are all  ×  matrices, where  and  are the number of frequency bins and the number of concatenated frames, respectively.In the NMF framework, Yrespectively.If it is assumed that S and D are fully separable from Y, Y can be rewritten as[23] Here,  is a transpose operator.Since   and   (  =   +   ) are the ranks of the basis matrices for S and D, the dimensions of B Y , B S , and B D are  ×   ,  ×   , and  ×   , respectively, and the dimensions of A Y , A S , and A D are   × ,   × , and   × , respectively.As described in (3), S and D can be obtained, if B S , B D , A S , and A D are known a priori.Therefore, most conventional NMF-based noise reduction methods estimate B S and B D from previously stored speech and noise databases, S and D [14, 15].For given S and D, B S and B D can be obtained through a multiplicative update rule

Table 1 :
Comparison of the average SDRs, SIRs, and SARs (in dB) of the proposed method and the conventional methods under a living room noise condition.

Table 2 :
Comparison of the average SDRs, SIRs, and SARs (in dB) between the proposed method and the conventional methods under stationary noise conditions such as (a) bus stop, (b) restaurant, and (c) subway noise condition. 2,

Table 3 :
Comparison of average word error rates (WERs) (%) of an ASR system employing the proposed method and the conventional methods under a living room noise condition.

Table 4 :
Comparison of average word error rates (WERs) (%) of an ASR system employing the proposed method and the conventional methods under stationary noise conditions such as (a) bus stop, (b) restaurant, and (c) subway noise condition.