Speech enhancement on smartphone voice recording

Speech enhancement is challenging task in audio signal processing to enhance the quality of targeted speech signal while suppress other noises. In the beginning, the speech enhancement algorithm growth rapidly from spectral subtraction, Wiener filtering, spectral amplitude MMSE estimator to Non-negative Matrix Factorization (NMF). Smartphone as revolutionary device now is being used in all aspect of life including journalism; personally and professionally. Although many smartphones have two microphones (main and rear) the only main microphone is widely used for voice recording. This is why the NMF algorithm widely used for this purpose of speech enhancement. This paper evaluate speech enhancement on smartphone voice recording by using some algorithms mentioned previously. We also extend the NMF algorithm to Kulback-Leibler NMF with supervised separation. The last algorithm shows improved result compared to others by spectrogram and PESQ score evaluation.


Introduction
Signal enhancement can be viewed from many perspectives, one of the currently widely used is from the view of source separation perspective. Source separation is the process of decoupling of two or more sources with no or little prior information. From this definition, speech enhancement can be categorized as natural application for source separation because the goal is to enhance target speech from noises.
The use of smartphone as multimedia device has increased exponentially in current era. As we can seen on the media, most of journalist use smartphone devices instead of traditional recorder to record speech sound. In the other speech events, traditional recorder was almost gone replaced by smartphone voice recording feature to record human voice. The voice recording feature on the smartphone also can be used for voice biometrics, voice tapping, video recording, voice commands and voice calling. The use of smartphone as voice recording tools should be evaluated for measurement and it can be enhanced by signal processing technique addressing background noises.
This paper evaluates signal enhancement on smart phone voice recording from traditional method to the advanced one. From the traditional view, spectral subtraction method by Boll [1] was used with three different derivation: power spectral subtraction, magnitude spectral subtraction and over spectral subtraction [2]. The second approach to enhance speech on smartphone voice recording is by wiener filtering to estimate minimum mean square error (MMSE) as proposed by Lim and Oppenheim [3]. On the third approach, estimation of MMSE was done by short-time spectral amplitude (STSA) as suggested by Ephraim and Malah [4].
Moving to current speech enhancement algorithm, Non-Negative Matrix Factorization (NMF) was used as fourth approach based on [5], and finally on the fifth approach, NMF was modified by using supervised separation [6]. All of five approaches was evaluated by spectrogram and PESQ score to evaluate the performance of speech enhancement over smartphone device.

Speech Enhancement Algorithms
In the following section we briefly describe some algorithms used on this research: spectral subtraction, Wiener filtering, MMSE STSA and NMF.

Spectral subtraction
Speech signal recorded by smartphone device can be modelled by addition of clean signal with by noises. This phenomena can be write in frequency domain as follows, where Y (ω) is recorded noisy signal, X(ω) is clean target signal and D(ω) is noise. The goal is to obtain target signal from noisy signal. One of the simple solution is by subtracting Y (ω) with estimation of D(ω) to obtain X(ω), By taking real part of that equation, it can be obtained the magnitude of spectral subtraction as follows.
Instead of calculating magnitude of spectral subtraction, the power of each components can be calculated to achieve power of spectral subtraction.
in which α = 2. From the power spectral subtraction 4, the gain can be expressed as follows . Different gain can be obtained for various α.
The final method on spectral subtraction is overspectral subtraction which is proposed by Berouti et al. [2] by adding weighting coefficient α and β to improve separation result.
The weighting coefficient α and β are used for reduce noise peaks which in α should be dependent on the frame segmental SNR (γ), less attenuation (small α) for high SNR and more attenuation (large α) for low SNR.
, then Wiener filter is where

MMSE-STSA
Minimum mean square error based on short-time spectral amplitude (MMSE STSA) estimator is estimator that minimizes the mean square error of the spectral magnitude. It minimizes, Defined γ(ω) is called the a-posteriori SNR, and MMSE STSA estimator is function of the gain containing ξ(ω) and γ(ω). Moreover, the full derivation can be obtained from [4].

NMF
NMF is matrix Factorization where everything is non-negative. It can be used for signal enhancement by source separation method. Source separation is the core of this work after modeling sound mixture. Separation principle consist of the following steps: Those steps of signal enhancement by source separation can be organized by the block diagram in figure 1. As seen in the diagram, the source signal can be decomposed into its components, for examplex 1 ,x 2 andx 3 . The target signal is the only desired signal while the others can be neglected or assumes as noises.
(i) Unsupervised NMF In NMF, the matrix output magnitude |X| from STFT process is decomposed into two matrices there are basis vectors W and weights H. Then a subset of basis vectors W s and activation H s is chosen to reconstruct source s by estimate the source s magnitude.
To solve W and H for given a known V, a optimization problem frame is applied as which W, H ≥ 0 and D is a measure of "divergence" that obtained by Kullback-Leibler, then to minimize

Experiment
The experiment is conducted by using smartphone Motorola Razr IS12M to record speech target and fan noise with the voice recording software, Easy Voice Recorder TM . The sampling rate was used at 8000 Hz and recorded in semi anechoic chamber. Sound source used is laptop speaker with Indonesian speech database and background noise from electric fan. This fan noise is made to imitate wind noises when making a call in the road on vehicle. The experiment set-up can be seen in Figure 3. The sound data, Matlab/Octave codes along with .tex files of this paper are openly available at: http://bitbucket.org/bagustris/icopia2016 in the spirit of open science.

Result and Discussion
The recorded sound explained from the previous section analyzed offline by PC computation. Each .wav files were evaluated with seven algorithms explained in section 2 and enhanced  [7]. Figure 2 shows spectrogram of noisy signal (a) with results from seven different speech enhancement algorithms (b to h).
From the spectrogram, it is shown that NMF algorithms both unsupervised and supervised can suppress noise much more other speech enhancement algorithms. Unsupervised NMF even shows almost no noise in the lower part of spectrogram. However, from the PESQ score, it is clearly shown that supervised has the highest speech intelligibility compared to other algorithms.
The PESQ score also shows that NMF has lowest score although its spectrogram shows it can suppress noise more than other algorithms. When listened, the sound resulted from this unsupervised NMF has degraded voice quality that impact its intelligibility score. However, the spectrogram of enhanced speech from Unsupervised NMF showed filtered noises on lower frequencies more than other methods. We choose PESQ score for objective evaluation because it is the current standard in telecommunication. The listened sound also shows consistency with the PESQ score.

Conclusion
This paper review some speech enhancement algorithms from classical to modern method by evaluating its directly in smartphone voice recording. From objective evaluation by using spectrogram and PESQ score, it can be concluded that the modern method, supervised NMF has highest PESQ score and spectrogram similarity compared to the original clean signal. On future research, computation time should be studied and enhanced for real time implementation.