Sound Event Detection Based on Convolutional Neural Networks with Overlapping Pooling Structure

In this paper, a sound event detection measure is proposed. This measure is based on convolutional neural networks with overlapping pooling structure Different from the traditional GMM-HMM model and DNN-HMM model, the CNN model uses the convolutional layer which can speed up training by reducing training parameters. In this paper, the extracted sound feature is the mel-frequency cepstrum coefficient (MFCC). The dropout layer is added to the convolutional layer. Over-fitting can decrease the accuracy of the detection, dropout layer can prevent the model from over-fitting. Moreover, the overlapping pooling structure is used in CNN, the stride size is smaller than the pooling kernel size. The output of pooling layer has overlapping parameters, which can increase the richness of features. The final experimental results show that the precision of the proposed CNN model more robust than the GMM-HMM model and baseline model.


Introduction
Sound event detection is widely used in healthcare monitoring [1], multimedia indexing and retrieval [2], urban analysis, etc. The traditional sound event detection model, such as Gaussian-Hidden-Markov-Model [3][4], is an effective model, it can achieve sound event detection under the partial conditions. Features include the mel-frequency cepstrum coefficient (MFCC) and the zero-crossing rate (ZCR). Deep learning technology has been applied in many parts of sound detection, besides the well-known back-propagation (BP) algorithm is applied to the neural network. The BP algorithm can optimize the model parameters and promote the accuracy rate of the deep neural network (DNN). This article adopts the CNN method to detect sound event signals and improve the robustness of detection system.

Framework of sound event detection system
The network model needs to be trained to optimize a model. The multi-sound event detection system is generally divided into three steps: data pre-handling, model building and output evaluating.

Data pre-handling
According to the requirements of this job, firstly, the source signal is divided into short time frames. The time-frequency transformation performed on the short frame can remove the redundant information, so that the sound feature vector of the corresponding time frame can be obtained. By comparing the accuracy between the output value and the tag value, a reliable network model is obtained. The specific network model structure will be described in detail in the third part. Figure 1 shows the block diagram system:

Baseline system
In the last decades, DNN plays an important role in solving sound problems such as natural language processing, speech enhancement and speech recognition. Baseline system is an advanced detection system based on DNN. However, the DNN system has many parameters, which will take a lot of time to train, if the features are not suitable for the task, the final result will be poor.
Baseline system is suitable for being used in some simple sound event detection [5]. The baseline network consists of one input layer, two hidden layers (128 neurons per layer) and one output layer (17 classes). The baseline system is easy to build compared with other neural networks, and it is suitable for simple situations. Baseline system has good robustness in the case of fewer types of events.

Proposed Convolutional Neural Network
The convolutional neural network consists of 3 convolutional layers, 3 pooling layers, a flatten layer and two full connection layers. Figure 2 shows the convolutional neural network model.  Figure 2. Proposed CNN architecture The general convolutional neural network includes the following five parts: 1) Convolutional layer: Some low-level features may be extracted in the first convolutional layer, such as edges, lines and angles, while the more higher features can be extracted from the lower features by deeper networks. In this task, it is to extract the information hidden in MFCC. In order to reduce the over-fitting problem, dropout =0.2 was set.
In the convolutional layer, there are 64 neurons, and the convolutional kernel size is 3*3. The activation functions of the full connection layer and the convolutional layer is Elu activation function. The Elu function formula is shown below: 2) Pooling layer: The size of the parameter matrix can be reduced by Pooling layer efficiently. Pooling layer can speed up the calculation and reduce over-fitting [6][7][8]. The step size is smaller than that of pooling kernel, this structure can improve the richness of features. In the pooling layer, the step size is 1*1. The pooling size is 2*2.
3) Flatten layer: Flatten layer is added after the convolutional layer. The flatten layer can reduce the dimension of parameters. After dimension reduction, the parameters can be input into the full connected layer for detection and classification. Moreover, flatten layer does not affect the size of the batch.
The flatten layer is shown in the Figure 3.
M is the sound event. N is the number of sound event categories. Through the maximum posterior probability, the test sample can be detected. 0 is the sound event.
is the label of the sound event.

Data preparation
Daily-Sound-Event is an open-source sound signal database [9]. The sound event signal collected by quaternion microphone validation array. The sample sampling rate was 44.1 KHz, the 20-30ms sound can be regarded as short-time stable, so we set the frame length to 1024 sampling points (23ms).
The whole data set contains 2040 data samples, which are divided into training set, validation set and test set. The proportion is classified into three parts, the 72% is the training set, the 8% is the validation set and the 20% is the test set.
The spectrogram of the flushing sample is shown in the Figure 4. The spectrogram of the noise sample is shown in the Figure 5.

Get Feature vectors
Extraction steps are mainly divided into two steps: one is to pre-process the sound events; The other is to obtain MFCC features. In this experiment, the operation of smoothing was carried out. For the sample with short signal time, the zero-filling operation is carried out, and the sample with long signal time is clipped. In this paper, the MFCC dimension of each frame is equal to 13.

Training and Evaluation
During the experiment processing, the MFCC is set as feature to learn. The loss function is CROSS-ENTROPY. This experiment use the Adam gradient-based optimizer [10]. The calculation formula of cross entropy is: In the experiment processing, the hyperparameters are designed as below: the batch-size, epoch and the learning rate is designed as 8, 500 and 1e-3 separately. Meanwhile, in order to avoid over-fitting in the neural network system, the rate of a dropout process is set to 0.2.

Detection Results
This experiment has 1468 training samples and 163 verifiable samples. Epochs are set to 500. The network training result is shown in Table 1:  Table 1 shows the training loss is declining. The validation_loss is also declining. Validation_loss has been declining and remained stable, indicating that this system can still fit data. Neural network is not entered into the fitting. And the detection rate of the final validation set remains around 90%. Figure 6 shows the loss and accuracy curves of the training samples and the validation samples:   Table 2, the accuracy of the baseline detection system is around 75%, which is not very good. But accuracy rates are still rising. Finally, on the test data, the detection rate of CNN based network system reaches 88.23%, while the detection rate of baseline system is only 71.32%. Compared with the detection rate of the traditional GMM-HMM system [4], the accuracy of the neural network system with overlapping pooling structure increases by about 17%. Experimental results indicate that the detection precision of this model is higher than that of traditional methods, and it is suitable for the sound event detection.

Conclusion
In this passage, the baseline system using full connected neural network, the other is a convolution neural network with overlapping pool structure. The experimental results suggest that the CNN model is more reliable with the same database, and has a better detection rate than the baseline system. The CNN with overlapping pooling structure is used to enhance the richness of features, and the final recognition rate is improved by 17% compared with the baseline system. The experiment also indicates that the dropout layer can improve the precision of the sound event detection.