Feedback Module Based Convolution Neural Networks for Sound Event Classification

Sound event classification is starting to receive a lot of attention over the recent years in the field of audio processing because of open datasets, which are recorded in various conditions, and the introduction of challenges. To use the sound event classification model in the wild, it is needed to be independent of recording conditions. Therefore, a more generalized model, that can be trained and tested in various recording conditions, must be researched. This paper presents a deep neural network with a dual-path frequency residual network and feedback modules for sound event classification. Most deep neural network based approaches for sound event classification use feed-forward models and train with a single classification result. Although these methods are simple to implement and deliver reasonable results, the integration of recurrent inference based methods has shown potential for classification and generalization performance improvements. We propose a weighted recurrent inference based model by employing cascading feedback modules for sound event classification. In our experiments, it is shown that the proposed method outperforms traditional approaches in indoor and outdoor conditions by 1.94% and 3.26%, respectively.


I. INTRODUCTION
While vision provides a wealth of information about a scene, much of scene understanding can also be achieved by sounds. As such, Sound Event Classification(SEC) may play a crucial role in automatic scene understanding. Acoustic recording from a scene may reveal sound producing objects, environmental conditions such as weather, or background settings including traffic noise, or others. Thus, exploiting sound as descriptive features of a scene may enhance scene recognition or understanding. SEC also affect a wide range of applications, such as sound event detection [1], audio tagging [2], audio monitoring in smart cities [3], life assistance and healthcare [4], etc.
The goal of SEC is to recognize audio recordings of various sound-producing events which may occur indoors or outdoors. In indoor conditions, reverberation is typically the main obstacle to successful classification. In the outdoors, on the other hand, it is challenging due to background noise, The associate editor coordinating the review of this manuscript and approving it for publication was Jad Nasreddine . e.g. conversations, machine noise, vehicle noise, etc. There have been some successful efforts in the past for dealing with these issues, however, as the severity of these challenges grows, a novel alternative SEC model structure is needed. Especially, it is necessary to study a more generalized model that works well in both indoor and outdoor conditions. SEC is similar to Acoustic Scene Classification(ASC) and Environment Sound Classification(ESC). The Detection and Classification of Acoustic Scenes and Events(DCASE) challenge [5] prompted many innovations and novel concepts to solve the challenges associated with ASC task [6]- [8]. The concepts proposed in the challenge are useful to design more generalized model for the SEC. For the ESC, Piczak [9] and Salamon et al. [10] developed new ESC datasets and proposed a novel neural networks architecture. Their proposed ESC concept, however, focused only on outdoor sound classification. Since the SEC model should be designed to be effective in both indoor and outdoor environments, the evaluation of the SEC model includes both types of environments. SECL_UMONS [11], which is a SEC dataset, is suitable to evaluate the SEC model in indoor environments. Training and developing modern systems for the SEC task requires the use of large-scale labeled SEC datasets. The large-scale labeled SEC datasets have {audio recording, sound event label} paired samples. Many previous models used these datasets with single-inference supervised learning. However, there are no models with multi-inference supervised learning and inference search.
Over the last few decades, several approaches have been developed for SEC problems, such as Gaussian Mixture Model(GMM) [12], Deep Neural Network(DNN)-GMM-ergodic Hidden Markov Model(HMM) combination model [13]. [14] presented a combination of visual and acoustic features for audio classification tasks. While these earlier methods have shown successes and had been utilized in various applications, their performances have been well surpassed in recent years by Convolution Neural Networks (CNN) based methods.
CNN based models [15]- [18] have shown significant performance improvements in SEC. Piczak [15] designed a CNN architecture to solve the environmental sound classification task, which is a subset of SEC task. Salamon and Bello [16] improved the classification performance by changing the model size and data augmentation. Mcdonnell and Gao [17] proposed dual-path residual neural networks to solve the acoustic scene classification task, which is similar to the SEC task. Kim et al. [19] also presented a multi-band CNN architecture to solve the SEC task. Ref. [18] proposed a self teaching method for improving generalization in sound event recognition. Recently, Transformer [20] based model [21] is proposed to classify audio signal. The model successfully applied Transformer structure for audio classification task with ImageNet [22] pre-training. Although previous studies with single-inference supervised learning have yielded good results, it is necessary to propose a more generalized model structures.
Others employed some of the latest image classification methods using 2D features [23]- [25] to improved SEC performance. While recognizing the effectiveness of the methodologies developed in computer vision, we adopted some of these techniques by deliberately selecting them based on their efficacy for applying to the SEC. As shown in their effectiveness in the computer vision field, we found recurrent inference [26] and feedback structure [27], [28] particularly effective in improving SEC performance of our neural network structure.
In cognition theory, feedback connection which link the cortical visual areas can transmit response from higher-order areas to low-order areas [29], [30]. Motivated by this phenomenon, recent studies have applied the feedback mechanism [31], [32] to network architectures. The feedback module [27], [28] is a neural network based module with a recurrent structure. The module consists of the layers of the neural network with arbitrary shape and feedback loops. In the feed-forward module, input features pass through the structure of the network only once. Unlike the feed-forward module, the input features pass through the networks several times in the feedback module. The output features of the previous feedback loop are fed into the input features of the next feedback loop. The feedback module not only has the advantage of a feedback mechanism, but also can be employed to design deeper networks without increasing network parameters. The model predicts progressively improved results as multiples of these feedback loops can be stacked effectively, thus employing a recurrent inference paradigm. Our model integrates the idea of stacked feedback loops in solving ESC to train our proposed parallel network structure in a recurrent inference fashion.
The recurrent inference is a prediction manner that predicts labels several times with recurrent structure in one training step. The recurrent inference is inspired by the recurrent inference machine [26], which is proposed to solve inverse problems. The machine is applied Recurrent Neural Network(RNN) structure, which is Turing complete [33], to establish a general framework for any kind of inverse problem. According to the results of the recurrent inference machine, it has merits in generalization performance on several tasks. In [34], they proposed the recurrent inference of time-frequency mask for monaural singing voice separation. In [27], they proposed a feedback module with recurrent inference for image super-resolution. In [35], they proposed recurrent inference layers for visual question answering. From the results of these previous studies, we hypothesize that the multi-inference supervised learning and inference search can improve generalization and robustness like a collaborative learning manner [36] compared to single-inference supervised learning. Also, the method of deriving the final answer from the values predicted by multi-inference has the same effect as the ensemble model.
Although the neural network with RNN structure has successfully conquered a lot of important tasks, it has some problems to use for recurrent inference. First, the hidden units in the RNN are limited to a fixed-length vector. This problem can raise the loss of spatial information. Second, the shape of the RNN cell is fixed. This problem can limit the number of parameters and network shapes that can be learned. To solve these problems, we design a feedback module with weight sharing. With the feedback module, input and hidden units would not lose spatial information. Moreover, it can design arbitrary cell shapes.
In this paper, we propose a novel neural network model that aggregates dual-path residual neural networks with feedback structures to recognize indoor and outdoor sound events. We also propose a multi-inference supervised learning model and its inference searching method. The block diagram describing the proposed train/test pipeline is shown in Fig. 1. First, the features are extracted from audio data recorded in various environments. The extracted features pass through a feedback module based sound event classification model and N predictions are predicted. The model is trained by N predictions with weighted recurrent inference training and tested by N predictions with inference search. Our contributions are: • We propose a neural network structure to improve the performance of the sound event classification task recording in indoor and outdoor conditions.
• We present a multi-inference supervised learning model with weighted recurrent inference-based module and an inference search module to test the model.
• We evaluate the proposed training procedure using indoor and outdoor environments datasets and achieve state-of-the-art performance.
The remainder of the paper is organized as follows. The relations to prior work and the proposed method of this research are described in Section II and Section III, respectively. To evaluate our model, the experimental process and results are presented in Section IV. Conclusions are drawn in Section V.

II. FEATURE EXTRACTION
In this section, the feature extraction method is explained. In this paper, the event sound recording is converted into a Mel-Spectrogram, its delta, and delta-delta features, and they are concatenated along the channel axis. To acquire these features, we first transform the fixed-length signal in the time domain into the frequency domain using Short Time Fourier Transform(STFT) to extract the Spectrogram. Second, Spectrogram is filtered by Mel-filter bank to generate Mel-Spectrogram. Third, delta and delta-delta features are calculated from the Mel-Spectrogram.
Spectrogram is one of the most frequently used 2-D representations for time series data analysis. It contains the magnitude and phase information of local sections of the time series signal as it changes over time. The STFT of the time-series signal x(t) is where τ and f denote the time and frequency axis of Spectrogram, and g(*) is a window function. Mel-Spectrogram can be extracted by filtering Spectrogram by the Mel-filter bank. The Mel-scale aims to mimic the non-linear human ear perception of sound, by being more discriminative at lower frequencies and less discriminative at higher frequencies. The conversion equation from frequency to Mel is: For practical use, the Mel-filter bank is designed by equation 2, and applying it to Spectrogram.
The Mel-Spectrogram describes the instantaneous frequency information of the audio signal. However, audio signals are time-variant signals. Therefore, having features that can express changes over time can have good influence on classification. The delta and delta-delta features are extracting information about changes with the first difference and second difference of the Mel-Spectrogram, respectively.
After concatenating Mel-Spectrogram, its delta, and delta-delta features, the shape of the extracted feature is [BxFxTx3], where B is the batch size, F is the frequency bin, and T is the time bin.

III. SOUND EVENT CLASSIFICATION MODEL
The proposed deep neural network architecture is shown in Fig. 2. The network consists of three parts: an encoder, a feedback module, and a decoder.

A. ENCODER
The encoder is based on low/high frequency path residual networks. In the encoder, the input features are separated by half along the frequency axis, thus the feature size becomes [Bx(F/2)xTx3]. Next, the features pass through low/high frequency residual network, which is described in Fig. 3(a), and the network outputs are concatenated along the frequency axis and passed through one convolution layer so that output feature size becomes [BxFx(T/16)x128]. The proposed network architecture. In the encoder, 2D input features are separated in half along the frequency axis, pass through high/low frequency path residual networks for feature extraction, and one convolution layer for changing channel size. After that, in the feedback module, there are four feedback loops for estimate class step-by-step.
As shown in Fig. 4, the residual path of the residual block contains two pre-activation convolution layers with each layer consisting of BatchNormalization(BN)-ReLU-Convolution. The one pre-activation convolution layer is where F is input features, φ is ReLU activation function and W is weights of the convolution layer. This layer order has an advantage of generalization because it normalizes the very first input features and the output features are not constrained over 0 by ReLU activation function. The BN is feature normalization method for neural networks. The mean and variance of the input features are normalized to 0 and 1, respectively. The BN is calculated by the equation where x i is i-th sample of the mini-batch, µ B is mini-batch mean, σ B is mini-batch variance, is constant, and γ B and β B are trainable parameters. However, γ and β terms are not trained in the BN layers except for the first layer because of regularization effect [37]. The identity path of the residual block has an average pooling layer and a channel-padding layer. The input features are squeezed by average pooling and zero-padded to the shape of the residual path. If the stride of the residual block is (1,1), average pooling and channel padding layers are excepted.
Finally, the residual path and identity path of the residual block are summed at the end of the residual block. The output features of the residual block is where F in is input features. The spatial and temporal features of the low/high frequency of the input data are extracted by the encoding networks.

B. FEEDBACK MODULE
We design the feedback module to train our model using a recurrent inference strategy. The feedback module consists of some neural networks layers and feedback loops. In the first loop of the feedback module, the output features of the encoder are replicated once and they are concatenated along the channel axis. After that, concatenated features are passed through the neural networks layers. The layers consists of one pre-activation convolution layer and three residual blocks, as described in Fig. 3  extracted. The output features of i-th feedback loops are where f (·) denotes neural networks layers for feedback module and {·} denotes concatenate operation. After some iterations, we found N = 4 to be the most suitable in terms of network performance. The explorations about N are summarized in the experiment section. As the feedback module passes through the feedback loop, it learns more relevant features.

C. DECODER
The output features of each loop of the feedback module are passed through the decoder, which is shown in Fig. 3(c). The decoder consists of a BN-convolution layer and three bidirectional Gate Recurrent Units (GRUs) [38].
In the BN-convolution layer, the channel of features is reduced to 8, and reshape the features to [BxTx8F]. In this case, since the second dimension of the tensor represents time and the third dimension of the tensor represents features, the RNN structure can be used to extract temporal features.
The GRU is a gating mechanism in recurrent neural networks. The inputs of GRU cell in time t are previous state h t−1 and input features x t . First, the update gate vector and reset gate vector are calculated to calculate current state h t . The update gate vector z u is where W u , U u , and b u are trainable weights, and σ (·) is sigmoid function. The reset gate vector r t is where W r , U r , and b r are trainable weights. Next, the candidate activation vectorĥ t is calculated bŷ where W h , U h , and b h are trainable weights, denotes Hadamard product, and τ (·) is tanh function. Finally, the current state h t is The current state h t is used to current output vector. The input state vectors x 1 , x 2 , . . . , x T are used to calculate output state vectors h 1 , h 2 , . . . , h T by the one GRU model. A bidirectional GRU consists of forward GRU and backward GRU. Since the features can be extracted considering both time directions, the bidirectional GRU layers are used.
In the first and second bidirectional GRU layers, the output feature size is 256 and every GRU cell output h 1 , h 2 , . . . , h T are used for the output features. In the third GRU layer, the output feature size is the number of classes and only the last GRU cell output h T is used for the output features. Finally, the output features h T are used for classification with softmax activation function. Repeat the decoder step N times using N feedback module outputs, Nx[Bx# of class] classification predictions are estimated.

D. WEIGHTED FOCAL LOSS FOR RECURRENT INFERENCE
To train the model with the N output predictions, the loss functions for each feedback loop should be defined. We develop the weighted recurrent inference strategy with the order of easy to hard tasks as well as a novel loss function based on the focal loss [39] to train our model as where p i is predicted values of i-th feedback loop and y i is ground truths. According to [39], if γ i is high, the loss function only affects hard-classified samples. In contrast, if γ i is low, the loss function affects both hard-classified samples and wellclassified samples. Thus, by adjusting the γ 1 , γ 2 , . . . , γ N , we can define loss functions for multiple predictions individually. The final loss is summation of the losses of each feedback loop.

E. INFERENCE SEARCH
In the test phase, it is necessary to determine the final prediction using the predictions from feedback loops. We present three methods depending on how you choose the output: maximum training accuracy, softmax summing, and voting. These methods are inspired by the ensemble methods. The maximum training accuracy method selects the model and feedback loop that derive the best validation accuracy. Let p n i is the predictions of the validation data on the i-th epoch and the n-th feedback loop, then the predictions on the i-th epoch are where x is validation data samples, f (·) is classification model, and θ i is model weights at the i-th epoch. Then, the optimal model weights θ using the maximum training accuracy method are where X is the number of validation data samples, y is ground truth label, and '==' operator determines whether the prediction is correct or not. If prediction is correct the value is 1, and else is 0. Note that i = 1 to max_epoch and n = 1 to N. The softmax summing method adds the output probabilities of every feedback loop and selects the highest value. The optimal model weights θ using the softmax summing method are Optimize θ θ is model weights 8: Return test result 6: end function The voting method counts the predictions of every feedback loop and selects the most frequent label as final predictions. The optimal model weights θ using the voting method are where and p i [x] is the most frequent label of the predictions

F. ALGORITHM
The algorithms of the proposed training/test procedure are shown below. In the training phase, features are extracted and parameters are initialized. Then, training the model with the weighted focal loss. Next, the validation set is selected in the training set. The model is evaluated by the inference search and save the best model. In the test phase, features are extracted and the best model is loaded. Then the test result is predicted by the model and inference search.
where TP is true positive, FP is false positive and FN is false negative samples.

2) Urbansound8K
Our outdoor dataset, Urbansound8K, 2 consists of 10 classes of urban sound recordings. The class of acoustic scenes are air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, and street music. The dataset contains 8732 labeled urban sound recordings. To compare to other state-of-the-art methods, 10-fold cross validation was used for accuracy calculation. The accuracy is defined as where TN is true negative samples.

3) ESC-50
ESC-50 3 [9] is used to evaluate our model using the dataset with a larger number of data classes. The dataset consists of 5-second long recordings organized into 50 semantical classes loosely arranged into 5 major categories:

4) CONFUSION MATRIX
We assess our model with the confusion matrix, which is a specific table layout that allows visualization of the classification results. TP, FP, FN, and TN can be counted using confusion matrix. TP is the number of samples that predict a class sample for that class. FP is the number of samples that predicted a non-class sample as that class. FN is the number of predicted samples of a class that are not of that class. TN is the number of samples that predicted a non-class sample as not of that class. When the class categories are more than two, we count TP, FP, FN, and TN and calculate precision, recall, accuracy and F1-score for each class, and average them to get final accuracy and F1-score. For the normalized confusion matrix, each column of the confusion matrix is normalized by the number of each class sample.

B. EXPERIMENT SETTINGS
We used Specmix [40] data augmentation strategy to increase the diversity of data distribution. Stochastic gradient descent was applied with a batch size 16,the momentum of 0.9, and weight decay 1 × 10 −3 for regularization. Each network was trained for 310 epochs using a learning rate schedule with warm restarts that reset the learning rate to its maximum value of 0.1 after 10, 30, 70, 150 epochs, and then decays according to a cosine pattern to 1 × 10 −5 . This method helps performance improvement as verified by [17].
To find the optimal parameters and model, we used validation set in SECL_UMONS data set and crossvalidation in Urbansound8K and ESC-50 data sets. For the SECL_UMONS, we trained the model with changing parameters, and assess them using validation set, which is randomly selected in training set on every epoch, and choose the best model. For the Urbansound8K, we separated the data set into 10 folds and selected one fold to test set and the rest of the folds to train set. We set 10 folds one by one as a test set and performed cross validation. For the ESC-50, we separated the data set into 5 folds and repeat the cross validation process.

C. COMPARISON OF THE PREVIOUS MODELS 1) INDOOR CONDITIONS
The results are shown in Table 1. In terms of F1-score, the proposed model outperforms the baseline model and the dual-path residual networks [17] by 1.94% and 3.01%, respectively. To evaluate the effectiveness of the feedback loop, compared to the proposed model with and without the feedback loop. The F1-score of the model with the feedback loop is 1.32% better than the model without the feedback loop. Therefore, the feedback implemented in our proposed model clearly improves sound event classification VOLUME 9, 2021 in indoor conditions. In Fig. 5(a), we provide the normalized confusion matrix yielded by the proposed model. All classes except keyboard class are classified over 95% accuracy. As a result, the proposed encoder module and feedback module are suitable to classify sounds recorded in indoor conditions.

2) OUTDOOR CONDITIONS
The results are shown in Table 2. Our model achieves 81.40% mean accuracy, which is the best result compared to   state-of-the-art models. We also tested the effectiveness of the feedback loop in outdoor conditions the same as indoor conditions. The mean accuracy of the model with a feedback loop is 1.93% better than the model without a feedback loop. The proposed feedback loop, therefore, is equally effective in outdoor conditions. Fig. 5(b) shows the normalized confusion matrix yielded by the proposed model, which is the best result of the 10-fold validations. air_conditioner, dog_bark, drilling, and engine_idling classes are achieved under 90% accuracy and other classes are achieved over 90% accuracy. As a result, the proposed encoder module and feedback module are suitable to classify sounds recorded in outdoor conditions.

3) THE LARGE NUMBER OF CLASSES
The results are shown in Table 3. Our model achieves 71.10% mean accuracy, which is not competitive result compared to state-of-the-art models. We assume that the cause of the bad result is the encoder structure. To verify this assumption and the effect of the feedback module in the ESC-50 data set, we designed a new model that consists of AST [21] encoder and feedback decoder. The feedback decoder, which consists of 5 fully connected layers with 4 feedback loops, is connected to the output layer of the AST model. As a result, the AST model with feedback module achieved 96.2% accuracy, which is 0.5% higher than the original AST model. Therefore, the feedback loops help to improve performance. However, there is a problem that the encoder structure must be well determined because the performance is greatly affected by the encoder structure.

D. VISUALIZATION
To justify the effectiveness of the feedback module, output features of each feedback loop are visualized. We first extract the output features of each feedback loop for every validation data. After applying dimensional reduction to the extracted features, we checked how well the features were classified by comparing the reduced features with the class label. The visualization results are shown in Fig. 6. The feature distribution of the second, third, and fourth loop shows a better classification than the feature distribution of the first loop. Especially, orange class and turquoise class are confused in the first loop, but it seems to classify well in the third loop. Also, the distances between samples of different classes tends to increase.

E. ABLATION STUDY
The proposed method contains some prominent technical proposals: input features, applying data augmentation, γ , feedback loop, and inference search strategy. We test the variation of the proposed model with some ablation studies using SECL_UMONS dataset.

1) EFFECT OF INPUT FEATURES
We test the model using alternating nfft and Mel-Spectrogram, delta, and delta-delta features to find a suitable input feature for the model. The results are shown in Table 3. The effect of nfft was not great, and it was confirmed that the performance improved little by little each time del and del-del were added.

2) EFFECT OF DATA AUGMENTATION
Since the proposed model should be trained well using generalized training data distribution, we test the model with various data augmentation strategies. We compare Mixup [44], Cutmix [45], Specaugment [46], and Specmix [40], which are data augmentation strategies for Spectrogram data. As a result, Specmix outperforms other strategies. In addition, the model was successfully trained for all data augmentation methods.

3) EFFECT OF γ s
To evaluate the effect of γ s, we test the model with different orders of γ s. The sort order of gamma changes the way it learns by changing the loss function. For example, the case that γ s are sorted by ascending order denotes that the mid-level features pass through the feedback loop, it sequentially solves from easy to difficult tasks, because of the characteristic of focal loss. We compare three cases: descending order, equal values, and ascending order. As a result, the case that γ s are equal value is better than other cases.

4) EFFECT OF THE NUMBER OF FEEDBACK LOOP N
To find the optimal number of feedback loop N, we test the models with changing N in the range of 1 to 6. From the results, the performance gradually improves for N = 1 to 4 and then decreases from N = 5.

5) EFFECT OF THE INFERENCE SEARCH
We compare three different inference search algorithms. As a result, the maximum training accuracy method outperforms VOLUME 9, 2021 other methods. The best training model matches the best test model, demonstrating good generalization performance. However, contrary to the expectation that better features will be learned as it goes through the feedback loop, there are cases where the loop in the early stage performed better and is selected as the best training model. This instability is a future challenge.

V. CONCLUSION
In this study, we proposed a novel deep neural networks model using dual-path residual networks and feedback modules for sound event classification. Our method adopted weighted recurrent inference of sound event classification via the proposed linked feedback modules. Through the experiments, we demonstrated that the proposed model shows significant improvements in terms of sound event classification compared to state-of-the-art methods both in indoor and outdoor conditions. Additionally, the effectiveness of the feedback module was demonstrated with ablation studies. We expected that the proposed sound event classification model can be used to abnormal detection, sound event detection, audio monitoring, healthcare, social robot, and scene understanding. Furthermore, the feedback module and recurrent inference methods can be applied to several tasks, such as classification, denoising, and generation. Moreover, the feedback module is easily adapted as a sub-module of the neural networks. A potential limitation, however, of the proposed model was that performance may vary greatly depending on the encoder structure. Future work will explore the efficacy of the feedback structure in solving other learning based tasks that may benefit weighted recurrent inference manner.