Fully Deep Neural Networks Incorporating Unsupervised Feature Learning for Audio Tagging

In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a fully deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk are fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can map the audio features sequence to a multi-tag vector. For the unsupervised feature learning, we propose to use a deep auto-encoder (AE) to generate new features with non-negative representation from the basic features. The new feature can further improve the performance of audio tagging. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the dropout and background noise aware training, to enhance the generalization capability of DNNs for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method is able to utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtains a 19.1% relative improvement compared with the official GMM-based baseline method of DCASE 2016 audio tagging task.

sounds from a wide variety of sources.The need for analyzing these sounds is increasing, e.g. for automatic audio tagging [1], audio segmentation [2] and audio context classification [3], [4].Due to the technology and customer need, there have been some applications of audio processing in different scenarios, such as urban monitoring, surveillance, health care, music retrieval, and customer video.
In urban monitoring, the audio scene analysis, as an effective complement to video scene recognition, can provide road traffic estimation, and road traffic event detection and localization [5].Relying on relative cheap and scalable sensors, Gubbi et al. developed a system to present visualization of real-time data via the internet [6].In surveillance of suspicious events accompanied by banging sounds and screaming, audio analysis can take us closer to semantics than video analysis and is also computationally more efficient [7].For health care, breath sounds have been shown very valuable for diagnosis of obstructive sleep apnea [8], [9].Content based music information retrieval can now well process different queries by searching for the relevant contents in a million song database [10].For consumer video analysis, Zhang et al. [11] proposed a system to segment and classify audio from movies or TV programs into multiple classes such as speech, music and environmental sound.
These applications have motivated the development of new audio processing methods, but also introduced new challenges, such as the lack of accurately annotated data and acoustic feature learning, the two problems to be discussed in this paper.In many previous methods, in order to train a model, annotated data are often required, where the desirable audio events need to be clearly labeled.Although supervised approaches have been widely used, their effectiveness relies heavily on the quantity and quality of the training data.Nowdays, it is easy to collect a large amount of audio data online, e.g. from Youtube or Freesound, however most of them are either not labeled at all or only labeled very weakly using a small amount of metadata or simple conceptual tagging.It is clear that processing such audio data in a supervised manner will be quite hard because manually labeling data is very time-consuming.Moreover, the quality of audio data is often unsatisfying because of the variations in recording devices used and the environment where the audio data are recorded.Accordingly, learning robust audio features to handle acoustic variation becomes highly desirable.
To address the these problems, two types of methods have been developed.One is to convert low-level acoustic features into "bag of audio words" using unsupervised learning meth-ods [12], [13], [14], [15], [16].The second type of methods is based on only weakly labeled data [17], e.g.audio tagging.In the first type of methods, K-means, as an unsupervised clustering method, has been widely used in audio analysis [12] and music retrieval [13], [14].In [15], Cai et al. replaced K-means with a spectral clustering-based scheme to segment and cluster the input stream into audio elements.Sainath at al. [16] derived an unsupervised audio segmentation method using Extended Baum-Welch (EBW) transformations for estimating parameters of Gaussian mixtures.Shao at al. [14] proposed to use a measure of similarity derived by hidden Markov models to cluster segment of audio streams in an unsupervised way.Xia et al. [18] used Eigenmusic and Adaboost to separate rehearsal recordings into segments, and an unsupervised clustering and alignment process to organize segments.Gaussian mixture model (GMM), as a common model, was also used as the official baseline method in DCASE 2016 for audio tagging.More details can be found in [19].
A closely related work to the use of weakly labeled training data is Multiple Instance Learning (MIL).It is proposed in [20] as a variation of supervised learning for problems with incomplete knowledge about labels of training examples.It aims at classifying bags of instances instead of targeting at classifying single instances.Following this work, Andrew et al. [21] proposed a new formulation of MIL as a maximum margin problem, which had led to some further work [22], [23], [24], [25], [26] in audio and video processing using weakly labeled data.Mandel and Ellis in [22] used clip-level tags to derive tags at the track, album, and artist granularities by formulating a number of music information related multiple-instance learning tasks and evaluated two MIL based algorithms on them.Briggs [23] transformed an input audio signal into a bag-of-instances representation by a 2D timefrequency segmentation.The trained MIL classifiers can well separate bird sounds that overlap in time.In [27], Phan et al. used event-driven MIL to learn the key evidences for event detection.Recently, [17] also presented a SVM based MIL system for audio tagging and event detection.It is clear that tagging audio chunks needs much less time compared to precisely locating event boundaries within recordings.This will certainly improve tractability of obtaining manual annotations for large databases.
Mel-frequency Cepstral Coefficients (MFCCs) have been used in environmental sound source classification [7], [28], however, some previous work [29], [30] showed that the use of MFCCs is not the best choice as they are sensitive to background noise.Recent studies in audio classification have shown that accuracy can be boosted by using features that are learned in an unsupervised manner, with examples in the areas of bioacoustics [31] and music [32].In [31], [33], Spherical k-means [34] works as a more effective feature learning method than the original K-means by using unit L2 norm as a constraint on the centroids.To further improve feature learning, some pre-processing methods, e.g.principal component analysis (PCA) whitening [34] by decorrelating the input dimensions and post-processing methods, e.g.pooling the features across wide windows [29], are also utilized.
Although the methods mentioned above have led to some useful results in detection and analysis of audio data, most of them ignored possible relationships of any contextual information and only focused on training the model for each single event class independently.To handle the two problems mentioned above, we will use deep neural networks (DNNs) because of their flexible structure, which is made up of multiple hidden layers and many units per layer, enabling it to learn robust high-level acoustic features and to model very complex and highly non-linear relationships between inputs and outputs.Recently, deep learning technologies have obtained great successes in speech, image and video fields [35], [36], [37], [38] since Hinton and Salakhutdinov showed the insights using a greedy layer-wise unsupervised learning procedure to train a deep model in 2006 [39].The deep learning methods were also investigated for related tasks, like unsupervised feature learning [40], acoustic scene classification [41] and acoustic event detection [42], and better performance could be obtained in these tasks.For music tagging task, [43], [44] have also demonstrated the superiority of deep learning methods.However, to the best of our knowledge, the deep learning based methods have not been used for EAT, a newly proposed task in DCASE 2016 challenge based on the CHiME-home dataset [45].For the audio tagging task, only the chunklevel instead of frame-level labels are available.Furthermore, multiple instances could occur simultaneously, for example, child speech could exist with TV sound for several seconds.Hence, a good way is to feed the DNN with the whole frames of the chunk to predict the multiple tags in the output, as we propose here.Deep learning was also widely explored in the feature learning.These works have demonstrated that data-driven learned features can get better performance than the expert-designed features.Neural network based bottle-neck feature [46] in speech recognition task is one successful type of supervised learned feature where the bottle-neck feature is extracted from the middle layer of a DNN classifier.In [47], four unsupervised learning algorithms, K-means clustering, restricted Boltzmann machine (RBM), Gaussian mixtures and auto-encoder are explored in image classification.Compared with RBM, auto-encoder is a non-probabilistic feature learning paradigm [48].
In this paper, we propose a deep learning framework for the audio tagging task, with contributions mainly in the following two parts, acoustic modeling and feature learning, respectively.For the acoustic modeling, we propose a fully DNNbased method, which can well utilize the long-term temporal information, to map the whole sequence of audio features into a multi-tag vector.We will utilize the whole or almost whole frames of the observed chunk as the input of a deep neural network.The fully neural network structure was also successfully used in image segmentation [49].For the feature learning, we propose a deep auto-encoder based unsupervised feature learning method to generate a new feature with nonnegative representations from the basic MFCC feature.To get a better prediction of the tags, a deep pyramid structure is designed with gradually shrinked size of layers.This deep pyramid structure can reduce the non-correlated interferences in the whole audio features while focusing on extracting the robust high-level features related to the target tags.Dropout [50] and background noise aware training [51] are adopted to further improve the tagging performance in the DNN-based framework.
The rest of the paper is organized as follows.In section II, we will introduce the related work using GMM and SVM based MIL in detail, which will be used as baselines for performance comparison in our study.We present our fully DNN based framework in section III.The deep autoencoder based unsupervised feature learning will be presented in section IV.The data description and experimental setup will be given in section V. We will show the related results and discussions in section VI, and finally draw a conclusion in section VII.

II. RELATED WORK
Two baseline methods compared in our work are briefly summarized below.

A. Audio Tagging using Gaussian Mixture Models
GMMs are a commonly used generative classifier.A GMM is parametrized by where M is the number of mixtures and w m is the weight of the m-th mixture component.
To implement multi-label classification with simple event tags, a binary classifier is built associating with each audio event class in the training step.For a specific event class, all audio frames in an audio chunk labeled with this event are categorized into a positive class, whereas the remaining features are categorized into a negative class.On the classification stage, given an audio chunk C i , the likelihoods of each audio frame x ij , (j ∈ {1 • • • L Ci }) are calculated for the two class models, respectively.Given audio event class k and chunk C i , the classification score S C ik is obtained as log-likelihood ratio:

B. Audio Tagging using Multiple Instance SVM
Multiple instance learning is described in terms of bags B. The jth instance in the ith bag, B i , is defined as , then at least one instance x ij ∈ B i is a positive example of the underlying concept [21].
As MI-SVM is the bag-level MIL support vector machine to maximize the bag margin, we define the functional margin of a bag with respect to a hyper-plane as: Using the above notion, MI-SVM can be defined as: subject to: where w is the weight vector, b is bias, ξ is margin violation, and A is a regularization parameter.
Classification with MI-SVM proceeds in two steps.In the first step, x i is initialized as the centroid for every positive bag B i as follows The second step is an iterative procedure in order to optimize the parameters.
Firstly, w and b are computed for the data set with positive samples {x I : Y i = 1}.
Secondly, we compute The iteration in this step will stop when there is no change of x i .The optimized parameters will be used for test.

III. PROPOSED FULLY DNN-BASED AUDIO TAGGING
DNN is a non-linear multi-layer model for extracting robust features related to a specific classification [37] or regression [36] task.The objective of the audio tagging task is to perform multi-label classification on audio chunks (i.e.assign zero or more labels to each audio chunk of a length e.g.four seconds in our experiments).This chunk only has utterancelevel labels, but without frame-level labels.Multiple events happen at many particular frames.Hence, the common framelevel cross entropy based loss function cannot be adopted.We propose a method to encode the whole or almost whole chunk.
A. Fully DNN-based multi-label regression using sequence to sequence mapping Fig. 1 shows the proposed fully DNN-based audio tagging framework using the deep pyramid structure.With the proposed framework, the whole or almost whole audio features of the chunk are encoded into a vector with values {0, 1} in a regression way.Sigmoid was used as the activation function of the output layer to learn the presence probability of certain events.Minimum mean squared error (MMSE) was adopted as the objective function.A stochastic gradient descent algorithm is performed in mini-batches with multiple epochs to improve learning convergence as follows, (5 where E is the mean squared error, Tn (X n+τ n−τ , W, b) and T n denote the estimated and reference tag vector at sample index n, respectively, with N representing the mini-batch size, X n+τ n−τ being the input audio feature vector where the window size of context is 2 * τ + 1.It should be noted that the input window size should cover the whole or almost whole of the chunk considering that the reference tags are in chunk-level rather than frame-level labels.However, slightly relaxing the window size without covering all of the chunk frames could increase the total number of training samples for DNN.It can improve the performance as observed in our experiments.(W, b) denote the weight and bias parameters to be learned.The updated estimate of W and b in the -th layer, with a learning rate λ, can be computed iteratively as follows: where L denotes the total number of hidden layers and L + 1 represents the output layer.
During the learning process where the DNN can be regarded as an encoding function, the audio tags are automatically predicted.Hence the multi-label regression rather than classification can be conducted.Two additional methods are given below to improve the DNN-based audio tagging performance.

B. Dropout for the over-fitting problem
Deep learning architectures have a natural tendency towards over-fitting especially when there is little training data.This audio tagging task only has about four hours training data with imbalanced training data distribution for each type of tag.Dropout is a simple but effective way to alleviate this problem [50].In each training iteration, the feature value of every input unit and the activation of every hidden unit are randomly removed with a predefined probability (e.g., ρ).These random perturbations effectively prevent the DNN from learning spurious dependencies.At the decoding stage, the DNN discounts all of the weights involved in the dropout training by (1 − ρ), regarded as a model averaging process [52].
A mismatch problem may also exist in this task, and the audio segments in the testing set could be totally different from the existing audio segments in the training set due to the presence of background noise.Thus Dropout should be adopted to improve its robustness to generalize to the variation in the segments from the test set.background noise adaptation method).To enable this noise awareness, the DNN is fed with the primary audio features augmented with an estimate of the background noise.In this way, the DNN can use additional on-line background noise information to better predict the expected tags.The background noise is estimated as follows: where the background noise Ẑn is fixed over the utterance and estimated using the first T frames.Although this noise estimator is simple, a similar idea was shown to be effective in DNN-based speech enhancement [36], [51].

IV. PROPOSED DEEP AUTO-ENCODER BASED UNSUPERVISED FEATURE LEARNING
MFCCs are used as the basic feature for the training of DNN-based predictor in this work.MFCC is one kind of welldesigned feature derived from experts based on the human hearing perception scheme [53].Recently, more supervised feature learning or unsupervised feature learning works have demonstrated that data-driven learned features can offer better performance than the expert-designed features.Neural network based bottle-neck feature [46] in speech recognition task is one type of supervised learned feature where the bottle-neck feature is extracted from the middle layer of a DNN classifier.Significant improvement can be obtained after it is fed into a subsequent GMM-HMM (Hidden Markov Model) system compared with the MFCC feature.However, for the audio tagging task, the tags are weakly labeled and not accurate through multiple voting scheme because it is more difficult to label the environmental events than the speech text along an audio file.Furthermore, there are lots of related audio files without labels on the web.Hence we proposed a deep autoencoder based unsupervised feature learning method.An unsupervised feature learning algorithm is used to discover features from the unlabeled data.For this purpose, the unsupervised feature learning algorithm takes the dataset X and outputs a new feature vector.In [47], four unsupervised learning algorithms, K-means clustering, restricted Boltzmann machine (RBM), Gaussian mixtures and auto-encoder are explored in image classification.Among them, RBM and auto-encoder are widely adopted to get new features or pretrain a deep model.Compared with RBM, auto-encoder is a non-probabilistic feature learning paradigm [48].The auto-encoder explicitly defines a feature extracting function, called the encoder, in a specific parameterized closed form.It also has another closed form parametrized function, called the decoder.The encoder and decoder function are denoted as f θ and g θ , respectively.Fig. 2 shows a typical one hidden layer auto-encoder structure with an encoder and a decoder.The encoder generates a new feature vector h from an input x = x (1) , ..., x (T ) .It is defined as, where h (t) is the new feature vector or new representation or code [48] of the input data.s f is the non-linear activation function.W and b denote the weights and bias of the encoder, respectively.On the other hand, the decoder, g θ can map the new representation back to the original feature space, namely producing a reconstruction x = g θ (h).
Where x is the reconstructed feature which is the approximation of the input feature.s g is the non-linear activation function of the decoder.W and b denote the weights and bias of the decoder.The set of parameters θ = {W, b, W , b } of the autoencoder are updated to incur the lowest reconstruction error L(x, x), which is a measure of the distance between the input x and the output.x.The general loss function of the autoencoder training can be defined as, Furthermore, the auto-encoder can be stacked to get a deep auto-encoder.The auto-encoder is actually a non-linear PCA with the non-linear activation functions [39].In Fig. 3, the framework of deep auto-encoder based unsupervised feature learning for audio tagging is presented.It is a deep auto-encoder stacked by simple auto-encoders with random initialization.To utilize the contextual information, seven frames MFCCs are fed into the deep auto-encoder.A typical auto-encoder is a symmetric structure with the same size as the input.But here, the deep auto-encoder is only designed to predict the middle frame MFCC, namely the current frame feature.Because the more predictions in the output means the more memory needed in the bottle-neck layer.In our practice, the deep auto-encoder would generate a larger reconstruction error if seven frames MFCCs were designed as the output with a narrow bottle-neck layer.This leads to an inaccurate representation of the original feature in a new space.Nonetheless, with the only middle frame MFCC in the output, the reconstruction error of the deep auto-encoder is nearly zero, which means the activations of the bottle-neck layer can well represent the original features.
MMSE was also adopted as the objective function to finetune the whole deep auto-encoder model.A stochastic gradient descent algorithm is performed in mini-batches with multiple epochs to improve learning convergence as follows, where Er is the mean squared error, Xn (X n+τ n−τ , W, b) and X n denote the reconstructed and input feature vector at sample index n, respectively, with N representing the mini-batch size, X n+τ n−τ being the input audio feature vector where the window size of context is 2 * τ + 1. (W, b) denote the weight and bias parameters to be learned.
The activation function of the bottle-neck layer is another key point in the proposed auto-encoder based unsupervised feature learning framework.Sigmoid cannot be used as the activation function of the code layer.Because it compresses the value of the new feature into a range [0, 1] which will reduce its representation capability.Hence, Linear or ReLU activation function is a more suitable choice.In [39], the activation function of the units of the bottle-neck layer or the code layer of the deep auto-encoder is linear.A perfect reconstruction of the image can be obtained.In this work, ReLU and Linear activation functions of the bottle-neck layer are both verified to reconstruct the audio features in the deep auto-encoder framework.Fig. 4 shows the mean squared error over the CV set with the Linear or ReLU as the activation function of the bottle-neck layer units from epoch 20 to epoch 50.It can be found that the ReLU can be slightly better than the Linear function at the final epoch.Hence, the ReLU was used as the activation function of the bottle-neck layer.
Note that all of the other layer units also adopt ReLU as the activation function.
In summary, the new feature derived from the bottleneck layer of the deep auto-encoder can be regarded as the optimized feature due to two factors.The first one is that the auto-encoder learned feature is generated from a contextual input frames.The new auto-encoder learned feature can well capture the temporal structure information compared with the original feature.The second advantage is that the deep autoencoder based unsupervised feature learning can utilize the huge unlabeled data on the web.More statistical knowledge in the feature space can be learned by this framework.

A. DCASE2016 data set for audio tagging
The data that we used for evaluation is the dataset of Task 4 of DCASE 2016 [19], which is built based on the CHiME-home dataset [45].The audio recordings were made in a domestic environment [54].Prominent sound sources in the acoustic environment are two adults and two children, television and electronic gadgets, kitchen appliances, footsteps and knocks produced by human activity, in addition to sound originating from outside the house [54].The audio data are provided as 4-second chunks at two sampling rates (48kHz and 16kHz) with the 48kHz data in stereo and the 16kHz data in mono.The 16kHz recordings were obtained by downsampling the right channel of the 48kHz recordings.Note that Task 4 of DCASE 2016 challenge is based on using only 16kHz recordings.
For each chunk, multi-label annotations were first obtained from each of 3 annotators.There are 4378 such chunks available, referred to as CHiME-Home-raw [45]; discrepancies between annotators are resolved by conducting a majority vote for each label.The annotations are based on a set of 7 label classes as shown in Table I.A detailed description of the annotation procedure is provided in [45].To reduce uncertainty  We pre-process each audio chunk by segmenting them using a (80ms) sliding window with a 40ms hop size, and converting each segment into 24-D MFCCs.For each 4-second chunk, 99 frames of MFCCs are obtained.A 91-frame expansion as the input instead of the total frames were found to be better because this relaxed input scheme can increase the total number of training samples.Hence the input size of DNN was 2208 with 91-frame MFCCs and also the appended noise vector.The first hidden layer with 1000 units and the second with 500 units were used to construct a pyramid structure.Seven sigmoid outputs were adopted to predict the seven tags.The learning rate was 0.005.The momentum was set to be 0.9.The dropout rates for input layer and hidden layer were 0.1 and 0.2, respectively.The mini-batch size was 3. T in Equation ( 8) was 6.It should be noted that the remaining 2432 chunks without 'strong agreement' labels in the development dataset were also added into the DNN training considering that DNN has a better fault-tolerant capability.Meanwhile, these 2432 chunks without 'strong agreement' labels were also added into the training data for GMM and SVM training.The deep autoencoder has 5 layers with 3 hidden layers.The input is 7-frame MFCCs, and the output is the middle frame MFCC.The first and third hidden layer both have 100 hidden units while the middle layer is the bottle-neck layer with 30 units.For a comparison, we also ran two baselines using GMMs and the MI-SVM mentioned in Section II.For the GMM based method, the number of mixture components is 8. Since the GMM based baseline focuses on computing frame-level likelihoods and MI-SVM prefers to instance-level scores, the sliding window and hop size set for the two baselines are different.The GMM based baseline uses a 20ms sliding window with 10ms hop size, while the sliding window and hop size for MI-SVM are set to be 400ms and 200ms, respectively.To handle audio tagging with MI-SVM, each audio recording will be viewed as a bag and its shorter segments obtained by a sliding window can be treated as an instance.To accelerate computation, we use linear function kernel in our experiments.
To evaluate the effectiveness of our approach, as compared with the two baselines, we use equal error rate (EER) as a metric.EER is defined as the point of the graph of false negative rate (FNR) versus false positive rate (FPR) [55] F N R = #f alse negative #positive F P R = #f alse positive #negative EERs are computed individually for each evaluation fold, and we then average the obtained EERs across the five folds to get the final performance.

A. Overall evaluations
Figure 5 shows the results obtained using the proposed fully DNN approach, the fully DNN improved by the proposed deep auto-encoder feature (denoted as DNN AE), and the two baselines, namely GMM and MISVM across five evaluation folds.It is clear that the fully DNN-based approaches outperform the two baselines across the five-fold evaluations.This is because of the following two main reasons: First, our proposed approach can well utilize the long-term temporal information The GMM based method yields a close performance to the proposed fully DNN baseline method only in the third evaluation fold.We found that two of the audio event classes, namely adult male's speech (label 'm') and other identifiable sounds (label 'o'), are well identified in the evaluation of this fold.This is probably because the acoustic characteristics and their variations of the two event classes in the evaluation data can match with the trained models.The use of MISVM does not yield competitive performances in comparison with our proposed approach and the GMM-based baseline.Furthermore, to accelerate computation, the use of a linear kernel function instead of a non-linear one probably also has a negative impact on the performance.It is clear that MISVM does not handle these complex conditions well, while our proposed approaches do due to the use of the long contextual information.
For a further comparison, Table III shows the detailed performances obtained using the proposed fully DNN approach, the fully DNN improved by the proposed deep auto-encoder

B. Evaluations for the number of contextual frames in the input of the fully DNN classifier
The reference label information for this audio tagging task is on the utterance-level rather than the frame-level, and the occurring orders and frequencies of the tags are unknown.Hence, fully DNN is a direct choice to map from the whole audio features into the multi-tag labels.However, the number of training samples would be reduced to be equal to the utterance number if the whole frames of the utterance were fed into the DNN classifier each time.Fewer training samples will make the training process of DNN unstable considering that the parameters are updated using a stochastic gradient descent algorithm performed in mini-batches.We proposed to suitably relax the number of the contextual frames to 91-frame while the total number of frames for each utterance is 99-frame.Then the frame expansion is conducted with one frame slide along the utterance.The same tag labels are assigned to each training sample belonging to the certain utterance.This will increase the number of training samples and also ensure that most of the tags related audio features are covered in the input.Fig. 7 shows the EERs for Fold1 evaluated by using different number of contextual frames in the input of the fully DNN classifier.Here the MFCCs are used as the input features.It can be found that using the 91-frame MFCC as the input can obtain the lowest EER.Using the whole utterance, namely the 99-frame MFCC, as the input leads to the worst performance due to the smallest number of training samples.An interesting phenomenon is that using the 11-frame MFCC as the input still gives comparable performance, and in this case, the window used is relatively small and may not contain any audio features corresponding to certain tags in the output.This might indicate that most of the tags overlap heavily with each other in certain utterance.

C. Evaluations for the size of the training dataset
In the preceding experiments, 'CHiME-Home-raw' dataset was used to train the DNN, GMM and SVM models.Here, to evaluate the performance using different training data sizes, DNNs were trained based on 'CHiME-Home-raw' or 'CHiME-Home-refine' alternatively while keeping the same testing set.MFCCs were used as the input features for the fully DNN classifier.
Table IV shows the EERs for Fold1 across seven tags with the fully DNNs trained on the 'CHiME-Home-raw' set and 'CHiME-Home-refine' set.It can be easily found that the fully DNN trained on the 'CHiME-Home-raw' set is better than the fully DNN trained on the 'CHiME-Home-refine' set, although part of the labels of the 'CHiME-Home-raw' set are not accurate.This indicates that DNN has strong fault-tolerant capability which suggests that the labels for the tags can not be refined with much annotators' effort.And the large training set is crucial for the DNN training.Nonetheless the GMM method is sensitive to the inaccurate labels.The increased training data with inaccurate tag labels does not help to improve the performance of GMMs. in Sec.IV.Hence, the value of the learned feature are all non-negative, leading to a non-negative representation of the original MFCC.Such a non-negative representation can then be multiplied with the weights in the decoding part of the autoencoder to obtain the reconstructed MFCCs.It is also adopted to replace the MFCCs as the input to the fully DNN to make a better prediction for the tags.The pure blue area at about the 18th dimension in Fig. 8 indicates the zero values due to the ReLU activation function.

VII. CONCLUSIONS
In this paper we have studied the acoustic modeling and feature learning issues in audio tagging.We have proposed a fully-DNN incorporating unsupervised feature learning based approach to handle audio tagging with weak labels, in the sense that only the chunk-level instead of the frame-level labels are available.This fully DNN is regarded as an encoding function to map the sequence of audio features to a multitag vector in a regression way.A deep auto-encoder based unsupervised feature learning was also proposed to generate a new feature with non-negative representations.To extract robust high-level features, a deep pyramid structure was designed to reduce the non-relevant interfering features while keeping the highly related features.The dropout and background noise aware training methods were adopted to further improve its generalization capacity for new recordings in unseen environments.We tested our approach on the dataset of the Task4 of the DCASE 2016 challenge, and obtained significant improvements over the two baselines, namely GMM and MI-SVM.Compared with the official GMM-based baseline system given in the DCASE 2016 challenge, the proposed DNN system can reduce the EER from 0.21 to 0.1699 on average.For the future work, we will use fully convolutional neural network (CNN) to extract more robust high-level features for the audio tagging task.

Fig. 1 .
Fig. 1.Fully DNN-based audio tagging framework using the deep pyramid structure.

Fig. 2 .
Fig.2.A typical one hidden layer auto-encoder structure with an encoder and a decoder.

Fig. 3 .
Fig. 3.The framework of deep auto-encoder based unsupervised feature learning for audio tagging.

Fig. 4 .
Fig. 4. Mean squared error over the CV set with the Linear or ReLU as the activation function of the bottle-neck layer units from epoch 20 to epoch 50.

Fig. 5 .
Fig. 5. Equal error rates obtained using the proposed fully DNN approach, the fully DNN improved by the proposed deep auto-encoder feature (denoted as DNN AE), and the two baselines, namely GMM and MISVM across five evaluation folds.

Fig. 6 .
Fig. 6.Spectrograms of the original MFCCs and the reconstructed MFCCs by the deep auto-encoder.

Fig. 8 Fig. 8 .
Fig.8presents the audio spectrogram of the deep autoencoder features, which can be regarded as the new nonnegative representation or optimized feature of the original MFCCs.The units of the bottle-neck layer in the deep autoencoder are all activated by the ReLU functions as mentioned

TABLE I LABELS
USED IN ANNOTATIONS.

TABLE II THE
NUMBER OF AUDIO CHUNKS FOR TRAINING AND TEST.In our experiments, following the original specification of Task4 of DCASE 2016 [19], we use the same five folds as the evaluation set from the given development dataset, and use the remaining audio recordings for training.Table II lists the number of chunks of training and test data used for each evaluation fold.

TABLE III AVERAGE
EER COMPARISON AMONG THE PROPOSED FULLY DNN METHODS, THE GMM AND MISVM METHODS, EVALUATED FOR EACH EVENT ACROSS THE FIVE FOLDS.

TABLE IV EERS
FOR FOLD1 ACROSS SEVEN TAGS USING DNNS AND GMMS TRAINED ON THE 'CHIME-HOME-RAW' SET AND'CHIME-HOME-REFINE' SET.