Knowledge Distillation in Acoustic Scene Classification

Common acoustic properties that different classes share degrades the performance of acoustic scene classification systems. This results in a phenomenon where a few confusing pairs of acoustic scenes dominate a significant proportion of all misclassified audio segments. In this article, we propose adopting a knowledge distillation framework that trains deep neural networks using soft labels. Soft labels, extracted from another pre-trained deep neural network, are used to reflect the similarity between different classes that share similar acoustic properties. We also propose utilizing specialist models to provide additional soft labels. Each specialist model in this study refers to a deep neural network that concentrates on discriminating a single pair of acoustic scenes that are frequently misclassified. Self multi-head attention is explored for training specialist deep neural networks to further concentrate on target pairs of classes. The goal of this article is to train a single deep neural network that demonstrates performance equivalent to, or higher than, an ensemble of multiple models, by distilling the knowledge from several models. Diverse experiments conducted using the detection and classification of acoustic scenes and events 2019 task 1-a dataset demonstrate that the knowledge distillation framework is effective in acoustic scene classification. Specialist models successfully decrease the number of misclassified audio segments in the target classes. The final single model with the proposed method that is trained by the proposed knowledge distillation from several models, including specialists trained using an attention mechanism, shows a classification accuracy of 77.63 %, higher than an ensemble of the baseline and multiple specialists.


I. INTRODUCTION
Acoustic scene classification (ASC) is a multi-class classification task that classifies an input audio segment into one of the pre-defined acoustic scenes. With the recent prosperity in deep neural networks (DNNs), research interest in the ASC with DNNs is increasing where recent stateof-the-art systems comprise DNNs [1]- [4]. For studies in ASC task, the detection and classification of acoustic scenes and events (DCASE) community is providing a common platform including datasets released annually for researches to study and report results [5]- [8]. Through previous challenges, a variety of studies have been made in feature The associate editor coordinating the review of this manuscript and approving it for publication was Bin Liu . exploration [9]- [12], DNN architectures [13]- [18], data augmentation [19]- [21], and ensembles [22], [23].
Common acoustic properties that reside among different acoustic scenes cause performance degradation of the ASC task [24]. It is hypothesized that the learning of a DNN can be disturbed if audio segments are labelled differently even though they share acoustic properties. For example, a babbling sound can exist in both 'airport' and 'shop-ping_mall' audio segments. 1 However, audio segments that contain the same babbling sound are labelled as airport and shopping_mall depending on the location of the recording. One phenomenon that these common acoustic properties evoke is that a few pairs of acoustic scenes occupy the 1 Both scenes are pre-defined by the DCASE community majority of misclassified audio segments. Fig. 1 demonstrates the impact of a few dominantly misclassified pairs of acoustic scenes by illustrating a cumulative bar graph. It shows that among 45 pairs, a combination of 10 acoustic scenes, the top five most confused pairs accounted for more than half of the total errors, and the top ten pairs accounted for more than 80 % of the total misclassifications. The confusion matrix of the baseline DNN in Fig. 2 additionally confirms this phenomenon. It shows that pairs such as 'airport-shopping_mall', 'public_square-street_pedestrian', and 'metro-metro_station' which share similar acoustic configurations, dominates the misclassifications. In this study, we adopt a knowledge distillation (KD) framework [25], [26] from the perspective of extracting soft labels that reflect the similarity between defined acoustic scenes, which was first introduced to the ASC task in authors' previous work [24]. The KD framework sequentially trains two DNNs, where the first is trained using conventional ground truth labels (i.e. one-hot hard labels), combined with a categorical cross-entropy (CCE) loss function, and the second is trained with soft labels that are the output of the first DNN. Refer to Section III for further details regarding the KD framework. The experimental results show that the adoption of the KD framework successfully lowers the misclassification of pairs of acoustic scenes that are frequently misclassified.
However, these pairs are still occupy the majority of misclassified audio segments. To further mitigate this phenomenon, we take the approach of training several specialist DNNs where each specialist model concentrates on discriminating a specific target pair of acoustic scenes that are frequently misclassified [27]. We propose training with biased mini-batches to train the specialist DNNs. We also explore the attention mechanism to further concentrate on the target pairs of acoustic scenes. The purpose of this study using the specialist DNNs is to develop a single DNN that successfully distilled the knowledge of multiple models needed to identify pairs of acoustic scenes that are frequently confused. Specifically, we aim for a distilled single DNN to outperform an ensemble of several specialist models including the baseline DNN. Fig. 2 illustrates three confusion matrices from the baseline and two proposed systems: KD and specialist-based KD (SKD). It demonstrates that the proposed methods reduce the number of misclassified audio segments among frequently confusing pairs of acoustic scenes.
In this study, we present an overall adaptation of the KD framework for the ASC task with hypothesis and problematic phenomena in which our hypothesis evokes [24], [27]. The novelty of this article can be summarized as follows: 1) A modified three-phase training framework by combining previous studies. 2) Compare the effect of combinations of various loss functions, used in the proposed framework. 3) A criterion for model selection and application of multi-head attention mechanism for specialist DNNs. 4) A method to provide superiority to the teacher DNN's input. 5) The effectiveness of using different number of specialist DNNs for the SKD. The rest of this article is organized as follows. Section II describes the overall system used in this study. Section III describes the KD framework. The proposed specialist models are introduced in Section IV. Section V addresses specialist knowledge distillation. Experiments and results are in Section VI and the paper is concluded.

II. SYSTEM DESCRIPTION
In this section, we describe two components of the ASC system that we use throughout this article. The system comprises a front-end DNN for feature extraction and a support vector machine (SVM) for back-end classification. Table 1 shows the classification accuracy of our implementation of the baseline DNN in this study compared to a previous system [27], which demonstrated the second highest accuracy in the DCASE 2019 challenge among those using raw waveform as input.

A. FRONT-END FEATURE EXTRACTION
Recent studies on the ASC task mainly utilize DNNs for feature extraction combined with other back-end classifiers, or use them in an end-to-end fashion. With the recent prosperity in deep learning, these systems succeeded in extracting fixed dimensional representation (also referred to as feature or code) with minimal pre-processing of input audio segments. In this study, we use the DNN architecture from [27], with stereo raw waveform inputs and representation vector outputs. Table 2 describes the DNN architecture used in this article. The DNN is directly fed by raw waveforms and comprises convolutional residual blocks [29], [30] followed by a global max pooling and a global average pooling layer, a fullyconnected layer, and an output layer. Convolutional blocks with residual connections extract a frame-level representation, which is then aggregated using a global max pooling where 479999 refers to the number of samples and 2 refers to the two channels of an input audio segment. Each residual block is followed by a max pooling layer of size 3, as done in [28]. and a global average pooling layer. Concatenation of the two global pooling layers is connected to the fully-connected layer whose output is used as the representation vector.
The system is trained using a categorical cross-entropy (CCE) with the ground truth (hard label) that can be described as: where N refers to the size of the mini-batch, C is the number of classes, y n refers to a C dimensional one-hot vector (if audio segment x n belongs to class c, only y n,c would be 1 and other values are all 0), and P(o c |x n ) is the output of the DNN output layer's c th node with softmax activation applied when audio segment x N is input. Note that after training is complete, the output layer is removed. Other detailed explanations are dealt in Section VI.

B. BACK-END CLASSIFICATION
Conducting the ASC task in end-to-end fashion by combining front-end feature extraction and back-end classification with a single DNN is known to performs well. However, when conducting an ensemble of multiple DNNs, the softmax activated outputs of a DNN may not be appropriate to convey the probabilities or the concept of confidence, leading to poor calibration [31]. Therefore, we train a support vector machine (SVM) for back-end classification. Two kernels, Gaussian and sigmoid, are explored. Unless mentioned explicitly, all system performance in this article is reported using a SVM for back-end classification. Overall process pipeline of the proposed KD framework for the ASC task. In phase 1, we train the teacher DNN (i.e. baseline). In phase 2, KD is conducted using soft labels generated by the baseline (T: teacher, S: student, single DNN).

III. KNOWLEDGE DISTILLATION A. OVERVIEW
Knowledge distillation was first devised in two studies conducted by Hinton et al. [25] and Li et al. [26] and was introduced for the ASC task in [24]. Although the two papers convey similar methodologies, Hinton et al. focused on the transfer of knowledge between DNNs, therefore, referred to as knowledge distillation. Li et al., on the other hand, focused on the fact that one DNN is used to train the other DNN and therefore referred to as teacher-student learning.
Throughout this study, we use the terminology knowledge distillation (KD) framework and refer to the two DNNs as the teacher DNN and the student DNN. Fig. 3 illustrates the process pipeline of the KD framework. In the KD framework, the teacher DNN (i.e. baseline) is trained in advance using the conventional scheme of hard label (one-hot vector) combined with the CCE loss function described in (1). After training the teacher DNN, the weight parameters of the teacher DNN are frozen. Then, the student DNN is trained using the Kullbeck-Leibler (KL) divergence loss function with the soft labels generated by the teacher DNN. The training of the student DNN can be described as: where P T (o c |x n ) and P S (o c |x n ) refer to the teacher DNN and the student DNN output layer's c th node with softmax activation applied when audio segment x n is input, and other notations follow those in (1).

B. FURTHER OBJECTIVE FUNCTIONS
In this sub-section, we further hypothesize two possible phenomena that can arise when applying the KD framework, and we propose two additional objective functions to supplement the KD framework. The first problematic scenario is that because the student DNN is only trained using the output of the teacher DNN, if the teacher DNN is overfitted and provides wrong soft labels, then the model may collapse. To find a solution for this, we conduct comparative experiments by adding the CCE loss in (1) using hard labels when applying the KD framework. The second problem is that the distillation of knowledge is indirect because the representation of the last hidden layer is fed to the back-end SVM, instead of the output layer of the teacher. Therefore, we hypothesize that comparing the representation of the last hidden layer between the teacher DNN and the student DNN can be a more direct approach. We use two distance metrics, as done in [32], mean squared error (MSE) and cosine distance described as: where M refers to the dimensionality of the last hidden layer, and e m is the m th note output of the last hidden layer. Note that in practice, we use −1× cosine similarity (also referred to as the cosine proximity loss) for cosine distance. Application of the two additional objective functions can be described as: where L CODE is either L MSE or L COS defined in (3) and (4), respectively. We investigate the effect of each component, expanding [33] which only considered L CODE and L KD The experimental results in Table 3 demonstrate that L KD + L MSE shows the best performance.

C. VARIOUS DURATIONS AND TEMPERATURES
To leverage the KD framework successfully, superiority of the teacher DNN needs to be defined depending on the objective. For instance, in [26] where the KD framework was first proposed, the task was model compression which aims to demonstrate similar performance using fewer weight parameters. In this case, the superiority of the teacher DNN is larger model capacity (more weight parameters). In [34] which dealt far-field compensation study for speech recognition, the close-talking speech utterances that contain less reverberation and noise were the superiority. In this study, we propose to utilize multiple audio segments within an acoustic scene, recorded in different locations to extract soft labels. The ASC task defines rather abstract acoustic scenes such as a park or an airport, thus, the acoustic properties can widely vary depending on the recording locations. For example, a recording of a park where kids are playing tennis and where birds are chirping will contain different acoustic properties but both will be labelled park. Therefore, we hypothesize that using multiple audio segments recorded at different locations will enable the extraction of soft labels that are less dependent on recording location. Concretely, we utilize two approaches to set the superiority of the teacher DNN using multiple segments recorded at different locations. First, we concatenate multiple audio segments, as proposed in [33]. In this case, x n used in P T (·) for (2) to (4) would refer to a concatenation of multiple audio segments resulting in a longer duration. Second, we propose to measure the loss with each audio segment from the teacher DNN multiple times and average the result. Comparison results of the two approaches can be found in Table 4, each denoted as 'Concat' and 'Each' where results demonstrate that the proposed 'Each' outperforms 'Concat'.
Temperature [25] is the parameter that decides the extent of smoothness of the output after applying the softmax function. It can be interpreted as a sort of label smoothing, and is widely used in tasks of other domains that utilize the KD framework. The DNN's softmax output with temperature applied can be described as: where o i refers to the output node for class i with softmax applied, z i refers to the logit node for class i before softmax is applied, and T is the temperature. Table 4 shows the results of comparative experiments with varying temperatures.

IV. SPECIALIST MODELS A. OVERVIEW
In this research, we show that adapting the KD framework to the ASC task is effective not only by improving the overall classification accuracy, but also by lowering the number of misclassified audio segments in pairs of acoustic scenes that are frequently misclassified. The middle confusion matrix in Fig. 2 illustrates the confusion matrix of the model described in the previous section with KD training. It shows that the misclassification in majority of pairs, including top-3 most confusing pairs, decreased with few exceptions (e.g. 'park'-'public_square'). Despite the improvement, the confusion matrix still has the same problem of a few pairs frequently being misclassified. To further mitigate the confusion between the acoustic scenes we introduce specialist models for the confusing pairs [27].
In [25], a specialist DNN was proposed to classify more specific categories with fewer classes, because there were too many categories. For example, the general model classifies 15000 classes and a specialist was set to classify 150 classes among the 15000 classes. We adopted the concept of specialist models to the ASC task with two modifications [27]. First, instead of making a specialist DNN to classify a subset of the classes, we train specialist DNNs to classify entire classes. This approach enables both direct score-level ensemble and knowledge distillation from multiple specialist models to a single DNN, introduced in Section V. Second, we construct biased mini-batches focusing on a pair of confusing two classes. Specifically, we construct biased mini-batches by composing half of them with recordings from the two target confusing classes, and the other half with the rest of the classes. All the specialist DNNs in this article are initialized using the weight parameters of the baseline system.
We also propose a criterion for model selection for the specialist models. Various methods have been proposed for general model selection such as early stopping. Overall accuracy, which is widely used for model selection may not be an appropriate metric for the specialist models. Specialist models are not designed to classify all classes well; rather, they are trained to give additional discriminative power for the target pairs of acoustic scenes. To achieve such a goal, we propose to use the number of misclassified target pairs of two classes as a criterion. The number of misclassified segments by different specialist DNNs are described in Table 5, and performances based on using different numbers of specialist Comparison of the number of misclassified segments by 10 specialist DNNs that concentrate on one of the top ten pairs each according to different training schemes for the specialist DNNs. The term 'w/o Attention' refers to the specialist DNNs trained only using biased mini-batch construction, whereas 'w/ Attention' refers to the specialist DNNs trained with a multi-head attention mechanism. Bold depicts the best in each row.

TABLE 6.
Experimental results from using different numbers of specialist DNNs for specialist knowledge distillation using biased mini-batch composition and the multi-head attention scheme. The term 'w/o Attention' refers to the specialist DNNs trained only using biased mini-batch construction, whereas 'w/ Attention' refers to the specialist DNNs trained with a multi-head attention mechanism. Bold depicts the top four performances per each scheme. Performance is reported as overall accuracy (%). Table 6, show that the proposed criterion is valid for the proposed SKD framework for the ASC task.

B. MULTI-HEAD ATTENTION-BASED SPECIALIST MODELS
Attention mechanisms demonstrate successful performance across a number of tasks [15], [35]- [38]. It emphasizes and de-emphasizes parts of a given representation by re-scaling hidden representations with attention weights. Attention weights has a total value of 1 through a softmax activation function, which ensures that some parts are emphasized and other parts are de-emphasized. Additional information can be used to calculate attention weights, e.g. phonetic information, can be used to derive attention weights for speaker recognition. Weights for attention mechanism can also be derived using the given representation itself, referred to as a self-attention mechanism.
Multi-head attention separates a representation space into multiple subspaces and concatenates them after applying softmax activation to each subspace. Each subspace is referred to as a head. Multi-head attention assigns an attention weight for each head; therefore, it is expected that each subspace can be adequately emphasized. In this study, we propose adopt self multi-head attention [36] to further let a specialist DNN focus on discriminating a target pair of confusing acoustic scenes. Biased mini-batches make each specialist DNN to focus on a target pair of acoustic scenes, however, the DNN architecture remains the same. We assume that adding a self multi-head attention to the DNN would facilitate specialist DNN to focus on a target pair of scenes with a few additional parameters by emphasizing relevant information and de-emphasizing others.
Specifically, we apply this mechanism to the feature map of the last convolutional layer ('Res7' in Table 2), before global pooling, similar to the method proposed in [38]. Given a feature map of 1-dimensional CNN, h = [h 1 , h 2 , · · · , h T ], h t ∈ R F , the representation space is first separated into multiple subspaces according to the number of heads: where F is the number of filters, t is the time index, and k is the head index, and K is the number of heads. We derive attention weights for each subspace by first averaging a given representation by the time axis (h k ), conducting dot product with weight u l , and then applying the softmax activation: where w kl refers to the attention weight for l th filter in head k, u l is a trainable vector, u l ∈ R F/K , and w k = [w k1 , w k2 , · · · , w k,F/K ]. Finally, we derive the multi-head attention applied feature map

V. SPECIALIST KNOWLEDGE DISTILLATION
The knowledge distillation framework enables training a student DNN with the soft labels extracted from a teacher DNN.
With specialist models, we can make an ensemble of multiple specialist models and a baseline model. Although an ensemble brings additional performance improvement, it requires VOLUME 8, 2020 multiplicative computational costs and thus impractical for real-world scenarios. In this study, we propose to combine two frameworks and distill the knowledge from multiple models, specialist models and a baseline model, into a single DNN, which we refer to as the specialist knowledge distillation (SKD) framework. Our objective is to develop a single DNN that show equivalent to, or better performance than an ensemble of multiple specialist DNNs and a baseline. Fig. 4 illustrates the overall process pipeline of the proposed SKD framework for the ASC task. Distillation of the proposed framework (phase 3 in Fig. 4) is addressed as a summation of knowledge distillation in (2) with multiple soft labels: where Sp is the group of DNNs used to extract soft labels, P T is the teacher DNN (baseline) P sp is the specialist DNN, and k refers to concentrating on the k th most confused pair of acoustic scenes. Experimental results in Table 7 show that the SKD framework successfully incorporates the teacher DNN and multiple specialist DNNs into a single DNN. Comparison of accuracy when conducting a score-sum ensemble and the proposed specialist knowledge distillation using the baseline and multiple specialist DNNs. Best performing configurations from Table 6 are reported.

VI. EXPERIMENTS AND RESULTS
All experiments in this article were conducted using PyTorch [39], a deep learning library for Python. Codes for reproduction is available. The DCASE 2019 task 1-a dataset was used for all experiments. It comprises 14400 segments totaling about 40 hours from one of the 10 defined acoustic scenes recorded in 12 European cities. All segments are recorded at a sampling rate of 48 kHz, 24-bit resolution, in stereo. Each segment has a duration of 10 seconds where the original recording of five to six minutes was split by organizers. We use the train set in the fold-1 configuration for model training and test set in fold-1 configuration for model evaluation following the literature. 3

B. EXPERIMENTAL CONFIGURATIONS
We directly input raw waveforms with pre-emphasis applied to both channels. All 479999 samples consisting of 10 seconds at a 48kHz sampling rate are input to the DNN. Residual blocks comprise a 1d convolution layer with a kernel length of 3, batch normalization (BN) [40], leaky rectified unit (ReLU) activation [41], and a max pooling of size 3 following the work found in [27]. We conduct both global max pooling and global average pooling to the output of the last residual block, and then concatenate the two outputs. Code representation has a dimensionality of 64. The rest of the configuration follow the description in Table 2. In our research, all DNNs, the baseline, specialist DNNs, and the final single model have identical architecture, resulting in the same number of weight parameters, with an exception in multi-head attention specialist DNNs. We use the AMSGrad optimizer [42], [43] with a learning rate of 0.001. We adopt mix-up [44] with α = 0.1 after first five epochs. Weight decay with λ = 0.001 is applied to all DNNs. We use a mini-batch size of 24 for all experiments. Center loss is additionally applied for training all DNNs [45]. Note that we fix all DNNs throughout this article to use raw waveform as input to measure the sole effect of the proposed methods. The KD framework itself has been validated to be effective for different input features (e.g. Mel-spectrogram) in previous works [24], [27]. Table 3 describes comparison results with different composition of loss functions for knowledge distillation. Here, columns depict whether a specific component is included in the loss function and rows depict the performance for each combination. Soft label refers to the original knowledge distillation loss function, which uses the KL-divergence between the student DNN's output and the teacher DNN's output (i.e. L KD ). Hard label refers to a conventional CCE loss function where the purpose is to complement in case the soft label is wrong (i.e. LCCE). Code refers to comparing the teacher DNN and the student DNN's code representation with a defined loss function (i.e. L MSE or L COS ). In this study, 3 Performance on the competition evaluation set cannot be reported because labels are not publicly available we conduct experiments using cosine similarity and MSE following the work in [32].

C. RESULTS
Results show that in all cases, knowledge distillation improves performance regardless of distillation points and the kind of objective functions used. Usage of hard labels is not necessary for successful distillation. For the objective function used to compare the code representation of the teacher and the student DNN, MSE demonstrates higher accuracy than cosine similarity. The best result is achieved by using soft label and code together, without hard labels, demonstrating an accuracy of 75.89 %. Hereafter, we conduct all knowledge distillation, including specialist knowledge distillation, using both soft label and MSE for code distillation. Table 4 shows the results of knowledge distillation under different durations of teacher DNN input to extract the soft label, the types of methods to extract soft labels from longer durations, and various temperatures for the teacher DNN and the student DNN. The first row shows the result of the conventional knowledge distillation framework, and the second to fifth rows show the results of providing additional superiority to the soft labels extracted from the teacher DNN by using multiple segments from the same class. Under 'Type of superiority', 'Concat' refers to inputting concatenated segments to the teacher DNN, and 'Each' refers to inputting each segment to the teacher DNN and averaging soft labels from different segments to derive the final soft label. Different columns show the effect of using temperature for knowledge distillation where a higher temperature leads to a smoother soft label.
Results show that regardless of different temperatures and types of methods of superiority, the proposed scheme of using multiple segments for extracting soft labels improves performance. Using three segments showed a slightly further improvement compared to using two segments in most cases. When using multiple segments to extract a soft label, averaging the soft labels showed higher performance than inputting concatenated segments to the teacher DNN in most cases. Adopting the concept of temperature was effective with single segment-based soft labels. However, when using soft labels extracted from multiple segments, setting both the teacher DNN and the student DNN's temperature to 1 showed the best performance. The best accuracy, 76.27 % is achieved by using three segments to extract a soft label, and averaging the soft labels, without giving explicit temperatures for knowledge distillation. Hereafter, we use this configuration for all knowledge distillation from multiple teacher DNNs including specialist DNNs to a single student DNN. Table 5 describes the number of misclassified segments by the baseline, the specialist DNNs with biased mini-batch construction, and the specialist DNNs that additionally used multi-head attention. We use both biased mini-batch and multi-head attention schemes for 'Attention' because biased mini-batch construction is needed to make a specialist DNNs focus on a specific pair of classes. Results show that six of the 10 multi-head attention-based specialist DNNs outperform the baseline and biased mini-batch based specialist DNNs.
We conclude that specialist DNNs can effectively reduce the number of misclassifications of the target confusing pairs and that using attention mechanism for specialist DNNs show further improvement for the most confusing pairs. Table 6 shows the results from conducting specialist knowledge distillation using two specialist frameworks with differing numbers of specialist DNNs. Specialist DNNs trained with biased mini-batch composition and attention show different tendencies overall. When using biased mini-batch composition, using more specialist DNNs demonstrates the best performance where an accuracy of 76.37 % is achieved with seven specialist DNNs. On the other hand, using only a few specialist DNNs demonstrated the best performance when both biased mini-batch and multi-head attention is used to train the specialist DNNs. Accuracy of 77.63% is achieved with only one specialist DNN.
Additionally, the results in Table 6 shows a tendency consistent with the results in Table 5. Specialists trained with the multi-head attention framework show relatively more effectiveness for the top confusing pairs. Accordingly, specialist knowledge distillation using attention-based specialist DNNs require only one, two, or three specialist DNNs for the best performance. From these results, we conclude that using the proposed multi-head attention framework for training specialist DNNs in addition to biased mini-batch construction is effective. It not only shows better performance, but also requires fewer specialist DNNs, minimizing the overhead in the overall process pipeline. Table 7 describes comparison results between a score-sum ensemble and the proposed SKD. For the ensemble, we report the best performance among using different numbers of specialists empirically. Results show that the combination of training specialists with multi-head attention and conducting SKD outperforms all other configurations. Based on this result, we conclude that the SKD successfully distills the knowledge of each specialist DNNs into a single DNN.
Lastly, Table 8 addresses the comparison of state-of-theart systems in the ASC task. Three systems from the DCASE 2019 task 1-a are used for comparison which ranked the 1st, 3rd, and 5th, respectively where we report best single system performance that use Mel-spectrograms extracted from both channels as input [2]- [4]. Note that we could not find single system performance of the 2nd and 4th ranking systems' performance on the task 1-a fold 1 test set. Result demonstrates TABLE 8. Comparison with state-of-the-art systems on the DCASE 2019 task1-a challenge. Three systems selected for comparison adopt two-channel Mel-spectrogram as input feature, without any ensemble technique (-: Not reported). VOLUME 8, 2020 the effectiveness of the proposed approach. The proposed system using SKD shows the second best performance with 0.24 % lower than that of Huang et al.'s work, but requires almost forty times less number of parameters.

VII. CONCLUSION
In this research, we adapted a knowledge distillation framework for the ASC task with two purposes. One was to mitigate the misclassification of a few pairs of confusing acoustic scenes using soft labels that reflect the similarity of acoustic properties between different classes. The other purpose was to develop a single DNN that demonstrates performance comparable to a score-level ensemble of multiple DNNs by conducting distillation from multiple DNNs into a single DNN. We proposed various techniques, configuring superiority for soft labels with multiple segments and temperatures to make knowledge distillation successful in the ASC task. We also adapted a scheme of applying specialist DNNs to the ASC task by configuring each specialist DNN to concentrate on discriminating a target pair of acoustic scenes. We also proposed the techniques using biased mini-batch construction and the multi-head attention specialist. Utilizing various proposed methods, a single DNN that is fed by raw waveforms demonstrated an accuracy of 77.63 % on the test set of fold-1 configuration from the DCASE task 1-a dataset.
JEE-WEON JUNG (Member, IEEE) received the B.S. degree in computer science and the B.A. degree in business administration from the University of Seoul, South Korea, in 2017. He is currently pursuing the Ph.D. degree. His research interests include speaker recognition, acoustic scene classification, audio spoofing detection, and deep learning.
HEE-SOO HEO (Member, IEEE) received the B.S. and Ph.D. degrees in computer science from the University of Seoul, South Korea, in 2013 and 2019, respectively. Since 2020, he has been an AI Researcher at Naver Corporation. His research interests include machine learning, speaker recognition, acoustic scene classification, and image recognition.