Drum Sound Detection in Polyphonic Music with Hidden Markov Models

This paper proposes a method for transcribing drums from polyphonic music using a network of connected hidden Markov models (HMMs). The task is to detect the temporal locations of unpitched percussive sounds (such as bass drum or hi-hat) and recognise the instruments played. Contrary to many earlier methods, a separate sound event segmentation is not done, but connected HMMs are used to perform the segmentation and recognition jointly. Two ways of using HMMs are studied: modelling combinations of the target drums and a detector-like modelling of each target drum. Acoustic feature parametrisation is done with mel-frequency cepstral coe ﬃ cients and their ﬁrst-order temporal derivatives. The e ﬀ ect of lowering the feature dimensionality with principal component analysis and linear discriminant analysis is evaluated. Unsupervised acoustic model parameter adaptation with maximum likelihood linear regression is evaluated for compensating the di ﬀ erences between the training and target signals. The performance of the proposed method is evaluated on a publicly available data set containing signals with and without accompaniment, and compared with two reference methods. The results suggest that the transcription is possible using connected HMMs, and that using detector-like models for each target drum provides a better performance than modelling drum combinations.


Introduction
This paper applies connected hidden Markov models (HMMs) to the transcription of drums from polyphonic musical audio. For brevity, the word "drum" is here used to refer to all the unpitched percussions met in Western pop/rock music, such as bass drum, snare drum, and cymbals. The word "transcription" is used to refer to the process of locating drum sound onset instants and recognising the drums played. The analysis result enables several applications, such as using the transcription to assist beat tracking [1], drum track modification in the audio [2], reusing the drum patterns from existing audio, or musical studies on the played patterns.
Several methods have been proposed in literature to solve the drum transcription problem. Following the categorisation made in [3,4], majority of the methods can be viewed to be either segment and classify or separate and detect approaches. The methods in the first category operate by segmenting the input audio into meaningful events, and then attempt to recognise the content of the segments. The segmentation can be done by detecting candidate sound onsets or by creating an isochronous temporal grid coinciding with most of the onsets. After the segmentation a set of features is extracted from each segment, and a classifier is employed to recognise the contents. The classification method varies from a naive Bayes classifier with Gaussian mixture models (GMMs) [5] to support vector machines (SVMs) [4,6] and decision trees [7].
The methods in the second category aim at segregating each target drum into a separate stream and to detect sound onsets within the streams. The separation can be done with unsupervised methods like sparse coding [8] or independent subspace analysis (ISA) [9], but these require recognising the instruments from the resulting streams. The recognition step can be avoided by utilising prior knowledge of the target drums in the form of templates, and 2 EURASIP Journal on Audio, Speech, and Music Processing applying a supervised source separation method. Combining ISA with drum templates produces a method called prior subspace analysis (PSA) [10]. PSA represents the templates as magnitude spectrograms and estimates the gains of each template over time. The possible negative values in the gains do not have a physical interpretation and require a heuristic post-processing. This problem was solved using nonnegative matrix factorisation (NMF) restricting the component spectra and gains to be nonnegative. This approach was shown to perform well when the target signal matches the model (signals containing only target drums) [11].
Some methods cannot be assigned to either of the categories above. These include template matching and adaptation methods operating with time-domain signals [12], or with a spectrogram representation [13].
The main weakness with the "segment and classify" methods is the segmentation. The classification phase is not able to recover any events missed in the segmentation without an explicit error correction scheme, for example, [14]. If a temporal grid is used instead of onset detection, most of the events will be found, but the expressivity lying in the small temporal deviations from the grid is lost, and problems with the grid generation will be propagated to subsequent analysis stages.
To avoid making any decisions in the segmentation, this paper proposes to use a network of connected HMMs in the transcription in order to locate sound onsets and recognise the contents jointly. The target classes for recognition can be either combinations of drums or detectors for each drum. In the first approach, the recognition dictionary consists of combinations of target drums with one model to serve as the background model when no combination is played, and the task is to cover the input signal with these models. In the detector approach, each individual target drum is associated with two models: a "sound" model and a "silence" model, and the input signal is covered with these two models for each target drum independently from the others.
In addition to the HMM baseline system, the use of model adaptation with maximum likelihood linear regression (MLLR) will be evaluated. MLLR adapts the acoustic models from training to better match the specific input.
The rest of this article is organised as follows: Section 2 describes the proposed HMM-based transcription method; Section 3 details the evaluation setup and presents the obtained results; and finally Section 4 presents the conclusions of the paper. Parts of this work have been published earlier in [15,16]. Figure 1 shows an overview of the proposed method. The input audio is subjected to sinusoids-plus-residual modelling to suppress the effect of nondrum instruments by using only the residual. Then the signal is subdivided into short frames from which a set of features is extracted. The features serve as observations in HMMs that have been constructed in the training phase. The trained models are adapted with unsupervised maximum likelihood linear regression [17] to match the transcribed signal more closely. Finally, the transcription is done by searching an optimal path through the HMMs with Viterbi algorithm. The steps are described in more detail in the following.

Feature Extraction and Transformation.
It has been noted, for example, in [13,18], that suppression of tonal spectral components improves the accuracy of drum transcription. This is no surprise, as the common drums in pop/rock drum kit contain a notable stochastic component and relatively little tonal energy. Especially the idiophones (e.g., cymbals) produce mostly noise-like signal, while the membranophones (skinned drums) may contain also tonal components [19]. The harmonic suppression is here done with simple sinusoids-plus-residual modelling [20,21]. The signal is subdivided into 92.9 ms frames, the spectrum is calculated with discrete Fourier transform, and 30 sinusoids with the largest magnitude are selected by locating the 30 largest local maxima in the magnitude spectrum. The sinusoids are then synthesised and the resulting signal is subtracted from the original signal. The residual serves as the input to the following analysis stages. Even though the processing may remove some of the tonal components of the membranophones, the remaining ones and the stochastic components are enough for the recognition. Preliminary experiments also suggest that the exact number of removed components is not important, even doubling the number to 60 caused only an insignificant drop in the performance.
The feature extraction calculates 13 mel-frequency cepstral coefficients (MFCCs) in 46.4 ms frames with 75% overlap [22]. In addition to the MFCCs, their first-order temporal derivatives are estimated. The zeroth coefficient which is often discarded is also used. MFCCs have proven to work well in a variety of acoustic signal content analysis tasks including instrument recognition [23]. In addition to the MFCCs and their temporal derivatives, other spectral features, such as band energy ratios, spectral kurtosis, skewness, flatness, and slope used, for example, in [6] were considered for the feature set. However, preliminary experiments suggested that their inclusion reduces the overall performance slightly and they are not used in the presented results. The reason for this degradation is an open question to be addressed in the future work, but is assumed that the features do not contain enough additional information compared to the original set to compensate the increased modelling requirements.
The resulting 26-dimensional feature vectors are normalised to have zero mean and unity variance in each feature dimension over the training data. Then the feature matrix is subjected to dimensionality reduction. Though unsupervised transformation with principal component analysis (PCA) has been successfully used in some earlier publications, for example, [24], it did not perform well in our experiments. It is assumed that this is because PCA attempts only to describe the variance of the data without class information, and it may be distracted by the amount of noise present in the data. The feature transformation used here is calculated with linear discriminant analysis (LDA). LDA is a class-aware transformation attempting to minimise intra-class scatter while maximising interclass separation. If there are N different classes, LDA produces a transformation to N − 1 feature dimensions.

HMM Topologies.
Two different ways to utilise connected HMMs for drum transcription are considered: drum sound combination modelling and detector models for each target drum. In the first case, each of the 2 M combinations of M target drums is modelled with a separate HMM. In the latter case, each target drum has two separate models: a "sound" model and a "silence" model. In both approaches the recognition aims to find a sequence of the models providing the optimal description of the input signal. Figure 2 illustrates the decoding with combination modelling, while Figure 3 illustrates the decoding with drumwise detectors.
The main motivation for the combination modelling is that in popular music multiple drums are often hit simultaneously. However, the main weakness is that as the number of target drums increases, the number of combinations to be modelled also increases rapidly. Since only the few most frequent combinations cover most of the occurrences, as illustrated in Figure 4, there is very little training data for the more rare combinations. Furthermore, it may be difficult to determine whether or not some softer sound is present in a combination (e.g., when kick and snare drums are played, the presence of hi-hat may be difficult to detect from the acoustic information) and a wrong combination may be recognised.
With detector models, the training data can be utilised more efficiently than with combination models, because all combinations containing the target drum can be used to train the model. Another difference in the training phase is that each drum has a separate silence (or background) model.
As will be shown in Section 3, the detector topology generally outperforms the combination modelling which was found to have problems with overfitting the limited amount of training data. This was indicated by the following observations: performance degradation with increasing the number of HMM training iterations and acoustic adaptation, and slight improvement in the performance with simpler models and reduced feature dimensions. Because of this, the results on acoustic model adaptation and feature transformations are presented only for the detector topology (similar choice has been done, e.g., in [4]). For the sake of comparison, however, results are reported also for the combination modelling baseline.
The sound models consist of a four-state left-to-right HMM where a transition is allowed to the state itself and to the following state. The observation likelihoods are modelled with single Gaussian distributions. The silence model is a single-state HMM with a 5-component GMM for the observation likelihoods. This topology was chosen because the background sound does not have a clear sequential form. The number of states and GMM components were empirically determined.
The models are trained with expectation maximisation algorithm [26] using segmented training examples. The segments are extracted after annotated event onsets using a maximum duration of 10 frames. If there is another onset closer than the set limit, the segment is truncated accordingly. In detector modelling, the training instances for the "sound" model are generated from the segments containing the target drum, and the remaining frames are used to train the "silence" model. In combination modelling, the training instances for each combination are collected from the data, and the remaining frames are used to train the background model.

Acoustic Adaptation.
Unsupervised acoustic adaptation with maximum likelihood linear regression (MLLR) [17] has been successfully used to adapt the HMM observation density parameters, for example, in adapting speaker independent models to speaker dependent models in speech recognition [17], language adaptation from Spanish to Valencian [27], or to utilise a recognition database trained for phone speech to recognise speech in car conditions [28]. The motivation for using MLLR here is that, it is assumed that the acoustic properties of the target signal always differ from those of the training data, and the match between the model and the observations can be improved with adaptation. The adaptation is done for each target signal independently to provide models that fit the specific signal better. The adaptation is evaluated only for the detector topology, because for drum combinations, the adaptation was not successful, most likely due to the limited amount of observations.  Figure 3: Illustration of the basic idea of drum transcription with HMM-based drum detectors. Each target drum is associated with two models, "sound" and "silence", and the decoding is done for each drum separately.
In single variable MLLR for the mean parameter, a transformation matrix is used to apply a linear transformation to the GMM mean vector μ so that the likelihood of the adaptation data is maximised. The mean vector μ with the length n is transformed by where the transformation matrix has the dimensions of n × (n + 1), and ω = 1 is a bias parameter. The nonzero elements of W can be organised into a vector w = w 1,1 , . . . , w n,1 , w 1,2 , . . . , w n,n+1 .
The value of the vector can be calculated by where t is frame index; o(t) is the observation vector from frame t; s is an index of GMM components in the HMM; C s is the covariance matrix of GMM component s, γ s (t) the occupation probability of sth component in frame t (calculated, e.g., with the forward-backward algorithm), and matrix D s is defined as a concatenation of two diagonal matrices where μ s is the mean vector of the sth component and I is a n × n identity matrix [17]. In addition to the single variable mean transformation, also full matrix mean transformation [17] and variance transformation [29] were EURASIP Journal on Audio, Speech, and Music Processing tested. In the evaluations, the single variable adaptation performed better than the full matrix mean transformation, and therefore the results are presented only for it. Variance transformation reduced performance in all cases. The adaptation is done so that the signal is first analysed with the original models. Then it is segmented to examples of either class ("sound"/"silence") based on the recognition result, and the segments are used to adapt the corresponding models. The adaptation can be repeated using the models from the previous adaptation iteration for segmentation. It was found in the evaluations that applying the adaptation repeatedly for three times produced the best result even though the obtained improvement after the first adaptation was usually very small. Further increment of the number of adaptation iterations from this started to degrade the results.

Recognition.
In the recognition phase, the (adapted) HMM models are combined into a larger compound model; see Figures 2 and 3. This is done by concatenating the state transition matrices of the individual HMMs and incorporating the intermodel transition probabilities in the same matrix. The transition probabilities between the models are estimated from the same material that is used for training the acoustic models, and the bigram probabilities are smoothed with Witten-Bell smoothing [30]. The compound model is then used to decode the sequence with Viterbi algorithm. Another alternative would be to use token passing algorithm [31], but since the model satisfies the first-order Markov assumption (only bigrams are used), Viterbi is still a viable alternative.

Results
The performance of the proposed method is evaluated using the publicly available data set "ENST drums" [25]. The data set allows adjusting the accompaniment (everything else but the drums) level in relation to the drum signal, and two different levels are used in the evaluations: a balanced mix and a drums-only signal. The performance of the proposed method is compared with two reference systems: a "segment and classify" method by Tanghe et al. [6], and a supervised "separate and detect" method using nonnegative matrix factorisation [11].
3.1. Acoustic Data. The data set "ENST drums" contains multichannel recordings of three drummers playing with different drum kits. In addition to the original multichannel recordings, also two downmixes are provided: "dry" with minimal effects, mainly having only the levels of different drums balanced, and "wet" resembling the drum tracks on commercial recordings, containing some effects and compression. The material in the data set ranges from individual hits to stereotypical phrases, and finally to longer tracks played along with an accompaniment. These "minus one" tracks played on accompaniment have the synchronised accompaniment available as a separate signal allowing to create polyphonic signals with custom mixing levels. The ground truth for the data set contains the onset times for the different drums, and was provided with the data set.
The "minus one" tracks are used as the evaluation data. They are naturally split into three subsets based on the player and kit, each having approximately the same number of tracks (two with 21 tracks and one with 22). The lengths of the tracks range from 30 s to 75 s with mean duration of 55 s. The mixing ratios of drums and accompaniment used in the evaluations are drums-only and a "balanced" mix. The former is used to obtain a baseline result for the system with no accompaniment. The latter, corresponding to applying scaling factors of 2/3 for the drum signal and 1/3 for the accompaniment, is used then to evaluate the system performance in realistic conditions met in polyphonic music. (The mixing levels are based on personal communication with Gillet, and result into an average of −1.25 dB drumsto-accompaniment ratio over the whole data set.)

Evaluation Setup.
Evaluations are run using a three-fold cross-validation scheme. Data from two drummers are used to train the system and the data from the third are used for testing, and the division is repeated three times. This setup guarantees that the acoustic models have not seen the test data and their generalisation capability will be tested. In fact, the sounds of the corresponding drums in different kits may differ considerably (e.g., depending on the tension of the skin, the use of muffling in case of kick drum, or the instrument used to hit the drum that can be a mallet, a stick, rods, or brushes) and using only two examples of a certain drum category to recognise a third one is a difficult problem. Hence, in real applications the training should be done with as diverse data as possible.
The target drums in the evaluations are bass drum (BD), snare drum (SD), and hi-hat (HH). The target set is limited to these three for two main reasons. Firstly, they are found practically in every track in the evaluation data and they cover a large portion of all the drum sound events, as can be seen from Figure 5. Secondly, and more importantly, these   three instruments convey the main rhythmic feel of most of the popular music songs, and occur in a relatively similar way in all the kits.
In the evaluation of the transcription result, the found target drum onset locations are compared with the locations given in the ground truth annotation. The hits are matched to the closest hit in the other set so that each hit has at most one hit associated to it. A transcribed onset is accepted as correct if the absolute time difference to the ground truth onset is less than 30 ms. (When comparing the results obtained with the same data set in [4], it should be noted that there the allowed deviation was 50 ms.) When the number of events is G in the ground truth and E in the transcription result, and the number of missed ground truth events and inserted events are m and i, respectively, the transcription performance can be described with precision rate and recall rate These two metrics can be further summarised by their harmonic mean, F-measure

Reference Methods.
The system performance is compared with two earlier methods: a "segment and classify" method by Tanghe et al. [6] and a "separate and detect" method by Paulus and Virtanen [11]. The former, referred to as SVM in the results, was designed for transcribing drums from polyphonic music by detecting sound onsets and then classifying the sounds with binary SVMs for each target drum. An implementation of the original author is used [32]. The latter, referred to as NMF-PSA, was designed for transcribing drums from a signal without accompaniment. The method uses spectral templates for each target drum and estimates their time-varying gains using NMF. Onsets are detected from the recovered gains. Also here the original implementation is used. The models for the SVM method are not trained specifically for the data used, but the generic models provided are used instead. The spectral templates for NMF-PSA are calculated from the individual drum hits in the data set used here. In the original publication the mid-level representation used spectral resolution of five bands. Here they are replaced with 24 Bark bands for improved frequency resolution.

3.4.
Results. The evaluation results are given in Tables 1 and  2. The former contains the evaluation results in the case of the "balanced" mixture as the input, while the latter contains the results for signals without accompaniment. The methods are referred to as (i) HMM: The proposed HMM method with detectors for each target drum without acoustic adaptation, (ii) HMM + MLLR: The proposed detector-like HMM method including the acoustic model adaptation with MLLR, (iii) HMM comb: The proposed HMM method with drum combinations without acoustic adaptation, (iv) NMF-PSA: A "separate and detect" method using NMF for the source separation, proposed in [11], (v) SVM: A "segment and classify" method proposed in [6] using SVMs for detecting the presence of each target drum in the located segments.
The results show that the proposed method performs best among the evaluated methods. In addition, it can be seen that the acoustic adaptation slightly improves the recognition result. All the evaluated methods seem to have problems in transcribing the snare drum (SD), even without the presence of accompaniment. One reason for this is that the snare drum is often played in more diverse ways than, for example, the bass drum. Examples of these include producing the excitation with sticks or brushes, or playing with and without the snare belt, or by producing barely audible "ghost hits".
When analysing the results of "segment and classify" methods, it is possible to distinguish between errors in segmentation and classification. However, since the proposed method aims to perform these tasks jointly, acting as a specialised onset detection method for each target drum, this distinction cannot be made.
An earlier evaluation with the same data set was presented in [4, Table II] . The table section "Accompaniment +0 dB" in there corresponds to the results presented in Table 1, and section "Accompaniment −∞ dB" corresponds to the results in Table 2. In both cases, the proposed method clearly outperforms the earlier method in bass drum and hihat transcription accuracy. However, the performance of the proposed method on snare drum is slightly worse.
The improvement obtained using the acoustic model adaptation is relatively small. Measuring the statistical significance with two-tailed unequal variance Welch's t-test [33] on the F-measures for individual test signals produces Pvalue of approximately .64 for the balanced mix test data and .18 for the data without accompaniment suggesting that the difference in the results is not statistically significant. However, the adaptation seems to provide a better balance on precision and recall rates. The performance differences between the proposed detector-like HMMs and the other methods are clearly in favour of the proposed method. Table 3 provides the evaluation results with different feature transformation methods while using detector-like HMMs without acoustic adaptation. The results show that PCA has a very small effect on the overall performance while LDA provides a considerable improvement.

Conclusions
This paper has studied and evaluated different ways of using connected HMMs for transcribing drums from polyphonic music. The proposed detector-type approach is relatively simple with only two models for each target drum: a "sound" and a "silence" model. In addition, modelling of drum combinations instead of detectors for individual drums was investigated, but found not to work very well. It is likely that the problems with the combination models are caused by overfitting the training data. The acoustic frontend extracts mel-frequency cepstral coefficients (MFCCs) and their first-order derivatives to be used as the acoustic feature. Comparison of feature transformations suggests that LDA provides a considerable performance increase with the proposed method. Acoustic model adaptation with MLLR is tested, but the obtained improvement is relatively small. The proposed method produces a relatively good transcription of bass drum and hi-hat, but snare drum recognition has 8 EURASIP Journal on Audio, Speech, and Music Processing some problems that need to be addressed in future work. The main finding is that it is not necessary to have a separate segmentation step in a drum transcriber, but the segmentation and recognition can be performed jointly with an HMM even in the presence of accompaniment and with bad signal-to-noise ratios.