Elsevier

Signal Processing

Volume 92, Issue 11, November 2012, Pages 2723-2737
Signal Processing

Music genre classification using LBP textural features

https://doi.org/10.1016/j.sigpro.2012.04.023Get rights and content

Abstract

In this paper we present an approach to music genre classification which converts an audio signal into spectrograms and extracts texture features from these time-frequency images which are then used for modeling music genres in a classification system. The texture features are based on Local Binary Pattern, a structural texture operator that has been successful in recent image classification research. Experiments are performed with two well-known datasets: the Latin Music Database (LMD), and the ISMIR 2004 dataset. The proposed approach takes into account some different zoning mechanisms to perform local feature extraction. Results obtained with and without local feature extraction are compared. We compare the performance of texture features with that of commonly used audio content based features (i.e. from the MARSYAS framework), and show that texture features always outperforms the audio content based features. We also compare our results with results from the literature. On the LMD, the performance of our approach reaches about 82.33%, above the best result obtained in the MIREX 2010 competition on that dataset. On the ISMIR 2004 database, the best result obtained is about 80.65%, i.e. below the best result on that dataset found in the literature.

Highlights

► Music genre classification using LBP texture descriptors extracted from spectrograms. ► Evaluate performance with local feature extraction and with global feature extraction. ► Compare results with state of the art in Latin Music Database and in ISMIR 2004 database.

Introduction

With the rapid expansion of the Internet, a huge amount of data from different sources has become available online. Studies indicate that in 2007 the amount of digital data scattered around the world consumed about 281 exabytes. In 2011, the amount of digital information produced in the year should be equal nearly 1800 exabytes, or 10 times that produced in 2006 [1].

Among all the different sources of information, music certainly is the one that can benefit from this impressive growing since it can be shared by users with different background and education, easily crossing cultural and language barriers [2]. In general, indexing and retrieving music is based on meta information tags such as ID3 tags. This metadata includes information such as song title, artist, album, year, musical genre, etc. [3]. Among all these descriptors, musical genre is probably the most obvious descriptor which comes to mind, and it is probably the most widely used to organize and manage large digital music databases [4].

Taking into account previous works, we can find different reasons which motivate research on automatic music genre classification. McKay and Fujinaga [5] pointed out that individuals differ on how they classify a given recording, but they can also differ in terms of the pool of genre labels from which they choose. On the other hand, Gjerdingen and Perrot [6] claimed that people are consistent in their genre categorization, even when these categorizations are wrong, or for very short segments. Pachet and Cazaly [7] showed that some traditional music taxonomies, like taxonomy of music industry, and internet taxonomy, are very inconsistent. Finally, Pampalk [8] says that genre classification-based evaluations can be used as proxy for listening tests of music similarity.

According to Lidy et al. [9] there are different approaches to describe the contents of a given piece of music. The most commonly used is the content-based approach which extracts representative features from the digital audio signal. Other approaches such as semantic analysis and community metadata have proved to perform well for traditional Western music, however, their use for other kinds of music is compromised because both community meta-data and lyrics-based approaches are dependent of natural language processing (NLP) tools, which are typically more developed for English than other languages.

In the case of the content-based approach, one of the earlier works was introduced by Tzanetakis and Cook [10] where they represented a music piece using timbral texture, beat-related, and pitch-related features. The employed feature set has become of public use, as part of the MARSYAS framework (Music Analysis, Retrieval and SYnthesis for Audio Signals), and it has been widely used for music genre recognition [3], [9], [11]. Other characteristics such as Inter-Onset Interval Histogram Coefficients, Rhythm Patterns and its derivatives Statistical Spectrum Descriptors, and Rhythm Histograms have been proposed in the literature recently [12], [13], [14].

In spite of all efforts done during the last years, automatic music genre classification still remains an open problem. McKay and Fujinaga [5] pointed out some problematic aspects of genre and refer to some experiments where human beings were not able to classify correctly more than 76% of the music pieces [15]. In spite of the fact that more experimental evidence is needed, these experiments give some insights about the upper bounds on software performance. McKay and Fujinaga also suggest that different approaches should be proposed to achieve further improvements.

In light of this, in this paper we propose an alternative approach for music genre classification which converts the audio signal into a spectrogram [16] (short-time Fourier representation) and then extract features from this visual representation. The rationale behind this is that treating the time-frequency representation as a texture image we can extract features which are expected to be suitable to build a robust music genre classification system even if there is not a straight relation between the musical dimension and the extracted features. Furthermore, these image-based features may capture different information from the approaches that work directly with the audio signal. Fig. 1 illustrates two examples of spectrograms taken from music pieces of different genres. Fig. 1(a) shows a spectrogram taken from a classical music piece. In this case there is a very clear presence of almost horizontal lines, related to harmonic structures, while in Fig. 1(b) one can observe the intensive beats, typical of electronic music, depicted as clear vertical lines. The features used in this work are provided by Local Binary Pattern (LBP), a structural texture operator presented by Ojala et al. [17].

By analyzing the spectrogram images, we have noticed that the textures are not uniform, so we decided to consider a local feature extraction beyond the global feature extraction. Furthermore, our previous results [18] have shown that using Gray Level Co-occurrence Matrix (GLCM) descriptors, local feature extraction can help to improve outcomes in music genre classification using spectrograms. With this in mind, we have studied different zoning techniques to obtain local information of the given pattern beyond the global feature extraction. We also demonstrate through experimentation that certain zones of the spectrogram perform better than others.

The use of spectrograms in music genre classification has already been proposed in other works [18], [19], [20]. However, some important issues remain overlooked. Thus, some innovations are presented here, such as: the use of LBP structural approach in order to get texture descriptors from the spectrogram; zoning mechanism taking into account human perception in setting up frequency bands; creation of one individual classifier for each created zone, combining their outputs in order to get the final decision; and comparison of results with and without zoning mechanism with a structural texture descriptor.

Through a set of comprehensive experiments on the Latin Music Database [21] and on the ISMIR 2004 database [22], we demonstrate that in most cases the proposed approach compares favorably to the traditional approaches reported in the literature. The results obtained with LMD in this work can be directly compared with those obtained by Lopes et al. [23] and Costa et al. [18], since all of them used artist filter restriction and folds with exactly the same music pieces to perform the classifier training and testing. The overall recognition rate improvement was about 22.66% when comparing with [23], and about 15.13% when comparing with the best result obtained in [18]. Taking into account the best results obtained with LMD in Music Information Retrieval Evaluation eXchange (MIREX) 2009 and MIREX 2010 [24] competitions, the improvement was about 7.67% and 2.47%, respectively. Concerning ISMIR 2004 database, obtained results are comparable to results described in the literature. In addition, these results can corroborate the versatility of the proposed approach.

The remaining of this paper is organized as follows: Section 2 describes the music databases used in the experiments. Section 3 presents the LBP texture operator used to extract features in this work. Section 4 introduces the methodology used for classification while Section 5 reports all the experiments that have been carried out on music genre classification. Finally the last section presents the conclusions of this work as well as opens up some perspectives for future work.

Section snippets

Music databases

The LMD and the ISMIR 2004 database are among the most used music database for researching in Music Information Retrieval. These two databases were chosen because, taking into account the signal segmentation strategy described in Section 3, these are among those databases that could be used.

Feature extraction

Since our approach is based on the visual representation of the audio signal, the first step of the feature extraction process consists in converting the audio signal to a spectrogram. In the LMD, the spectrograms were created using audio files with the following technical features: bit rate of 352 kbps, audio sample size of 16 bits, one channel, and audio sample rate of 22.05 kHz. In the ISMIR 2004 database, the audio files used had the following technical features: bit rate of 706 kbps, audio

Methodology used for classification

The classifier used in this work was the Support Vector Machine (SVM) introduced by Vapnik in [28]. Normalization was performed by linearly scaling each attribute to the range [1,+1]. Different parameters and kernels for the SVM were tried out but the best results were achieved using a Gaussian kernel. Parameters cost and gamma were tuned using a grid search.

The classification process is done as follows: the three 10-s segments of the music are converted into the spectrograms (ϒ¯beg, ϒ¯mid,

Experimental results and discussion

The following subsections present the experiments carried out with the global feature extraction and with the three different zoning mechanisms proposed in the previous section. Additional experiments carried out using acoustic features are also presented. The experimental results reported on the LMD refers to the average classification rates and standard deviations considering the three folds aforementioned.

Conclusion

In this paper we have presented an alternative approach for music genre classification which is based on texture images. Such visual representations are created by converting the audio signal representation into spectrograms images which can be divided into zones then that features can be extracted locally. We have demonstrated that, with LBP, there is a slight difference in terms of recognition rate when different zoning mechanisms are used and when a global feature extraction is performed,

Acknowledgments

This research has been partly supported by The National Council for Scientific and Technological Development (CNPq) grant 301653/2011-9, CAPES grant BEX 5779/11-1 and 223/09-FCT595-2009, Araucária Foundation grant 16767-424/2009, European Commission, FP7 (Seventh Framework Programme), ICT-2011.1.5 Networked Media and Search Systems, grant agreement No 287711; and the European Regional Development Fund through the Programme COMPETE and by National Funds through the Portuguese Foundation for

References (36)

  • G. Tzanetakis et al.

    Musical genre classification of audio signals

    IEEE Transactions on Speech and Audio Processing

    (2002)
  • T. Li, M. Ogihara, Q. Li, A comparative study on content-based music genre classification, in: 26th Annual...
  • F. Gouyon, S. Dixon, E. Pampalk, G. Widmer, Evaluating rhythmic descriptors for musical genre classification, in: 25th...
  • A. Rauber, E. Pampalk, D. Merkl, Using psycho-acoustic models and self-organizing maps to create a hierarchical...
  • T. Lidy, A. Rauber, Evaluation of feature extractors and psycho-acoustic transformations for music genre...
  • S. Lippens, J.P. Martens, M. Leman, B. Baets, H. Meyer, G. Tzanetakis, A comparison of human and automatic musical...
  • M.R. French et al.

    Spectrograms: turning signals into pictures

    Journal of Engineering Technology

    (2007)
  • T. Ojala et al.

    Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • Cited by (149)

    • Hierarchical mining with complex networks for music genre classification

      2022, Digital Signal Processing: A Review Journal
    • Genre Classification in Music using Convolutional Neural Networks

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text