Abstract
A source-filter model, originally devised to represent a sound production process, has been widely used to estimate both of the source signal which includes pitch information and the synthesis filter which includes vowel information, as from sounds of a speech signal. We use this model to identify instruments by their instrumental sound signal. However, this model suffers from an indeterminacy problem. To resolve it, we employ three elements of the sound: loudness, pitch and timbre. Our assumption is that the source signal is represented by time-varying pitch and amplitude, and the synthesis filter by time-invariant line spectral frequency parameters. We construct a probabilistic model that represents our assumption with an extension of the source-filter model. For learning of model parameters, we employed an EM-like minimization algorithm of a cost function called the free energy. Reconstruction of the spectrum with the estimated source signal and synthesis filter, and instrument identification by using the model parameters of the estimated synthesis filter are performed to evaluate our approach, showing that this learning scheme could achieve simultaneous estimation of the source signal and the synthesis filter.
Similar content being viewed by others
Notes
Gaussian function is defined as \( {\text{Gauss}}(x;\mu ,\sigma^{2} ) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp \left( { - \frac{{(x - \mu )^{2} }}{{2\sigma^{2} }}} \right). \)
References
Fant G (1970) Acoustical theory of speech production: with calculations based on X-ray studies of Russian articulations. The Hague, Mouton
Itakura F, Saito S (1971) Speech information compression based on the maximum likelihood spectral estimation. J Acoust Soc Jpn 27(9):463–472 (in Japanese)
Klapuri A (2007) Analysis of musical instrument sounds by source-filter model. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 1, pp I53–I56
Virtanen T (2000) Audio signal modeling with sinusoids plus noise, Master’s thesis, Tampere University of Technology
Itakura F, Saito S (1970) A statistical method for estimation of speech spectral density and formant frequencies. Trans IEICE Jpn 53-A:36–43 (in Japanese)
Yuan Z (2003) The weighted sum of the line spectrum pair for noisy speech, Master’s thesis, Helsinki University of Technology
Itakura F (1975) Line spectrum representation of linear predictive coefficients of speech signals. J Acoust Soc Jpn 57:S35 (in Japanese)
Sugamura N, Itakura F (1981) Line spectrum representation of linear predictor coefficients of speech signal and its statistical properties. Trans IEICE Jpn J64-A(4):323–330 (in Japanese)
Krishna A, Sreenivas T (2004) Music instrument recognition: from isolated notes to solo phrases. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 4, pp 265–268
Ahmadi S, Spanias AS (2001) Low bit-rate speech coding based on an improved sinusoidal model. Speech Commun34(4)
Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental sparse and other variants. learning in graphical models, pp 355–368
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J Optimization 9(1):112–147
Teukolsky SA, Vetterling WT, Flannery BP, Press WH (1994) Numerical recipes in C. Cambridge University Press, London
Agostini G, Longari M, Pollastri E (2003) Musical instrument timbres classification with spectral features. In: European conference on signal processing (EUSIPCO), vol 1
Marques J, Moreno PJ (1999) A study of musical instrument classification using Gaussian mixture models and support vector machines. Technical report, Compaq computer corporation
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm
Campbell P, Tremain T (1986) Voiced/unvoiced classification of speech with applications to the U.S. Government LPC-10E algorithm. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP)
Atal B, Hanauer S (1971) Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am 50(2):637–655
Livshin A, Rodet X (2004) Musical instrument identification in continuous recordings. In: International conference on digital audio effects (DAFx)
Essid S, Richard G, David B (2006) Musical instrument recognition by pairwise classification strategies. In: IEEE transactions on audio, speech and language processing, vol 14, no 4, pp 1401–1412
Kitahara T (2007) Computational musical instrument recognition and its application to content-based music information retrieval. Ph.D. Thesis, Kyoto University
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn edn. Academic Press, Boston
Martinez A, Kak A (2001) PCA versus LDA. In: IEEE transactions on pattern analysis and machine intelligence, vol 23, no 2
Eggink J, Brown GJ (2003) Application of missing feature theory to the recognition of musical instruments in polyphonic audio. In: International conference on music information retrieval (ISMIR)
Jinachitra P (2004) Polyphonic instrument identification using independent subspace analysis. In: International conference on multimedia and expo (ICME)
Essid S, Richard G, David B (2004) Musical instrument recognition based on class pairwise feature selection. In: International conference on music information retrieval (ISMIR)
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Relation between Itakura-Saito distortion measure and the likelihood of a Chi-square distribution
We consider the noise distribution in the frequency domain rather than in the time domain. In this Appendix, the equivalence between the Itakura-Saito distortion measure and the Chi-square distribution is shown.
When the observation spectrum is represented along the continuous frequency axis as in the Itakura-Saito distortion, the summation is replaced by the integral, i.e.,
where \( \hat{s}(\omega ) \) is the estimated power spectrum, \( s(\omega ) \) is the true short-term power spectrum, \( \tilde{\omega } = \frac{\omega Fs}{2\pi } \) is the normalized angular frequency \( \left( { - \pi \le \tilde{\omega } \le \pi } \right), \) and ω and Fs are frequency and sampling frequency, respectively.
On the other hand, a Chi-square distribution (degree of freedom: 3) is given by
Taking the logarithm, we have
where c is a constant. Substituting \( n = \frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}} \) into (A.2), we obtain
Since we have assumed that the noise is generated independently from a Chi-square distribution for each frequency, the joint log-probability of the observation noise becomes a summation of (A.4) over frequencies, which is equivalent to the Itakura-Saito distortion (A.1).
Appendix B
2.1 Free energy calculation of the sound production model
In the calculation of the free energy, we assume the trial distribution q(X 1:T |κ) is a single Gaussian distribution: q(X 1:T |κ) = Gauss(X 1:T ; μ, Σ), where κ = {μ, Σ}. The free energy then becomes
1. The term for the initial state distribution
2. The term for the observation process
where
3. The term for the state transition
where
Note that w a , w b , w c are constants, and * = {a,f}.
4. The term for the entropy
Rights and permissions
About this article
Cite this article
Ihara, M., Maeda, Si. & Ishii, S. Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics. Neural Comput & Applic 18, 3–14 (2009). https://doi.org/10.1007/s00521-008-0201-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-008-0201-7