Skip to main content
Log in

Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics

  • IJCNN 2007
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

A source-filter model, originally devised to represent a sound production process, has been widely used to estimate both of the source signal which includes pitch information and the synthesis filter which includes vowel information, as from sounds of a speech signal. We use this model to identify instruments by their instrumental sound signal. However, this model suffers from an indeterminacy problem. To resolve it, we employ three elements of the sound: loudness, pitch and timbre. Our assumption is that the source signal is represented by time-varying pitch and amplitude, and the synthesis filter by time-invariant line spectral frequency parameters. We construct a probabilistic model that represents our assumption with an extension of the source-filter model. For learning of model parameters, we employed an EM-like minimization algorithm of a cost function called the free energy. Reconstruction of the spectrum with the estimated source signal and synthesis filter, and instrument identification by using the model parameters of the estimated synthesis filter are performed to evaluate our approach, showing that this learning scheme could achieve simultaneous estimation of the source signal and the synthesis filter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Gaussian function is defined as \( {\text{Gauss}}(x;\mu ,\sigma^{2} ) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp \left( { - \frac{{(x - \mu )^{2} }}{{2\sigma^{2} }}} \right). \)

References

  1. Fant G (1970) Acoustical theory of speech production: with calculations based on X-ray studies of Russian articulations. The Hague, Mouton

    Google Scholar 

  2. Itakura F, Saito S (1971) Speech information compression based on the maximum likelihood spectral estimation. J Acoust Soc Jpn 27(9):463–472 (in Japanese)

    Google Scholar 

  3. Klapuri A (2007) Analysis of musical instrument sounds by source-filter model. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 1, pp I53–I56

  4. Virtanen T (2000) Audio signal modeling with sinusoids plus noise, Master’s thesis, Tampere University of Technology

  5. Itakura F, Saito S (1970) A statistical method for estimation of speech spectral density and formant frequencies. Trans IEICE Jpn 53-A:36–43 (in Japanese)

    Google Scholar 

  6. Yuan Z (2003) The weighted sum of the line spectrum pair for noisy speech, Master’s thesis, Helsinki University of Technology

  7. Itakura F (1975) Line spectrum representation of linear predictive coefficients of speech signals. J Acoust Soc Jpn 57:S35 (in Japanese)

    Article  Google Scholar 

  8. Sugamura N, Itakura F (1981) Line spectrum representation of linear predictor coefficients of speech signal and its statistical properties. Trans IEICE Jpn J64-A(4):323–330 (in Japanese)

    Google Scholar 

  9. Krishna A, Sreenivas T (2004) Music instrument recognition: from isolated notes to solo phrases. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 4, pp 265–268

  10. Ahmadi S, Spanias AS (2001) Low bit-rate speech coding based on an improved sinusoidal model. Speech Commun34(4)

  11. Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental sparse and other variants. learning in graphical models, pp 355–368

  12. Bishop C (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  13. Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J Optimization 9(1):112–147

    Article  MATH  MathSciNet  Google Scholar 

  14. Teukolsky SA, Vetterling WT, Flannery BP, Press WH (1994) Numerical recipes in C. Cambridge University Press, London

  15. Agostini G, Longari M, Pollastri E (2003) Musical instrument timbres classification with spectral features. In: European conference on signal processing (EUSIPCO), vol 1

  16. Marques J, Moreno PJ (1999) A study of musical instrument classification using Gaussian mixture models and support vector machines. Technical report, Compaq computer corporation

  17. Vapnik V (1995) The nature of statistical learning theory. Springer, New York

    MATH  Google Scholar 

  18. Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm

  19. Campbell P, Tremain T (1986) Voiced/unvoiced classification of speech with applications to the U.S. Government LPC-10E algorithm. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP)

  20. Atal B, Hanauer S (1971) Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am 50(2):637–655

    Article  Google Scholar 

  21. Livshin A, Rodet X (2004) Musical instrument identification in continuous recordings. In: International conference on digital audio effects (DAFx)

  22. Essid S, Richard G, David B (2006) Musical instrument recognition by pairwise classification strategies. In: IEEE transactions on audio, speech and language processing, vol 14, no 4, pp 1401–1412

  23. Kitahara T (2007) Computational musical instrument recognition and its application to content-based music information retrieval. Ph.D. Thesis, Kyoto University

  24. Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn edn. Academic Press, Boston

    MATH  Google Scholar 

  25. Martinez A, Kak A (2001) PCA versus LDA. In: IEEE transactions on pattern analysis and machine intelligence, vol 23, no 2

  26. Eggink J, Brown GJ (2003) Application of missing feature theory to the recognition of musical instruments in polyphonic audio. In: International conference on music information retrieval (ISMIR)

  27. Jinachitra P (2004) Polyphonic instrument identification using independent subspace analysis. In: International conference on multimedia and expo (ICME)

  28. Essid S, Richard G, David B (2004) Musical instrument recognition based on class pairwise feature selection. In: International conference on music information retrieval (ISMIR)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mizuki Ihara.

Appendices

Appendix A: Relation between Itakura-Saito distortion measure and the likelihood of a Chi-square distribution

We consider the noise distribution in the frequency domain rather than in the time domain. In this Appendix, the equivalence between the Itakura-Saito distortion measure and the Chi-square distribution is shown.

When the observation spectrum is represented along the continuous frequency axis as in the Itakura-Saito distortion, the summation is replaced by the integral, i.e.,

$$ E\left[ {\frac{{\hat{s}_{t} }}{s}} \right] = 2\int\limits_{ - \pi }^{\pi } {\left\{ {\log \frac{{\hat{s}_{t} (\tilde{\omega })}}{{s_{t} (\tilde{\omega })}} + \frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}} - 1} \right\}{\text{d}}\omega ,} $$
(A.1)

where \( \hat{s}(\omega ) \) is the estimated power spectrum, \( s(\omega ) \) is the true short-term power spectrum, \( \tilde{\omega } = \frac{\omega Fs}{2\pi } \) is the normalized angular frequency \( \left( { - \pi \le \tilde{\omega } \le \pi } \right), \) and ω and Fs are frequency and sampling frequency, respectively.

On the other hand, a Chi-square distribution (degree of freedom: 3) is given by

$$ f(n) = \frac{1}{2\Upgamma (1.5)}\left( {\frac{n}{2}} \right)^{{\frac{1}{2}}} \exp \left( { - \frac{n}{2}} \right). $$
(A.2)

Taking the logarithm, we have

$$ \begin{aligned} \log f(n) & = - \log \Upgamma (1.5) - \frac{1}{2}\log \frac{1}{n} - \frac{3}{2}\log 2 - \frac{n}{2} \\ & \equiv - \frac{1}{2}\left( {c + \log \frac{1}{n} + n} \right), \\ \end{aligned} $$
(A.3)

where c is a constant. Substituting \( n = \frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}} \) into (A.2), we obtain

$$ \log f\left( {\frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}}} \right) \equiv \log \frac{{\hat{s}(\tilde{\omega })}}{{s(\tilde{\omega })}} + \frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}} + c. $$
(A.4)

Since we have assumed that the noise is generated independently from a Chi-square distribution for each frequency, the joint log-probability of the observation noise becomes a summation of (A.4) over frequencies, which is equivalent to the Itakura-Saito distortion (A.1).

Appendix B

2.1 Free energy calculation of the sound production model

In the calculation of the free energy, we assume the trial distribution q(X 1:T |κ) is a single Gaussian distribution: q(X 1:T |κ) = Gauss(X 1:T ; μ, Σ), where κ = {μ, Σ}. The free energy then becomes

$$ \begin{aligned} &F(q(X_{1:T} |\kappa ),\theta ) \\ & = - \int { \cdots \int {q(X_{1:T} |\kappa )\log p(X_{1:T} ,S_{1:T} |\theta ){\text{d}}X_{1:T} } } \\ &\quad + \int { \cdots \int {q(X_{1:T} |\kappa )} \log q(X_{1:T} |\kappa ){\text{d}}X_{1:T} } \\ & = - \int { \cdots \int \begin{gathered} q(X_{1:T} |\kappa ) \hfill \\ \left( {\log (p(s_{1} |x_{1} ,\theta )p(x_{1} |\theta )\prod\limits_{t = 2}^{T} {p(s_{t} |x_{t} ,\theta )p(x_{t} |x_{t - 1} ,\theta )} )} \right) \hfill\\ \end{gathered} } {\text{d}}X_{1:T} \\ & \quad + \int { \cdots \int {q(X_{1:T} |\kappa )\log q(X_{1:T} |\kappa ){\text{d}}X_{1:T} } } \\ & = - \int {q(x_{1} |\kappa )\log p(x_{1} |\theta ){\text{d}}x_{1} - \sum\limits_{t = 2}^{T} {\int {q(x_{t} |\kappa )\log p(s_{t} |x_{t} ,\theta ){\text{d}}x_{t} } } } \\ & \quad - \sum\limits_{t = 2}^{T} {\int {q(x_{t} ,x_{t - 1} |\kappa )\log p(x_{t} |x_{t - 1} ,\theta ){\text{d}}x_{t} {\text{d}}x_{t - 1} - {\text{H}}(q(X_{1:T} |\kappa ))} } . \\ & = F_{1} + F_{2} + F_{3} + F_{4} . \\ \end{aligned} $$
(B.1)

1. The term for the initial state distribution

$$ \begin{aligned} F_{1} & = - \int {q(x_{1} |\kappa )\log (x_{1} |\theta ){\text{d}}x_{1} } \\ & = - \int {{\text{Gauss}}(x_{1} ;\mu_{1} ,\Sigma_{1} )} \log \left( {{\text{Gauss}}(x_{1} ;m_{1} ,(\sigma_{1} )^{2} )} \right){\text{d}}x_{1} \\ & = \frac{1}{2}\left( {{\text{Tr}}\left( {(\sigma_{1} )^{ - 2} \Sigma_{1} } \right) + (\mu_{1} - m_{1} )^{T} (\sigma_{1} )^{ - 2} (\mu_{1} - m_{1} )} \right) + \frac{1}{2}\log \left| {(\sigma_{1} )^{2} } \right| + \log (2\pi ). \\ \end{aligned} $$
(B.2)

2. The term for the observation process

$$ \begin{aligned} F_{2} & = - \int {q(x_{1} |\kappa )\log p(s_{t} |x_{t} ,\theta ){\text{d}}x_{t} } \\ & = - \int {{\text{Gauss}}(x_{1} ;\mu_{1} ,\Sigma_{1} )} \\ \quad \sum\limits_{i = 1}^{N} {\log \frac{1}{{\sqrt {2\pi \sigma_{o} s_{t} (i)} }}\exp \left( { - \frac{1}{{2\sigma_{o}^{2} }}\left( {\log s_{t} (i) - \log \hat{s}_{t} (i)} \right)^{2} } \right){\text{d}}x_{t} } \\ \quad {\text{Substitute}}\quad \hat{s}_{t} (i) = H(i)G_{t} (i)\quad {\text{and}}\quad q = {\text{Gauss}}(x_{t} ;\mu_{t} ,\Sigma_{t} ), \\ & = - \int {{\text{Gauss}}(x_{t} ;\mu_{t} ,\Sigma_{t} )} \sum\limits_{i = 1}^{N} {\left( { - \frac{1}{2}\log (2\pi ) - \log \sigma_{o} - \log s_{t} (i)} \right.} \\ \quad \left. { - \frac{1}{{2\sigma_{o}^{2} }}\left( {\log s_{t} (i) - \log H(i)\log G_{t}{(i)} } \right)^{2} } \right){\text{d}}x_{t} \\ & = \frac{N}{2}\log (2\pi ) + \sum\limits_{i = 1}^{N} {\log (s_{t} (i)) + N\log (\sigma_{o} ) + \frac{1}{{2\sigma_{o}^{2} }}} \sum\limits_{i = 1}^{N} {\left( {\log s_{t} (i) + \log H(i)} \right)^{2} } \\ \quad - \frac{1}{{2\sigma_{o}^{2} }}\sum\limits_{i = 1}^{N} {\left( {\log \frac{{s_{t} (i)}}{H(i)}\left( {\mu_{a_{t}} + \frac{{A(\omega_{i} )}}{{\sqrt {2\pi } }}K_{\exp } (i)} \right)} \right),} \\ \end{aligned} $$
(B.3)

where

$$ \begin{aligned} A(\omega ) & = A\exp \left( { - \frac{\omega }{\tau }} \right), \\ K_{\exp } (i) & = \sum\limits_{k}^{K} {k_{\exp } (i),} \\ k_{\exp } (i) & = \frac{1}{{\sqrt {\left( {k^{2} \Sigma_{ft} + \sigma_{p}^{2} } \right)} }}\exp \left( { - \frac{1}{2}\left( {\Sigma_{ft} + \frac{{\sigma_{p}^{2} }}{{k^{2} }}} \right)^{ - 1} \left( {\mu_{f_{t}} - \frac{{\omega_{i} }}{k}} \right)^{2} } \right), \\ KL_{\exp } (i) & = \sum\limits_{k,l} {kl_{\exp } (i)}, \\ kl_{\exp } (i) & = \frac{1}{{\sqrt {\left( {\left( {k^{2} + l^{2} } \right)\Sigma_{f_{t}} + \sigma_{p}^{2} } \right)} }} \\ \quad \exp \left( { - \frac{1}{2}\left( {\left( {\Sigma_{ft} + \frac{{\sigma_{p}^{2} }}{{k^{2} + l^{2} }}} \right)^{ - 1} \left( {\mu_{ft} - \frac{k + l}{{k^{2} + l^{2} }}\omega_{i} } \right)^{2} + \frac{{(k - l)^{2} }}{{k^{2} + l^{2} }}\frac{{\omega_{i}^{2} }}{{\sigma_{p}^{2} }}} \right)} \right), \\ H(\tilde{\omega}, b_1,..., b_p) &= 2^{1-p}\left( \sin^2 \frac{\tilde{\omega}}{2}\prod_{n=2,4,...,p} \left( \cos \tilde{\omega} - \cos b_n \right)^2 + \cos^2 \frac{\tilde{\omega}}{2} \prod_{n=1,3,...,p-1}\left( \cos \tilde{\omega} - \cos b_n \right)^2 \right)^{-2}, \\ \tilde{\omega } & = \frac{{{\text{Fs}}\omega }}{2\pi }. \end{aligned} $$
(B.4)

3. The term for the state transition

$$ \begin{aligned} F_{3} & = - \int {q(x_{t} ,x_{t - 1} |\kappa )\log p(x_{t} |x_{t - 1} ,\theta ){\text{d}}x_{t} {\text{d}}x_{t - 1} } \\ & \approx \frac{1}{2}\log \upsilon_{a} + \log \Upgamma \left( {\frac{{\upsilon_{a} }}{2}} \right) - \log \Upgamma \left( {\frac{{\upsilon_{a} + 1}}{2}} \right) + \frac{1}{2}\log \upsilon_{f} + \log \Upgamma \left( {\frac{{\upsilon_{f} }}{2}} \right) \\ & + \frac{{\upsilon_{a} + 1}}{2}\left( {\frac{{\beta_{a}^{4} U_{a} }}{{\upsilon_{a}^{2} }} + \frac{{\beta_{a}^{4} V_{a} }}{{\upsilon_{a} }} + abc_{a,t} } \right) + \frac{{\upsilon_{f} + 1}}{2}\left( {\frac{{\beta_{f}^{4} U_{f} }}{{\upsilon_{f}^{2} }} + \frac{{\beta_{f}^{4} V_{f} }}{{\upsilon_{f} }} + abc_{f,t} } \right) \\ & - \log \Upgamma \left( {\frac{{\upsilon_{f} + 1}}{2}} \right) + \log \beta_{a} + \log \beta_{f} + \log \pi , \\ \end{aligned} $$
(B.5)

where

$$ \begin{aligned} U_{a} & = a_{a,t} \left( {3\Sigma_{u,a}^{2} + e_{u,a}^{4} + 6\Sigma_{u,a} e_{u,a}^{2} } \right), \\ U_{f} & = a_{f,t} \left( {3\Sigma_{u,f}^{2} + \mu_{u,f}^{4} + 6\Sigma_{u,f} \mu_{u,f}^{2} } \right), \\ V_{a} & = \left( {2a_{a,t} + b_{a,t} } \right)\left( {\Sigma_{u,a} + e_{u,a}^{2} } \right), \\ V_{f} & = \left( {2a_{f,t} + b_{f,t} } \right)\left( {\Sigma_{u,f} + \mu_{u,f}^{2} } \right), \\ e_{u,a} & = \mu_{u,a} - \log \rho , \\ e_{u,f} & = \mu_{u,f} , \\ \mu_{u,*} & = \frac{1}{\sqrt 2 }\left( {\mu_{*,t} - \mu_{*,t - 1} } \right), \\ abc_{*,t} & = a_{*,t} + b_{*,t} + c_{*,t} , \\ a_{*,t} & = \frac{{w_{*1} }}{{x_{\max ,*,t} - 1}} + w_{*2} , \\ b_{*,t} & = - \frac{{w_{b1} }}{{\sqrt {x_{\max ,*,t} - 1} + w_{b3} }} + w_{b2} , \\ c_{*,t} & = w_{c1} \log \left( {x_{\max ,*,t} - 1} \right) + w_{c2} , \\ x_{\max ,*,t} & = 1 + \frac{{\left( {\mu_{u,a} + k_{a} \sqrt {\Sigma_{u,*} } - \log \rho } \right)^{2} }}{{\upsilon_{a} }}, \\ x_{\min } & = 1, \\ \Sigma_{u,*} & = \frac{1}{2}\left( {\Sigma_{*,t} + \Sigma_{*,t - 1} + 2\Sigma_{*,t,t - 1} } \right). \\ \end{aligned} $$
(B.6)

Note that w a , w b , w c are constants, and * = {a,f}.

4. The term for the entropy

$$ \begin{aligned} F_{4} & = - H\left( {q\left( {X_{1:T} |\kappa } \right)} \right) \\ & = - H\left( {{\text{Gauss}}\left( {X_{1:T} |\mu ,\Sigma } \right)} \right) \\ & \approx - \frac{1}{2}\left( {n + n\log (2\pi ) + \log \left| \Sigma \right|} \right). \\ \end{aligned} $$
(B.7)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ihara, M., Maeda, Si. & Ishii, S. Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics. Neural Comput & Applic 18, 3–14 (2009). https://doi.org/10.1007/s00521-008-0201-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-008-0201-7

Keywords

Navigation