Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics

Ihara, Mizuki; Maeda, Shin-ichi; Ishii, Shin

doi:10.1007/s00521-008-0201-7

Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics

IJCNN 2007
Published: 25 September 2008

Volume 18, pages 3–14, (2009)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Mizuki Ihara¹,
Shin-ichi Maeda² &
Shin Ishii²

136 Accesses
Explore all metrics

Abstract

A source-filter model, originally devised to represent a sound production process, has been widely used to estimate both of the source signal which includes pitch information and the synthesis filter which includes vowel information, as from sounds of a speech signal. We use this model to identify instruments by their instrumental sound signal. However, this model suffers from an indeterminacy problem. To resolve it, we employ three elements of the sound: loudness, pitch and timbre. Our assumption is that the source signal is represented by time-varying pitch and amplitude, and the synthesis filter by time-invariant line spectral frequency parameters. We construct a probabilistic model that represents our assumption with an extension of the source-filter model. For learning of model parameters, we employed an EM-like minimization algorithm of a cost function called the free energy. Reconstruction of the spectrum with the estimated source signal and synthesis filter, and instrument identification by using the model parameters of the estimated synthesis filter are performed to evaluate our approach, showing that this learning scheme could achieve simultaneous estimation of the source signal and the synthesis filter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Synthesizing the note-specific atoms based on their fundamental frequency, used for single-channel musical source separation

Article 15 January 2019

Percussive/harmonic sound separation by non-negative matrix factorization with smoothness/sparseness constraints

Article Open access 11 July 2014

Application of Multiple Sound Representations in Multipitch Estimation Using Shift-Invariant Probabilistic Latent Component Analysis

Notes

Gaussian function is defined as $ {\text{Gauss}}(x;\mu ,\sigma^{2} ) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp \left( { - \frac{{(x - \mu )^{2} }}{{2\sigma^{2} }}} \right). $

References

Fant G (1970) Acoustical theory of speech production: with calculations based on X-ray studies of Russian articulations. The Hague, Mouton
Google Scholar
Itakura F, Saito S (1971) Speech information compression based on the maximum likelihood spectral estimation. J Acoust Soc Jpn 27(9):463–472 (in Japanese)
Google Scholar
Klapuri A (2007) Analysis of musical instrument sounds by source-filter model. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 1, pp I53–I56
Virtanen T (2000) Audio signal modeling with sinusoids plus noise, Master’s thesis, Tampere University of Technology
Itakura F, Saito S (1970) A statistical method for estimation of speech spectral density and formant frequencies. Trans IEICE Jpn 53-A:36–43 (in Japanese)
Google Scholar
Yuan Z (2003) The weighted sum of the line spectrum pair for noisy speech, Master’s thesis, Helsinki University of Technology
Itakura F (1975) Line spectrum representation of linear predictive coefficients of speech signals. J Acoust Soc Jpn 57:S35 (in Japanese)
Article Google Scholar
Sugamura N, Itakura F (1981) Line spectrum representation of linear predictor coefficients of speech signal and its statistical properties. Trans IEICE Jpn J64-A(4):323–330 (in Japanese)
Google Scholar
Krishna A, Sreenivas T (2004) Music instrument recognition: from isolated notes to solo phrases. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 4, pp 265–268
Ahmadi S, Spanias AS (2001) Low bit-rate speech coding based on an improved sinusoidal model. Speech Commun34(4)
Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental sparse and other variants. learning in graphical models, pp 355–368
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J Optimization 9(1):112–147
Article MATH MathSciNet Google Scholar
Teukolsky SA, Vetterling WT, Flannery BP, Press WH (1994) Numerical recipes in C. Cambridge University Press, London
Agostini G, Longari M, Pollastri E (2003) Musical instrument timbres classification with spectral features. In: European conference on signal processing (EUSIPCO), vol 1
Marques J, Moreno PJ (1999) A study of musical instrument classification using Gaussian mixture models and support vector machines. Technical report, Compaq computer corporation
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
MATH Google Scholar
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm
Campbell P, Tremain T (1986) Voiced/unvoiced classification of speech with applications to the U.S. Government LPC-10E algorithm. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP)
Atal B, Hanauer S (1971) Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am 50(2):637–655
Article Google Scholar
Livshin A, Rodet X (2004) Musical instrument identification in continuous recordings. In: International conference on digital audio effects (DAFx)
Essid S, Richard G, David B (2006) Musical instrument recognition by pairwise classification strategies. In: IEEE transactions on audio, speech and language processing, vol 14, no 4, pp 1401–1412
Kitahara T (2007) Computational musical instrument recognition and its application to content-based music information retrieval. Ph.D. Thesis, Kyoto University
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn edn. Academic Press, Boston
MATH Google Scholar
Martinez A, Kak A (2001) PCA versus LDA. In: IEEE transactions on pattern analysis and machine intelligence, vol 23, no 2
Eggink J, Brown GJ (2003) Application of missing feature theory to the recognition of musical instruments in polyphonic audio. In: International conference on music information retrieval (ISMIR)
Jinachitra P (2004) Polyphonic instrument identification using independent subspace analysis. In: International conference on multimedia and expo (ICME)
Essid S, Richard G, David B (2004) Musical instrument recognition based on class pairwise feature selection. In: International conference on music information retrieval (ISMIR)

Download references

Author information

Authors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, Nara, 630-0192, Japan
Mizuki Ihara
Graduate School of Informatics, Kyoto University, Uji, Kyoto, 611-0011, Japan
Shin-ichi Maeda & Shin Ishii

Authors

Mizuki Ihara
View author publications
You can also search for this author in PubMed Google Scholar
Shin-ichi Maeda
View author publications
You can also search for this author in PubMed Google Scholar
Shin Ishii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mizuki Ihara.

Appendices

Appendix A: Relation between Itakura-Saito distortion measure and the likelihood of a Chi-square distribution

We consider the noise distribution in the frequency domain rather than in the time domain. In this Appendix, the equivalence between the Itakura-Saito distortion measure and the Chi-square distribution is shown.

When the observation spectrum is represented along the continuous frequency axis as in the Itakura-Saito distortion, the summation is replaced by the integral, i.e.,

$$ E\left[ {\frac{{\hat{s}_{t} }}{s}} \right] = 2\int\limits_{ - \pi }^{\pi } {\left\{ {\log \frac{{\hat{s}_{t} (\tilde{\omega })}}{{s_{t} (\tilde{\omega })}} + \frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}} - 1} \right\}{\text{d}}\omega ,} $$

(A.1)

where $ \hat{s}(\omega ) $ is the estimated power spectrum, $ s(\omega ) $ is the true short-term power spectrum, $ \tilde{\omega } = \frac{\omega Fs}{2\pi } $ is the normalized angular frequency $ \left( { - \pi \le \tilde{\omega } \le \pi } \right), $ and ω and F_s are frequency and sampling frequency, respectively.

On the other hand, a Chi-square distribution (degree of freedom: 3) is given by

$$ f(n) = \frac{1}{2\Upgamma (1.5)}\left( {\frac{n}{2}} \right)^{{\frac{1}{2}}} \exp \left( { - \frac{n}{2}} \right). $$

(A.2)

Taking the logarithm, we have

$$ \begin{aligned} \log f(n) & = - \log \Upgamma (1.5) - \frac{1}{2}\log \frac{1}{n} - \frac{3}{2}\log 2 - \frac{n}{2} \\ & \equiv - \frac{1}{2}\left( {c + \log \frac{1}{n} + n} \right), \\ \end{aligned} $$

(A.3)

where c is a constant. Substituting $ n = \frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}} $ into (A.2), we obtain

$$ \log f\left( {\frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}}} \right) \equiv \log \frac{{\hat{s}(\tilde{\omega })}}{{s(\tilde{\omega })}} + \frac{{s(\tilde{\omega })}}{{\hat{s}(\tilde{\omega })}} + c. $$

(A.4)

Since we have assumed that the noise is generated independently from a Chi-square distribution for each frequency, the joint log-probability of the observation noise becomes a summation of (A.4) over frequencies, which is equivalent to the Itakura-Saito distortion (A.1).

Appendix B

2.1 Free energy calculation of the sound production model

In the calculation of the free energy, we assume the trial distribution q(X _1:T|κ) is a single Gaussian distribution: q(X _1:T|κ) = Gauss(X _1:T; μ, Σ), where κ = {μ, Σ}. The free energy then becomes

$$ \begin{aligned} &F(q(X_{1:T} |\kappa ),\theta ) \\ & = - \int { \cdots \int {q(X_{1:T} |\kappa )\log p(X_{1:T} ,S_{1:T} |\theta ){\text{d}}X_{1:T} } } \\ &\quad + \int { \cdots \int {q(X_{1:T} |\kappa )} \log q(X_{1:T} |\kappa ){\text{d}}X_{1:T} } \\ & = - \int { \cdots \int \begin{gathered} q(X_{1:T} |\kappa ) \hfill \\ \left( {\log (p(s_{1} |x_{1} ,\theta )p(x_{1} |\theta )\prod\limits_{t = 2}^{T} {p(s_{t} |x_{t} ,\theta )p(x_{t} |x_{t - 1} ,\theta )} )} \right) \hfill\\ \end{gathered} } {\text{d}}X_{1:T} \\ & \quad + \int { \cdots \int {q(X_{1:T} |\kappa )\log q(X_{1:T} |\kappa ){\text{d}}X_{1:T} } } \\ & = - \int {q(x_{1} |\kappa )\log p(x_{1} |\theta ){\text{d}}x_{1} - \sum\limits_{t = 2}^{T} {\int {q(x_{t} |\kappa )\log p(s_{t} |x_{t} ,\theta ){\text{d}}x_{t} } } } \\ & \quad - \sum\limits_{t = 2}^{T} {\int {q(x_{t} ,x_{t - 1} |\kappa )\log p(x_{t} |x_{t - 1} ,\theta ){\text{d}}x_{t} {\text{d}}x_{t - 1} - {\text{H}}(q(X_{1:T} |\kappa ))} } . \\ & = F_{1} + F_{2} + F_{3} + F_{4} . \\ \end{aligned} $$

(B.1)

1. The term for the initial state distribution

$$ \begin{aligned} F_{1} & = - \int {q(x_{1} |\kappa )\log (x_{1} |\theta ){\text{d}}x_{1} } \\ & = - \int {{\text{Gauss}}(x_{1} ;\mu_{1} ,\Sigma_{1} )} \log \left( {{\text{Gauss}}(x_{1} ;m_{1} ,(\sigma_{1} )^{2} )} \right){\text{d}}x_{1} \\ & = \frac{1}{2}\left( {{\text{Tr}}\left( {(\sigma_{1} )^{ - 2} \Sigma_{1} } \right) + (\mu_{1} - m_{1} )^{T} (\sigma_{1} )^{ - 2} (\mu_{1} - m_{1} )} \right) + \frac{1}{2}\log \left| {(\sigma_{1} )^{2} } \right| + \log (2\pi ). \\ \end{aligned} $$

(B.2)

2. The term for the observation process

$$ \begin{aligned} F_{2} & = - \int {q(x_{1} |\kappa )\log p(s_{t} |x_{t} ,\theta ){\text{d}}x_{t} } \\ & = - \int {{\text{Gauss}}(x_{1} ;\mu_{1} ,\Sigma_{1} )} \\ \quad \sum\limits_{i = 1}^{N} {\log \frac{1}{{\sqrt {2\pi \sigma_{o} s_{t} (i)} }}\exp \left( { - \frac{1}{{2\sigma_{o}^{2} }}\left( {\log s_{t} (i) - \log \hat{s}_{t} (i)} \right)^{2} } \right){\text{d}}x_{t} } \\ \quad {\text{Substitute}}\quad \hat{s}_{t} (i) = H(i)G_{t} (i)\quad {\text{and}}\quad q = {\text{Gauss}}(x_{t} ;\mu_{t} ,\Sigma_{t} ), \\ & = - \int {{\text{Gauss}}(x_{t} ;\mu_{t} ,\Sigma_{t} )} \sum\limits_{i = 1}^{N} {\left( { - \frac{1}{2}\log (2\pi ) - \log \sigma_{o} - \log s_{t} (i)} \right.} \\ \quad \left. { - \frac{1}{{2\sigma_{o}^{2} }}\left( {\log s_{t} (i) - \log H(i)\log G_{t}{(i)} } \right)^{2} } \right){\text{d}}x_{t} \\ & = \frac{N}{2}\log (2\pi ) + \sum\limits_{i = 1}^{N} {\log (s_{t} (i)) + N\log (\sigma_{o} ) + \frac{1}{{2\sigma_{o}^{2} }}} \sum\limits_{i = 1}^{N} {\left( {\log s_{t} (i) + \log H(i)} \right)^{2} } \\ \quad - \frac{1}{{2\sigma_{o}^{2} }}\sum\limits_{i = 1}^{N} {\left( {\log \frac{{s_{t} (i)}}{H(i)}\left( {\mu_{a_{t}} + \frac{{A(\omega_{i} )}}{{\sqrt {2\pi } }}K_{\exp } (i)} \right)} \right),} \\ \end{aligned} $$

(B.3)

where

$$ \begin{aligned} A(\omega ) & = A\exp \left( { - \frac{\omega }{\tau }} \right), \\ K_{\exp } (i) & = \sum\limits_{k}^{K} {k_{\exp } (i),} \\ k_{\exp } (i) & = \frac{1}{{\sqrt {\left( {k^{2} \Sigma_{ft} + \sigma_{p}^{2} } \right)} }}\exp \left( { - \frac{1}{2}\left( {\Sigma_{ft} + \frac{{\sigma_{p}^{2} }}{{k^{2} }}} \right)^{ - 1} \left( {\mu_{f_{t}} - \frac{{\omega_{i} }}{k}} \right)^{2} } \right), \\ KL_{\exp } (i) & = \sum\limits_{k,l} {kl_{\exp } (i)}, \\ kl_{\exp } (i) & = \frac{1}{{\sqrt {\left( {\left( {k^{2} + l^{2} } \right)\Sigma_{f_{t}} + \sigma_{p}^{2} } \right)} }} \\ \quad \exp \left( { - \frac{1}{2}\left( {\left( {\Sigma_{ft} + \frac{{\sigma_{p}^{2} }}{{k^{2} + l^{2} }}} \right)^{ - 1} \left( {\mu_{ft} - \frac{k + l}{{k^{2} + l^{2} }}\omega_{i} } \right)^{2} + \frac{{(k - l)^{2} }}{{k^{2} + l^{2} }}\frac{{\omega_{i}^{2} }}{{\sigma_{p}^{2} }}} \right)} \right), \\ H(\tilde{\omega}, b_1,..., b_p) &= 2^{1-p}\left( \sin^2 \frac{\tilde{\omega}}{2}\prod_{n=2,4,...,p} \left( \cos \tilde{\omega} - \cos b_n \right)^2 + \cos^2 \frac{\tilde{\omega}}{2} \prod_{n=1,3,...,p-1}\left( \cos \tilde{\omega} - \cos b_n \right)^2 \right)^{-2}, \\ \tilde{\omega } & = \frac{{{\text{Fs}}\omega }}{2\pi }. \end{aligned} $$

(B.4)

3. The term for the state transition

$$ \begin{aligned} F_{3} & = - \int {q(x_{t} ,x_{t - 1} |\kappa )\log p(x_{t} |x_{t - 1} ,\theta ){\text{d}}x_{t} {\text{d}}x_{t - 1} } \\ & \approx \frac{1}{2}\log \upsilon_{a} + \log \Upgamma \left( {\frac{{\upsilon_{a} }}{2}} \right) - \log \Upgamma \left( {\frac{{\upsilon_{a} + 1}}{2}} \right) + \frac{1}{2}\log \upsilon_{f} + \log \Upgamma \left( {\frac{{\upsilon_{f} }}{2}} \right) \\ & + \frac{{\upsilon_{a} + 1}}{2}\left( {\frac{{\beta_{a}^{4} U_{a} }}{{\upsilon_{a}^{2} }} + \frac{{\beta_{a}^{4} V_{a} }}{{\upsilon_{a} }} + abc_{a,t} } \right) + \frac{{\upsilon_{f} + 1}}{2}\left( {\frac{{\beta_{f}^{4} U_{f} }}{{\upsilon_{f}^{2} }} + \frac{{\beta_{f}^{4} V_{f} }}{{\upsilon_{f} }} + abc_{f,t} } \right) \\ & - \log \Upgamma \left( {\frac{{\upsilon_{f} + 1}}{2}} \right) + \log \beta_{a} + \log \beta_{f} + \log \pi , \\ \end{aligned} $$

(B.5)

where

$$ \begin{aligned} U_{a} & = a_{a,t} \left( {3\Sigma_{u,a}^{2} + e_{u,a}^{4} + 6\Sigma_{u,a} e_{u,a}^{2} } \right), \\ U_{f} & = a_{f,t} \left( {3\Sigma_{u,f}^{2} + \mu_{u,f}^{4} + 6\Sigma_{u,f} \mu_{u,f}^{2} } \right), \\ V_{a} & = \left( {2a_{a,t} + b_{a,t} } \right)\left( {\Sigma_{u,a} + e_{u,a}^{2} } \right), \\ V_{f} & = \left( {2a_{f,t} + b_{f,t} } \right)\left( {\Sigma_{u,f} + \mu_{u,f}^{2} } \right), \\ e_{u,a} & = \mu_{u,a} - \log \rho , \\ e_{u,f} & = \mu_{u,f} , \\ \mu_{u,*} & = \frac{1}{\sqrt 2 }\left( {\mu_{*,t} - \mu_{*,t - 1} } \right), \\ abc_{*,t} & = a_{*,t} + b_{*,t} + c_{*,t} , \\ a_{*,t} & = \frac{{w_{*1} }}{{x_{\max ,*,t} - 1}} + w_{*2} , \\ b_{*,t} & = - \frac{{w_{b1} }}{{\sqrt {x_{\max ,*,t} - 1} + w_{b3} }} + w_{b2} , \\ c_{*,t} & = w_{c1} \log \left( {x_{\max ,*,t} - 1} \right) + w_{c2} , \\ x_{\max ,*,t} & = 1 + \frac{{\left( {\mu_{u,a} + k_{a} \sqrt {\Sigma_{u,*} } - \log \rho } \right)^{2} }}{{\upsilon_{a} }}, \\ x_{\min } & = 1, \\ \Sigma_{u,*} & = \frac{1}{2}\left( {\Sigma_{*,t} + \Sigma_{*,t - 1} + 2\Sigma_{*,t,t - 1} } \right). \\ \end{aligned} $$

(B.6)

Note that w _a, w _b, w _c are constants, and * = {a,f}.

4. The term for the entropy

$$ \begin{aligned} F_{4} & = - H\left( {q\left( {X_{1:T} |\kappa } \right)} \right) \\ & = - H\left( {{\text{Gauss}}\left( {X_{1:T} |\mu ,\Sigma } \right)} \right) \\ & \approx - \frac{1}{2}\left( {n + n\log (2\pi ) + \log \left| \Sigma \right|} \right). \\ \end{aligned} $$

(B.7)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ihara, M., Maeda, Si. & Ishii, S. Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics. Neural Comput & Applic 18, 3–14 (2009). https://doi.org/10.1007/s00521-008-0201-7

Download citation

Received: 20 February 2008
Accepted: 14 August 2008
Published: 25 September 2008
Issue Date: January 2009
DOI: https://doi.org/10.1007/s00521-008-0201-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics

Abstract

Access this article

Similar content being viewed by others

Synthesizing the note-specific atoms based on their fundamental frequency, used for single-channel musical source separation

Percussive/harmonic sound separation by non-negative matrix factorization with smoothness/sparseness constraints

Application of Multiple Sound Representations in Multipitch Estimation Using Shift-Invariant Probabilistic Latent Component Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Relation between Itakura-Saito distortion measure and the likelihood of a Chi-square distribution

Appendix B

2.1 Free energy calculation of the sound production model

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Solo instrumental music analysis using the source-filter model as a sound production model considering temporal dynamics

Abstract

Access this article

Similar content being viewed by others

Synthesizing the note-specific atoms based on their fundamental frequency, used for single-channel musical source separation

Percussive/harmonic sound separation by non-negative matrix factorization with smoothness/sparseness constraints

Application of Multiple Sound Representations in Multipitch Estimation Using Shift-Invariant Probabilistic Latent Component Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Relation between Itakura-Saito distortion measure and the likelihood of a Chi-square distribution

Appendix B

2.1 Free energy calculation of the sound production model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation