Skip to main content
Log in

A Novel Zero-Resource Spoken Term Detection Using Affinity Kernel Propagation with Acoustic Feature Map

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Spoken term detection (STD) without linguistic clues is challenging for retrieval tasks. Despite numerous studies to overcome the challenges, there is a scope for improvement. Dynamic time warping based techniques were extensively employed to accomplish the STD task in the absence of linguistic resources. A drawback of this approach is handling the speaker, language, acoustic and spoken query variabilities that exist in natural speech. Our approach introduces a novel acoustic feature representation adjoined with affinity kernel propagation to overcome the challenges. At first, the Self Organising Map based feature vector representation was employed to overcome the speaker variability issues. In the next stage, introducing the affinity kernel propagation approach captures the best alignment between the spoken query and the utterances in the similarity-matching task without constraining the nature of the query. By introducing the acoustic feature mapping and similarity-matching through affinity kernel propagation, a 6% performance gain of Maximum Term Weigh Value and a 5% reduction in the cross-entropy cost were achieved during the evaluation with QUESST-14 speech corpus across multiple languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The dataset and evaluation scripts used in this study are available at https://speech.fit.vutbr.cz/software.

Notes

  1. https://speech.fit.vutbr.cz/software/quesst-2014-multilingual-database-query-by-example-keyword-spotting.

  2. https://github.com/sudhakar-pandiarajan/KWS.

  3. https://github.com/iiscleap/FeatureExtractionUsingFDLP.

References

  1. Chelba C, Hazen TJ, Saraclar M. Retrieval and browsing of spoken content. IEEE Signal Process Mag. 2008;25(3):39–49.

    Article  Google Scholar 

  2. Levin K, Jansen A, Durme BV. Segmental acoustic indexing for zero resource keyword search (2015).

  3. Kamper H, Livescu K, Goldwater S. An embedded segmental k-means model for unsupervised segmentation and clustering of speech (2017).

  4. Oosterveld B, Veale R, Scheutz M. A parallelized dynamic programming approach to zero resource spoken term discovery (2017).

  5. Thual A, Dancette C, Karadayi J, Benjumea J, Dupoux E. A k-nearest neighbours approach to unsupervised spoken term discovery (2018).

  6. Bhati S, Villalba J, Zelasko P, Dehak N. Self-expressing autoencoders for unsupervised spoken term discovery (2020).

  7. Sung M-L, Lee T. Unsupervised spoken term discovery based on re-clustering of hypothesized speech segments with siamese and triplet networks. CoRR arXiv:2011.14062 (2020) .

  8. Benzeghiba M, et al. Automatic speech recognition and speech variability: a review. Speech Commun. 2007;49(10):763–86.

    Article  Google Scholar 

  9. Li J, Wang X, Xu B. An empirical study of multilingual and low-resource spoken term detection using deep neural networks (2014).

  10. Knill K, Gales M, Ragni A, Rath SP. Language independent and unsupervised acoustic models for speech recognition and keyword spotting (2014).

  11. Park A, James RG. Unsupervised pattern discovery in speech. IEEE Trans Audio Speech Lang Process. 2008;16(1):186–97.

    Article  Google Scholar 

  12. Râsänen O, Doyle G. Unsupervised word discovery from speech using automatic segmentation into syllable-like units. G. & Michael C. Frank; 2015.

  13. Mantena G, Prahallad K. Use of articulatory bottle-neck features for query-by-example spoken term detection in low resource scenarios (2014).

  14. Tulsiani H, Rao P. The iit-b query-by-example system for mediaeval 2015 (2015).

  15. Cui J, et al. Multilingual representations for low resource speech recognition and keyword search (2015).

  16. Yuan Y, Xie L, Leung C-C, Chen H, Ma B. Fast query-by-example speech search using attention-based deep binary embeddings. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1988–2000.

    Google Scholar 

  17. Ram D, Miculicich L, Bourlard H. Multilingual bottleneck features for query by example spoken term detection (2019).

  18. Park A, James RG. Towards unsupervised pattern discovery in speech (2005).

  19. Gupta V, Ajmera J, Kumar A, Verma A. A language independent approach to audio search (2011).

  20. Bhati S, Nayak SK. Unsupervised segmentation of speech signals using kernel-gram matrices. Sri Rama Murty; 2018.

  21. Zhang Y, Glass JR. Towards multi-speaker unsupervised speech pattern discovery (2010).

  22. Jansen A, Benjamin VD. Efficient spoken term discovery using randomized algorithms (2011).

  23. Chan C, Lee L. Model-based unsupervised spoken term detection with spoken queries. IEEE Trans Audio Speech Lang Process. 2013;21(7):1330–42.

    Article  Google Scholar 

  24. Muscariello A, Gravier G, Bimbot F. Unsupervised motif acquisition in speech via seeded discovery and template matching combination. IEEE Trans Audio Speech Lang Process. 2012;20(7):2031–44.

    Article  Google Scholar 

  25. Ludusan B, et al. Exploring multi-language resources for unsupervised spoken term discovery (2015).

  26. Lyzinski V, Sell G, Jansen A. An evaluation of graph clustering methods for unsupervised term discovery (2015).

  27. Yang P, et al. The nni query-by-example system for mediaeval 2014 (2014).

  28. Karthik Pandia DS, Saranya MS, Hema AM. A fast query-by-example spoken term detection for zero resource languages (2016).

  29. Myers C, Rabiner L, Rosenberg A. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans Acoust Speech Signal Process. 1980;28(6):623–35.

    Article  MATH  Google Scholar 

  30. Jansen A, Church K, Hermansky H. Towards spoken term discovery at scale with zero resources (2010).

  31. Cottrell M, Fort J-C, Pagès G. Theoretical aspects of the som algorithm. Neurocomputing. 1998;21(1):119–38.

    Article  MATH  Google Scholar 

  32. Xu H et al. Approximate search of audio queries by using dtw with phone time boundary and data augmentation (2016).

  33. Lopez-Otero P, Parapar J, Barreiro A. Statistical language models for query-by-example spoken document retrieval. Multimed Tools Appl. 2020;79(11):7927–49.

    Article  Google Scholar 

  34. Mantena G, Achanta S, Prahallad K. Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Trans Audio Speech Lang Process. 2014;22(5):946–55.

    Article  Google Scholar 

  35. Heskes T. Self-organizing maps, vector quantization, and mixture modeling. IEEE Trans Neural Netw. 2001;12(6):1299–305.

    Article  Google Scholar 

  36. Delgado S, Higuera C, Calle-Espinosa J, Morán F, Montero F. A som prototype-based cluster analysis methodology. Expert Syst Appl. 2017;88:14–28.

    Article  Google Scholar 

  37. Yao P, Zhu Q, Zhao R. Gaussian mixture model and self-organizing map neural-network-based coverage for target search in curve-shape area. IEEE Trans Cybern. 2020;52:3971–83.

    Article  Google Scholar 

  38. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93, 27403 (1993).

  39. Anguera X, et al. Query-by-example spoken term detection evaluation on low-resource languages (2014).

  40. Fiscus JG, Ajot J, Garofolo JS, Doddingtion G. Results of the 2006 spoken term detection evaluation (2007).

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

All the authors contribute to the conceptualization, methodology, implementation and article writing aspects.

Corresponding author

Correspondence to P. Sudhakar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

The implementation was available at https://github.com/sudhakar-pandiarajan/KWS.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sudhakar, P., Rao, K.S. & Mitra, P. A Novel Zero-Resource Spoken Term Detection Using Affinity Kernel Propagation with Acoustic Feature Map. SN COMPUT. SCI. 4, 310 (2023). https://doi.org/10.1007/s42979-023-01754-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-01754-9

Keywords

Navigation