Abstract
Spoken term detection (STD) without linguistic clues is challenging for retrieval tasks. Despite numerous studies to overcome the challenges, there is a scope for improvement. Dynamic time warping based techniques were extensively employed to accomplish the STD task in the absence of linguistic resources. A drawback of this approach is handling the speaker, language, acoustic and spoken query variabilities that exist in natural speech. Our approach introduces a novel acoustic feature representation adjoined with affinity kernel propagation to overcome the challenges. At first, the Self Organising Map based feature vector representation was employed to overcome the speaker variability issues. In the next stage, introducing the affinity kernel propagation approach captures the best alignment between the spoken query and the utterances in the similarity-matching task without constraining the nature of the query. By introducing the acoustic feature mapping and similarity-matching through affinity kernel propagation, a 6% performance gain of Maximum Term Weigh Value and a 5% reduction in the cross-entropy cost were achieved during the evaluation with QUESST-14 speech corpus across multiple languages.
Similar content being viewed by others
Data availability
The dataset and evaluation scripts used in this study are available at https://speech.fit.vutbr.cz/software.
References
Chelba C, Hazen TJ, Saraclar M. Retrieval and browsing of spoken content. IEEE Signal Process Mag. 2008;25(3):39–49.
Levin K, Jansen A, Durme BV. Segmental acoustic indexing for zero resource keyword search (2015).
Kamper H, Livescu K, Goldwater S. An embedded segmental k-means model for unsupervised segmentation and clustering of speech (2017).
Oosterveld B, Veale R, Scheutz M. A parallelized dynamic programming approach to zero resource spoken term discovery (2017).
Thual A, Dancette C, Karadayi J, Benjumea J, Dupoux E. A k-nearest neighbours approach to unsupervised spoken term discovery (2018).
Bhati S, Villalba J, Zelasko P, Dehak N. Self-expressing autoencoders for unsupervised spoken term discovery (2020).
Sung M-L, Lee T. Unsupervised spoken term discovery based on re-clustering of hypothesized speech segments with siamese and triplet networks. CoRR arXiv:2011.14062 (2020) .
Benzeghiba M, et al. Automatic speech recognition and speech variability: a review. Speech Commun. 2007;49(10):763–86.
Li J, Wang X, Xu B. An empirical study of multilingual and low-resource spoken term detection using deep neural networks (2014).
Knill K, Gales M, Ragni A, Rath SP. Language independent and unsupervised acoustic models for speech recognition and keyword spotting (2014).
Park A, James RG. Unsupervised pattern discovery in speech. IEEE Trans Audio Speech Lang Process. 2008;16(1):186–97.
Râsänen O, Doyle G. Unsupervised word discovery from speech using automatic segmentation into syllable-like units. G. & Michael C. Frank; 2015.
Mantena G, Prahallad K. Use of articulatory bottle-neck features for query-by-example spoken term detection in low resource scenarios (2014).
Tulsiani H, Rao P. The iit-b query-by-example system for mediaeval 2015 (2015).
Cui J, et al. Multilingual representations for low resource speech recognition and keyword search (2015).
Yuan Y, Xie L, Leung C-C, Chen H, Ma B. Fast query-by-example speech search using attention-based deep binary embeddings. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1988–2000.
Ram D, Miculicich L, Bourlard H. Multilingual bottleneck features for query by example spoken term detection (2019).
Park A, James RG. Towards unsupervised pattern discovery in speech (2005).
Gupta V, Ajmera J, Kumar A, Verma A. A language independent approach to audio search (2011).
Bhati S, Nayak SK. Unsupervised segmentation of speech signals using kernel-gram matrices. Sri Rama Murty; 2018.
Zhang Y, Glass JR. Towards multi-speaker unsupervised speech pattern discovery (2010).
Jansen A, Benjamin VD. Efficient spoken term discovery using randomized algorithms (2011).
Chan C, Lee L. Model-based unsupervised spoken term detection with spoken queries. IEEE Trans Audio Speech Lang Process. 2013;21(7):1330–42.
Muscariello A, Gravier G, Bimbot F. Unsupervised motif acquisition in speech via seeded discovery and template matching combination. IEEE Trans Audio Speech Lang Process. 2012;20(7):2031–44.
Ludusan B, et al. Exploring multi-language resources for unsupervised spoken term discovery (2015).
Lyzinski V, Sell G, Jansen A. An evaluation of graph clustering methods for unsupervised term discovery (2015).
Yang P, et al. The nni query-by-example system for mediaeval 2014 (2014).
Karthik Pandia DS, Saranya MS, Hema AM. A fast query-by-example spoken term detection for zero resource languages (2016).
Myers C, Rabiner L, Rosenberg A. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans Acoust Speech Signal Process. 1980;28(6):623–35.
Jansen A, Church K, Hermansky H. Towards spoken term discovery at scale with zero resources (2010).
Cottrell M, Fort J-C, Pagès G. Theoretical aspects of the som algorithm. Neurocomputing. 1998;21(1):119–38.
Xu H et al. Approximate search of audio queries by using dtw with phone time boundary and data augmentation (2016).
Lopez-Otero P, Parapar J, Barreiro A. Statistical language models for query-by-example spoken document retrieval. Multimed Tools Appl. 2020;79(11):7927–49.
Mantena G, Achanta S, Prahallad K. Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Trans Audio Speech Lang Process. 2014;22(5):946–55.
Heskes T. Self-organizing maps, vector quantization, and mixture modeling. IEEE Trans Neural Netw. 2001;12(6):1299–305.
Delgado S, Higuera C, Calle-Espinosa J, Morán F, Montero F. A som prototype-based cluster analysis methodology. Expert Syst Appl. 2017;88:14–28.
Yao P, Zhu Q, Zhao R. Gaussian mixture model and self-organizing map neural-network-based coverage for target search in curve-shape area. IEEE Trans Cybern. 2020;52:3971–83.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93, 27403 (1993).
Anguera X, et al. Query-by-example spoken term detection evaluation on low-resource languages (2014).
Fiscus JG, Ajot J, Garofolo JS, Doddingtion G. Results of the 2006 spoken term detection evaluation (2007).
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
All the authors contribute to the conceptualization, methodology, implementation and article writing aspects.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Code availability
The implementation was available at https://github.com/sudhakar-pandiarajan/KWS.
Ethics approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sudhakar, P., Rao, K.S. & Mitra, P. A Novel Zero-Resource Spoken Term Detection Using Affinity Kernel Propagation with Acoustic Feature Map. SN COMPUT. SCI. 4, 310 (2023). https://doi.org/10.1007/s42979-023-01754-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-01754-9