Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization

Gupta, Aishwarya; Purwar, Archana

doi:10.1007/s11042-023-17017-x

Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization

Published: 05 December 2023

(2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

115 Accesses
Explore all metrics

Abstract

In this digitally-driven culture, the need and demand for diarizing online meetings, classes, conferences, and medical diagnoses have increased a lot. Speaker Diarization, a sub-domain of Speaker Recognition has grown with the advent of neural networks in the last decade. Diarize generally refers to obtaining the duration of individual speakers in any event. Researchers have suggested various approaches for multiple-speaker diarization. However, it still suffers from a problem of various environmental noises, and non-speech sounds like laughter, murmuring, clapping, etc. in the datasets. Hence, this paper proposes an improved speaker diarization pipeline to deal with the noise present in a dataset having multiple speakers. This improved diarization pipeline uses Bi-directional Long Short-Term Memory (Bi-LSTM), based speech refinement pre-processing module, and Modified Spectral Clustering with Symmetrized Singular Value Decomposition (MSC-SSVD). MSC-SSVD is used to cater to the problem of spectral clustering in large datasets. The proposed diarization pipeline is evaluated using the publicly available VoxConverse dataset. The Diarization Error Rate (DER) obtained after experimentation are 37.2%, 37.1%, and 43.3% respectively for three batches of dataset under study. The results are also compared with the baseline system and significant change in DER by 6.1%, 4.7%, and 7% respectively for three batches is observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble of Incremental System Enhancements for Robust Speaker Diarization in Code-Switched Real-Life Audios

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

Article 22 June 2021

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Article 18 February 2022

Data availability

DAS: The datasets analyzed during the current study are publicly available from the link: https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ and can be currently downloaded from [https://github.com/joonson/voxconverse] repository at GitHub.

References

Park TJ, Kanda N, Dimitriadis D, Han KJ, Watanabe S, Narayanan S (2022) A review of speaker diarization: Recent advances with deep learning. Comput Speech Lang 72:101317. https://doi.org/10.1016/j.csl.2021.101317
Article Google Scholar
Anguera X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: A review of recent research. IEEE Trans Audio Speech Lang Process 20(2):356–370. https://doi.org/10.1109/TASL.2011.2125954
Article Google Scholar
Sun L, Du J, Jiang C, Zhang X, He S, Yin B, Lee C (2018) Speaker diarization with enhancing speech for the first DIHARD challenge. Interspeech
Sinclair M, King S (2013) Where are the challenges in speaker diarization?. In: 2013 IEEE International conference on acoustics, speech and signal processing. IEEE, pp 7741–7745
Sarikaya R, Hansen JH (1998) December). Robust detection of speech activity in the presence of noise. Proc ICSLP 4:1455–1458
Google Scholar
Meignier S, Moraru D, Fredouille C, Bonastre J-F, Besacier L (2006) Stepby-step and integrated approaches in broadcast news speaker diarization. Comput Speech Lang 20:303–330
Article Google Scholar
Chen S, Gopalakrishnan P (1998) Speaker, environment and channel change detection and clustering via the bayesian information criterion. In Proc. DARPA broadcast news transcription and understanding workshop, vol. 8. DARPA, pp 127–132
Delacourt P, Wellekens CJ (2000) Distbic: A speaker-based segmentation for audio data indexing. Speech Commun 32:111–126
Article Google Scholar
Senoussaoui M, Kenny P, Stafylakis T, Dumouchel P (2013) A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans Audio Speech Lang Process 22:217–227
Landini F, Glembek O, Matějka P, Rohdin J, Burget L, Diez M, Silnova A (2021) Analysis of the but diarization system for voxconverse challenge. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5819–5823
Snyder, D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: robust dnn embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5329–5333
Landini Federico et al. (2021) Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation, and analysis on standard tasks. Comput Speech Lang. https://doi.org/10.1016/j.csl.2021.101254
Sell G, Garcia-Romero D (2014) Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In: 2014 IEEE Spoken Language Technology Workshop (SLT). IEEE, pp 413–417
Kang W, Roy BC, Chow W (2020) Multimodal speaker diarization of real-world meetings using d-vectors with spatial features. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 6509–6513
Novoselov S, Gusev A, Ivanov A, Pekhovsky T, Shulipa A, Avdeeva A et al (2019) Speaker diarization with deep speaker embeddings for DIHARD challenge II. In: Interspeech. pp 1003–1007
Comaniciu, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619. https://doi.org/10.1109/34.1000236
Stafylakis T, Katsouros V, Carayannis G (2010) Speaker clustering via the mean shift algorithm. ReCALL 2:7
Google Scholar
Han KJ, Narayanan SS (2007) A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system. In: Interspeech. pp 1853–1856
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. https://doi.org/10.1007/s11222-007-9033-z
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: Analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856. https://dl.acm.org/doi/abs/https://doi.org/10.5555/2980539.2980649
Ning H, Liu M, Tang H, Huang TS (2006) A spectral clustering approach to speaker diarization. In: Ninth international conference on spoken language processing
Luque J, Hernando J (2012) On the use of agglomerative and spectral clustering in speaker diarization of meetings. In: Odyssey 2012-The speaker and language recognition workshop
Park TJ, Han KJ, Kumar M, Narayanan S (2019) Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Process Lett 27:381–385. https://arxiv.org/abs/2003.02405
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Adv Neural Inform Proc Syst 17
Shum Stephen H et al (2013) Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Trans Audio Speech Lang Process 21.10:2015–2028. https://doi.org/10.1109/TASL.2013.2264673
Rouvier M, Bousquet PM, Favre B (2015) Speaker diarization through speaker embeddings. In: 2015 23rd European Signal Processing Conference (eusipco). IEEE, pp 2082–2086
Toruk M, Bilgin G, Serbes A (2020) Speaker diarization using embedding vectors. In 2020 28th Signal Processing and Communications Applications Conference (SIU). IEEE, pp 1–4
Sun G, Liu D, Zhang C, &Woodland PC (2021) Content-aware speaker embeddings for speaker diarisation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7168–7172
Zhang A, Wang Q, Zhu Z, Paisley J, Wang C (2019) Fully supervised speaker diarization. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 6301–6305
Wang Q, Downey C, Wan L, Mansfield PA, Moreno IL (2018) Speaker diarization with LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5239–5243
Nakanishi I, Nagata Y, Itoh Y, Fukui Y (2006) Single-channel speech enhancement based on frequency domain ALE. In: 2006 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, pp 4
Li W (2008) Effective post-processing for single-channel frequency-domain speech enhancement. In: 2008 IEEE International conference on multimedia and expo. IEEE, pp 149–152
Parchami M, Zhu WP, Champagne B, Plourde E (2016) Recent developments in speech enhancement in the short-timeFourier transform domain. IEEE Circ Syst Mag 16(3):45–77. https://doi.org/10.1109/MCAS.2016.2583681
Hu Y, Loizou PC (2004) Incorporating a psycho acoustical model in frequency domain speech enhancement. IEEE Signal Process Lett 11(2):270–273
Article Google Scholar
Hu Y, Loizou PC (2004b) Speech enhancement based on wavelet thresholding the multi-taper spectrum. IEEE Trans Speech Audio Process 12(1):59–67. https://doi.org/10.1109/tsa.2003.819949
Boll SF (1979) Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE Trans Acoust Speech Signal Process 27:113–120. https://doi.org/10.1109/TASSP.1979.1163209
Article Google Scholar
Upadhyay N, Karmakar A (2015) Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study. Procedia Comput Sci 54:574–584. https://doi.org/10.1016/j.procs.2015.06.066
Abd El-Fattah MA, Dessouky MI, Abbas AM et al (2014) Speech enhancement with an adaptive Wiener filter. Int J Speech Technol 53–64. https://doi.org/10.1007/s10772-013-9205-5
Pandey A, Wang DL, Fellow IEEE (2019) A new framework or CNN-based speech enhancement in the time domain. IEEE Trans Audio Speech Lang Process 27(7):1179–1188. https://doi.org/10.1109/taslp.2019.2913512
Yu H, Ouyang Z, Zhu WP, Champagne B, Ji Y (2019) A deep neural network based Kalman filter for time domain speech enhancement. In: 2019 IEEE International Symposium on Circuits And Systems (ISCAS). IEEE, pp 1–5
Sainburg T (2018) Noise reduction using spectral gating in python. Tim Sain-burg
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inform Proc Syst 14
Xia W, Lu H, Wan Q, Tripathi A, Huang Y, Moreno IL, Sak H (2022) Turn-to-diarize: online speaker diarization constrained by transformer transducer speaker turn detection. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 8077–8081
Mihov SG, Ivanov RM, Popov AN (2009) Denoising speech signals by wavelet transform. Annual J Electron 6:2–5
Google Scholar
Kumariss VSR, Devarakonda Dileep Kumar (2023) A Wavelet Based Denoising of Speech Signal. Int J Eng Trends Technol (IJETT) V5(2):107–115. ISSN:2231–5381
Kaladharan N (2014) Speech enhancement by spectral subtraction method. Int J Comput Applic 96(13):45–48. https://doi.org/10.5120/16858-6739
Article Google Scholar
Karam M et al (2014) Noise removal in speech processing using spectral subtraction. J Signal Inf Process 5:32–41. https://doi.org/10.4236/jsip.2014.52006
Article Google Scholar
Ahmad R, Zubair S, Alquhayz H, Ditta A (2019) Multimodal speaker diarization using a pre-trained audio-visual synchronization model. Sensors 19(23):5163. https://www.mdpi.com/1424-8220/19/23/5163
Ahmad R, Zubair S, Alquhayz H (2020) Speech enhancement for multimodal speaker diarization system. IEEE Access 8:126671–126680. https://doi.org/10.1109/ACCESS.2020.3007312
Gupta A, Purwar A (2022) Enhancing speaker diarization for audio-only systems using deep learning. In: Applications of artificial intelligence, big data and internet of things in sustainable development. CRC Press. pp 65–79
Das N, Chakraborty S, Chaki J, Dey N (2021) Fundamentals, present and future perspectives of speech enhancement. Int J Speech Technol 24(4):883–901. https://doi.org/10.1007/s10772-020-09674-2
Islam MR, Rahman MF, Khan MAG (2009) Improvement of speech enhancement techniques for robust speaker identification in noise. In: 2009 12th International conference on computers and information technology. IEEE, pp 255–260
Défossez A, Usunier N, Bottou L, Bach F (2019) Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254
Defossez A et al (2020) Real time speech enhancement in the waveform domain. Interspeech
Défossez A, Usunier N, Bottou L, Bach F (2019) Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174
Stoller Daniel, Ewert Sebastian, Dixon Simon (2018) Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. URL https://arxiv.org/abs/1806.03185
https://www.kaggle.com/code/mauriciofigueiredo/methods-for-sound-noise-reduction/notebook.
Wang X, Qian B, Davidson I (2014) On constrained spectral clustering and its applications. Data Min Knowl Disc 28(1):1–30
Article MathSciNet MATH Google Scholar
Li J, Xia Y, Shan Z, Liu Y (2014) Scalable constrained spectral clustering. IEEE Trans Knowl Data Eng 27(2):589–593
Article Google Scholar
Raj D, Huang Z, Khudanpur S (2021) Multi-class spectral clustering with overlaps for speaker diarization. In: 2021 IEEE Spoken Language Technology workshop (SLT). IEEE, pp 582–589
Huang Z, Zhou JT, Peng X, Zhang C, Zhu H, Lv J (2019) Multi-view spectral clustering network. IJCAI 2(3):4
Google Scholar
Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: a large-scale speaker identification dataset. Interspeech
Chung Joon Son, Nagrani Arsha, Zisserman Andrew (2018) Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622
Chung Joon Son et al (2020) Spot the conversation: speaker diarisation in the wild. arXiv preprint arXiv:2007.01216. Interspeech
(2017) Herve Bredin, pyannote. metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. Hypothesis 100(60):90
Kang Z, Huang Z, Lu C (2022) Speech Enhancement Using U-Net with Compressed Sensing. Appl Sci 12(9):4161. https://doi.org/10.3390/app1209416
Article Google Scholar
Macartney Craig, Weyde Tillman (2018) Improved speech enhancement with the wave-u-net. arXiv preprint arXiv:1811.11307
Gupta A, Purwar A (2022) Analysis of clustering algorithms for Speaker Diarization using LSTM. 2022 1st International Conference on Informatics (ICI), Noida, India, pp. 19–24. https://doi.org/10.1109/ICI53355.2022.9786928

Download references

Author information

Authors and Affiliations

Computer Science & Engineering and Information Technology, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India
Aishwarya Gupta & Archana Purwar

Authors

Aishwarya Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Archana Purwar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aishwarya Gupta.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gupta, A., Purwar, A. Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17017-x

Download citation

Received: 11 November 2022
Revised: 13 August 2023
Accepted: 08 September 2023
Published: 05 December 2023
DOI: https://doi.org/10.1007/s11042-023-17017-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization

Abstract

Access this article

Similar content being viewed by others

Ensemble of Incremental System Enhancements for Robust Speaker Diarization in Code-Switched Real-Life Audios

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization

Abstract

Access this article

Similar content being viewed by others

Ensemble of Incremental System Enhancements for Robust Speaker Diarization in Code-Switched Real-Life Audios

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation