Skip to main content
Log in

Speech signal analysis and enhancement using combined wavelet Fourier transform with stacked deep learning architecture

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech enhancement is a process of improving the quality and intelligibility of the degraded speech signal. Various techniques have been demonstrated in the literature employing regression that subtract noise or predict clean signals. This research proposes a novel technique in speech signal analysis with enhancement using wavelet based denoising. Here the input is noisy speech signal and noise is removed using combined wavelet Fourier transform and train stacked encoder with non-negative matrix factorization integrated with convolutional ResNet architecture. Then used pairs of noisy as well as clean speech for training convolutional ResNet architecture at both input and output. The neural network was entirely retrained utilising combinations of wavelet-based denoising and clean speech, using outputs of wavelet denoising as its inputs. Segmental Signal-to-Noise Ratio and Perceptual Evaluation of Speech Quality are the parameters considered in the experimental study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

All the data generated or analysed during this study are included in the published article.

References

  • Abdulbaqi, J., Gu, Y., & Marsic, I. (2019). RHR-Net: A residual hourglass recurrent neural network for speech enhancement. arXiv preprint arXiv:1904.07294

  • Abdullah, S. M. S. A., Ameen, S. Y. A., Sadeeq, M. A., & Zeebaree, S. (2021). Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, 2(02), 52–58.

    Article  Google Scholar 

  • Abdullah, S., Zamani, M., & Demosthenous, A. (2022). A compact CNN-based speech enhancement with adaptive filter design using gabor function and region-aware convolution. IEEE Access, 10, 130657.

    Article  Google Scholar 

  • Bie, X., Leglaive, S., Alameda-Pineda, X., & Girin, L. (2022). Unsupervised speech enhancement using dynamical variationalautoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 2993–3007.

    Article  Google Scholar 

  • Braithwaite, D. T., & Kleijn, W. B. (2019). Speech enhancement with variance constrained autoencoders. In Interspeech (pp. 1831–1835).

  • Chaiani, M., Selouani, S. A., Boudraa, M., & Yakoub, M. S. (2022). Voice disorder classification using speech enhancement and deep learning models. Biocybernetics and Biomedical Engineering, 42(2), 463–480.

    Article  Google Scholar 

  • Gutiérrez-Muñoz, M., & Coto-Jiménez, M. (2022). An experimental study on speech enhancement based on a combination of wavelets and deep learning. Computation, 10(6), 102.

    Article  Google Scholar 

  • Han, C., Luo, Y., & Mesgarani, N. (2020). Real-time binaural speech separation with preserved spatial cues. In Proceedings of IEEE international conference on acoustics, speech, signal processing (ICASSP) (pp. 6404–6408). IEEE.

  • Huang, Z., Watanabe, S., Yang, S. W., García, P., & Khudanpur, S. (2022). Investigating self-supervised learning for speech enhancement and separation. In ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6837–6841). IEEE.

  • Hwang, J. W., Park, R. H., & Park, H. M. (2021). Efficient audio-visual speech enhancement using deep U-Net with early fusion of audio and video information and RNN attention blocks. IEEE Access, 9, 137584–137598.

    Article  Google Scholar 

  • Jabari, S., Rezaee, M., Fathollahi, F., & Zhang, Y. (2019). Multispectral change detection using multivariate Kullback-Leibler distance. ISPRS Journal of Photogrammetry and Remote Sensing, 147, 163–177.

    Article  Google Scholar 

  • Karthik, A., & MazherIqbal, J. L. (2021). Efficient speech enhancement using recurrent convolution encoder and decoder. Wireless Personal Communications, 119(3), 1959–1973.

    Article  Google Scholar 

  • Kim, H., Kang, K., & Shin, J. W. (2022). Factorized MVDR deep beamforming for multi-channel speech enhancement. IEEE Signal Processing Letters, 29, 1898–1902.

    Article  Google Scholar 

  • Kim, H., & Shin, J. W. (2021). Target exaggeration for deep learning-based speech enhancement. Digital Signal Processing, 116, 103109.

    Article  Google Scholar 

  • Kominek, J., & Black, A. W. (2004). The CMU Arctic speech databases. In Proceedings of the fifth ISCA workshop on speech synthesis, Vienna, Austria, 20–22 September 2004.

  • Le, X., Lei, T., Chen, K., & Lu, J. (2022). Inference skipping for more efficient real-time speech enhancement with parallel RNNs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 2411–2421.

    Article  Google Scholar 

  • Li, A., Zheng, C., Zhang, L., & Li, X. (2022). Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Applied Acoustics, 187, 108499.

    Article  Google Scholar 

  • Li, X. X., Li, D., Ren, W. X., & Zhang, J. S. (2022). Loosening identification of multi-bolt connections based on wavelet transform and ResNet-50 convolutional neural network. Sensors, 22(18), 6825.

    Article  Google Scholar 

  • Lin, J., van Wijngaarden, A. J. D. L., Wang, K. C., & Smith, M. C. (2021b). Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3440–3450.

    Article  Google Scholar 

  • Lin, Y. C., Yu, C., Hsu, Y. T., Fu, S. W., Tsao, Y., & Kuo, T. W. (2021a). SEOFP-NET: Compression and acceleration of deep neural networks for speech enhancement using sign-exponent-only floating-points. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1016–1031.

    Article  Google Scholar 

  • Llombart, J., Ribas, D., Miguel, A., Vicente, L., Ortega, A., & Lleida, E. (2021). Progressive loss functions for speech enhancement with deep neural networks. EURASIP Journal on Audio, Speech, and Music Processing, 2021(1), 1–16.

    Article  Google Scholar 

  • Michelsanti, D., Tan, Z. H., Zhang, S. X., Xu, Y., Yu, M., Yu, D., & Jensen, J. (2021). An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1368–1396.

    Article  Google Scholar 

  • Passos, L. A., Khubaib, A., Raza, M., & Adeel, A. (2022). Multimodal speech enhancement using burst propagation. arXiv preprint arXiv:2209.03275.

  • Polyak, A., Wolf, L., Adi, Y., Kabeli, O., & Taigman, Y. (2021). High fidelity speech regeneration with application to speech enhancement. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 7143–7147). IEEE.

  • Rao, W., Fu, Y., Hu, Y., Xu, X., Jv, Y., Han, J., Jiang, Xie, L., Wang, Y., Watanabe, S., et al. (2021). Interspeech 2021 conferencing speech challenge: Towards far-field multi-channel speech enhancement for video conferencing. arXiv:2104.00960.

  • Reddy, C. K., Dubey, H., Koishida, K., Nair, A., Gopal, V., Cutler, R., Braun, S., Gamper, H., Aichner, R., & Srinivasan, S. (2021). Interspeech 2021 deep noise suppression challenge. In Interspeech.

  • Ribas, D., Miguel, A., Ortega, A., & Lleida, E. (2022). Wiener filter and deep neural networks: A well-balanced pair for speech enhancement. Applied Sciences, 12(18), 9000.

    Article  Google Scholar 

  • Sun, K., & Zhang, X. (2021). UltraSE: Single-channel speech enhancement using ultrasound. In Proceedings of the 27th annual international conference on mobile computing and networking (pp. 160–173).

  • Toloosham, B., & Koishida, K. (2022). A training framework for stereo-aware speech enhancement using deep neural networks. In ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6962–6966). IEEE.

  • Tolooshams, B., Giri, R., Song, A. H., Isik, U., & Krishnaswamy, A. (2020). Channel-attention dense u-net for multichannel speech enhancement. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 836–840). IEEE.

  • Wang, R., Chencho, An, S., Li, J., Li, L., Hao, H., & Liu, W. (2021). Deep residual network framework for structural health monitoring. Structural Health Monitoring, 20, 1443–1461.

    Article  Google Scholar 

  • Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A., & Pardede, H. F. (2021). Speech enhancement using deep learning methods: A review. Jurnal Elektronikadan Telekomunikasi, 21(1), 19–26.

    Article  Google Scholar 

  • Zhang, Q., Qian, X., Ni, Z., Nicolson, A., Ambikairajah, E., & Li, H. (2022). A time-frequency attention module for neural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 462.

    Article  Google Scholar 

  • Zheng, C., Liu, W., Li, A., Ke, Y., & Li, X. (2022). Low-latency monaural speech enhancement with deep filter-bank equalizer. The Journal of the Acoustical Society of America, 151(5), 3291–3304.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. Srinivasarao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srinivasarao, V. Speech signal analysis and enhancement using combined wavelet Fourier transform with stacked deep learning architecture. Int J Speech Technol 26, 735–742 (2023). https://doi.org/10.1007/s10772-023-10044-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10044-x

Keywords

Navigation