Elsevier

Speech Communication

Volume 132, September 2021, Pages 114-122
Speech Communication

An adaptive transmission line cochlear model based front-end for replay attack detection

https://doi.org/10.1016/j.specom.2021.06.004Get rights and content

Highlights

  • We propose an adaptive transmission line cochlear model for use in speech front-ends.

  • The proposed adaptive elements of the cochlear model lead to improved frequency selectivity and dynamic range compression.

  • The model helps capture low amplitude channel characteristics which aid in replay detection.

Abstract

The cochlea is a remarkable spectrum analyser with desirable properties including sharp frequency tuning and level-dependent compression and the potential advantages of incorporating these characteristics in a speech processing front-end are investigated. This paper develops a framework for an active transmission line cochlear model employing adaptive notch and resonant filters. The proposed model reproduces the observed asymmetric auditory filter shape with a sharp high-frequency roll-off and level-dependent nonlinear dynamic range compression characteristics. Experimental analysis demonstrates that sharp frequency tuning and dynamic range compression of the proposed model lead to an enhanced spectral representation compared with other spectral analysis methods. The proposed model was employed in the front-end of replay spoofing attack detection systems, and experiments on the ASVspoof 2017 version 2.0 and ASVspoof 2019 databases demonstrate that the proposed model outperforms linear and nonlinear level-dependent parallel filter bank auditory models and classical spectro-temporal front-ends. The use of the proposed model leads to relative improvements of 45.6%, 51.9% and 60.8% over the baseline feature CQCCs of ASVspoof version 2.0 and CQCCs and LFCCs of ASVspoof2019 on evaluation datasets, respectively.

Introduction

Auditory model front ends are integrated into a vast majority of the speech processing systems and have been shown to outperform conventional speech processing techniques (Kim et al., 1999; Tchorz and Kollmeier, 1999). Multiple approaches to computational auditory modeling have been reported in the literature. For example, conventional auditory filters have been implemented using a set of overlapping parallel filter banks (Hohmann, 2002; Irino and Patterson, 2006). Alternatively, transmission line auditory models (Lyon, 1997) (Kates, 1991), a cascade of digital filters that closely mimic underlying cochlea physiology have also been developed. These transmission line models reproduce auditory responses more realistically than parallel filter bank models (Lyon, 2011b; Hemmert et al., 2004).

Sharp frequency tuning and nonlinear level dependent dynamic range compression are known to be two prominent phenomena responsible for the sensitivity and selectivity of the auditory systems over a broad intensity and frequency range (Moore, 1985; Robles and Ruggero, 2001). Measurements of mammalian cochlea demonstrate that the cochlea has remarkable frequency selectivity with a steep high-frequency roll-off (Moore, 1978). This improved frequency selectivity in turn could lead to noise robustness (Li, 2009).

The level-dependent nonlinear dynamic range compression is achieved via an active feedback mechanism that modifies the auditory response such that low amplitude input signals are boosted. This contributes to increasing the speech intelligibility (French and Steinberg, 1947), (Villchur, 1989).Auditory models that include level-dependent nonlinearity have been shown to improve the generalisability of speech enhancement systems (Baby and Verhulst, 2018) and have been successful in analysing, classifying and recognizing sounds in applications such as audio content categorization and music recommendation (Lyon, 2011b).

A number of active auditory models that include the level dependent nonlinearities have been validated by comparing response characteristics with the available experimental measurements of the cochlea (Walters, 2011), (Kates, 1993). However, their application in different speech processing systems has not been extensively investigated thus far.

In this paper, an active cochlear model that is focused on reproducing the sharp frequency tuning and level-dependent nonlinear characteristics of the cochlea in a way that closely matches the physiological observations is developed. A front-end based on this model is then developed for replay spoofing attack detection in automatic speaker verification systems. The channel and environmental acoustic distortions are the key discriminative cues used to identify the replay attack (Wu et al., 2015), (Singh and Pati, 2019). It is anticipated that the proposed model will effectively capture these discriminative cues from regions of silence, pauses or low speech amplitude. The proposed model is an extends earlier work published by the authors (Gunendradasan et al., 2019a; Gunendradasan et al., 2019b) to incorporate level-dependent non-linear dynamic range compression.

Section snippets

Related work

This section discusses the literature on the auditory models that incorporate sharp frequency tuning and nonlinear level-dependent cochlea characteristics as well as some background on replay spoofing attack detection.

Proposed adaptive transmission line (ATL) cochlear model

This section presents the implementation details of the proposed active transmission line cochlear model developed from the analytical electrical representation of the cochlea. It introduces relevant background on the passive transmission line cochlear models before the proposed adaptive transmission line cochlear model is detailed.

Proposed ATL cochlear model characteristics

The proposed ATL model produces an auditory filter shape similar to the one shown in Fig. 1 in close agreement with the mammalian cochlea's physiological tuning curves. The auditory response of the proposed model at different frequency positions are illustrated in Fig. 5. The model exhibits the desired characteristics of having broader tuning curves in the low-frequency side, whereas narrow tuning in the high-frequency side (Robles and Ruggero, 2001). A comparison of the high-frequency side

Experimental setup

Experiments were conducted to investigate the potential benefits of the proposed ATL cochlear model as a front-end for replay spoofing attack detection. This section details the feature extraction process from the ATL model for replay attack detection. Further, the database used for the experiments, the experimental settings and the baseline model used for the comparison are discussed.

The amplitude modulation (AM) feature that tracks the amplitude envelope of the speech signal was investigated

Results and discussion on replay spoofing attack detection

In this section, comparisons of the proposed ATL model with other auditory models and spectral feature extraction techniques are presented, based on the AS spoof 2017 version 2.0 and ASVspoof 2019 databases. AM and short-term spectral energy based features are among the most widely used features for distinguishing genuine speech from replayed speech. The ASVspoof 2017 challenge baseline feature constant-Q cepstral coefficients (CQCC) uses CQT transform for spectral decomposition. There are

Conclusion

This paper presents an adaptive transmission line (ATL) cochlear model that includes novel adaptive notch and resonant filters to mimic the feedback provided by outer hair cells in the cochlea. This in turn leads to a cochlear model with auditory filter shapes, frequency selectivity, and nonlinear level dependent dynamic range compression characteristics in close agreement with experimental measurements of the human cochlea. Our results show that the high selectivity achieved by the proposed

CRediT authorship contribution statement

Tharshini Gunendradasan: Conceptualization, Methodology, Software, Validation, Visualization, Writing – original draft. Eliathamby Ambikairajah: Conceptualization, Methodology, Writing – review & editing, Supervision, Project administration, Funding acquisition. Julien Epps: Investigation, Writing – review & editing, Funding acquisition. Vidhyasaharan Sethu: Methodology, Investigation, Writing – review & editing, Funding acquisition. Haizhou Li: Methodology, Writing – review & editing, Funding

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was funded by ARC Discovery Grant DP190102479. The authors would also like to thank the reviewers for the invaluable feedback which helped improve this paper.

References (57)

  • D. Baby et al.

    Biophysically-inspired features improve the generalizability of neural network-based speech enhancement systems

  • H. Delgado et al.

    ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements

  • R. Font et al.

    Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 Challenge

    Proc. Interspeech

    (2017)
  • N.R. French et al.

    Factors governing the intelligibility of speech sounds

    J. Acoust. Soc. Am.

    (1947)
  • C. Giguere et al.

    A computational model of the auditory periphery for speech and hearing research. I. Ascending path

    J. Acoust. Soc. Am.

    (1994)
  • T. Gunendradasan et al.

    An adaptive-Q cochlear model for replay spoofing detection

  • T. Gunendradasan et al.

    Transmission line cochlear model based AM-FM features for replay attack detection

  • T. Gunendradasan et al.

    Detection of replay-spoofing attacks using frequency modulation features

    Proc. Interspeech

    (2018)
  • J. Hall

    Spatial differentiation as an auditory “second filter’’: assessment on a nonlinear model of the basilar membrane

    J. Acoust. Soc. Am.

    (1977)
  • W. Hemmert et al.

    Auditory-based automatic speech recognition

    ISCA Tutorial and Research Workshop (ITRW) On Statistical and Perceptual Audio Processing

    (2004)
  • T. Hirahara et al.

    A computational cochlear nonlinear preprocessing model with adaptive Q circuits

  • V. Hohmann

    Frequency analysis and synthesis using a Gammatone filterbank

    Acta Acustica united with Acustica

    (2002)
  • T. Irino et al.

    A compressive gammachirp auditory filter for both physiological and psychophysical data

    J. Acoust. Soc. Am.

    (2001)
  • T. Irino et al.

    A dynamic compressive gammachirp auditory filterbank

    IEEE Trans. Audio Speech Lang. Process.

    (2006)
  • P. Johannesma

    The pre-response stimulus ensemble of neurons in the cochlear nucleus

  • M. Kamble et al.

    Effectiveness of Speech Demodulation-Based Features for Replay Detection

  • M.R. Kamble et al.

    Combination of amplitude and frequency modulation features for presentation attack detection

    J. Signal Process Syst.

    (2020)
  • J.M. Kates

    A time-domain digital cochlear model

    IEEE Trans. Signal Process.

    (1991)
  • View full text