An adaptive transmission line cochlear model based front-end for replay attack detection

doi:10.1016/j.specom.2021.06.004

Speech Communication

Volume 132, September 2021, Pages 114-122

https://doi.org/10.1016/j.specom.2021.06.004 Get rights and content

Highlights

•
We propose an adaptive transmission line cochlear model for use in speech front-ends.
•
The proposed adaptive elements of the cochlear model lead to improved frequency selectivity and dynamic range compression.
•
The model helps capture low amplitude channel characteristics which aid in replay detection.

Abstract

The cochlea is a remarkable spectrum analyser with desirable properties including sharp frequency tuning and level-dependent compression and the potential advantages of incorporating these characteristics in a speech processing front-end are investigated. This paper develops a framework for an active transmission line cochlear model employing adaptive notch and resonant filters. The proposed model reproduces the observed asymmetric auditory filter shape with a sharp high-frequency roll-off and level-dependent nonlinear dynamic range compression characteristics. Experimental analysis demonstrates that sharp frequency tuning and dynamic range compression of the proposed model lead to an enhanced spectral representation compared with other spectral analysis methods. The proposed model was employed in the front-end of replay spoofing attack detection systems, and experiments on the ASVspoof 2017 version 2.0 and ASVspoof 2019 databases demonstrate that the proposed model outperforms linear and nonlinear level-dependent parallel filter bank auditory models and classical spectro-temporal front-ends. The use of the proposed model leads to relative improvements of 45.6%, 51.9% and 60.8% over the baseline feature CQCCs of ASVspoof version 2.0 and CQCCs and LFCCs of ASVspoof2019 on evaluation datasets, respectively.

Introduction

Auditory model front ends are integrated into a vast majority of the speech processing systems and have been shown to outperform conventional speech processing techniques (Kim et al., 1999; Tchorz and Kollmeier, 1999). Multiple approaches to computational auditory modeling have been reported in the literature. For example, conventional auditory filters have been implemented using a set of overlapping parallel filter banks (Hohmann, 2002; Irino and Patterson, 2006). Alternatively, transmission line auditory models (Lyon, 1997) (Kates, 1991), a cascade of digital filters that closely mimic underlying cochlea physiology have also been developed. These transmission line models reproduce auditory responses more realistically than parallel filter bank models (Lyon, 2011b; Hemmert et al., 2004).

Sharp frequency tuning and nonlinear level dependent dynamic range compression are known to be two prominent phenomena responsible for the sensitivity and selectivity of the auditory systems over a broad intensity and frequency range (Moore, 1985; Robles and Ruggero, 2001). Measurements of mammalian cochlea demonstrate that the cochlea has remarkable frequency selectivity with a steep high-frequency roll-off (Moore, 1978). This improved frequency selectivity in turn could lead to noise robustness (Li, 2009).

The level-dependent nonlinear dynamic range compression is achieved via an active feedback mechanism that modifies the auditory response such that low amplitude input signals are boosted. This contributes to increasing the speech intelligibility (French and Steinberg, 1947), (Villchur, 1989).Auditory models that include level-dependent nonlinearity have been shown to improve the generalisability of speech enhancement systems (Baby and Verhulst, 2018) and have been successful in analysing, classifying and recognizing sounds in applications such as audio content categorization and music recommendation (Lyon, 2011b).

A number of active auditory models that include the level dependent nonlinearities have been validated by comparing response characteristics with the available experimental measurements of the cochlea (Walters, 2011), (Kates, 1993). However, their application in different speech processing systems has not been extensively investigated thus far.

In this paper, an active cochlear model that is focused on reproducing the sharp frequency tuning and level-dependent nonlinear characteristics of the cochlea in a way that closely matches the physiological observations is developed. A front-end based on this model is then developed for replay spoofing attack detection in automatic speaker verification systems. The channel and environmental acoustic distortions are the key discriminative cues used to identify the replay attack (Wu et al., 2015), (Singh and Pati, 2019). It is anticipated that the proposed model will effectively capture these discriminative cues from regions of silence, pauses or low speech amplitude. The proposed model is an extends earlier work published by the authors (Gunendradasan et al., 2019a; Gunendradasan et al., 2019b) to incorporate level-dependent non-linear dynamic range compression.

Section snippets

Related work

This section discusses the literature on the auditory models that incorporate sharp frequency tuning and nonlinear level-dependent cochlea characteristics as well as some background on replay spoofing attack detection.

Proposed adaptive transmission line (ATL) cochlear model

This section presents the implementation details of the proposed active transmission line cochlear model developed from the analytical electrical representation of the cochlea. It introduces relevant background on the passive transmission line cochlear models before the proposed adaptive transmission line cochlear model is detailed.

Proposed ATL cochlear model characteristics

The proposed ATL model produces an auditory filter shape similar to the one shown in Fig. 1 in close agreement with the mammalian cochlea's physiological tuning curves. The auditory response of the proposed model at different frequency positions are illustrated in Fig. 5. The model exhibits the desired characteristics of having broader tuning curves in the low-frequency side, whereas narrow tuning in the high-frequency side (Robles and Ruggero, 2001). A comparison of the high-frequency side

Experimental setup

Experiments were conducted to investigate the potential benefits of the proposed ATL cochlear model as a front-end for replay spoofing attack detection. This section details the feature extraction process from the ATL model for replay attack detection. Further, the database used for the experiments, the experimental settings and the baseline model used for the comparison are discussed.

The amplitude modulation (AM) feature that tracks the amplitude envelope of the speech signal was investigated

Results and discussion on replay spoofing attack detection

In this section, comparisons of the proposed ATL model with other auditory models and spectral feature extraction techniques are presented, based on the AS spoof 2017 version 2.0 and ASVspoof 2019 databases. AM and short-term spectral energy based features are among the most widely used features for distinguishing genuine speech from replayed speech. The ASVspoof 2017 challenge baseline feature constant-Q cepstral coefficients (CQCC) uses CQT transform for spectral decomposition. There are

Conclusion

This paper presents an adaptive transmission line (ATL) cochlear model that includes novel adaptive notch and resonant filters to mimic the feedback provided by outer hair cells in the cochlea. This in turn leads to a cochlear model with auditory filter shapes, frequency selectivity, and nonlinear level dependent dynamic range compression characteristics in close agreement with experimental measurements of the human cochlea. Our results show that the high selectivity achieved by the proposed

CRediT authorship contribution statement

Tharshini Gunendradasan: Conceptualization, Methodology, Software, Validation, Visualization, Writing – original draft. Eliathamby Ambikairajah: Conceptualization, Methodology, Writing – review & editing, Supervision, Project administration, Funding acquisition. Julien Epps: Investigation, Writing – review & editing, Funding acquisition. Vidhyasaharan Sethu: Methodology, Investigation, Writing – review & editing, Funding acquisition. Haizhou Li: Methodology, Writing – review & editing, Funding

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was funded by ARC Discovery Grant DP190102479. The authors would also like to thank the reviewers for the invaluable feedback which helped improve this paper.

References (57)

E. Ambikairajah et al.
Digital filter simulation of the basilar membrane
Comput. Speech Lang.
(1989)
J. Gałka et al.
Playback attack detection for text-dependent speaker verification over telephone channels
Speech Commun.
(2015)
J.L. Goldstein
Modeling rapid waveform compression on the basilar membrane as multiple-bandpass-nonlinearity filtering
Hear. Res.
(1990)
C. Hanilci et al.
Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise
Speech Commun.
(2016)
B. Johnstone et al.
Basilar membrane measurements and the travelling wave
Hear. Res.
(1986)
M.R. Kamble et al.
Amplitude and frequency modulation-based features for detection of replay spoof speech
Speech Commun.
(2020)
E.S. Olson et al.
Von Békésy and cochlear mechanics
Hear. Res.
(2012)
Z. Wu et al.
Spoofing and countermeasures for speaker verification: a survey
Speech Commun.
(2015)
H. Yin et al.
Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency
Speech Commun.
(2011)
J. Allen
Nonlinear cochlear signal processing
Physiol. Ear
(2001)

D. Baby et al.

Biophysically-inspired features improve the generalizability of neural network-based speech enhancement systems

H. Delgado et al.

ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements

R. Font et al.

Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 Challenge

Proc. Interspeech

(2017)

N.R. French et al.

Factors governing the intelligibility of speech sounds

J. Acoust. Soc. Am.

(1947)

C. Giguere et al.

A computational model of the auditory periphery for speech and hearing research. I. Ascending path

J. Acoust. Soc. Am.

(1994)

T. Gunendradasan et al.

An adaptive-Q cochlear model for replay spoofing detection

T. Gunendradasan et al.

Transmission line cochlear model based AM-FM features for replay attack detection

T. Gunendradasan et al.

Detection of replay-spoofing attacks using frequency modulation features

Proc. Interspeech

(2018)

J. Hall

Spatial differentiation as an auditory “second filter’’: assessment on a nonlinear model of the basilar membrane

J. Acoust. Soc. Am.

(1977)

W. Hemmert et al.

Auditory-based automatic speech recognition

ISCA Tutorial and Research Workshop (ITRW) On Statistical and Perceptual Audio Processing

(2004)

T. Hirahara et al.

A computational cochlear nonlinear preprocessing model with adaptive Q circuits

V. Hohmann

Frequency analysis and synthesis using a Gammatone filterbank

Acta Acustica united with Acustica

(2002)

T. Irino et al.

A compressive gammachirp auditory filter for both physiological and psychophysical data

J. Acoust. Soc. Am.

(2001)

T. Irino et al.

A dynamic compressive gammachirp auditory filterbank

IEEE Trans. Audio Speech Lang. Process.

(2006)

P. Johannesma

The pre-response stimulus ensemble of neurons in the cochlear nucleus

M. Kamble et al.

Effectiveness of Speech Demodulation-Based Features for Replay Detection

M.R. Kamble et al.

Combination of amplitude and frequency modulation features for presentation attack detection

J. Signal Process Syst.

(2020)

J.M. Kates

A time-domain digital cochlear model

IEEE Trans. Signal Process.

(1991)

Cited by (2)

Voice spoofing countermeasure for voice replay attacks using deep learning
2022, Journal of Cloud Computing
Voice Spoofing Countermeasure for Voice Replay Attacks using Deep Learning
2022, Research Square

View full text

An adaptive transmission line cochlear model based front-end for replay attack detection

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed adaptive transmission line (ATL) cochlear model

Proposed ATL cochlear model characteristics

Experimental setup

Results and discussion on replay spoofing attack detection

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Comput. Speech Lang.

Speech Commun.

Hear. Res.

Speech Commun.

Hear. Res.

Speech Commun.

Hear. Res.

Speech Commun.

Speech Commun.

Nonlinear cochlear signal processing

Physiol. Ear

Biophysically-inspired features improve the generalizability of neural network-based speech enhancement systems

ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements

Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 Challenge

Proc. Interspeech

Factors governing the intelligibility of speech sounds

J. Acoust. Soc. Am.

A computational model of the auditory periphery for speech and hearing research. I. Ascending path

J. Acoust. Soc. Am.

An adaptive-Q cochlear model for replay spoofing detection

Transmission line cochlear model based AM-FM features for replay attack detection

Detection of replay-spoofing attacks using frequency modulation features

Proc. Interspeech

Spatial differentiation as an auditory “second filter’’: assessment on a nonlinear model of the basilar membrane

J. Acoust. Soc. Am.

Auditory-based automatic speech recognition

ISCA Tutorial and Research Workshop (ITRW) On Statistical and Perceptual Audio Processing

A computational cochlear nonlinear preprocessing model with adaptive Q circuits

Frequency analysis and synthesis using a Gammatone filterbank

Acta Acustica united with Acustica

A compressive gammachirp auditory filter for both physiological and psychophysical data

J. Acoust. Soc. Am.

A dynamic compressive gammachirp auditory filterbank

IEEE Trans. Audio Speech Lang. Process.

The pre-response stimulus ensemble of neurons in the cochlear nucleus

Effectiveness of Speech Demodulation-Based Features for Replay Detection

Combination of amplitude and frequency modulation features for presentation attack detection

J. Signal Process Syst.

A time-domain digital cochlear model

IEEE Trans. Signal Process.