An adaptive transmission line cochlear model based front-end for replay attack detection
Introduction
Auditory model front ends are integrated into a vast majority of the speech processing systems and have been shown to outperform conventional speech processing techniques (Kim et al., 1999; Tchorz and Kollmeier, 1999). Multiple approaches to computational auditory modeling have been reported in the literature. For example, conventional auditory filters have been implemented using a set of overlapping parallel filter banks (Hohmann, 2002; Irino and Patterson, 2006). Alternatively, transmission line auditory models (Lyon, 1997) (Kates, 1991), a cascade of digital filters that closely mimic underlying cochlea physiology have also been developed. These transmission line models reproduce auditory responses more realistically than parallel filter bank models (Lyon, 2011b; Hemmert et al., 2004).
Sharp frequency tuning and nonlinear level dependent dynamic range compression are known to be two prominent phenomena responsible for the sensitivity and selectivity of the auditory systems over a broad intensity and frequency range (Moore, 1985; Robles and Ruggero, 2001). Measurements of mammalian cochlea demonstrate that the cochlea has remarkable frequency selectivity with a steep high-frequency roll-off (Moore, 1978). This improved frequency selectivity in turn could lead to noise robustness (Li, 2009).
The level-dependent nonlinear dynamic range compression is achieved via an active feedback mechanism that modifies the auditory response such that low amplitude input signals are boosted. This contributes to increasing the speech intelligibility (French and Steinberg, 1947), (Villchur, 1989).Auditory models that include level-dependent nonlinearity have been shown to improve the generalisability of speech enhancement systems (Baby and Verhulst, 2018) and have been successful in analysing, classifying and recognizing sounds in applications such as audio content categorization and music recommendation (Lyon, 2011b).
A number of active auditory models that include the level dependent nonlinearities have been validated by comparing response characteristics with the available experimental measurements of the cochlea (Walters, 2011), (Kates, 1993). However, their application in different speech processing systems has not been extensively investigated thus far.
In this paper, an active cochlear model that is focused on reproducing the sharp frequency tuning and level-dependent nonlinear characteristics of the cochlea in a way that closely matches the physiological observations is developed. A front-end based on this model is then developed for replay spoofing attack detection in automatic speaker verification systems. The channel and environmental acoustic distortions are the key discriminative cues used to identify the replay attack (Wu et al., 2015), (Singh and Pati, 2019). It is anticipated that the proposed model will effectively capture these discriminative cues from regions of silence, pauses or low speech amplitude. The proposed model is an extends earlier work published by the authors (Gunendradasan et al., 2019a; Gunendradasan et al., 2019b) to incorporate level-dependent non-linear dynamic range compression.
Section snippets
Related work
This section discusses the literature on the auditory models that incorporate sharp frequency tuning and nonlinear level-dependent cochlea characteristics as well as some background on replay spoofing attack detection.
Proposed adaptive transmission line (ATL) cochlear model
This section presents the implementation details of the proposed active transmission line cochlear model developed from the analytical electrical representation of the cochlea. It introduces relevant background on the passive transmission line cochlear models before the proposed adaptive transmission line cochlear model is detailed.
Proposed ATL cochlear model characteristics
The proposed ATL model produces an auditory filter shape similar to the one shown in Fig. 1 in close agreement with the mammalian cochlea's physiological tuning curves. The auditory response of the proposed model at different frequency positions are illustrated in Fig. 5. The model exhibits the desired characteristics of having broader tuning curves in the low-frequency side, whereas narrow tuning in the high-frequency side (Robles and Ruggero, 2001). A comparison of the high-frequency side
Experimental setup
Experiments were conducted to investigate the potential benefits of the proposed ATL cochlear model as a front-end for replay spoofing attack detection. This section details the feature extraction process from the ATL model for replay attack detection. Further, the database used for the experiments, the experimental settings and the baseline model used for the comparison are discussed.
The amplitude modulation (AM) feature that tracks the amplitude envelope of the speech signal was investigated
Results and discussion on replay spoofing attack detection
In this section, comparisons of the proposed ATL model with other auditory models and spectral feature extraction techniques are presented, based on the AS spoof 2017 version 2.0 and ASVspoof 2019 databases. AM and short-term spectral energy based features are among the most widely used features for distinguishing genuine speech from replayed speech. The ASVspoof 2017 challenge baseline feature constant-Q cepstral coefficients (CQCC) uses CQT transform for spectral decomposition. There are
Conclusion
This paper presents an adaptive transmission line (ATL) cochlear model that includes novel adaptive notch and resonant filters to mimic the feedback provided by outer hair cells in the cochlea. This in turn leads to a cochlear model with auditory filter shapes, frequency selectivity, and nonlinear level dependent dynamic range compression characteristics in close agreement with experimental measurements of the human cochlea. Our results show that the high selectivity achieved by the proposed
CRediT authorship contribution statement
Tharshini Gunendradasan: Conceptualization, Methodology, Software, Validation, Visualization, Writing – original draft. Eliathamby Ambikairajah: Conceptualization, Methodology, Writing – review & editing, Supervision, Project administration, Funding acquisition. Julien Epps: Investigation, Writing – review & editing, Funding acquisition. Vidhyasaharan Sethu: Methodology, Investigation, Writing – review & editing, Funding acquisition. Haizhou Li: Methodology, Writing – review & editing, Funding
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was funded by ARC Discovery Grant DP190102479. The authors would also like to thank the reviewers for the invaluable feedback which helped improve this paper.
References (57)
- et al.
Digital filter simulation of the basilar membrane
Comput. Speech Lang.
(1989) - et al.
Playback attack detection for text-dependent speaker verification over telephone channels
Speech Commun.
(2015) Modeling rapid waveform compression on the basilar membrane as multiple-bandpass-nonlinearity filtering
Hear. Res.
(1990)- et al.
Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise
Speech Commun.
(2016) - et al.
Basilar membrane measurements and the travelling wave
Hear. Res.
(1986) - et al.
Amplitude and frequency modulation-based features for detection of replay spoof speech
Speech Commun.
(2020) - et al.
Von Békésy and cochlear mechanics
Hear. Res.
(2012) - et al.
Spoofing and countermeasures for speaker verification: a survey
Speech Commun.
(2015) - et al.
Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency
Speech Commun.
(2011) Nonlinear cochlear signal processing
Physiol. Ear
(2001)
Biophysically-inspired features improve the generalizability of neural network-based speech enhancement systems
ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements
Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 Challenge
Proc. Interspeech
Factors governing the intelligibility of speech sounds
J. Acoust. Soc. Am.
A computational model of the auditory periphery for speech and hearing research. I. Ascending path
J. Acoust. Soc. Am.
An adaptive-Q cochlear model for replay spoofing detection
Transmission line cochlear model based AM-FM features for replay attack detection
Detection of replay-spoofing attacks using frequency modulation features
Proc. Interspeech
Spatial differentiation as an auditory “second filter’’: assessment on a nonlinear model of the basilar membrane
J. Acoust. Soc. Am.
Auditory-based automatic speech recognition
ISCA Tutorial and Research Workshop (ITRW) On Statistical and Perceptual Audio Processing
A computational cochlear nonlinear preprocessing model with adaptive Q circuits
Frequency analysis and synthesis using a Gammatone filterbank
Acta Acustica united with Acustica
A compressive gammachirp auditory filter for both physiological and psychophysical data
J. Acoust. Soc. Am.
A dynamic compressive gammachirp auditory filterbank
IEEE Trans. Audio Speech Lang. Process.
The pre-response stimulus ensemble of neurons in the cochlear nucleus
Effectiveness of Speech Demodulation-Based Features for Replay Detection
Combination of amplitude and frequency modulation features for presentation attack detection
J. Signal Process Syst.
A time-domain digital cochlear model
IEEE Trans. Signal Process.
Cited by (2)
Voice spoofing countermeasure for voice replay attacks using deep learning
2022, Journal of Cloud ComputingVoice Spoofing Countermeasure for Voice Replay Attacks using Deep Learning
2022, Research Square