Elsevier

Speech Communication

Volume 70, June 2015, Pages 28-41
Speech Communication

Mismatched distances from speakers to telephone in a forensic-voice-comparison case

https://doi.org/10.1016/j.specom.2015.03.001Get rights and content

Highlights

  • Illustration of methodology for implementing FVC based on conditions of a real case.

  • Use of relevant data, quantitative measurements, statistical models to calculate LRs.

  • Procedure for testing of validity and reliability under the conditions of the case.

  • Investigation of bias due to mismatched distances of speakers to the microphone.

  • Demonstration of three methods for mismatch compensation.

Abstract

In a forensic-voice-comparison case, one speaker (A) was standing a short distance away from another speaker (B) who was talking on a mobile telephone. Later, speaker A moved closer to the telephone. Shortly thereafter, there was a section of speech where the identity of the speaker was in question – the prosecution claiming that it was speaker A and the defense claiming it was speaker B. All material for training a forensic-voice-comparison system could be extracted from this single recording, but there was a near-far mismatch: Training data for speaker A were mostly far, training data for speaker B were near, and the disputed speech was near. Based on the conditions of this case we demonstrate a methodology for handling forensic casework using relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. A procedure is described for addressing the degree of validity and reliability of a forensic-voice-comparison system under such conditions. Using a set of development speakers we investigate the effect of mismatched distances to the microphone and demonstrate and assess three methods for compensation.

Introduction

Although there remain some dissenting voices, there is wide support for the position that the logically correct way for a forensic scientist to evaluate the strength of forensic evidence is using a likelihood ratio (Evett et al., 2011, Berger et al., 2011, Redmayne et al., 2011, Robertson et al., 2011). A likelihood ratio is the probability of the observed evidence if the prosecution hypothesis were true versus if the defense hypothesis were true (Robertson and Vignaux, 1995, Aitken et al., 2010). Over the last half century there have also been calls for forensic-analysis methodologies to be empirically tested under conditions reflecting those found in casework (see Morrison, 2014 for a review). Morrison and Stoel (2014) have also argued in favor of the calculation of forensic likelihood ratios on the basis of relevant data, quantitative measurements, and statistical models. Morrison (2014) has described a paradigm for the evaluation of the strength of forensic evidence consisting of the following components:

  • 1.

    use of the likelihood-ratio framework

  • 2.

    use of approaches based on data representative of the relevant population, quantitative measurements, and statistical models

  • 3.

    testing of validity and reliability under conditions reflecting those of the case under investigation.

In this paper we illustrate a methodology for implementing this paradigm based on the conditions of a particular forensic-voice-comparison case: One speaker (speaker A) was standing a short distance away from another (speaker B) who was holding a mobile telephone through which a call had been established to an emergency call center. Both speakers spoke in a loud voice, and their speech was recorded off the telephone system at the emergency call center.1 At a particular point in time speaker A moved closer to the telephone. Shortly thereafter, there was a short section of the recording where the identity of the speaker was in question – the prosecution claimed that it was speaker A and the defense claimed it was speaker B (henceforth this section of the recording is referred to as the “questioned utterance”).2 Based on the circumstances of the case, it was determined that the hypotheses to be considered are

  • the questioned utterance was spoken by speaker A (prosecution hypothesis)

  • the questioned utterance was spoken by speaker B (defense hypothesis)

and that this is an exhaustive list of hypotheses, i.e., a priori the probability that the speaker of questioned origin could be a speaker other than one of these two is zero.

All material for creating models representing these hypotheses, and thus for training a forensic-voice-comparison system, could be extracted from the recording of the conversation; however, there was a mismatch in the distance from the speakers to the microphone. Data from undisputed utterances produced by speaker A that were used for speaker model training were mostly far, while those of speaker B were near, and the questioned utterance was near.

Our purpose here is to illustrate how a forensic voice comparison may be conducted under the conditions of this particular case; however, nothing we say should be taken as an explicit or implicit comment about the strength of evidence in the actual case. For this illustration, we used recordings from a research database. We did not use the recording from the actual case. We picked recordings of a pair of speakers from the research database to stand in place of the speakers on the actual casework recording, then processed these recordings to reflect the recording conditions of the case.3

We describe how we calculated a likelihood ratio using data from a single pair of speakers, and how we assessed the validity and reliability of the system we used to make this calculation. An initial baseline analysis is conducted without applying any compensation for the mismatch in recordings conditions (distance to microphone) between the training data from the two speakers. We then use additional pairs of speakers to investigate the effect of this mismatch and to test the effectiveness of three compensation strategies:

  • adjustment for bias in the likelihood ratio output of the system by shifting log likelihood ratios using an offset estimated from likelihood-ratio values calculated in matched and mismatched conditions,4

  • mapping feature vectors in the far condition to more closely resemble the distribution of those in the near condition, and

  • transforming features using canonical linear discriminant functions (CLDF), discarding dimensions that are believed to mostly capture variability due to mismatched distances while retaining those believed to mostly capture speaker-specific information.

We then select the most promising of these methods and recalculate the likelihood ratio for the recording of the first pair of speakers.

Copies of the data and the Matlab (MathWorks Inc., 2013) scripts used to perform the calculations in this paper are available from http://ewaldenzinger.entn.at/nearfar/.

Section snippets

Database

Recordings of pairs of male speakers were taken from a database of Australian English voice recordings designed and collected for the purpose of conducting forensic research and casework (Morrison et al., 2015). See Morrison et al. (2012) for details of the data collection protocol. The recordings used were of telephone conversations between pairs of speakers. Each speaker sat in a separate sound booth (IAC 250 Series Mini Sound Shelter) and talked to the other speaker over a telephone.

Amplitude normalization

All else being equal, the signal from a speaker who is further from the microphone will be of lower amplitude. Since the extracted features would be based on signal amplitude, we normalized the amplitudes of the recordings made in the near and far conditions. The root-mean-square (RMS) amplitude of the signal in each section of each recording was set to the same level across all sections. Care was taken so that the normalization level did not incur clipping on any of the recording sections.

Feature extraction

Mel

Testing validity and reliability

Given the specific set of hypotheses and conditions to be considered, we tested the validity and reliability of our forensic-voice-comparison analysis on the basis of relevant data, prior to applying it to the questioned utterance.

Bias due to near-far mismatch

Even though the amplitude was normalized and the zeroth cepstral coefficient discarded, there may still have been differences between the training data of speaker A and B due to near-far mismatch. If this is the case, then it may be that our system responds primarily to near-far differences rather than speaker differences. If this is the case, we would expect a bias towards likelihood ratios favoring the defense hypothesis (speaker B) over the prosecution hypothesis (speaker A), i.e., favoring

Compensation for near-far mismatch

Given the results in the previous section, we would like to have a procedure to compensate for the mismatch in distance to the microphone for the training data of speaker A. In general, amplitude of acoustic radiation depends on properties including the area of the radiator, distance from the radiator, and frequency (Kinsler et al., 2000, ch. 7). Vermeulen (2009) investigated the impact of distance to microphone on the normalized average magnitude spectra of synthetic vowels. Distances between

Evaluation of the likelihood ratio for the questioned utterance

Mimicking how we would apply the compensation method in a real case for which we only had the 90% far and 10% near training data for speaker A, we take the results reported in Sections 3 Forensic-voice-comparison system, 4 Testing validity and reliability and apply the feature mapping mismatch compensation (method 2) from Section 6 to the far condition training data. Using these mapped feature vectors and the remaining 10% of the training data already in the near condition we test the validity

Conclusion

We have demonstrated a methodology for performing a forensic voice comparison under conditions reflecting those of an actual case. The methodology used relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. There was concern about a potential bias towards one of the hypotheses due to the mismatch in distance to the telephone in the training data from one speaker. Presence of a bias was confirmed via experiments conducted using pairs of speakers in a

Acknowledgments

This research was supported by the Australian Research Council, Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Science, Australasian Speech Science and Technology Association, and the Guardia Civil through Linkage Project LP100200142.

NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program.

The opinions expressed are those of the

References (34)

  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    IEEE Trans. Audio, Speech Lang. Proc.

    (2011)
  • Ellis, D.P.W., 2005. PLP and RASTA (and MFCC, and inversion) in Matlab....
  • I.W. Evett et al.

    Expressing evaluative opinions: a position statement

    Sci. Justice

    (2011)
  • S. Furui

    Cepstral analysis technique for automatic speaker verification

    IEEE Trans. Acoustics, Speech Signal Proc.

    (1981)
  • Jin, Q., Li, R., Yang, Q., Laskowski, K., Schultz, T., 2010. Speaker identification with distant microphone speech. In:...
  • Q. Jin et al.

    Far-field speaker recognition

    IEEE Trans. Audio, Speech Lang. Proc.

    (2007)
  • L. Kinsler et al.

    Fundamentals of Acoustics

    fourth ed

    (2000)
  • Cited by (0)

    A preliminary version of portions of this paper was published as Enzinger, E. “Mismatched distances from speakers to telephone in a forensic-voice-comparison case,” Proceedings of the 21st International Congress on Acoustics (ICA), June 2–7, Montréal, Canada (POMA Volume 19, pp. 060039). 10.1121/1.4805425.

    View full text