Elsevier

Computer Speech & Language

Volume 20, Issues 2–3, April–July 2006, Pages 128-158
Computer Speech & Language

NIST and NFI-TNO evaluations of automatic speaker recognition

https://doi.org/10.1016/j.csl.2005.07.001Get rights and content

Abstract

In the past years, several text-independent speaker recognition evaluation campaigns have taken place. This paper reports on results of the NIST evaluation of 2004 and the NFI-TNO forensic speaker recognition evaluation held in 2003, and reflects on the history of the evaluation campaigns. The effects of speech duration, training handsets, transmission type, and gender mix show expected behaviour on the DET curves. New results on the influence of language show an interesting dependence of the DET curves on the accent of speakers. We also report on a number of statistical analysis techniques that have recently been introduced in the speaker recognition community, as well as a new application of the analysis of deviance analysis. These techniques are used to determine that the two evaluations held in 2003, by NIST and NFI-TNO, are of statistically different difficulty to the speaker recognition systems.

Introduction

Evaluations of text independent speaker recognition systems have been held regularly in the past decade (Przybocki and Martin, 1999, Martin and Przybocki, 2000, Doddington et al., 2000, Martin and Przybocki, 2001, Przybocki and Martin, 2002, Przybocki and Martin, 2004, Van Leeuwen and Bouten, 2004). The evaluations provide the developers of systems an opportunity to assess the quality of their system and inspire them to try out new approaches to the problem of speaker recognition. A leading role in the methodology and focus of the evaluation has been played by NIST and its sponsors. A co-operation with the Linguistic Data Consortium (LDC) has guaranteed regular new challenges with regard to the application domain while the LDC provided a constant quality of the evaluation databases.

Around 2002, two independent efforts resulted in the availability of completely new types of speech database for speaker recognition. The first database was collected by a co-operation between two Dutch parties, the Netherlands Forensic Institute (NFI) and TNO. It consisted of wire-tapped telephone recordings made by the Dutch police forces in police investigations. The second database is the MIXER corpus, collected by LDC, in which a multi dimensional design of controlled recordings of telephone conversations is implemented. Parameters that have proven to be important in earlier speaker recognition evaluations are systematically varied, such that the database now consists of data recorded with several microphones, in five languages, from different handsets and over several transmission lines. Both databases have been used in an evaluation, the former in what has been coined the ‘NFI-TNO forensic speaker recognition evaluation’ and the latter in the regular NIST evaluation in the year 2004.

The two evaluations differ on many points, such as size, language, design, and collection method. The most important difference is the type of data: On the one hand the NFI-TNO evaluation consists of genuine field data, collected in exactly the same way as it would be used in an application for police investigations, with speech uttered by people suspected of criminal activity, who in no way realized their speech was used for this kind of technology evaluation. The database is uncontrolled, several conditions are unbalanced, and the amount of material useful for a proper evaluation is limited. On the other hand NIST evaluations consist of well-controlled and well-balanced conditions, and vast amounts of speakers and speech. Every subject collected is keenly aware that their conversation is being recorded (although they only know it is for speech research purposes) so in a sense they can be viewed as co-operative subjects. Despite these apparent large differences, it is possible to analyze and compare both evaluations both qualitatively and quantitatively.

Meaningful evaluations are carefully planned. By providing explicit evaluation specifications, common test sets, standard measurements of error, and a forum for participants to openly discuss algorithmic successes and failures, the NIST and NFI-TNO evaluations have provided a means for recording the progress of text-independent speaker recognition performance.

Several relevant papers were presented at Odyssey 2004 The Speaker and Language Recognition Workshop in Toledo, Spain, including a paper on past NIST speaker recognition evaluations (Przybocki and Martin, 2004). The basic results of the NFI-TNO evaluation (Van Leeuwen and Bouten, 2004) and the design of the NIST 2004 evaluation (Przybocki and Martin, 2004) were also presented at Odyssey 2004, but in this paper we have the unique opportunity to present the results of both evaluations together in greater depth where the advance in evaluation methodology and speaker recognition performance will be made apparent.

The layout of this paper is as follows. First a recapitulation is made of the evaluation paradigm, and some notes on statistical analyses are made. Then the results of the NFI-TNO 2003 and NIST 2004 evaluation are presented and various performance factors are analyzed. Finally an attempt is made to compare the results of the NIST 2003 and NFI-TNO 2003 evaluations.

Section snippets

Evaluation paradigm

There are many similarities between the various evaluations held, despite the aforementioned differences. We will summarize the more important ingredients of the benchmark evaluations in general, showing the common ground and the specific differences.

  • Task. The speaker recognition system is evaluated in terms of a detection task. The question here is whether or not a given speech segment is uttered by a given speaker. There are several variants of this task defined: the (basic) one-speaker

Statistics

In order to be able to compare the performance of different systems within an evaluation, or different conditions for one system, or even different evaluations, it is necessary to perform statistical tests that assess the significance of an observed difference. In this section we will discuss the statistical techniques that are commonly used in the speaker recognition community, some of which are used in the remainder of the paper.

Designs of the NFI-TNO and NIST 2004 evaluations

In the Odyssey articles (Przybocki and Martin, 2004, Van Leeuwen and Bouten, 2004) the design and data collections paradigm for the two evaluations has been reported on quite elaborately. For completeness, we reproduce the most important issues here.

Results and analysis of the NFI-TNO and NIST evaluations

Although the basic results of the NFI-TNO evaluation have been reported in Van Leeuwen and Bouten (2004), we will extend the results with additional statistical analyses here. The results of NIST 2004 have not been published before, and we will integrate the NFI-TNO results and analysis with the NIST results where applicable.

Twelve partners submitted correct system results to NFI-TNO evaluation, 24 sites participated in NIST 2004. The systems are identified anonymously here as a number, there

Summary and conclusions

We have given an overview of the evaluation paradigm of the yearly text independent speaker recognition evaluations held by NIST and that of NFI-TNO in 2003. We have presented and analyzed the result of two recent evaluations. We have introduced an analysis of deviance for studying various factors affecting the equal error rate in the NFI-TNO evaluation, and studied various performance factors affecting the DET curve in the NIST 2004 evaluation. Important factors are training segment duration

Acknowledgements

We want to thank Roland Auckenthaler, Claude Barras, Todor Ganchev and Doug Reynolds for supplying us with additional results, and Niko Brümmer for the many discussions involving decision theory.

References (16)

There are more references available in the full text version of this article.

Cited by (45)

  • Comparison of clustering methods: A case study of text-independent speaker modeling

    2011, Pattern Recognition Letters
    Citation Excerpt :

    Independent of the corpus and feature set-up, GMM–UBM seems to be better for user-convenience and VQ-UBM for security application, respectively. McNemar’s significance test at 95% confidence level (Huang et al., 2001; Leeuwen et al., 2006) was performed at each operating point between GMM–UBM (repeated EM) and VQ-UBM (RS). We measure the difference in the decisions on the impostor trials in case of user-convenience application (FAR @ FRR = 1.

  • An overview of text-independent speaker recognition: From features to supervectors

    2010, Speech Communication
    Citation Excerpt :

    The methods for long data do not readily generalize to short-duration tasks as indicated in (Bonastre et al., 2007; Burget et al., 2009; Fauve et al., 2008). The NIST speaker recognition evaluations (Leeuwen et al., 2006; Przybocki et al., 2007) have systematized speaker recognition methodology development and constant positive progress has been observed in the past years. However, the NIST evaluations have mostly focused on combating technical error sources, most notably that of training/test channel mismatch (for instance, using different microphones in training and test material).

  • Improving the characterization of the alternative hypothesis via minimum verification error training with applications to speaker verification

    2009, Pattern Recognition
    Citation Excerpt :

    The first set of experiments followed Configuration II of XM2VTSDB, as defined in [30]. The second set of experiments followed a configuration that was modified from Configuration II of XM2VTSDB to conform to NIST speaker recognition evaluation (NIST SRE) [6–8]. In the experiments, the population size of the GA was set to 50, the maximum number of generations was set to 100, and the crossover probability pc was set to 0.5 for the EMVE training; the gradient-based MVE training for the WAC and WGC methods was initialized with an equal weight, wi, and the threshold θ was set to 0.

View all citing articles on Scopus
View full text