Forensic Speaker Comparison Using Evidence Interval in Full Bayesian Significance Test

Graduate Program in Electrical Engineering, Universidade Federal de Minas Gerais, Av. Antônio Carlos 6627, 31270-901, Belo Horizonte, MG, Brazil Institute of Criminalistics of Minas Gerais, Av. Augusto de Lima 1833, 30110-017, Belo Horizonte, MG, Brazil Centro Universitário Newton Paiva, Rua José Cláudio Resende 420, 30494-230, Belo Horizonte, MG, Brazil Department of Electronic Engineering, Universidade Federal de Minas Gerais, Av. Antônio Carlos 6627, 31270-901, Belo Horizonte, MG, Brazil


Introduction
e main task in forensic speaker comparison (FSC) is to analyze two or more voice records to infer whether they come from the same speaker. FSC differs from biometric voice recognition in the hypothesis test approach and in the nature of the voice samples. In the FSC scenario, a questioned-voice is compared to a known-voice, whereas in biometric recognition, the comparison is made among multiple speakers [1,2]. e questioned-voice (or voice evidence) is an audio recording accepted as a vestige or evidence in a criminal investigation.
e questioned-voice may be recorded in different situations, such as lawful phone interception (wiretapping), recordings of face-to-face conversation, or audio broadcasting.
In FSC, the hypothesis H 0 considers that both the questioned-and known-voices come from different speakers, whereas H 1 assumes that the questioned-and known-voices come from the same speaker. However, the "individualization" that the hypotheses above propose has been considered a fallacy. is individualization assumes that the result of the confrontation between the questioned and standard voice is unique, without a priori probability and without repeating the test for the entire population [3,4]. According to Saks and Koehler [3], the most reasonable hypotheses would be Punctual inference in FSC is based on a score of (dis) similarity [5][6][7]. Interval inference is a tradeoff between precision and confidence because it sacrifices some precision of the estimate by moving from a point to a range, but results in greater confidence that the statement is correct (inside interval) [8] (pp. 418).
Reports on interval inference in automatic speaker recognition (ASR) began with Bisani and Ney [9], who used bootstrap [10] to compute confidence intervals. Subsequently, Campbell et al. [11] computed confidence intervals using multilayer perceptron (MLP) based on statistical entropy. Later, Koval and Lokhanova [12] used a sigmoid function to approximate the a posteriori probability , where x → is the voice data and H 0 is the null hypothesis, using Platt scaling [13] and estimated credibility intervals. e credibility interval can also be computed by empirical methods (Morrison et al. [14]). e present work proposes the application of the full Bayesian significance test (FBST) to compute evidence intervals of FSC.
is proposal aims to obtain the same confidence of capturing the parameter of interest in FSC and to reduce type I errors, reinforcing the legal aphorism of Absolvere nocentem satius est, quam condemnare innocentem. One of the motivations of this work, among others, is to establish a confidence limit of the automatic speaker comparison techniques, primarily when used as a support to quantify an FSC [15].
Applications of the FBST to FSC were not found during the bibliographic survey in the development of this research. us, the main contribution of this work is that it proposes an application of the FBST to FSC and develops a method to calculate the FBST for the distribution of the expected value (mean) with unknown variance without using Monte Carlo Markov chains (MCMC).
e results indicate that the application of the FBST to FSC can improve the evaluation of results by the LR framework, reducing the occurrence of type I errors.
e FBST also supports decisions on multispeaker comparisons. e paper is organized as follows. Section 1 presents the FBST and our proposed improvements and proposes adaptations for FSC (the GMM-UBM method was chosen because it presented more satisfactory results in previous experiments than the i-vector-and x-vector-based methods with deep neural networks (DNN). ese experiments were performed with database in Portuguese, quoted in this article, and with voices provided by the Civil Police of Minas Gerais (Brazil) forensic sector. e result of this experiment is in the process of being published). Section 1 compares the evidence interval to other methods. Section 1 presents the conclusion and future research directions.

Evidence FSC Interval with the FBST
are, respectively, the evaluation of the data x → Q of the GMM of the known-voice, λ K , and of the UBM λ UBM . e GMM-UBM is a methodology applied to voice comparison [7,16,17]. In the first studies [5,18], the GMM-UBM methodology was applied using Mel-frequency cepstrum coefficients (MFCC). e first step in the GMM-UBM procedure is to compute the GMM of the known-voice, λ K , and of the UBM λ UBM , which can be computed using the expectation-maximization (EM) algorithm [5]. In the second step, the Score of the comparison (LR( x → Q )) is obtained as a ratio between two likelihoods: the questioned-voice ( x → Q ) versus the knownvoice (λ K ) and the questioned-voice versus the UBM model (λ UBM ).
e score proposed by Reynolds et al. [5] is the sample mean of the log-likelihood ratio (LLR) over T speech frames: Because the features x . , x Q [T − 1]}are not independent and not identically distributed (i.i.d.), the resulting values are not, technically, a likelihood ratio. Normalization by the number of frames, T, also removes the duration effects from the log-likelihood value. However, the LLR( x → Q ) of equation (3) allows us to include an interval-based inference.
Calculating the interval inference is possible empirically or analytically over the sample space. e widespread empirical approaches include bootstrap [10], jackknife [19], and the method proposed by Morrison et al. [14]. One possible analytical method uses the t-Student distribution of Gosset [20,21]: where σ is the sample standard deviation, µ is the expected value of LLR( x → Q ), and t ((α/2),T− 1) is a t-Student distribution with significance α and T − 1 degrees of freedom.
In Section 1, we compare our evidence interval computed using the FBST to Morrison's credibility/confidence intervals, the analytical method in equation (4).
Morrison's approach [14,22] uses two samples of voice per speaker and measures the LLR from the vowel formants. In these works, the credibility intervals were computed from raw data rather than from a statistic such as the mean. We 2 Mathematical Problems in Engineering propose a small modification to Morrison's approach such that the computation is based on the sample mean instead of the raw data.

Full Bayesian Significance
Test. e FBST can be used to compute evidence against a precise hypothesis LLR( x → Q ) � η, where η is a value in the parametric space of LLR of equation (2). e FBST [23,24] is a coherent Bayesian significance test for sharp hypotheses. e test is based on an evidence concept value, whose original definition was motivated by practical, juridical, and epistemological requirements. Consider the parametric space Θ and a subset θ ∈ Θ ⊆ R n and a precise (null) hypothesis H 0 that the parameter lies in the null set, defined by the inequality (g(θ)) and equality (h(θ)) constraints given by the vector functions g and h in the parameter space: For the experimental data x → , the a posteriori density of a precise hypothesis is proportional to the product of the likelihood and the a priori density [25]: where f(θ)is an a priori density and e points of the parameter space with highest "surprise" in the null set H 0 are while the highest relative surprise set (HRSS), T * , is e Bayesian evidence value against H 0 is the a posteriori probability of the "tangent" set; that is, where Pr(θ ∈ T * | x → ) is the probability that the parameter θ is inside T * . e e-value associated with the FBST is e e-value is a probability in the parameter space (μ and ρ), whereas the p value is a probability in the sample space [26]. In Section 1, we use the e-value and ev (Bayesian evidence value against H 0 ) to compute the evidence interval on FSC using hypothesis H:

Improvement of the FBST over the Mean with an Unknown Variance.
is section describes a method to compute the FBST for a distribution of the mean (expected value) with an unknown variance. To lower the computational cost, we focus on a mostly analytical development.
is is important in order to limit the computation time of the e-value over the η-space.
Consider a normally distributed sample x ∈ X with n i.i.d. observations, X(μ, (1/ρ)), where µ is the expected value and ρ � 1/σ 2 is the precision. e minimal sufficient statistic could be the sample mean x and total sum of squares Taking the a priori noninformative distribution p(μ, ρ) � dμ dρ/ρ [27], the a posteriori probability density function (PDF) is [26] where and c is calculated such that the integral over equation (12) is 1. e gradient is given by the partial derivatives of P n (μ, ρ)(henceforth, the we write the PDF P n (μ, ρ|n, x, Q) as P n (μ, ρ)) lead to the maximum P(μ * , ρ * ): (14) Figure 1 shows an example of the FBST evaluation over H 0 : μ � 0. e bell-shaped surface is P n (μ, ρ) and the solid black line is the restriction of the null hypothesis (μ � 0). e maximum value of the black line delimits the "tangent" T * set, represented as a dash-dot line. e dotted line is the restriction P n (μ, ρ � ρ * ).
e evidence against the null hypothesis (H 0 : μ � η � 0) is evaluated by equation (9). Main works on the FBST over the distribution of a mean with an unknown variance [26,28,29] use MCMC to solve the integral of f n in equation (9). However, specifically for equation (12), it shows that the "tangent" set T * has extreme points ρ A , ρ B , ρ C e and ρ D (as in Figure 2), where Making P n (x, ρ) � P n (η, ρ A ) for equation (12) results in (16) and grouping variables and taking the natural logarithm in both sides yields Mathematical Problems in Engineering with the roots being where W n (·) is the Lambert-W function [30]. By the symmetry of T * over the µ-axis, we can compute the evidence ev by (19) where μ(ρ) is the contour function (from any boundary) on the µ-axis of the "tangent" set T * (Figure 2). e contour of T * can be defined as where P * . e roots of equation (20) in µ define the left and right sides of the contour (see Figure 2): Note that μ(ρ) is a contour for values greater and less than x. By symmetry, we compute equation (19) as (22) where erf(·) is the error function. We can simplify the argument of this function as where ρ A is the inferior limit of ρ, and η is the hypothesis test H 0 : μ � η. us, we can rewrite equation (19) as the onedimensional integral: e integral in equation (24) does not need MCMC techniques, thus demanding less computational effort than equation (9) does.

Proposed Method.
is section proposes a method to compute the evidence interval with a Bayesian evidence level α, which can be computed using equation (24). e result in the GMM-UBM scenario is the sample mean LLR( x → Q ) of the time series LLR(x[t] Q ), as equation (3) shows, on the parametric space η.
Consider the time series LLR(x[t] Q ) with a parametric mean (expected value) of μ, precision ρ, and sample mean LLR( x → Q ). From this, it is possible to define the evidence interval of µ as the subspace η L ≤ μ ≤ η H , where η L and η H are values above and below LLR( x → Q ), respectively. e Bayesian evidence ev against the precise hypotheses H: μ � η L and H: μ � η H is 1 − α (see equation (24)).
Outside this range of the LLR, η L ≤ LLR( x → Q ) ≤ η H , the evidence (e-value computed by the FBST) that the parametric mean (μ) is higher than η H or lower than η L is less than α.
We are aware that the definition above does not fit the traditional confidence (or credibility) interval as defined in [31]. However, it is an analytical method based on the parameter space and represents the limits of evidence that the sample can provide Bayesian evidence ("significance") of 1 − α.
For example, consider that the comparison between a questioned-voice and a known-voice generates a time series (2) are used. Figure 3 shows the statistical distribution of these LLR values on the normalized histogram (Norm. Hist.) in the left panel. In this panel, the solid light gray line is the empirical PDF (emp. PDF) and the small circle over this curve indicates the sample mean (LLR( x → Q )). e dash-dotted rectangle on the left graph is the region on the right graph. e sample mean of the LLR(x Q [t]) series is LLR( x → Q ) ≈ − 0.8 Np (nepers) (neper is the natural logarithm of ratios, named after John Napier). e evaluation of the hypothesis H: LLR( x → Q ) � η along the variable η in the LLR space with the FBST (equation (24) yields the e-value curve. e variation of η values results in the e-value curve (ev-curve, solid dark gray) indicated in the right graph of Figure 3. is curve is computed by sampling the η space and solving equation (24) for each sample. On this graph, the horizontal dashdotted line (ev � 0.05) indicates the Bayesian evidence (significance) α � 0.05 (evidence value against hypothesis ev � 95% or e-value � 0.05). e horizontal solid black error bar (ev > 0.05) indicates the evidence interval and the sample mean.

Comparison with Other Methods
is section presents an experiment and a case study involving the range of evidence. We conducted training and testing stage with a voice data set CEFALA-1 [32], containing 104 speakers (55 men and 49 women) recorded with five microphones (generating 520 records). e validation step used 50 recordings that do not belong to the corpus CEFALA-1. is validation emulates an open-set database in speaker comparison.
We designed an experiment to compare the proposed interval inference method with other methods used in FSC. e experiment used 104 voices narrowband filtered (4th order butterworth) in the 300-3500 Hz range and resampled to 8 kHz, compatible with the Brazilian mobile phone system.
In order to compare the various interval inference methods, we need to use the speech database to define the known-voice and questioned-voice sets. We do this as follows. For each subject 50% of voice content was used as known-voice and 50% as questioned-voice, both in the CEFALA-1 corpus and in the validation recordings.
In order to emulate forensic conditions, both the knownvoice and questioned-voice data are subject to 3 types of degradation. First, the data are contaminated with pink noise at the following SNR levels: 25 dB, 23 dB, 20 dB, 17 dB, 15 dB, and 12 dB. Next, the data are encoded and then decoded by a GSM 06.60 codec [33]. Finally, the data are run through a narrowband filter (300-3500 Hz). e features were extracted with MFCC (c[n]) using 13 critical bands (filters), a frame length of 25 ms, and frame step of 10 ms. e features include delta Δc[n] and deltadelta Δ 2 c[n]. We used Sonh's [34] method for voice activity detection (VAD) to identify the voiced frames. e methods used to compute interval inference (significance α � 0.05) were Gosset: confidence interval computed by equation (4) Morrison: empirical credibility interval computed by combining the k-nearest neighborhood (KNN) with the linear regression, as described by Morrison [14] FBST: the proposed method that computes the evidence as a subspace of the parametric space, where the e-value is α We used the method proposed by Morrison et al. [14] to compute the credibility interval over the data themselves, not over the mean (expected value) of the data. Morrison's method was adapted to compute the mean of 50 subsamples with replacement (similar to bootstrap [10]).
We evaluated the performance of each interval inference method based on results presented in Figure 4. We expected that a comparison between the GMM model of a given speaker and a set of features coming from that speaker (same-speaker comparison hereafter) results in a higher LLR value than a comparison between that same GMM model and a set of features coming from a different speaker (different-speaker comparison hereafter). e training and Mathematical Problems in Engineering testing stage, using only samples from the CEFALA-1 corpus with contaminations between 12 and 25 dB, presented an equal error rate (EER) of 8.1% with threshold at LLR � 0.25 Np. e results presented below cover the test and validation steps. Figure 5 shows the number of correct classifications in scenario (a). e occurrences of correct classifications for the evidence interval (vertical light gray bar) is smaller than that of other methods (interval and punctual). e comparisons of the best interval methods yield values of 84.0% against 84.4% for SNR 12 dB, 84.4% against 88.6% for 15 dB, and less than 1% for the other SNR values. ese values represent a loss of the accuracy of less than 0.5% compared to interval inference. Compared to the punctual inference, the loss in the accuracy is less than 1.6% for the other SNR values. e intermediate results, in which the intervals overlap, are exemplified in Figure 4 by comparisons (b). ese scenarios are deemed inconclusive and represent an In dubio pro reo condition, meaning that a defendant should not be convicted when doubts remain about his or her guilt (association between questioned-and known-voices).

Mathematical Problems in Engineering
In the punctual inference, scenario (b) does not occur, and there is no transition region.
us, in the interval inference, scenarios (a) and (c) are decisive, and the intermediate scenario, (b), indicates that the results have some equivalence; that is, there is a chance that the comparison between different speakers will be larger (or smaller) than the comparison between the same speakers. Figure 6 shows the comparison results for various interval inference methods (Gosset, Morrison's method, and FBST). e results are grouped by the SNR level. e panel indicates the percentage of inconclusive interval inferences (b), wrong interval inferences (c), and punctual error inferences (dashed vertical line).
Compared to the punctual inference (dashed vertical line), the evidence interval computed by the FBST (horizontal light gray bar) reduces the number of wrong inferences in 1.6%, 1.1%, 0.9%, 0.7%, 0.6%, and 0.4%, respectively, for SNRs from 12 dB to 25 dB (see Figure 6). Compared to the other methods of the interval inference, the evidence interval (horizontal light gray bar) presents an incorrect number of inferences (c) less than or equal to the other methods (horizontal bars).
ese results can be explained by checking the size of the intervals for each method in Figure 7. In this figure, points represent the raw data (jittered horizontally), the horizontal line shows the sample mean, and the lateral lines represent a smoothed density. Table 1 summarizes the values contained in Figures 5, 6, and 7.
On an average, the length of the evidence interval (computed by the FBST) is 24% larger than the interval calculated by the Gosset method and 15% larger than the interval calculated by Morrison's method (see Table 1). ey also present a higher dispersion than the other methods do.
Another attempt to measure the influence of interval inference is to exclude from the confusion matrix the comparisons that result in scenario (b) of Figure 4. In this way, a fifth category, "In dubio pro reo," may be included.
e Table 2 presents a comparison of how the inclusion of  the interval inference, when including "In dubio pro reo," changes the percentage of true positives, true negatives, false positives, and false negatives. e table shows the EER calibration of 8.1%. However, for open-set validation, the GMM-UBM methodology presents false positive rates of 9.4%, which reduces to 8.4% using the range of evidence calculated from the FBST.

Conclusion and Future Work
is paper presented an improvement to the FBST calculation for the distribution of a mean with an unknown variance. ese improvements obviate the need for MCMC techniques to calculate the FBST integral. Compared with other methods, the evidence interval was more conservative, reducing incrementally Type I and Type II errors in low-SNR scenarios.
Although the results do not present a significant improvement in the reduction of the false positive rate, for open sets, the present work helps to understand the limits of the GMM-UBM methodology applied to FSC. e contribution of the range of evidence may seem insignificant. However, in the case of sex crimes, especially against children, understanding the limits of each tool in the FSC helps the forensic expert to make more informed decisions.
Possible developments of the present work include improving the FBST for the Behrens-Fisher problem, combining the evidence interval with background database calibration and tests with different features such as Power Normalized Cepstral Coefficients (PNCC), Perceptual Linear Predictive (PLP), and noise. e application of the interval inference in speaker verification techniques, such as ivector and x-vector, are under development and should be discussed in future work.
Data Availability e audio files (corpus) used in the experiments can be found at http://www.cefala.org. It is the intention of the authors to make available the processed data and the algorithms as soon as the work is published. Basically the data are acoustic features (Mel-frequency cepstrum) and Gaussian mixture models.

Conflicts of Interest
e authors declare that they have no conflicts of interest.