Speaker identification in courtroom contexts – Part I: Individual listeners compared to forensic voice comparison based on automatic-speaker-recognition technology

Expert testimony is only admissible in common law if it will potentially assist the trier of fact to make a decision that they would not be able to make unaided. The present paper addresses the question of whether speaker identification by an individual lay listener (such as a judge) would be more or less accurate than the output of a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-re-cognition technology. Listeners listen to and make probabilistic judgements on pairs of recordings reflecting the conditions of the questioned-and known-speaker recordings in an actual case. Reflecting different courtroom contexts, listeners with different language backgrounds are tested: Some are familiar with the language and accent spoken, some are familiar with the language but less familiar with the accent, and others are less familiar with the language. Also reflecting different courtroom contexts: In one condition listeners make judgements based only on listening, and in another condition listeners make judgements based on both listening to the recordings and considering the likelihood-ratio values output by the forensic-voice-comparison system. © 2022 The Author(


Research question
The present paper addresses the question of whether speaker identification 8 by a judge listening alone would be more or less accurate than the output of a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology. This question is important because expert testimony is only admissible in common law if it will potentially assist the trier of fact to make a decision that they would not be able to make unaided. If the trier of fact's speaker identification were equally accurate or more accurate than a forensic-voice-comparison system, then testimony based on the output of that system would not be admissible.
The introduction to the present paper also considers the question of whether speaker identification by a jury listening as a collaborative group would be more or less accurate than the output of a forensicvoice-comparison system that is based on state-of-the-art automaticspeaker-recognition technology, but the present paper does not include empirical research addressing this question. The present paper describes experiments using individual listeners. A future paper will describe experiments using groups of twelve collaborating listeners.
In the remaining sections of the introduction of the present paper: • we discuss the legal context related to forensic voice comparison conducted by experts and to speaker identification performed by triers of fact ( §1.2); • we describe prior research on speaker identification by lay listeners ( §1.3); • and we outline the empirical research we have conducted in order to address the research question ( §1.4).

Legal context
Identifying a speaker by comparing voices is fairly common in modern criminal proceedings. Indeed, the identification of a speaker is often the main, and sometimes the ultimate, issue confronting the trier of fact -whether jurors or judge(s). In most common law (but also inquisitorial or civil law) systems, resolving the identity of a speaker is left to the trier of fact. Sometimes the trier of fact will be responsible for resolving the question of identity based on their own listening to speech recordings -often comparing lawfully intercepted recordings with recorded police interviews, but occasionally comparing intercepted recordings with a defendant's live speech in court. Sometimes, the trier of fact is presented with opinions about a speaker's identity by police officers or interpreters who have no specialized training in forensic voice comparison. 9 Sometimes, the trier of fact is presented with expert testimony from a forensic practitioner who does have specialized training in forensic voice comparison. 10 Even when expert testimony is presented, the trier of fact may still listen to the recordings and attempt to perform their own speaker identification. 11 The present paper is concerned with whether speaker identification by individual lay listeners (like a judge), or by groups of lay listeners acting collaboratively (such as in the deliberations of a jury or appellate court), would be more or less accurate than testimony presented by an expert witness who has conducted a forensic voice comparison using state-of-the-art automatic-speaker-recognition technology. This is an important question because expert evidence is only admissible if it is capable of rationally assisting the trier of fact. Unless a forensic-voice-comparison system is more accurate than the trier of fact, testimony based on that system should not be admitted (and in some jurisdictions it is considered irrelevant). In recent years, substantial advances have been made in automaticspeaker-recognition technology, leading to improved performance. Advances have also been made in the application of automaticspeaker-recognition technology as part of forensic-voice-comparison systems. Given these advances, some conventional legal assumptions may require revision.
Historically, courts in common-law jurisdictions have not concerned themselves with the actual performance of either speaker identification by lay listeners or forensic voice comparison by forensic practitioners. In most cases little empirical data, such as the results of tests of human listeners' abilities or the results of validations of forensic-voice-comparison systems, have been introduced to inform admissibility determinations, or cross-examination and judicial directions. Even with the gradual rise of reliability standards for expert opinion evidence following Daubert, 12 with respect to forensic voice comparison and speaker identification, most courts have promoted approaches based on what might be characterised as "common sense" and the experience of judges. We present some examples: In Bulejcik v The Queen, 13 the High Court of Australia characterised speaker identification as routine; such that where samples of recorded speech are available, jurors should be entitled to make their own comparisons: Recognition of a speaker by the sound of the speaker's voice is a commonplace of human experience. … A person who is not familiar with the voice of a putative speaker may be able ... to recognise the speaker's voice by comparison with an established example of that voice ... There was no reason why, subject to a satisfactory warning, the jury should not have had regard to the sound of the appellant's voice in determining whether the appellant's voice had been recorded on Exhibit D.
Expert testimony based on forensic voice comparison was not proffered in this case. 8 In the well-established terminology of the research literature on speaker identification and speaker recognition by human listeners, "speaker identification" refers to a situation where a listener who is unfamiliar with the speaker or speakers compares a voice they hear on one occasion (e.g., while a crime is being committed) with a voice that they hear on another occasion (e.g., during a voice lineup) and, based on listening, attempts to determine whether the same speaker was speaking on both occasions. "Speaker identification" also refers to a situation where a listener who is unfamiliar with the speaker or speakers listens to two (or more) voice recordings and, based on listening, attempts to determine whether the same speaker is speaking on both recordings. The latter is the focus of the present paper. "Speaker identification" also refers to a situation in which one voice is recorded (e.g., a recording of a crime being committed) and the other is live (e.g., a defendant speaking in court). "Speaker identification" contrasts with "speaker recognition", which refers to the situation where a listener hears a voice (live or recorded) and states that they recognize the voice as that of a person who is familiar to them (and usually names that person). The present paper reports on speaker-identification research, not on speaker-recognition research. 9 For a criticism of this practice, see Edmond et al. [1]. Most  Ready admission of such experience-based testimony -characterised as lay opinion and as "ad hoc expert" opinion -has made recourse to expert forensic-voice-comparison testimony relatively uncommon. In Australia, and elsewhere, liberal admissibility practice relies on cross-examination and judicial directions (and the increasingly remote possibility of rebuttal expert testimony) to identify and effectively convey problems to jurors (and appellate judges) in the context of accusatorial proceedings. 10 For reviews of the admissibility of forensic voice comparison in US jurisdictions (under both Daubert and Frye) and in UK jurisdictions (England & Wales and Northern Ireland), see Morrison & Thompson [2] and Morrison [3] respectively. Briefer reviews of admissibility in Australia and in Canada are included in Morrison & Enzinger [4]. 11 For criticism of this practice, see Edmond [5]. 12  In R v Flynn, 14 the England & Wales Court of Appeal stated: The appellant submits that the judge misdirected the jury by instructing them that they should not attempt to compare the voices heard on the covert recording with the voices of the appellants which they had heard when they gave evidence in the trial. Apart from the decision in R. v Chenia [2003] 2 Cr. App. R. 6 (p.83) there is no decision which supports the direction given by the judge. On the contrary, there are passages in other authorities, ..., which suggest that the jury should be permitted to make such a comparison providing the judge directs the jury to listen to the tapes guided by the evidence of the voice recognition witnesses, whether expert or lay listeners.
In this case, practitioners of the auditory-acoustic-phonetic approach to forensic voice comparison had stated that the quality of the recordings was too poor for them to be able to conduct forensic voice comparisons. The poor quality of the recordings was a factor in the Court of Appeal ruling that speaker-recognition/speaker-identification testimony by police officers should not have been admitted at trial. Given such poor-quality recordings, it is curious that the Court of Appeal thought that it was appropriate for the jury to attempt to perform their own speaker identifications.
In United States v Arce-Lopez, 15 the defendant sought to have expert testimony based on forensic voice comparison admitted. The court found that: the jury is "perfectly well-equipped" to listen to the witnesses testify in court, compare their voices to the voice in the audio recordings, and draw conclusions about whose voice is in the audio recordings. ... Accordingly, this is "not an area in which expert testimony would be helpful to the jury." See United States v. Salimonu, 182 F.3d 63, 74 (1st Cir.1999) ... the Court finds that this expert testimony will not "help the trier of fact to understand the evidence or to determine a fact in issue," Fed.R.Evid. 702(a) The published ruling stated that the proffered forensic-voicecomparison testimony was based on "biometric analysis", but it is unclear from the ruling what approach to forensic voice comparison was actually used or whether any validation results were provided.
More recently, in R v Dunstan 16 (an appeal hearing in Ontario, Canada), Morrison appeared as an expert witness and presented the likelihood-ratio output of a forensic-voice-comparison system that was based on automatic-speaker-recognition technology. Morrison's report included the results of an empirical validation of the forensicvoice-comparison system under conditions reflecting those of the case under investigation. Although admissibility per se was not an issue in this hearing, during cross-examination, Morrison was asked why the judge could not simply listen to the recordings and make a decision. The cross-examining lawyer relied upon a more-than-adecade-old study to suggest that the performance of automaticspeaker-recognition systems was not better than human listeners.
In the next subsection, we review prior published research comparing the performance of lay listeners with that of automatic-speakerrecognition systems, and in the remainder of the paper we report new empirical research comparing the performance of individual lay listeners with that of a forensic-voice-comparison system which is based on state-of-the-art automatic-speaker-recognition technology.

Prior research on speaker identification by lay listeners compared to automatic-speaker-recognition systems
There are a number of published studies that have directly compared speaker identification by lay listeners with the output of automatic-speaker-recognition systems. Many of these studies, however, are outdated: Over the last two decades, there have been substantial advances in automatic-speaker-recognition technology (GMM-UBM-based systems have been replaced by i-vector-based systems, which in turn have been replaced by x-vector-based systems), and each new generation of technology has resulted in substantial improvements in performance. 17 Also, the conditions of the voice recordings used in these studies have seldom reflected the sorts of relatively poor-quality recordings conditions or the sorts of mismatched conditions between questioned-speaker and knownspeaker recordings that are commonly encountered in forensic casework (the studies were not necessarily intended to address questions of forensic interest). Also, the studies have usually had each listener listen independently, have then applied a function (a simple function such as mean or mode, or a more complex function such as a calibration model) to the pooled responses from all the listeners, and then compared the output of that function with the output of an automatic-speaker-recognition system. This does not reflect the situation where a judge alone listens to the questionedand known-speaker recordings, nor the situation where a group of people constituting a jury listen and collaboratively come to a decision. In addition, equal-error rate (EER) has often been used to compare listeners' pooled responses with the output of automaticspeaker-recognition systems. EER obscures potentially poorly calibrated responses: To calculate EER, the classification threshold is shifted to the point where the miss rate equals the false-alarm rate, whereas a system with a pre-determined classification threshold may be biased and produce a higher miss rate than false-alarm rate or vice versa. Finally, the use of a categorical-decision framework in these studies is suboptimal for assessing the performance of a forensic-voice-comparison system that outputs a likelihood ratio or of a human listener who expresses degree of confidence in their speaker identification decision -treating a likelihood ratio of 2 the same as a likelihood ratio of 2000, or treating a listener's "maybe" the same as their "very sure", ignores the fact that, in a legal-decision-making context, different likelihood-ratio values or different degrees of confidence would be expected to have different magnitudes of impact on downstream decision making (especially, for listeners, if the decision maker is the listener).
Human Assisted Speaker Recognition (HASR) evaluations were run by the National Institute of Standards and Technology (NIST) in 2010 and 2012 (Greenberg et al. [8]). The HASR evaluations were not intended to reflect forensic casework conditions. Whereas automatic-speaker-recognition systems are routinely tested on tens or hundreds of thousands of test pairs, most participants in the HASR 2010 evaluation only provided responses to a set of 15 test pairs, and not to a larger set of 150 test pairs that was also available. The HASR 2010 recordings were high quality, but the different-speaker test pairs were selected to be challenging: Multiple earlier automaticspeaker-recognition systems had made errors on these pairs, and, in pilot tests, listeners judged them difficult to distinguish (see Greenberg et al. [8] for details). HASR 2012 test pairs were also selected to be challenging.
On the HASR test sets, automatic-speaker-recognition systems outperformed systems based on pooled responses from groups of lay listeners [9][10][11][12]. In Ramos et al. [10], after it was calibrated, a system based on pooled listener responses achieved a log-likelihood-ratiocost (C llr ) value of 1, i.e., on average the human-listener system provided no useful information (see §2.7.2 below for an explanation of the C llr metric). In Matějka et al. [13], for each trial, in addition to being able to listen to the pair of recordings, listeners were provided with the score output by an automatic-speaker-recognition system in response to that pair of recordings. The listeners were familiar with this automatic-speaker-recognition system, and they could take its output into consideration while making their judgement. For only one of the ten listeners was the classification-error rate (CER) 18 better than that of the stand-alone automatic-speaker-recognition system.
In contrast to the performance of lay listeners, forensic practitioners employing auditory-acoustic-phonetic methods had the same CER as an automatic-speaker-recognition system (Schwartz et al. [14]), or a better EER than an automatic-speaker-recognition system (Saeidi & van Leeuwen [15]).
Kahn et al. [9] noted that there was high inter-listener variability for lay listeners: Individual listeners' CERs ranged from 34 % to 56 %. Miss rates ranged from 13 % to 90 %, and false-alarm rates ranged from 10 % to 97 %. Some listeners were biased toward giving samespeaker responses (resulting in fewer misses but more false alarms), and others were biased toward giving different-speaker responses (resulting in fewer false alarms but more misses). Similarly, individual lay listeners' EERs in Ramos et al. [10] ranged from 22 % to 60 %. Large inter-listener variability has often been observed in the broader research literature on speaker identification and speaker recognition by lay listeners. 19 Similar results have been obtained in studies using sets of voice recordings other than the HASR sets. Not all of the other sets were deliberately selected to be challenging. With few exceptions, automatic-speaker-recognition systems outperformed lay listeners [19][20][21][22][23][24][25]. The exceptions primarily occurred in earlier studies using older automatic-speaker-recognition technology (e.g., GMM-UBM systems as opposed to i-vector systems). Even then, in the oldest study (Schmidt-Nielsen & Crystal [19]), although the detailed results needed to make all the necessary comparisons were not presented, it appears that most individual listeners' EERs (as opposed to EERs based on pooled responses) would have been worse than the EERs of the automatic-speaker-recognition systems tested. Park et al. [26] noted that i-vector-based systems outperformed lay listeners for voice recordings of longer durations, but that lay listeners outperformed i-vector-based systems for voice recordings of shorter durations, e.g., less than 10 s. Short questioned-speaker recordings are common in forensic casework. In contrast, in van Dijk et al. [24], although listeners could listen for longer, their average listening time was ∼18 s, and the EER for their pooled responses was 27 %, but, using 20 s from each recording (close the human listeners' average listening time), an i-vector-based system's EER was only 7 %. Using only 5 s from each recording, the i-vector-based system's EER was 23 %, i.e., even using short recordings it performed better than the pooled responses of listeners who listened not only for longer but listened for as long as they wanted.
The current state of the art in automatic speaker recognition is based on deep-neural-network embeddings (DNN embeddings) called x-vectors [6,[26][27][28][29][30][31]. Compared to i-vector-based systems, newer x-vector-based systems have been found to have better performance, especially on mismatched conditions and on short voice recordings. A recently-published study, Hughes et al. [32], appears to be the only study so far that has compared speaker identification by lay listeners with an x-vector-based forensic-voice-comparison system. 20 That study elicited (as a number between 0 and 100) listeners' judgements as to: typicality of the questioned-speaker recording, similarity of the questioned-and known-speaker recordings, and posterior probability for same-speaker. Several functions were applied to pooled-listener responses. Some of these functions included divisions of similarity responses by typicality responses, others used the posterior-probability responses, and all included cross-validated calibration using logistic-regression. The xvector-based system outperformed the human listeners: The best C llr for a function applied to pooled-listener responses was 0.69, but the C llr value for the x-vector-based system was 0.26. Individual listeners' EERs ranged from 13 % to 67 %, but the EER for the x-vectorbased system was 4 %. The paper reported that there was no correlation between the listeners' EERs and their self-reported familiarity with the accent spoken by the speakers (the listeners had reported familiarity as a number between 0 and 100). The language and accent was "Standard Southern British English", and the listeners were all from the UK.

The current research
In the empirical research reported in the present paper, we conduct a series of experiments in which lay listeners are asked to make same-speaker/different-speaker judgements on pairs of recordings that reflect the conditions of an actual forensic case. The pairs of recordings are a subset of those from the forensic_eval_01 dataset [33], which has previously been used to perform benchmark validations of multiple forensic-voice-comparison systems [34][35][36][37][38][39][40]. The language and accent spoken on these recordings is Australian English.
Individual listeners provide probabilistic judgements in response to pairs of recordings consisting of one questioned-speaker-condition recording and one known-speaker-condition recording. The individual-listener experiments are intended to reflect a context where an individual judge listens and makes a judgement. 21 We compare the individual-listener responses with the likelihood-ratio values output by the E 3 Forensic Speech Science System (E 3 FS 3 ) in response to the same pairs of recordings. E 3 FS 3 is a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology [6,40,41]. 22 Reflecting different courtroom scenarios, 23 we conduct experiments with: 18 To calculate CER, the pre-determined classification threshold of the system is used and the miss rate and the false-alarm rate obtained. CER is the weighted mean of the miss rate and the false-alarm rate. Weighting may be equal for the miss rate and the false-alarm rate, or may be according to the number of same-source and differentsource inputs respectively. 19 Recent reviews of the broader literature appear in Sherrin [16] and Morrison et al. [17], and a recent study that found large between-listener variability in speaker recognition in a legally relevant context is Rosas et al. [18]. 20 Hughes et al. [32] was published after we began our data collection. 21 In a future paper, we will present the results of experiments in which groups of twelve listeners collaboratively make judgements. The group-of-listeners experiments are intended to reflect the situation where a group of jury members listen and collaboratively make a judgement. 22 More information about E3FS3 is available from http://forensic-voice-compar- 1. listeners who are familiar with the language and accent spoken on the recordings 2. listeners who are familiar with the language but less familiar with the accent 3. listeners who are less familiar with the language The three different language backgrounds are: 1. Australian-English listeners 2. North-American-English listeners 3. Spanish-language listeners In the broader research literature on speaker identification by lay listeners, listeners have been found to perform more poorly when the speakers spoke with accents that were less familiar for the listeners and even more poorly when the speakers spoke languages that were less familiar for the listeners. Recent reviews of the broader literature, including review of the effect of language and accent familiarity, appear in Sherrin [16] and Morrison et al. [17], and a recent review focusing on the effect of language familiarity appears in Perrachione [42].
In addition to the experiments outlined above, in order to assess the effect of participants receiving expert testimony and being able to listen to voice recordings, we conduct an additional experiment. In that experiment, we provide participants with information about the forensic-voice-comparison system, including validation results, and for each recording pair we provided participants with the likelihood-ratio value output by the forensic-voice-comparison system in response to that pair.

Ethical approval
Ethical approval for this research was obtained from both the University of New South Wales Human Research Ethics Advisory Panel C: Psychology, and from the Aston Institute for Forensic Linguistics Research Ethics Subcommittee.

Source
Stimuli were taken from the recordings in the forensic_eval_01 dataset [33]. 24 The forensic_eval_01 recordings reflect the conditions of the questioned-speaker recording and the known-speaker recording from an actual forensic case. The speakers on the recordings are adult male speakers of Australian English. The questionedspeaker condition reflects a landline-telephone call, with background babble noise, saved using lossy compression. The knownspeaker condition reflects an interview recorded in a reverberant room, with background ventilation-system noise. Prior to publication, the recordings were manually diarized, i.e., interlocutor speech, transient noises, and long periods during which the speaker of interest was not speaking were removed. Including remaining short pauses between utterances, the questioned-speaker condition recordings were 46 s long, and the known-speaker-condition recordings were 126 s long.
The forensic_eval_01 dataset includes a training set and a validation set. Each speaker was recorded on multiple occasions separated from each other by approximately a week or more. The validation set consists of a total of 223 recordings from 61 speakers, 61 questioned-speaker-condition recordings (which always came from the first available recording session) and 162 known-speakercondition recordings, allowing for the construction of 111 samespeaker pairs of recordings and 9720 different-speaker pairs of recordings (from 3660 pairs of speakers). The forensic_eval_01 validation protocol in [33] requires a forensic-voice-comparison system to provide a likelihood-ratio value in response to each of these 9831 pairs of recordings.

Subset selection
It is not reasonable to ask human listeners to respond to thousands of pairs of stimuli. For the present research, we therefore selected a subset of 61 pairs of recordings from the forensic_eval_01 validation set. We initially considered using 122 pairs of stimuli, but pilot tests indicated that this number took too long and was too fatiguing for listeners, so we reduced the number to 61. To shorten the time participants would potentially take to complete each comparison trial, we also reduced the duration of each of the recordings to approximately 15 s (listeners were, however, able to listen to each recording multiple times).
Each speaker in the validation set had one questioned-speakercondition recording. From each questioned-speaker-condition recording we randomly selected an ∼15 s long section of speech. Each speaker had at least two known-speaker-condition recordings. We randomly selected one known-speaker-condition recording from each speaker, and from that recording randomly selected an ∼15 s long section of speech. 15 s intervals within each recording were randomly selected from a uniform distribution, with the condition that they did not extend beyond the beginning or end of the recording. A researcher then manually extracted sections of speech that began and ended near the beginning and end of the randomly selected intervals, but made cuts at natural pauses rather than in the middle of words.

Construction of pairs of stimuli
For the individual-listener experiment, we selected: • 31 same-speaker pairs • For each speaker, the same-speaker pair was constructed as the ∼15 s long section from that speaker's questionedspeaker-condition recording plus the ∼15 s long section from their randomly selected known-speaker-condition recording.
• A constraint was imposed so that the questioned-and knownspeaker-condition recordings did not come from the same recording session.
• 30 different-speaker pairs • For each speaker, a different-speaker pair was constructed as the ∼15 s long section from that speaker's questionedspeaker-condition recording plus the ∼15 s long section from a randomly selected different speaker's randomly selected known-speaker-condition recording.
• A constraint was imposed so that if a pair consisted of a questioned-speaker-condition recording of speaker A and a known-speaker-condition recording of speaker B, another pair could not consist of a questioned-speaker-condition recording of speaker B and a known-speaker-condition recording of speaker A.
If recordings of a speaker were used to construct a same-speaker pair, recordings of that speaker were not also used to construct different-speaker pairs.
A copy of the stimuli used to conduct the experiments is available from https://forensic-voice-comparison.net/speaker-recognition-byhumans/ 24 The database is available from https://forensic-voice-comparison.net/databases/# forensic_eval_01 N. Basu

Participants (listeners)
Participants were recruited using an online recruitment platform, Prolific. 25 The experiment was advertised as taking up to 2 h to complete, and participants who completed the experiment were paid GBP 21 (the amount recommended by Prolific for 2 h of participant time).
There were three sets of individual listeners defined by language background:

Australian-English listeners
These listeners were familiar with both the accent and language spoken by the speakers on the stimulus recordings.

North-American-English listeners
These listeners were familiar with the language spoken by the speakers on the stimulus recordings but less familiar with the accent. 26

Spanish-language listeners
These listeners were less familiar with both the language and the accent spoken by the speakers on the stimulus recordings. 27 The target number of listeners to recruit for each language background was 60.
To be eligible, each participant had to self report that they: 1. were 18 years of age or older 2. were a fluent speaker of English (for language backgrounds 1 and 2) or Spanish (for language-background 3) 3. were currently a resident of one of: 3. The sub-criterion for eligibility criteria 3 and 4 had to correspond with the language background. Criteria 2-4 did not require a participant to be a first-language and first-accent speaker of the particular language and accent background, but did require them to be familiar with that language and accent background.
The education criterion was included because the individuallistener experiments are intended to inform us about how an individual judge might perform with respect to speaker identification. Recruiting judges per se we considered to be impractical. Judges would be expected to have a relatively high level of education. We therefore recruited participants who had completed at least an undergraduate degree. 29 Potential participants were directed from Prolific to bespoke experiment software that we developed. Participants accessed the experiment software using a web browser.
Potential participants were first asked questions to determine whether they were eligible. If they were eligible, they were provided with a copy of the informed-consent information. If a participant gave consent, they were asked several demographic questions.
We asked participants their age. They could enter a number or "prefer not to say". We asked participants their gender. They could enter "male", "female", "other", or "prefer not to say".
We asked participants what their first language was and what other languages they spoke fluently.
We also asked participants how familiar they were with English in general, and with Australian English in particular. To answer the first question, participants could choose from: We also asked participants: • In general, how good do you think you are at identifying speakers, i.e., if you hear two voice recordings, how good do you think you are at correctly deciding whether they are recordings of the same speaker or of two different speakers?
• How good do you think you are at identifying adult male Australian-English speakers, i.e., if you hear two voice recordings, how good do you think you are at correctly deciding whether they are recordings of the same adult male Australian-English speaker or of two different adult male Australian-English speakers?
• If you heard a large number of pairs of recordings of adult male Australian-English speakers, what percentage of the pairs do you think you would get "right", i.e., if they were recordings of the same speaker you would say they were recordings of the same speaker and if they were recordings of different speakers you would say they were recordings of different speakers? Count saying "can't decide" as incorrect.
To answer each of the first two questions, participants chose a value on a five-point Likert scale which had labels: "very poor", 25 https://prolific.co/ Since recruitment and payment of participants was handled by Prolific, we did not have access to participants' personal identifying information. 26 We chose North-American-English listeners rather than European-English listeners because there are greater cultural links between Australia and the British Isles than between Australia and the US & Canada. By recruiting North-American-English listeners, we were therefore less likely to recruit listeners who happened to be familiar with Australian English. 27 The online recruitment platform, Prolific, is entirely in English, so the Spanishlanguage participants had some degree of familiarity with English. 28 These particular Spanish-speaking countries were chosen because they happened to be the only Spanish-speaking countries from which Prolific recruits participants. 29 Requiring an even higher level of education would have made the recruitment pools available on Prolific smaller, and impractically small for the Australian-English (footnote continued) pool, which was already by far the smallest pool. Also, unlike in the US & Canada where a professional law degree is a graduate degree, in Australia a professional law degree is an undergraduate degree. "poor", "neutral", "good", "very good". To answer the third question, participants typed a number between 0 and 100 in a box. The first two questions (but not the third) were repeated at the very end of the experiment after the participants had responded to all the stimulus pairs. Information about the experiment, including informed-consent text, demographic questions, instruction text, and the text on the experiment screens, was provided in either English or Spanish, depending on the language background of the participant.

Experiment procedures
A demonstration of the bespoke software used to run the individual-listener experiment is available at https://forensic-voicecomparison.net/speaker-recognition-by-humans/. The software was designed to run on any modern web browser running on any modern operating system on any device, but participants were advised that the software display was optimized for larger screens, e.g., desktops, laptops, and tablets, rather than smartphones, and it was strongly recommended to participants that they not run the experiment on a smartphone.
After completing eligibility questions, providing informed consent, and answering demographic questions, each participant was presented with written instructions explaining the task, 30 plus a sound check to make sure they could hear audio playing on their device. They were instructed to perform the experiment in a quiet place, and were asked whether they were listening using headphones or loudspeakers. They were then presented with a warmup trial. The warmup trial was a different-speaker trial that was identical in form to the experiment trials. Participants were not told that this was a warmup. Their responses to this trial were not included in subsequent analysis. Each participant was then presented with the 61 experiment trials in random order, a different random order for each participant. Randomly mixed with the experiment trials were 4 attention-check trials. We describe the experiment trials and the attention-check trials in the paragraphs below. Each trial screen included a counter showing the number of trials completed out of the total (66 including warmup and attention-check trials). A participant could take a break whenever they wanted. If they closed their browser, they could later resume using the link originally provided by Prolific. On resuming after having closed the browser, a participant had to repeat the sound check, after which the experiment resumed where they had left off. The experiment could not be resumed if more than 7 days had passed since the participant first started the experiment. 31 A screenshot of an experiment trial is shown in Fig. 1. The participant was presented with two sets of audio-playback controls, one labelled "questioned-speaker recording" and the other labelled "known-speaker recording". Using each set of controls, the participant could start and stop playing the recording, and navigate to any point between the beginning and end of a recording. Only one recording would play at a time.
The participant was also presented with two response boxes. The first response box was embedded in the following sentence: • I think the properties of the voices on the recordings are ______ times more likely if they are both recordings of the same adult male Australian-English speaker than if they are recordings of two different adult male Australian-English speakers.
The second response box was embedded in the following sentence: • I think the properties of the voices on the recordings are ______ times more likely if they are recordings of two different adult male Australian-English speakers than if they are both recordings of the same adult male Australian-English speaker.
Participants were instructed to enter a number that was 1 or greater in one of the boxes. Participants were instructed that if they thought the properties of the voices on the recordings were a little more likely if they were recordings of the same speaker than if they were recordings of different speakers they should enter a number in the first box that is a little larger than 1, and if they thought the properties of the voices on the recordings were a lot more likely if they were recordings of the same speaker than if they were recordings of different speakers they should enter a number in the first box that is a lot larger than 1; and mutatis mutandis for the second box if they thought the properties of the voices on the recordings were more likely if they were recordings of different speakers than if they were recordings of the same speaker. The instructions (deliberately) did not suggest any particular numbers to use. Participants were instructed that if they thought the properties of the voices on the recordings were exactly equally likely irrespective of whether they were recordings of the same speaker or recordings of different speakers, they should enter 1 in either one of the boxes. 32 The software checked that the participant had listened to at least 5 s of each recording, and that they had entered a number 1 or greater in one, but only one, of the boxes. If these criteria were met, when the participant pressed the "next" button, they moved to the next trial. If not all criteria were met, the participant received a message indicating the criterion or criteria which had not been met. Once a participant had moved to the next trial, they could not return to an earlier trial.
In addition to saving the responses entered into the response boxes, for each recording, the software saved the total listening time.
The screen for an attention-check trial looked the same as the screen for an experiment trial, but instead of hearing a pair of questioned-speaker-condition and known-speaker-condition recordings, the participant heard a recording (the same recording on both players) that told them to enter a particular number in one of the boxes. 33 For the English-language listeners, the instructions were spoken in English by a synthetic voice with an Australian accent, and for the Spanish-language listeners they were spoken in Spanish by a synthetic voice with a European-Spanish accent.
After the last pair of recordings, the questions about how good the participant thought they were at speaker identification were repeated, and the participant was presented with a "submit" button. The participant could withdraw from the study at any point before pressing the "submit" button by simply not proceeding with the 30 The participant could also access the instructions whenever they wanted during the experiment. 31 Prolific's display of the link timed out after 24 h, and, if participants completed the experiment more then 24 h after they began, Prolific issued a warning. Using Prolific's messaging service, we sent each participant their link, and informed them that they could ignore the warning issued by Prolific. 32 The intent was to elicit subjectively assigned likelihood-ratio values. The logically correct output for a forensic-evaluation system (including a forensic-voice-comparison system) is a likelihood ratio. In order to compare like with like, we therefore had to attempt to elicit likelihood-ratio values from listeners. It may be that some (or many) listeners did not fully understand the implied request to provide a ratio of likelihoods, and they may instead have provided numbers that represented their "certainty" as to whether the recordings were of the same speaker or of different speakers, but this still provided an unconstrained number (theoretically, on a logarithmic scale, between minus infinity and plus infinity, rather then being constrained to a range such as 0-1 or 0-100) that was a subjectively assigned quantification of the listener's assessment of the strength of the evidence. 33 For the attention-check trials, the software did not check whether the participant had listened to at least 5 s of each recording. experiment. If they did not press the "submit" button within 7 days of starting the experiment, their temporarily saved responses were deleted. If the participant pressed the "submit" button within 7 days of starting the experiment, their responses were permanently saved.
Since the responses were submitted anonymously, once the "submit" button was pressed, the participant could no longer withdraw their responses.
After each participant submitted their responses, a researcher checked their responses to the attention-check trials and authorized payment if at least two of the four were answered correctly. 3 is a forensic-voice-comparison system which is based on state-of-the-art automatic-speaker-recognition technology. It extracts x-vectors using a Residual Network (ResNet). Backend models include linear discriminant analysis (LDA) for mismatch compensation and dimension reduction, probabilistic linear discriminant analysis (PLDA) to calculate uncalibrated likelihood ratios (scores), and logistic regression for calibration. 34 For more detailed descriptions of this system, see Morrison et al. [31] and Weber et al. [41]. For a previous report on the validation of this system, see Weber et al. [40].

Forensic-voice-comparison system
Sections of recordings from the forensic_eval_01 training set were used to train LDA and PLDA. From each recording in the training set, a 15 s long section was randomly selected, and an x-vector was extracted from that section. In addition to the in-domain foren-sic_eval_01 data, out-of-domain data from the SRE2018 Test dataset [45] were adapted to the forensic_eval_01 conditions using the correlation alignment (CORAL) algorithm [46,47], and the in-domain and adapted data were together used to train the LDA and PLDA.
The validation data consisted of the same ∼15-long recordingsections as had been used with the human listeners. An x-vector was extracted from each section.
For calibration, all recordings in the forensic_eval_01 validation set were used. From each recording in the forensic_eval_01 validation set, three 15 s long non-overlapping sections were randomly selected, and an x-vector was extracted from each section. 35 All possible questioned-speaker-condition versus known-speaker-condition pairs of recording-sections were constructed, excluding samespeaker pairs constructed from different recordings made during the same recording session. Hereinafter, these will be referred to as the calibration data.
Leave-one-speaker-out / leave-two-speakers-out cross validation was employed: In a cross-validation loop in which the score to be calibrated was a same-speaker score, e.g., a recording of speaker A compared to another recording of speaker A in the validation data, all scores in the calibration data that resulted from comparisons in which one or both members of the pair was a recording of speaker A were excluded and the remaining calibration data were used to train the calibration model (leave-one-speaker-out). In a cross-validation loop in which the score to be calibrated was a different-speaker score, e.g., a recording of speaker A compared to a recording of speaker B in the validation data, all scores in the calibration data that resulted from comparisons in which one or both members of the pair was a recording of speaker A or a recording of speaker B were excluded and the remaining calibration data were used to train the calibration model (leave-two-speakers-out).

Experiment in which forensic-voice-comparison results are provided
In order to assess the effect of providing participants with expert testimony on forensic voice comparison and also allowing them to listen to the recordings and perform their own speaker identification, we ran another version of the individual-listener experiment with a new set of North-American-English listeners. 36 In that version, along with the instructions, we provided participants with the information about the forensic-voice-comparison system given in Text Box 1 and in Figure 2.
The text that appeared on the experimental screens had the form of one of the following, as applicable: • Output of forensic-voice-comparison system: The acoustic properties of the questioned-speaker and known-speaker recordings are X times more likely if they were both produced by the same adult male Australian-English speaker than if they  34 A regularized version of logistic regression was used with a regularization weight equivalent to 1 pseudo-speaker relative to the number of speakers used for training the logistic-regression model (see Morrison & Poh [44] for details). 35 These sections were automatically extracted and were not manually adjusted to not begin or end in the middle of words. Although beginning or ending in the middle of words might be disturbing for human listeners, it is not an issue for the forensicvoice-comparison system. 36 We used North-American-English listeners because, of the three language backgrounds, they constituted the largest pool of potential participants available on Prolific. Listeners who had participated in the earlier experiment were excluded from participating in this experiment. were produced by two different adult male Australian-English speakers.
• Output of forensic-voice-comparison system: The acoustic properties of the questioned-speaker and known-speaker recordings are X times more likely if they were produced by two different adult male Australian-English speakers than if they were both produced by the same adult male Australian-English speaker.
If the likelihood-ratio value X was greater than 10, it was rounded to the nearest integer. If it was less than 10, it was rounded to 1 decimal place. No value happened to be rounded to 1.0.
For each attention-check trial, an arbitrary number was given as the likelihood-ratio value output by the forensic-voice-comparison system. This number was different from the number that the recording told participants to enter into one of the boxes. The recording told the participants to ignore the number that was given as the output by the forensic-voice-comparison system.

Introduction
For each response by an individual listener: if a number was entered into the first box, it was treated as a likelihood-ratio value; Text Box 1 Information that was provided to participants about the forensic-voice-comparison system.
To help you make your decision, for each pair of recordings, we provide the likelihood-ratio output by a forensic-voice-comparison system in response to the same pair of recordings. The output appears to the left of the screen, below the audio players. When deciding the value you think is appropriate to enter into either the first box or the second box, you can take into consideration your own listening of the recordings and you can take into consideration the likelihood-ratio value output by the forensic-voice-comparison system.
The overwhelming majority of experts in forensic inference and statistics agree that the likelihood-ratio framework is the logically correct way for a forensic practitioner to evaluate strength of evidence. Its use is also recommended by key organizations including the American Statistical Association, European Network of Forensic Science Institutes, Forensic Science Regulator for England & Wales, and National Institute of Forensic Science of the Australia New Zealand Policing Advisory Agency.
In the context of forensic voice comparison, a likelihood ratio quantifies: how much more likely the acoustic properties of the questioned-speaker and known-speaker recordings would be if they were both produced by the same speaker compared to if they were each produced by a different speaker from the relevant population; or how much more likely the acoustic properties of the questioned-speaker and known-speaker recordings would be if they were each produced by a different speaker from the relevant population compared to if they were both produced by the same speaker.
In this case, the relevant population is adult male speakers of Australian English.
The forensic-voice-comparison system used was the E 3 Forensic Speech Science System (E 3 FS 3 ). This system makes use of state-of-the-art automatic-speaker-recognition technology, which includes the use of deep neural networks. It has been developed by the Forensic Data Science Laboratory at Aston University, in collaboration with the Audio, Digital Intelligence and Speech (AUDIAS) Laboratory at the Autonomous University of Madrid, and in partnership with operational forensic laboratories in several organizations including the FBI, Netherlands Forensic Institute, Swedish National Forensic Center, German Federal Police Office, and Chilean Investigative Police.
The forensic-voice-comparison system has been calibrated and validated under the same conditions as those of the pairs of recordings that you will be asked to make judgments on. Calibration and validation was performed in accordance with the recommendations in the 2021 Consensus on validation of forensic voice comparison. To perform the validation, the system was presented with a large number of pairs of recordings that were same-speaker pairs and a large number of pairs of recordings that were different-speaker pairs (999 same-speaker pairs and 87,480 different-speaker pairs), and it gave a likelihood-ratio output in response to each pair. Each pair consisted of one recording in questioned-speaker condition and one recording in known-speaker condition. None of the pairs were the same as those that you will be asked to make judgments on.
Given a same-speaker pair, a good output would be a large likelihood-ratio value in favor of the same-speaker hypothesis, a less good output would be a smaller likelihood-ratio value in favor of the same-speaker hypothesis, a worse output would be a small likelihood-ratio value in favor of the different-speaker hypothesis, and a bad output would be a large likelihood-ratio value in favor of the different-speaker hypothesis.
Given a different-speaker pair, a good output would be a large likelihood-ratio value in favor of the different-speaker hypothesis, a less good output would be a smaller likelihood-ratio value in favor of the different-speaker hypothesis, a worse output would be a small likelihood-ratio value in favor of the same-speaker hypothesis, and a bad output would be a large likelihood-ratio value in favor of the same-speaker hypothesis.
The image below shows the validation results in a Tippett plot. The blue curve rising to the right shows the proportion of same-speaker pairs that had likelihood-ratio values equal to or less than the value on the x axis. The red curve rising to the left shows the proportion of different-speaker pairs that had likelihood-ratio values equal to or greater than the value on the x axis. The better the performance of the forensic-voice-comparison system the greater the separation between the same-speaker and different-speaker curves: the further to the right the same-speaker curve will be and the further to the left the different-speaker curve will be. The x axis of the Tippett plot shows values greater than 1 and values less than 1.
A value greater than 1 favors the same-speaker hypothesis, e.g., a likelihood ratio of 100 means that the acoustic properties of the questioned-speaker and knownspeaker recordings are 100 times more likely if they were both produced by the same speaker than if they were each produced by a different speaker from the relevant population.
A value less than 1 favors the different-speaker hypothesis, e.g., a likelihood ratio of 1/100 means that the acoustic properties of the questioned-speaker and knownspeaker recordings are 100 times more likely if they were each produced by a different speaker from the relevant population than if they were both produced by the same speaker.

C llr
For each listener, and for the forensic-voice-comparison system, the responses to the stimulus pairs were used to calculate a C llr value [48]. C llr was calculated using Equation (1), in which s and d are likelihood-ratio responses corresponding to same-speaker and different-speaker stimulus pairs respectively, and N s and N d are the number of same-speaker and different-speaker stimulus pairs respectively.
C llr is a standard metric of the performance of forensic-evaluation systems. It measures the accuracy of systems that output likelihood ratios. Its use is recommended in the Consensus on validation of forensic voice comparison [49]. For a system that always responded with a likelihood ratio of 1 irrespective of the input, the posterior odds would always equal the prior odds, and the system would therefore provide no useful information. Such a system would have a C llr value of 1. If the C llr value is less than 1, the system is providing useful information, and the better the performance of the system the lower the C llr value will be. C llr values cannot be less than or equal to 0. Uncalibrated or miscalibrated systems can have C llr values that are greater than 1.

D llr
In order to compare an individual-listener's responses with the forensic-voice-comparison system's responses, we also calculated a pairwise difference metric, D llr , see Equation (2), in which subscript h represents a human-listener's response and subscript f represents a response by the forensic-voice-comparison system. If the D llr value is greater than 0, the human listener is, on average, better at distinguishing between speakers than is the forensic-voice-comparison system (on average, their likelihood-ratio responses to same-speaker pairs and their likelihood-ratio responses to different-speaker pairs are further apart), and if the D llr value is less than 0, the human listener is, on average, worse at distinguishing between speakers than is the forensic-voice-comparison system (on average, their likelihood-ratio responses to same-speaker pairs and their likelihood-ratio responses to different-speaker pairs are closer together). A D llr of +1 would indicate that, on average, a listener's likelihood-ratio responses to same-speaker pairs and their responses to different-speaker pairs are twice as far apart as those of the forensic-voice-comparison system, a D llr of +2 that they are four times further apart, a D llr of +3 that they are eight times further apart, etc. A D llr of −1 would indicate that, on average, a listener's likelihoodratio responses to same-speaker pairs and their responses to different-speaker pairs are half as far apart as those of the forensicvoice-comparison system, a D llr of −2 that they are a quarter as far apart, a D llr of −3 that they are an eighth as far apart, etc.

B llr
In order to compare an individual-listener's responses with the forensic-voice-comparison system's responses, we also calculated a pairwise relative-bias metric, B llr . B llr is calculated using Equation (3). 38 If the B llr value is greater than 0, then, relative to the forensicvoice-comparison system, the human-listener's responses are biased toward giving larger likelihood-ratio response values (biased in favour of the same-speaker hypothesis), and if the B llr value is less than 0, then, relative to the forensic-voice-comparison system, the human-listener's responses are biased toward giving smaller likelihood-ratio response values (biased in favour of the differentspeaker hypothesis). A B llr value of +1 would indicate that, on average, the listener's likelihood-ratio responses are twice as large as those of the forensic-voice-comparison system, a B llr value of +2 that they are four times as large, a B llr value of +3 that they are eight times as large, etc. A B llr value of −1 would indicate that, on average, the listener's likelihood-ratio responses are half as large as those of the forensic-voice-comparison system, a B llr value of −2 that they are a quarter as large, a B llr value of −3 that they are an eighth as large, etc. Tippett plot presented to participants as part of the instructions in the experimental condition in which, in addition to listening to each pair of recordings, participants were provided with the likelihood-ratio output by a forensic-voicecomparison system in response to the same pair of recordings. 37 D llr and B llr are named by analogy with C llr . All three have a base-two logarithmic scale, but they do not have the same range: C llr values are greater than 0, with 1 being a reference value, whereas D llr and B llr values are less than or greater than 0, with 0 being a reference value. D llr and B llr are not costs measured in bits. 38 Note that Equation (2) and Equation (3)

Forensic-voice-comparison system
When previously validated on the full set of full-length foren-sic_eval_01 validation recordings, the C llr value for E 3 FS 3 was 0.21 [40]. Given the smaller number of shorter validation recordings used in the current research, the C llr value was 0.42. Poorer performance on shorter recordings is what would be expected, see examples in Weber et al. [40].
A Tippett plot of the validation results from E 3 FS 3 is provided in Fig. 3. For an explanation of how to interpret Tippett plots, see Appendix C.1 of the Consensus on validation of forensic voice comparison [49] and the references cited therein. Likelihood-ratio values resulting from same-speaker comparisons ranged up to approximately 750, and likelihood-ratio values resulting from different-speaker comparisons ranged down to approximately 1/1000.

Demographics
We excluded from analysis the submissions from listeners who did not answer all of the attention-check trials correctly, 39 and the submissions from listeners who, despite indicating that they were eligible at the informed-consent stage, gave answers to demographic questions about language and accent familiarity which indicated that they did not satisfy eligibility criterion 4. 40  Note that all this information was self reported. We are sceptical about the high proportions of North-American-English participants and Spanish-Language participants who stated that they were somewhat familiar with Australian English. On the Likert scale, "Somewhat familiar" with Australian English was glossed as "For example, I frequently watch Australian TV programmes, have multiple Australian friends, and/or I have visited Australia".

C llr values
A C llr value was calculated separately for each individual listener's responses. Fig. 4 shows violin plots of the resulting C llr values grouped by the listeners' language backgrounds. The horizontal line indicates the C llr value for the forensic-voice-comparison system.
In terms of C llr , there was large inter-listener variability. All of the listeners, however, performed worse than the forensic-voice-comparison system. The lowest C llr from a listener was 0.51, compared to 0.42 for the forensic-voice-comparison system. Just over half of the English-language listeners (30 of the 53 Australian-English listeners, 33 of the 61 North-American-English listeners) and three quarters of the Spanish-language listeners (41 of 55) had C llr ≥ 1, i.e., they performed worse than a system that provided no useful information.
In terms of C llr , the North-American-English listeners' median and quartile values were somewhat higher than those of the Australian-English listeners, and 9 of the Australian-English listeners performed better than the best-performing North-American-English listener. The Spanish-language listener's quartile values were somewhat higher and their median was substantially higher than those of the North-American-English listeners. This suggests that greater language familiarity contributes to better speaker-identification performance.   39 We did not exclude submissions for which the failure to answer all the attentioncheck questions correctly were obviously the result of transposition errors, e.g., entering the correct number in the wrong box or writing "16" for "61". 40 To be eligible, Australian-English listeners had to answer "extremely familiar" for both English and Australian English, and North-American-English listeners had to answer "extremely familiar" for English. Although a first-accent Australian-English listener who had lived in the US or Canada for more then 4 years would have satisfied eligibility criterion 4, we excluded from analysis submissions from North-American-English listeners who stated that they were "extremely familiar" with Australian English (which required that they be first-accent Australian-English speakers, or that they be resident in Australia). Although a first-language English speaker who had been resident in Chile, Mexico, or Spain for more then 4 years would have satisfied eligibility criterion 4, we excluded from analysis submissions from Spanish-language listeners who stated that they were first-language English speakers or that they were "extremely familiar" with English (which required that they be first-language English speakers, or that they be resident in a predominantly English-speaking country).
Judges, like other lay listeners, would be expected to exhibit inter-listener variability in speaker-identification performance. A limitation of online recruiting and unsupervised participation in the individual-listener experiment is that many listeners are unlikely to have approached the task as conscientiously as would be expected of a judge listening to questioned-and known-speaker recordings in the context of a legal case. Many listeners in the individual-listener experiment may, therefore, have performed worse than would be expected for judges in the context of a case. We expect, however, that the best-performing listeners in the individual-listener experiment approached the task conscientiously and were intrinsically good at speaker identification. We would not, therefore, expect judges in general to be better at speaker identification than the bestperforming listeners in the individual-listener experiment. 41

D llr values
A D llr value was calculated separately for each individual listener's responses. Fig. 5 shows violin plots of the resulting D llr values grouped by the listeners' language backgrounds.
Apart from a few outliers, across all language backgrounds, all the listeners' D llr values were negative, i.e., compared to the forensic-voicecomparison system, their scaling of log-likelihood-ratio values was narrower: on average, their likelihood-ratio responses to same-speaker pairs and their likelihood-ratio responses to different-speaker pairs were closer to each other than those of the forensic-voice-comparison system. The median scaling for Australian-English listeners responses was about a fifth that of the forensic-voice-comparison system, for North-American-English listeners it was about a sixth, and for Spanishlanguage listeners it was about a seventh. Within each language background, there was substantial inter-listener variability.

B llr values
A B llr value was calculated separately for each individual listener's responses. Fig. 6 shows violin plots of the resulting B llr values grouped by the listeners' language backgrounds.
The B llr values indicate that, relative to the forensic-voice-comparison system, the listeners were predominantly biased toward giving responses that favoured the different-speaker hypothesis. More than 90% of the listeners (48 of the 53 Australian-English listeners, 57 of the 61 North-American-English listeners, and 50 of the 55 Spanish-language listeners) exhibited relative bias that favoured the differentspeaker hypothesis. The median and quartile values across the different language backgrounds were similar. Across language backgrounds, the median relative bias was such that likelihood-ratio values were, on average, a little above half those of the forensic-voice-comparison system. There was, however, substantial inter-listener variability.
The likelihood-ratio output of the forensic-voice-comparison system in response to the validation data may have had a slight absolute bias in favour of the same-speaker hypothesis. This is likely due to sampling variability between the data used to train the calibration model and the data used to validate the system (see [50] for a discussion of this issue). The data used to train the calibration model were deliberately different from those used to validate the system. Calibrating and validating on the same data would result in better calibrated output for those particular data. What matters, however, is how well the system performs on previously unseen data, such as the questioned-speaker and known-speaker recordings in a case. Even taking into account that, for the particular validation data, the forensicvoice-comparison system may have had a slight absolute bias in favour of the same-speaker hypothesis, the magnitudes of the negative B llr values indicate that the majority of the listeners had absolute biases in favour of the different-speaker hypothesis. These biases could potentially be due to the poor recording conditions and the mismatches in recording conditions between the questioned-speaker-condition and the known-speaker-condition recordings. These would have made the voices on the two recordings in each pair sound more different from one another than had they both been high-quality recordings.
The individual-listener experiment did not include any contextual information that would be expected to bias the listeners. There is concern in the earwitness literature that context could influence a listener to expect to hear a particular individual and bias them toward identifying a speaker whom they hear as that individual, e.g., if a listener is asked to identify a speaker in a showup scenario rather than in a well-designed voice lineup [18,51]. Similar concerns apply when a trier of fact is asked to compare a voice on a recording with the voice of the defendant [5]. Abundant psychology research indicates that judges would not be immune from the potential effects of contextual bias [52]. Given such contexts, different speakers who sound at-least somewhat similar could be incorrectly identified as the same speaker, a situation that would usually favour the prosecution. The observed relative bias in the responses to the individual-listener experiment favoured the differentspeaker hypothesis rather than the same-speaker hypothesis. We assume that, since it was observed in a neutral context, this is due to an intrinsic bias. This should still be of concern, however, as bias that usually favours the defence would not be in the interest of victims. It   41 One of the reviewers suggested that the overall poor performance of the listeners may have been due to a lack of opportunity for training and to listeners not fully understanding the task of assigning a likelihood ratio. The reviewer proposed that this could be addressed, and a fairer comparison obtained, by calibrating listeners' responses. In a courtroom context, however, judges are not trained in speaker identification and their speaker-identification judgements are not calibrated. Lack of training and lack of calibration may be a cause of listeners' poor performance, but it is the untrained uncalibrated performance of individual listeners that is relevant for addressing the research question of whether speaker identification by a judge listening alone would be more or less accurate than the output of a forensic-voicecomparison system that is based on state-of-the-art automatic-speaker-recognition technology. may also be unwise to assume that this intrinsic bias would counteract the potential effect of a contextual bias in favour of a same-speaker response.

Tippett plots
There was substantial inter-listener variability, but several patterns were discernable in Tippett plots of the listeners' responses. These different patterns may reflect different conscious or unconscious strategies employed by the listeners. In this subsection, we show example Tippett plots of the patterns which we discerned (excluding those that only occurred in a few outliers). In the caption of each figure, we provide the C llr , D llr , and B llr values corresponding to the Tippett plots shown. Fig. 7 shows Tippett plots of the results from the three best-performing listeners, i.e., those with the lowest C llr values. Compared to the results from the forensic-voice-comparison system (see Fig. 3), the number-one best-performing listener's responses to same-speaker pairs were too low (too close to a log-likelihood ratio of 0 / too close to a likelihood ratio of 1), i.e., they were too conservative. This resulted in both a negative D llr value and a negative B llr value. Fig. 8 shows example Tippett plots of the results from listeners who used narrow ranges of likelihood-ratio values -the values of their likelihood-ratio responses to same-speaker pairs and their likelihoodratio responses to different-speaker pairs were too close to each other, i.e., they were too conservative. This resulted in large negative D llr values. Although the same trend can be observed in the responses from the best-performing listeners (Fig. 7), in the examples given in Fig. 8, the pattern is more extreme. It also resulted in higher C llr values. Fig. 9 shows example Tippett plots of the results from listeners who gave lots of likelihood-ratio-equal-to-one responses. This could be considered an extreme version of the conservative pattern just shown in Fig. 8. It resulted in C llr values close to 1, large negative D llr values, and B llr values close to 0. If these listeners were conscientiously engaged with the task, 42 then this would suggest that, under the conditions tested, they found it difficult to perform speaker identification. Fig. 10 shows example Tippett plots of the results from listeners who mostly used one response magnitude, i.e., they almost always entered the same number but entered it into the first box or the second box depending on whether they thought the pair of recordings was a same-speaker pair or a different-speaker pair. In panel (a) the listener almost always entered the number 100, and in panel (b) the listener always entered the number 2. A variant of this pattern was mostly using only two or three different numbers, e.g., in panel (c) the listener almost always entered 100 or 200. If these listeners were conscientiously engaged in the task, these results may reflect a strategy whereby they made a categorical decision on same-speaker versus different-speaker, picked a single value to represent that categorical decision, and potentially, if they were more or less certain about their decision, chose other values anchored on that first value. This pattern did not result in consistency in terms of C llr , D llr , or B llr values.
Finally, Fig. 11 shows example Tippett plots of results that were strongly biased toward the different-speaker hypothesis. As discussed in §3.2.4, this was a common pattern. It resulted in C llr values greater than 1, and large negative B llr values. Results similar to panel (a) were particularly common.

Listeners' beliefs about their own speaker-identification abilities
Both before and after the experiment, each participant was asked to indicate on a 5-point Likert scale how good they thought they were at identifying speakers in general and how good they thought they were at identifying adult male Australian-English speakers in particular. The levels on the scale were: 1. "very poor", 2. "poor", 3. "neutral", 4. "good", 5. "very good". Fig. 12, Fig. 13, and Fig. 14 show the responses from Australian-English, North-American-English, and Spanish-language listeners respectively. In each figure, the left panels show the responses for speakers in general, and the right panels show the responses for adult male Australian-English speakers in particular. In each figure, the top panels show the Likert-scale responses from before the experiment, the middle panels show the Likert-scale responses from after the experiment, and the bottom panels show the pairwise differences between the listeners' Likertscale responses from before and after the experiment.
For Australian-English listeners, as would be expected, their responses were similar for speakers in general and for adult male Australian-English speakers in particular. For both types of speakers, approximately half the listeners indicated that they thought they were worse at speaker identification after the experiment than they thought they were before the experiment (26 of 53 listeners for speakers in general and 29 of 53 listeners for adult male Australian-English speakers in particular). For each type of speaker, only 2   listeners indicated that they thought they were better after the experiment than before. This suggests that, even without feedback on the correctness of their answers, the experience of performing the task made many of the Australian-English listeners think that they had initially overestimated their speaker-identification abilities.
For North-American-English listeners, their initial responses indicated that they were less confident in their ability to identify adult male Australian-English speakers in particular than in their ability to identify speakers in general. This suggests that the listeners were aware of the potential impact of accent familiarity on speaker identification. For speakers in general, approximately half the listeners (32 of 61 listeners) indicated that they thought they were worse at speaker identification after the experiment than they thought they were before the experiment. For adult male Australian-English speakers in particular, this was the case for approximately two-thirds of listeners (39 of 61 listeners). For both types of speakers, only a few listeners indicated that they thought they were better after the experiment than before (7 of 61 listeners for speakers in general and 9 of 61 listeners for adult male Australian-English speakers in particular). This suggests that, even without feedback on the correctness of their answers, the experience of performing the task made many of the North-American-English listeners think that they had initially overestimated their speaker-identification abilities.
For Spanish-language listeners, their initial responses indicated that they were less confident in their ability to identify adult male Australian-English speakers in particular than in their ability to identify speakers in general. Also, they were initially less confident in their ability to identify adult male Australian-English speakers than were the North-American-English listeners. This suggests that the Spanish-language listeners were aware of the potential impact of language familiarity on speaker identification. For speakers in general, approximately one-third of the listeners (19 of 55 listeners) indicated that they thought they were worse at speaker identification after the experiment than they thought they were before the experiment. For adult male Australian-English speakers in particular, this was the case for more than two-fifths of listeners (23 of 55 listeners). For both types of speakers, only a few listeners indicated that they thought they were better after the experiment than before (2 of 55 listeners for speakers in general and 6 of 55 listeners for adult male Australian-English speakers in particular). This suggests that, even without feedback on the correctness of their answers, the experience of performing the task made many of the Spanish-language listeners think that they had initially overestimated their speaker-identification abilities for adult male Australian-English speakers. This experience, however, made fewer of the Spanishlanguage listeners think that they had overestimated their speakeridentification abilities for speakers in general. A similar differential was observed for North-American-English listeners. This suggests that these listeners attributed experiencing more difficulty than expected with the task as being due in part to language unfamiliarity or accent unfamiliarity, with language unfamiliarity resulting in less diminishment to their confidence in their speaker-identification abilities in general than accent unfamiliarity.
Before the experiment, each individual listener was asked to respond to the question: Australian-English speakers, what percentage of the pairs do you think you would get "right", i.e., if they were recordings of the same speaker you would say they were recordings of the same speaker and if they were recordings of different speakers you would say they were recordings of different speakers? Count saying "can't decide" as incorrect.
For each listener, we ignored the magnitudes of their responses and calculated their actual correct-classification rate as the proportion of recording-pairs for which they entered a value greater than one into the correct box (the first box if the recording pair was a same-speaker pair, and the second box if the recording pair was a different-speaker pair). Responses equal to one were counted as errors. This approximates a situation in which listeners had ternary response options: "same speaker", "different speaker", and "don't know". 43 Fig. 15 plots each listener's own initial estimate of their correct-classification rate against their actual correct-classification rate. If a data point is above the diagonal line, the listener overestimated their speaker-identification ability. If a data point is below the diagonal line, the listener underestimated their speaker-identification ability. The vertical distance to the diagonal line indicates the amount by which they overestimated or underestimated their ability. The heavy vertical line represents the correct-classification rate for the forensic-voice-comparison system, 87 %, which was calculated using equal priors and a posterior-odds threshold of 1. 44 If a data point is to the left of the vertical line, the listener's correct-classification rate was worse than that of the forensicvoice-comparison system.
All the individual listeners' correct-classification rates were worse than the correct-classification rate for the forensic-voice-comparison system. In terms of correct-classification rates, some listeners' estimates of their speaker-identification abilities were close to their actual abilities, but others substantially under-or overestimated their abilities. 45 There was substantial inter-listener variation.
Most Australian-English listeners overestimated their speakeridentification abilities, some by large amounts. None 43 Actual behaviour given those response options could differ. 44 We do this only for the purpose of being able to make a comparison with the responses given by participants to a question that could be asked without requiring a lot of explanation. We would not present correct-classification rates (or classificationerror rates) in the context of a legal case. 45 The question asked before the experiment did not specify what the recording conditions would be or that there would be a mismatch in recording conditions, so some of the overconfidence could have been due to participants' assuming highquality recording conditions, and they might have actually performed better on highquality recordings. One listener (who took part in a slightly modified version of the experiment not otherwise reported in the present paper) sent us a comment stating: "I am 100 % sure I overestimated my ability to discern the voices SOLELY because I did not know the conditions of the sound recordings yet. Had I had a sample of what they would sound like FIRST, then I would have estimated 15 % or lower. were this in a USA court of law and I were a juror (I have sat on two juries in my lifetime.) I would immediately throw out this evidence and 100 % discard it." It turned out, however, that this listener did not substantially overestimate their correct-classification rate: their estimated correct-classification rate was 80 % and their actual correct-classification rate was 72 % (their C llr was 0.84). underestimated their abilities by a large amount. Overconfidence due to familiarity with the accent of the speakers is a potential explanation for this pattern of results. In contrast some North-American-English listeners and some Spanish-language listeners overestimated their speaker-identification abilities by large amounts and some underestimated their speaker-identification abilities by large amounts. The apparent underconfidence of the latter listeners suggests that they were aware of the potential impact of accent or language familiarity on speaker identification.   • 32 identified as females and 23 as males • 55 identified as first-language English speakers • 3 stated that they were "very familiar", 34 that they were "somewhat familiar", and 20 that they were "not familiar" with Australian English Fig. 16, Fig. 17, and Fig. 18 provide violin plots of the C llr , D llr , and B llr values resulting from individual North-American-English listeners' responses in the original condition and (other) North-American-English listeners' responses in the condition in which they were provided with the likelihood-ratio values output by the forensic-voice-comparison system.

Performance metrics
The performance of the participants who (in addition to being able to listen to the recordings) were provided with the likelihoodratio output of the forensic-voice-comparison system was better than the performance of the participants who only listened to the recordings. The distribution of their C llr values was substantially lower. In addition to having better C llr values, D llr values and B llr values were also better.
In terms of C llr , no participant outperformed the stand-alone forensic-voice-comparison system. The best performance was from participants who always responded with values that were close to the likelihood-ratio values output by the forensic-voice-comparison system. No participant entered the exact likelihood-ratio values output by the forensic-voice-comparison system. The lowest C llr value for a participant's responses was 0.43 (the C llr value for the stand-alone forensic-voice-comparison system was 0.42). In terms of correct-classification rate, one participant equalled the performance of the forensic-voice-comparison system at 87 % and one exceeded it at 89 %. All others performed worse than the forensic-voice-comparison system.
The latter results replicate the pattern observed in Matějka et al. [13] (described in §1.3) in which participants could both listen to recordings and consider the output of a automatic-speaker-recognition system. In that study, only 1 of 10 participants outperformed the stand-alone automatic-speaker-recognition system. The full results also replicate the pattern observed in other domains in which an algorithm alone outperforms humans who can adjust the output of the algorithm using their own subjective judgment   who in turn outperform humans who are not exposed to the algorithm and rely only on their own subjective judgment, e.g., Dietvorst et al. [53].

General discussion and conclusion
Expert testimony is only admissible in common law if it will potentially assist the trier of fact to make a decision that they would not be able to make unaided. If the trier of fact's speaker identification were equally accurate or more accurate than a forensic-voicecomparison system, then testimony based on the output of the forensic-voice-comparison system would not be admissible.
We tested the accuracy of speaker identification by individual lay listeners. This was intended to be informative with respect to a context in which a judge attempts to identify a speaker. The pairs of recordings that we used for testing reflected the conditions of the questioned-speaker and known-speaker recordings in an actual case. The accuracy of individual listeners' responses was compared with the accuracy of likelihood-ratio values output by E 3 FS 3 , a forensicvoice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology. There was large inter-listener variation, but all listeners performed worse than the forensic-voicecomparison system. The lowest C llr for a listener's responses was 0.51, which was substantially worse than the C llr of 0.42 for likelihood-ratio values output by the forensic-voice-comparison system. In addition, more than half of the listeners' responses resulted in C llr ≥ 1, i.e., they performed worse than a system that provided no useful information.
Based on these results, at least under the particular case conditions tested, we infer that the forensic-voice-comparison system would satisfy the admissibility criterion of being more accurate than speaker identification performed by a judge. Also taking into consideration the results of previous research (which was summarized in §1.3), we think it is reasonable to extrapolate this inference to other recording conditions.
Given that forensic voice comparison based on state-of-the-art automatic-speaker-recognition technology outperforms speaker identification by individual listeners, we argue that judges should not attempt to perform speaker identification and should instead rely on expert testimony that is based on a validated forensic-voicecomparison system. We tested a condition in which individual listeners could attempt speaker identification based on listening to the pairs of recordings and could also consider the likelihood-ratio values output by the forensic-voice-comparison system in response to the same pairs of recordings. In terms of C llr , no participant outperformed the stand-alone forensic-voice-comparison system (in terms of correct classification rate only 1 of 55 participants outperformed the stand-alone forensic-voice-comparison system). We therefore argue that judges should rely exclusively on expert testimony that is based on a validated forensic-voice-comparison system -they should not attempt to supplement it by performing their own speaker identification as this will almost certainly lead to a less accurate result.
Based on the results of the present research (and the results of past research), we also infer that forensic voice comparison based on state-of-the-art automatic-speaker-recognition technology would be more accurate than speaker identification performed by a police officer or an interpreter. 46 We therefore argue that, when both questioned-speaker and known-speaker recordings are available or obtainable, a trier of fact (judge or jury) should not be presented with lay speaker-identification testimony or "ad hoc expert" speakeridentification testimony, and should instead be presented with expert testimony based on a validated forensic-voice-comparison system.
Edmond [5] provides additional arguments for why judges and juries should not attempt to perform their own speaker identification and why they should not be presented with and should not consider lay or "ad hoc expert" speaker-identification testimony. Judges and juries are invited to perform their own speaker identification in the suggestive context of the accusatorial trial. In many cases, they also hear from police officers and interpreters who do not use validated methods and do not manage their own exposure to task-irrelevant information. Other evidence in the case contaminates the trier of fact's speaker-identification judgement and contaminates the speaker-identification testimony of lay and "ad hoc expert" witnesses, but the speaker-identification judgement and testimony are then treated as independent support for the evidence that contaminated them, and the speaker-identification judgement by the trier of fact is treated as independent support for the speakeridentification testimony that contaminated it.
The experiments conducted for the present study were decontextualized in that they were not embedded in case contexts that could potentially bias the listeners. Listeners' responses were biased in favour of the different-speaker hypothesis. This may have been because of the poor recording conditions, including the mismatch in conditions between the questioned-speaker-condition and knownspeaker-condition recordings. These would have made the members of each pair of recordings sound more different from one another than had they both been high-quality recordings. In a future paper, we plan to report on experiments in which we provide contextual information that could potentially debias or differently bias the results, and on experiments in which we present high-quality versions of the recordings.
We tested individual listeners with different language and accent backgrounds: listeners who were familiar with both the language and accent spoken by the speakers (Australian-English listeners), listeners who were familiar with the language but less familiar with the accent (North-American-English listeners), and listeners who were less familiar with the language (Spanish-language listeners). The results were in accord with expectations based on previous research: the Australian-English listeners performed better than the North-American-English listeners, who in turn performed better than the Spanish-language listeners. Based on these results, speaker identification by judges who are unfamiliar with the language or accent spoken should be of even greater concern than when they are familiar with the language and accent spoken. We have, however, already argued that judges (and juries) should not attempt to perform speaker identification, even when the language and accent spoken are familiar to them, and that they should instead rely on expert testimony that is based on a validated forensic-voice-comparison system. For forensic voice comparison, the language and accent spoken is part of the specification of the relevant population adopted for the case. Assuming that data representative of the relevant population are available or obtainable, a forensic-voicecomparison system can be trained, calibrated, and validated for 46 We remind the reader of the definitions provided in footnote 1. "Speaker identification" by humans refers to a situation in which the listener hears the voice of an unfamiliar speaker of questioned identity and hears the voice of an unfamiliar speaker of known identity, and makes a judgement as to whether the two voices belong to the same speaker or to different speakers. This is not the same as "speaker recognition" by humans, which refers to a situation where a listener hears the voice a speaker and makes a judgement as to whether it is the voice of a speaker who is familiar to them (footnote continued) or not, and if the listener states that the voice is that of a speaker who is familiar to them they usually also state the name of the speaker. The research presented in the present paper relates to "speaker identification", not to "speaker recognition". The present discussion also relates to "speaker identification", not to "speaker recognition". speakers speaking the language and accent of interest in the case, see the Consensus on validation of forensic voice comparison [49].
Previous research has suggested that listeners overestimate their own speaker-identification abilities (and overestimate other listeners' speaker-identification abilities). Both before and after the experiment, we asked listeners to indicate how good they thought they were at speaker identification in general and speaker identification of adult male Australian English speakers (the type of speakers in the experiment). About half the listeners indicated that they thought their ability to identify speakers was worse after the experiment than they thought it was before the experiment, and few indicated that they thought it was better. Even without feedback on the correctness of their responses, the experience of taking part in the experiment appears to have made the former listeners realize that the task is more difficult than they initially believed it to be. For Australian-English listeners, the magnitude of this effect was about the same when they were asked about speakers in general and about adult male Australian English speakers in particular, but for North-American-English listeners and Spanish-language listeners, the magnitude of the effect was less for speakers in general than for adult male Australian English speakers. This suggests that the latter listeners tended to attribute the difficulty of the task as due, at least in part, to the unfamiliar accent or to the unfamiliar language, and thus they remained relatively confident about their speaker-identification abilities in general. Even before the experiment, North-American-English listeners and Spanish language listeners tended to indicate that they thought they were worse at identifying adult male Australian English speakers than identifying speakers in general, with the magnitude of the difference being greater for the Spanishlanguage listeners. This suggests that these listeners were already aware of the difficulty due to accent unfamiliarity and greater difficulty due to language unfamiliarity.
Before the experiment, we asked listeners to estimate what their correct-classification rate would be for identifying adult male Australian English speakers, and we later compared their estimates with their actual correct-classification rates. Some listeners' estimates were close to their actual correct-classification rates, but others substantially overestimated or underestimated. The interlistener variability (including within-language-background interlistener variability) was such that listeners' estimated correct-classification rates could not be used as reliable indicators of actual correct-classification rates. Some of the overestimation may have been due to listeners expecting to hear high-quality recordings rather than the poor-quality and mismatched-condition recordings that they did hear, and listeners may actually have performed better on high-quality recordings. The recordings presented to them did, however, reflect the conditions of recordings in an actual case. Until they experience attempting to perform speaker identification with case-condition recordings, listeners may not appreciate the difficulty due to the conditions, and listening to a single pair of recordings, as may occur in the context of a case, might not provide sufficient experience. Australian-English listeners tended to overestimate what their correct-classification rates would be, but North-American-English listeners and Spanish language listeners both overestimated and underestimated. The general warning that listeners often substantially overestimate their speaker-identification abilities should therefore be modified to a warning that listeners often substantially overestimate their speaker-identification abilities when listening to speakers of a language and accent with which they are familiar, but could substantially overestimate or substantially underestimate their speaker-identification abilities when listening to speakers of a language or accent with which they are less familiar or not familiar. In general, listeners' estimates of their own accuracy should not be taken as indicative of their actual accuracy.
In conclusion: • Is forensic voice comparison based on state-of-the-art automaticspeaker-recognition technology more accurate than speaker identification by individual lay listeners?
• Yes. • Can individual lay listeners outperform forensic voice comparison based on state-of-the-art automatic-speaker-recognition technology by considering the likelihood-ratio output of the forensic-voice-comparison system and also performing their own speaker identification? • Should judges attempt to perform their own speaker identifications in addition to considering likelihood-ratios output by validated forensic-voice-comparison systems?
• No. They should rely exclusively on expert testimony based on validated forensic-voice-comparison systems.
• Should judges rely on speaker-identification performed by lay or "ad hoc expert" listeners?
• No. They should rely exclusively on expert testimony based on validated forensic-voice-comparison systems.
The experiments reported in the present paper were conducted with individual lay listeners. They were intended to be informative of a context in which an individual judge attempts to perform speaker identification. In a future paper, we will report on experiments conducted with groups of twelve listeners acting collaboratively. Those experiments are intended to be informative of a context in which a group of jury members collaboratively attempt to perform speaker identification.

Disclaimer
All opinions expressed in the present paper are those of the authors, and, unless explicitly stated otherwise, should not be construed as representing the policies or positions of any organizations with which the authors are associated.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Dr Morrison is Director and Forensic Consultant for Forensic Evaluation Ltd. Dr Weber has worked as a contractor for Forensic Evaluation Ltd. Forensic Evaluation Ltd charges clients fees to perform forensic-voice-comparison evaluations, and to submit reports and testify in court regarding forensic voice comparison, and regarding speaker recognition and speaker identification by laypersons.