skip to main content
research-article
Open Access

Measuring the Accuracy of Automatic Speech Recognition Solutions

Published:09 January 2024Publication History

Skip Abstract Section

Abstract

For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence mean that automatic speech recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available—but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming that artificial intelligence has reached human parity or even outperforms manual transcription. At the same time, the DHH community reports serious issues with the accuracy and reliability of ASR. There seems to be a mismatch between technical innovations and the real-life experience for people who depend on transcription. Independent and comprehensive data is needed to capture the state of ASR. We measured the performance of 11 common ASR services with recordings of Higher Education lectures. We evaluated the influence of technical conditions like streaming, the use of vocabularies, and differences between languages. Our results show that accuracy ranges widely between vendors and for the individual audio samples. We also measured a significant lower quality for streaming ASR, which is used for live events. Our study shows that despite the recent improvements of ASR, common services lack reliability in accuracy.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Transcription is important to make spoken language accessible to d/Deaf and Hard of Hearing (DHH) individuals. With the rapid development of Artificial Intelligence (AI), astounding results in the field of Automatic Speech Recognition (ASR) are reported, particularly in terms of accuracy [Radford et al. 2022]. Fully automated captioning is now included in online meeting tools, video streaming platforms, and presentation software. Automatic Speech Recognition (ASR) is scalable, cheap and offers a solution for the increasing amount of digital content and resulting demand for accessibility. But transcripts and captions generated using ASR are only useful for those who depend on them, if they have very high levels of accuracy. This requirement is seemingly fulfilled—scientific publications report extremely low error rates on transcription tasks [Baevski et al. 2020], and vendors produce “state-of-the-art accuracy” [Google 2023]. At the same time, the National Association of the Deaf (NAD) has filed a petition in the United States to improve ASR-based captioning because people have reported serious issues with the correctness, timing and completeness of captions [Blake 2019]. There seems to be a mismatch between reports of artificial intelligence (AI) research and the lived experience of the d/Deaf and hard of hearing (DHH) community [Butler 2019; Kafle and Huenerfauth 2016; Kawas et al. 2016; National Deaf Center on Postsecondary Outcomes 2020]. The National Association of the Deaf (NAD) proposes to regulate the use of ASR and the applied metrics to specify accuracy of captions and transcripts. We therefore need independent, descriptive data about the accuracy of ASR and whether the achieved accuracy level is sufficient to make captions accessible.

The accuracy of ASR is commonly measured using the Word Error Rate (WER), which represents the amount of errors in a transcript compared to a sample solution. One main aim of research and development in ASR is to reduce the Word Error Rate (WER) of datasets through the application of new technological innovations, such End-to-End (E2E) models or self-supervised learning. Some of these tests produce very low error rates that even outperform those achieved through manual transcription [Zhang et al. 2022]. But these high reported accuracy rates might not be a reliable indicator of the general quality of ASR. Training and optimisation for specific datasets bear the risk of overfitting, and the obtained results might not be a good indicator of performance on novel datasets [Geirhos et al. 2020]. Many sources show ASR’s volatility with different datasets: ASR performs differently for male or female speakers, and shows racial biases or varying accuracy rates for speakers with different accents [Cumbal et al. 2021; Koenecke et al. 2020; Speechmatics 2023; Tadimeti et al. 2022; Tatman and Kasten 2017]. The studies aim to show that ASR has specific biases that often affect speakers of under-represented groups but are not concerned with the overall accuracy of the engines. This baseline data is still missing.

Whereas research refers to the WER, most commercial cloud providers that offer ASR services avoid providing specific accuracy rates. Instead they use indeterminate claims like “high-quality transcription” [Microsoft 2023], “produce accurate transcripts” [Amazon 2023], or “state-of-the-art accuracy” [Google 2023]. Government institutions and public organisations also remain vague in their requirements for the accuracy levels for captions. The Federal Communications Commission (FCC) requires captions to match the spoken words to the fullest extent possible, and additionally states that their rules on captioning distinguish between pre-recorded media and live transcriptions [Federal Communications Commission 2014]. The Web Content Accessibility Guidelines (WCAG) only specify that captions are provided, without clearly defined requirements for accuracy [Adams et al. 2022]. However, the W3C Web Accessibility Initiative (WAI) is a bit more specific by stating that ASR-generated captions do not meet accessibility requirements unless they can be confirmed to be fully accurate [Lawton Henry et al. 2022].

To progress in establishing accessibility standards, we need to make its accuracy more tangible. This article provides a comprehensive and independent overview of the WER for a variety of common ASR services. The aim is to capture the accuracy of the current generation of ASR models. We do not promote specific vendors, as their individual models are subject to change. We created a novel dataset that attempts to reflect the performance of ASR with minimal bias and without targeting specific weaknesses. To provide a high comparability of results, the transcription process was fully automated, using equivalent configurations across different vendors. The WER was calculated after extensive text normalisation to reduce errors caused by non-semantic differences. Our results may support future research in choosing realistic error rates, particularly for qualitative evaluations and the development of accessibility metrics.

Skip 2BACKGROUND Section

2 BACKGROUND

The accuracy of ASR rapidly increased with the development of machine learning, particularly through deep neural networks [Meta AI 2023]. At the time of this writing, most commercial services use hybrid systems that consist of multiple individual models (e.g., acoustic, language and lexical) [Li 2022]. In these hybrid models, the input of a model depends on the output of the previous model in the pipeline. Recently, a new generation of ASR systems have reported a further increase in accuracy through the use of end-to-end (E2E) models [Baevski et al. 2020]. In contrast to hybrid models, a single objective function is used to optimise the whole network. E2E training is especially interesting for academic research, as no expert knowledge of specific components like the acoustic model is required. This approach also offers some benefits for industry, as it simplifies the whole ASR pipeline and it is easier to generate larger training data through self- or semi-supervised learning [Chung et al. 2021; Xu et al. 2021; Zhang et al. 2022].

2.1 Transcription with ASR

Manually creating accurate transcripts for spoken language is a laborious task, as speech is up to 10 times faster than typing [Wald 2006]. Transcription of real-time events is even more challenging, as there is no time to replay audio, and there is only a small window of time to correct texts. Thus, trained professionals are needed to accurately transcribe live events. A common method is Communication Access Realtime Translation (CART), where the transcript of a professional typist is presented live on a screen. To keep up with the speed of speech, special phonetic keyboards or stenography are used. An alternative method is “respeaking”, in which professionals speak the speaker’s words into a trained ASR system and correct the resulting text using a keyboard. As both methods are very demanding, multiple professionals are required for longer events to give the operators rest periods. Besides the complexity of this task and high operational costs, a big issue is the limited availability of these professionals. At the same time, as more digital content is produced in the form of online presentations and remote meetings, the demand for transcription of events is increasing.

Fully automated transcription through ASR appears to be an effective solution. YouTube creates automatic subtitles for videos, Zoom offers real-time captioning in meetings, and PowerPoint enables automatic transcription during presentations. ASR is rather cheap, has few limitations regarding availability or logistics, and requires no working breaks. Popular cloud providers and specialised companies offer ASR as an on-demand service, and multiple open source models are publicly available. They support a wide range of languages, batch and real-time transcription, or even adding custom vocabularies for a presentation, to detect infrequent words, acronyms or technical terms.

It is important to note that the results of ASR and professional transcribers are quite different due to their inherent strengths and weaknesses [Romero-Fresco 2009]. AI is good at keeping up with the speed of speech and creates almost verbatim transcripts. The amount of words in an ASR-generated transcript is almost identical to the amount of words spoken. But AI can produce confusing words or sentences, by mistaking homonyms or hallucinating text. This can severely impair understanding, particularly for people who have to rely on captions. However, humans create less verbatim transcripts that concentrate on transporting the gist of the text, but even professional respeakers lag around 20 to 40 words per minute behind the actual speakers [Romero-Fresco 2009]. Even though the transcript might be easier to read than results from ASR, it can also miss important information.

Although more summarised captions may be selected in certain situations, such as children’s programs or when the speed of the caption presentation is very high, the majority of DHH individuals tend to favour the comprehensive access offered by verbatim texts [National Institute on Deafness and Other Communication Disorders 2017]. The FCC also requires that offline captions must be verbatim, and paraphrasing should be minimised for real-time transcription [Federal Communications Commission 2014]. Besides text accuracy, accessible transcription incorporates other issues like the placement of captions, correct speaker identification, or the presence of relevant non-speech information [Federal Communications Commission 2014; Lawton Henry et al. 2022].

2.2 Benchmarking ASR

Evaluation of an ASR system heavily depends on the use case, and various factors have to be considered [Aksënova et al. 2021]. A voice control application might focus on keyword spotting and correctly transcribing addresses or phone numbers. A transcription service for conversations and meetings, however, targets a high overall text accuracy. The speed of an ASR system is another critical factor. Real-time transcription aims for a short latency between the input audio stream and the text output. It is typically measured as the real-time factor [Liu 2000]. Another factor is the stability of partial results that can be transmitted before the final result in streaming ASR. Volatile intermediate transcripts result in many text changes, which creates a negative user experience. There is no universal metric yet, but Shangguan et al. [2020] propose to measure the stability as an unstable partial word ratio (UPWR) and Baumann et al. [2009] suggest an Incremental WER.

The quality of ASR in poor acoustic environments has greatly improved, for example, the accuracy in noisy environments [Kinoshita et al. 2020]. But besides technical improvements in specific areas and the evolution of machine learning in general, the quality and amount of training data is the major factor influencing the performance of an ASR system. Using more high-quality training data increases the general accuracy of an ASR system. Although public datasets like the Switchboard corpus [Godfrey et al. 1992] have been available since the 2000s, development is driven by additional and new datasets like LibriSpeech [Panayotov et al. 2015], and large multilingual audio recordings like Common Voice [Ardila et al. 2020].

The LibriSpeech corpus [Panayotov et al. 2015] is a widely used benchmark in science and industry. With its release in 2015, the first models reported a WER of 13.25%. Within only 6 years, the most accurate models could decrease the WER down to 2.5% [Meta AI 2023]. However, Geirhos et al. [2020] show that fine-tuning a model for a specific dataset carries the danger of shortcut learning, where a deep neural network exploits weaknesses in the training data. A model that reports excellent results on one speech corpus might perform worse on other corpora. Chan et al. [2021] show that training a model on multiple public speech datasets results in a comparable accuracy but higher robustness than models trained on a single source. Respectively, [Gandhi et al. 2022] propose to evaluate ASR systems on multiple public speech datasets resulting in an average End-to-End Speech Benchmark (ESB) score.

2.3 Measuring Transcription Accuracy

Research in speech recognition typically uses the WER to measure the accuracy of ASR systems. It is based on the minimum edit distance between a transcript and the reference solution (a.k.a. ground truth), and quantifies the relative amount of errors to the total number of words of a text. But the WER has been criticised, as it does not reflect text understanding and only weakly correlates with human judgement of a transcript’s quality [Favre et al. 2013; Mishra et al. 2011; Wang et al. 2003]. Additional measures like the Character Error Rate (CER), the match error rate [Morris et al. 2004], or the weighted word error rate (WWER) [Apone et al. 2010] have the same underlying problem, as they also quantify the editing distance between two texts. Recent approaches try to measure the accuracy with AI, as Kafle and Huenerfauth [2017] initially proposed with the Automated-Caption Evaluation (ACE) and extended with a second study [Kafle and Huenerfauth 2019]. But Wells et al. [2022] could not find a significant difference between Automated-Caption Evaluation (ACE) and WER. Although AI-based metrics have the potential to better align with human judgement compared, they also bear the risk of a bias related to their training data. ACE, for example, is trained on conversational speech and thus might not be suited for other scenarios like Higher Education or Entertainment.

Another metric, the NER model, was developed by Romero-Fresco and Pérez [2015] to measure the accuracy of respeaking. It requires a manual analysis of the transcript to classify the errors by classifying their severity. As it focuses on respeaking, it also distinguishes between recognition and editing errors, to provide feedback for the respeaker. A drawback of the NER model is that qualitative measures are subjective and hard to obtain on large datasets (e.g., training data for ASR).

Despite the criticism, the WER is still useful to evaluate and compare ASR systems. As these systems produce verbatim transcripts, a word-by-word comparison is suitable and an analysis in terms of the content becomes less relevant. The lower the WER is, the closer the produced transcript is to the actual text—fewer errors should therefore result in high levels of understandability. However, the WER is very vulnerable to non-semantic differences due to text formatting, such as errors resulting from incorrect capitalisation or punctuation. Table 1 shows an example sentence and the resulting WER for different text normalisations: None, Common (removal of capitalisation and punctuation) and the Whisper normaliser.

Table 1.
NormalisationTranscriptSentenceWER
NoneGround TruthHactar analysed, that the answer is not forty two!
NoneHypothesishactar analyzed that the answer isn’t 42.67%
CommonGround Truthhactar analysed that the answer is not forty two
CommonHypothesishactar analyzed that the answer isnt 4256%
WhisperGround Truthhactar analyzed that the answer is not 42
WhisperHypothesishactar analyzed that the answer is not 420%

Table 1. WER for Different Degrees of Text Normalisation

To reduce the impact of these formatting differences, it is common to pre-process transcripts before the calculation of the WER. Typically, the text is transformed to lowercase and all punctuation is removed. Koenecke et al. [2020] introduce additional replacements for abbreviations and colloquial speech specific to the used dataset of their study. Radford et al. [2022] developed an extensive normaliser for English that unifies UK- and U.S.-spelling differences, replaces common contractions, and converts written numbers. Errors from capitalisation or grammatical cases are more common in languages other than English, but it is unclear whether these have an impact on the understandability of the text. The Federal Communications Commission (FCC) requires that capitalisation and punctuation is included in accurate captions, as these influence an individual’s understanding of the text [Federal Communications Commission 2014].

2.4 Comparison of ASR Services

There is a lot of research that compares different ASR services and focuses on bias regarding gender, race or age. Koenecke et al. [2020] report an almost twice as high WER for Black American speakers compared to White American speakers. Despite the individual accuracy, these racial disparities exist across all tested vendors. They see the main problem in the performance gap of the acoustic model, due to insufficient training data featuring Black American speakers. Tatman and Kasten [2017] also note that the accuracy of ASR systems depend on sociolinguistic factors and is worse for Black Americans compared to White Americans.

Obviously, ASR does not distinguish between skin colours but reflects biases in the training data. ASR therefore performs worse for speakers of under-represented groups within the training data. As a result, the accuracy decreases for speakers with regional accents or second language learners. Tadimeti et al. [2022] report a performance gap between general American and non-American accents. This does not only apply to English, as Cumbal et al. [2021] show by comparing native and non-native speakers of Swedish. Bias in ASR applies to all kinds of dimensions. Catania et al. [2019] show that the level of emotionality decreases accuracy compared to neutral speech. The company Speechmatics [2023] reports that also the age of the speakers is a factor. ASR shows the highest accuracy between the ages of 28 to 36 years, and the highest error rates for the oldest group of those 60 to 81 years.

Other studies compare ASR systems in general, without focusing on a specific bias. Addlesee et al. [2020] tested three common cloud providers with the Switchboard corpus and reported results between 5.1% and 6.8% WER. Ballenger [2022] transcribed lecture recordings of adult educational content on five common services and report a WER ranging between 12% and 31%. Researchers from Microsoft report a WER of 5.1% on the Switchboard corpus [Xiong et al. 2018]. Testing the same dataset, a research group from IBM reported a WER of 5.6% [Saon et al. 2017]. These values are identical to Addlesee et al.’s [2020] results of Microsoft and IBM on the same corpus. Researchers from Google report a WER of 5.6% on an internal dataset of 12,500 hours of Google Voice Search utterances [Chiu et al. 2018]. In 2022, OpenAI released the open source ASR model Whisper. In the corresponding paper, they show that this state-of-the-art E2E model outperforms many commercial vendors on various public datasets [Radford et al. 2022].

We also found data often part of advertising on company websites or whitepapers. Whereas some vendors used public datasets to test the accuracy of their ASR engines (SpeechText.AI [2023], Rev AI [Jette 2020], Speechmatics [Hughes 2023], and AssemblyAI [2023]), others referred to internal datasets (Deepgram [Stephenson 2022]). When companies publish data on the accuracy of their ASR services, it is not always clear whether these serve the purpose of advertising or a favourable comparison to other vendors.

Table 2 shows an overview of different studies, whitepapers and company websites that provide data on ASR services. The lowest WER value (best result) of each row is highlighted in bold. Results where study authors and lowest WER are from the same company are underlined. The values of each study are not comparable to others, as the error rates depend on multiple factors: (1) the tested dataset, (2) the date of the test, and (3) the degree of text normalisation to calculate the WER.

Table 2.
SourceWER of vendors included in this study
peer-reviewedpublic datasetAmazonAssemblyAIDeepgramGoogleIBMMicrosoftRev AISpeechmaticsSpeechtext.AITencentWhisper
[Addlesee et al. 2020]yesyes765
[Ballenger 2022]yesno22312814
[Catania et al. 2019]yesyes2020
[Koenecke et al. 2020]yesyes13141610
[Tadimeti et al. 2022]yesno18112716
[Tatman and Kasten 2017]yesyes3145
[Chiu et al. 2018]yesno6
[Saon et al. 2017]yesyes6
[Xiong et al. 2018]yesyes5
[AssemblyAI 2023]noyes67
[Stephenson 2022]nono1113
[Jette 2020]nono1816171415
[Hughes 2023]noyes1719151216
[SpeechText.AI 2023]noyes169144
[Radford et al. 2022]noyes13

Table 2. Summary of WER Measurements of Different Services from Related Research

In 15 different tests, nine different providers delivered the best result. At maximum, five different vendors are compared within a study or test. Whereas the common cloud providers Amazon, Google, IBM and Microsoft appear in multiple studies, specialised ASR providers are less represented. The error rates have a very large range from 3.8% to 45.0%, which most likely results from the different datasets used and the degree of text normalisation.

Skip 3METHODOLOGY Section

3 METHODOLOGY

The methodology is illustrated in Figure 1 and explained in detail in the following sections. The dataset contains lecture recordings obtained through a systematic YouTube search (1). The search results were filtered according to a set of selection criteria (2) to ensure conformity with the dataset. A randomly selected segment of the entire audio clip was extracted and pre-processed into a standardised audio format (3). After an initial transcription using Whisper (4), the transcript was manually corrected (5) to create an error-free reference and a vocabulary list.

Fig. 1.

Fig. 1. Methodology.

A NodeJS script automates transcription across different ASR service providers (6). The dataset and scenario settings (batch or stream transcription; with or without vocabulary) are used as input for execution. The script unifies the Application Programming Interface (API) settings (7), measures the duration of the transcription process (8), and calls the vendor application programming interface (API)s (9). The response from the API and additional metadata are stored as the result.

For the analysis, the different API response formats were consolidated (10). Prior to calculating the WER, the ASR transcript and the reference transcript were normalised. The duration and ASR confidence were calculated from the stored metadata.

We evaluated 11 different ASR services with our dataset, which contains 120 audio samples with a duration of 3 minutes (90 YouTube recordings; 30 control samples from the LibriSpeech test-other corpus). Each service processed between 180 and 480 transcription jobs, depending on supported languages and availability of streaming ASR (see Table 3). In total, 221 hours of audio were transcribed, resulting in 3,840 individual transcriptions.

Table 3.
VendorBatchStreamingBatch with vocabularyStreaming with vocabulary
Amazonen, deen, deen, deen, de
AssemblyAIen, deenen, deen
Deepgramen, deen, deen, deen, de
Googleen, deen, deen, deen, de
IBMen, deen, de
Microsoften, deen, deen, deen, de
RevAIen, deen, deen, deen
Speechmaticsen, deenen, deen
SpeechText.AIen, de
Tencentenen
Whisperen, de

Table 3. Supported Features by Service Provider (en = English, de = German)

3.1 Dataset

The goal of this study was to capture the average performance of ASR under fairly ideal but realistic conditions, rather than to identify specific weaknesses or biases. To achieve this objective, a new dataset was created to minimise the risk of samples being part of an ASR model’s training data, which could lead to a particularly good performance.

3.1.1 Recording Scenario.

Recordings were made in a Higher Education setting and consisted of (undergraduate) lectures or invited talks. We chose Higher Education as our use case because there is a high demand for affordable and scalable transcription of lectures, talks and other learning formats. Higher Education institutions are required to offer equal access to all students, and the transcription of lectures is an essential part of that, benefiting not only DHH students but also (second) language learners and students taking part remotely. The scenario of Higher Education is also suitable, as it contains specialist vocabulary and lecturers are used to presenting without being professional speakers. We only chose parts of recordings, where only the presenter speaks, to avoid additional issues like speaker diarisation.

3.1.2 Selection Criteria.

We selected only videos published under a Creative Commons licence from YouTube, which reduced the possible sample size significantly. We used keywords like “university lecture”, “invited talk”, and corresponding search terms for German videos. We excluded any videos that covered unsuitable topics (e.g., lectures not recorded in the context of Higher Education, talks by religious groups, interest groups, or political parties). We selected only videos with an average audio quality, which (we think) is representative of a Higher Education setting (e.g., speakers using a clip-on microphone). We avoided recordings with either very professional setups or bad audio quality (e.g., strong reverberation, noise interferences, unintelligible or barely audible speech). We additionally measured the quality of all speech recordings with MOSNet and selected only recordings with an absolute value of 2 or higher [Lo et al. 2019]. We also excluded all videos that had manually created or corrected captions, as these might be part of the training data of Whisper and other services.

This initial selection of videos was then rated in a two-step review process by three members of the team with regard to the following criteria:

Recording quality

For English samples, perceived difference from RP (Received Pronunciation) in the UK or GenAm (General American English) in the United States

For English as a Second Language (ESL) samples, perceived intelligibility of speech

For German samples, High German without strong accents.

We used these criteria because heavily accented speech is challenging for ASR systems, as it contains acoustic and linguistic markers of both the source and target language, and could thus be the source of transcription errors. Similarly, ASR engines are often trained on datasets with Received Pronunciation (RP) or General American English (GenAm). Samples with a large difference from this training set could produce more errors. It is a general problem that ASR engines are only trained on limited datasets that do not include other variants of English, as this could lead to or enforce existing biases.

3.1.3 Bias.

The scenario of Higher Education leads to a bias in the dataset. For example, there is a higher number of recordings with White male speakers compared to Black female speakers. There are also many videos on some topics like Computer Science and few on other topics like Linguistics. We try to limit this bias by selecting content that covered a wide area of topics from Computer Science, Electrical Engineering and Chemistry to History, Sociology, Politics, Linguistics, Religion and Neuroscience. We selected samples with English as a second language (ESL) speakers who had different accents (European, South American, Asian). However, we did not select speakers for the ESL samples from countries or regions that might have English as an official government language (e.g., India or Nigeria), as they might be more fluent or completed their education in English. The dataset has an even distribution of male and female speakers but is not specifically designed to examine gender differences in ASR accuracy.

3.1.4 Data Preparation.

We randomly selected a part of each video that had continuous and fluent speech (e.g., no interactions with the audience or writing on a board). The target length of the sample was 3 minutes. We avoided cutting the recording sample during a sentence or a word. The samples have an average duration of 184 seconds (min 174, max 209). The number of spoken words is 455 on average (min 287, max 671). The average words per minute is 148 (min 94, max 222). The audio was converted to a standardised format that all vendors supported (44,100 Hz, mono, PCM 16-bit, LE). Additionally, the volume of the audio was normalised with sox [Bagwell 2023].

3.1.5 Availability.

Even though we consider our study to meet the Fair Use1 principle, and all recordings are publicly available videos under the Creative Commons licence, we do not want to publish our dataset without the individual speakers’ consent. We provide a detailed table on GitHub with the used recordings, time ranges, speaker origins and topic of the presentations.2 To contextualise the results of our recordings, we included a public dataset that many vendors or research studies use for training, model validation and comparison to other ASR models. We randomly selected 30 samples from the most common English corpus LibriSpeech test-other [Panayotov et al. 2015] as a control dataset. The results also help to identify if ASR services perform particularly well on LibriSpeech. The recordings fulfill the same criteria as the English videos with regard to speaker accents and audio length.

3.2 Additional Variables

Besides each samples’ individual speaker and content, this study uses additional variables to represent a wider range of material and to cover potential sources of bias in ASR systems. We chose 15 male and female speakers per language group. We have four language groups: English, ESL, German and LibriSpeech (English).

Additionally, we have four different technical scenarios: batch or stream transcription and with or without an additional custom vocabulary. Some vendors did not provide all technical requirements, such as custom vocabularies, streaming or specific combinations (e.g., German streaming with a custom vocabulary). Table 3 lists the supported features of the vendors at the date of the test.

3.3 Transcription

Previous research suggests using ASR as a basis for manual corrections to create more accurate transcripts compared to a fully manual approach [Che et al. 2017]. We could not rule out the possibility that a repeated transcription of the same audio file influences the result of the second run (e.g., faster processing time through caching), except for Whisper. We therefore initially transcribed all samples with Whisper. Two independent coders manually corrected each transcript. We focused on creating a verbatim transcript, and besides correcting mistakes, we carefully looked out for Whisper-specific errors like the removal of duplicate words. The manual transcription included additional research to find the correct spelling of names and technical terms.

We encountered several problems during transcription, such as differences between UK and U.S. spelling, common contractions (e.g., “it’s”, “can’t”), or currency writings (e.g., $12,000). Also for German, we found colloquial spellings (e.g., “en” for “ein”, “gsagt” for “gesagt”). Additionally, transcribing correct punctuation is difficult for spoken language and influences capitalization, for example, by ending a sentence after a break or separating it with a comma. We therefore use the extensive text normalisation from Whisper and added additional replacements for specific words occurring in our German transcripts.

3.4 Vendors

We used the Google Search Engine to find the most relevant vendors that provide an API for fully automated transcription of audio or video files. We excluded services that focus on specific tasks like keyword spotting or telephone transcription. We selected 10 commercial vendors (Amazon AWS, AssemblyAI, Deepgram, Google Cloud Platform, IBM Watson, Microsoft Azure, Rev AI, Speechmatics, SpeechText.AI, and Tencent Cloud) and OpenAI Whisper, as a self-hosted open source alternative. To minimise the effects of audio processing that vendors might use, we converted all samples to a standard audio format and normalised the volume.

We checked all API parameters of each vendor to assure comparability and to avoid unintentional default settings. If vendors provided multiple models, we selected the model that promised the highest accuracy. These enhanced models might influence the transcription speed and costs (e.g., higher processing power). We did not use models for specific use cases like transcription of medical recordings. To receive verbatim transcripts, we disabled all filters (e.g., profanity). We enabled detailed results if possible, including punctuation, text formatting, word-level timestamps and confidences. With regard to Whisper, we used the large-v2 model and disabled “condition_on_previous_text”, as it hallucinated and created longer transcripts than the actual file.

Some APIs can return multiple results and provide alternative transcriptions, sorted by the AI’s confidence in the text correctness. We limited the results to one to avoid additional processing time and analysed the text that the AI categorises as best fit. For stream processing, we enabled partial results to receive intermediate and final transcripts.

3.5 Automation

We used NodeJS to automate all transcription jobs. If possible, we used the official software development kit (SDK) of a vendor. Otherwise, we implemented the HTTP- or WebSocket-API according to the vendors’ documentation. The streaming APIs utilise WebSocket connections that expect raw audio data either as an integer buffer or base64-encoded. A file stream splits the audio into 10-KB chunks, and these parts are sent sequentially. The results might differ from an actual real-time stream, as the data are transmitted faster. However, in our test, we did not find any indications that an API buffered all data and processed the audio as whole. In contrast to all other vendors’ streaming APIs, Tencent does not provide a WebSocket-API. We therefore excluded Tencent from streaming transcription, as we see a possible issue in comparability due to the implementation.

In addition to the API responses itself, we measured timings locally to calculate specific durations (e.g., upload, vocabulary creation, transcription, download). Some vendors require additional steps like uploading files to a cloud bucket, whereas others handle multiple steps within a single HTTP-Call (e.g., transcription with a custom vocabulary). The different approaches might affect the measured timings and are considered for a comparative analysis. We always used the closest region to our location (e.g., eu-central), if cloud vendors offered multiple regions.

The cloud transcriptions were run on 22.06.2023. In a few cases, we had to restart a transcription job, as an HTTP or API error occurred. We had no issues after a second try. The Whisper transcription was executed on the same day on a Macbook Pro 2020 with 16 GB of RAM and an Apple M1 ARM processor.

3.6 Vocabularies

Some APIs offer the functionality to upload custom vocabularies before the actual transcription. These vocabularies are lists of words or phrase sets. These words will be weighted higher by the AI and prioritised during transcription, especially if the AI has similar probability values for multiple words. We created a vocabulary for each transcript consisting of abbreviations and names of persons, countries, regions, cities, rivers and buildings. We focused on an approach that is feasible in a Higher Education setting (e.g., automatically analysing the content of slides and filtering rare words). Therefore, we did not add “sounds alike” terms that provide additional information to the AI (e.g., how the phrases are pronounced). In the case of Tencent, we had to exclude all words that exceeded the limit of 10 characters.

3.7 Analysis

We used the jiwer [Vaessen 2023] Python implementation to calculate the WER, and applied extensive text transformation to all transcripts to reduce the impact of formatting differences on the error rate. We used the Whisper text normaliser [Radford et al. 2022], which transforms text to lowercase, removes all punctuation, and unifies common contractions, spelling differences and spoken numbers. We extended the normalisation for German according to common contractions and numbers appearing in the transcripts of our dataset. Consequently, as punctuation was removed for the calculation of the WER, we only used confidence outputs from an API on words and excluded punctuation marks.

To examine the effect of custom vocabularies, we counted the number of vocabulary hits. Every occurrence of a word from the vocabulary in a transcript is defined as a hit. The text normalisation was also applied to these words. We also measured the time taken for various processing steps like uploading, vocabulary creation, transcription and downloading of the results as described in Section 3.5.

Skip 4RESULTS Section

4 RESULTS

4.1 Accuracy

Table 4 shows the average vendors’ WER on the four different datasets for batch transcription without vocabulary. On average, native English speakers achieved the lowest error rates and ESL speakers performed better compared to the results for German native speakers. Comparing the results of English and the LibriSpeech dataset, most vendors show a comparable quality. Deepgram, Google, Rev AI and SpeechText.AI had considerably higher error rates for LibriSpeech. The lower results of AssemblyAI and Tencent could indicate an optimisation towards the LibriSpeech corpus. Interestingly, Google was the only vendor where German performed best compared to all other datasets, most likely due to the comparatively poor performance of the English ASR model.

Table 4.
VendorEnglishLibriSpeechESLGerman
Amazon4.46.17.118.0
AssemblyAI4.54.25.913.2
Deepgram8.312.811.419.3
Google20.123.628.118.1
IBM11.213.217.320.6
Microsoft4.45.96.710.1
Rev AI4.47.06.719.2
SpeechText.AI8.411.314.121.4
Speechmatics3.33.64.68.0
Tencent4.64.17.3
Whisper (large-v2)2.93.33.35.0
Average7.08.610.215.3

Table 4. Average WER in Percentage by Service Provider and Dataset

Figure 2 shows the WER by vendor for all English datasets transcribed as batch and without vocabulary. The average WER across all vendors was 7.0%. The lowest calculated value was 0% and the highest 53.8%. The results show that both the average accuracy and standard deviation of error rates varies between the vendors. Service providers with a high average accuracy tend to have a smaller standard deviation compared to providers with a low average accuracy.

Fig. 2.

Fig. 2. WER by vendor for the English datasets.

Figure 3 shows the average WER per audio sample. The results show that mean and standard deviation varies between the individual samples. Samples with a high average accuracy tend to have a smaller standard deviation compared to samples with a low average accuracy. For the 30 different samples, we counted seven different vendors that achieved the highest accuracy and three different vendors for the lowest accuracy.

Fig. 3.

Fig. 3. WER by file for the English datasets.

4.2 Vocabulary

Eight vendors offered the functionality to provide a custom vocabulary for words or phrases that might appear in the audio. The average WER of these providers across the English datasets was 8.61%. By providing a vocabulary, it was reduced to 8.10%. We performed a Welch’s t-test to investigate whether the WER differed between transcriptions with and without an additional vocabulary. There was no significant difference in the WER of the transcripts without vocabulary (M = 0.086, SE = 0.003) and with vocabulary (M = 0.081, SE = 0.003) speakers (t (1438) = –1.646, p = 0.640). The effect size was moderate (r = 0.580).

Figure 4 displays the average hits of these vocabulary words within the transcripts for each service provider. For all vendors, the majority of hits was already counted without a vocabulary. The hits increased for all vendors except Tencent when providing a vocabulary. No vendor reached the maximum hits according to the reference transcript.

Fig. 4.

Fig. 4. Average hits of vocabulary words by vendor for the English datasets.

4.3 Streaming

On the English datasets, streaming transcription had a higher WER (10.9%) compared to batch transcription (9.37%). We performed a Welch’s t-test to investigate whether the WER differed between batch and streaming transcription. There was a significant difference in the WER of the transcripts for batch (M = 0.094, SE = 0.003) and streaming (M = 0.109, SE = 0.003) transcription (t (1438) = –1.646, p < 0.01). The effect size was large (r = 0.995).

4.4 Confidence

Figure 5 shows the average confidence by WER for each vendor. Tencent did not provide confidence values. The results from batch transcription for all datasets are used. Ideally, the confidence aligns with the WER. If an AI is 80% confident, the measured WER should be 20%. The blue diagonal line represents this ideal relation between confidence and WER. Data points in the upper left half represent overconfidence and data points in the lower right half underconfidence. Google and SpeechText.AI tend to have higher confidences than the measured accuracy, whereas Microsoft and Whisper report lower confidences. The other services show a higher alignment between the measured WER and confidence output. Only AssemblyAI shows an ideal relation between WER and confidence on average. Considering the varying deviations of the other providers, this could be a mere coincidence.

Fig. 5.

Fig. 5. Average WER in percentage to confidence in percentage by vendor for all datasets.

4.5 Duration

The duration an ASR service requires to process a file depends on multiple factors, like the complexity of the ASR pipeline, the model size, and mainly the computing power that is provided by the vendor. It can also be reduced by splitting the audio into small chunks to process them in parallel. Figure 6 displays the WER of the transcript to the processing duration for each transcription job of the English dataset. Most vendors show a fixed duration independent of the resulting accuracy. This is as expected, as all audio samples are around 3 minutes long. Vendors that have a longer processing duration do not show lower error rates. Whisper has the highest accuracy and longest processing duration, which can either result from the model complexity or the fact that it was executed on consumer hardware.

Fig. 6.

Fig. 6. WER in percentage to processing duration in seconds by vendor for the English dataset.

4.6 Text Normalisation

The degree of text normalisation strongly impacts the calculated WER. Without any transformation, the average WER of all datasets was 21.83%. Transforming the text to lowercase reduced the WER to 18.15% and removing punctuation to 15.48%. Both normalisations combined resulted in an average WER of 11.29%. The additional replacements introduced by the Whisper normaliser and our extension for German reduced the average WER to 10.16%.

We performed a Welch’s t-test to investigate whether the WER differed between the simple normalisation (lowercase and no punctuation) and the extended normalisation (Whisper normaliser and our German additions). There was a significant difference in the WER of the transcripts for simple (M = 0.113, SE = 0.002) and extended (M = 0.102, SE = 0.002) normalisation (t (2578) = –1.645, p < 0.01). The effect size was large (r = 0.977).

4.7 Alternative Error Metrics

The match error rate (M = 0.099, SE = 0.080) shows similar results compared to the WER (M = 0.102, SE = 0.083). Whereas the average and standard deviation of the CER (M = 0.051, SE = 0.050) are around 50% lower, they are around 30% higher for the Word Information Lost (M = 0.152, SE = 0.118).

Skip 5DISCUSSION Section

5 DISCUSSION

Our study evaluated 11 common ASR services by transcribing recordings from Higher Education lectures. Results show that the accuracy varied strongly between the different vendors. Even providers that achieve a relatively low average WER can show a high error rate for an individual audio sample. Among the quality differences between vendors, this volatility is also shown in the accuracy distribution for each recording. Even though the samples do not contain strong accents, the performance of ASR heavily depends on the individual speaker and acoustic environment. Different vendors provided the highest or lowest scores for each recording. There was not one vendor who consistently reached the lowest WER across all samples. The variance in WER between the individual recordings show that it is hard to come up with one reliable number to score the accuracy of an ASR provider, even for a quite homogeneous dataset like ours. For English, the state-of-the-art average value seems to be around 5%, which aligns with the reports of related studies [Addlesee et al. 2020; AssemblyAI 2023; Chiu et al. 2018; Saon et al. 2017; Xiong et al. 2018]. However, our findings do not necessarily disprove other studies that report higher error rates (see Table 2), as the WER is heavily dependent on the degree of text normalisation.

As our samples are recordings in the context of Higher Education, results might be different for more spontaneous, conversational and colloquial speech. However, insufficient accuracy of ASR is the main issue reported by DHH individuals in related studies [Butler 2019; Kawas et al. 2016] and by the DHH community [Blake 2019; National Deaf Center on Postsecondary Outcomes 2020]. This makes it difficult to assess ASR’s ability to provide accessible transcripts, as it depends on a case-by-case basis.

Apart from the different services and individual speakers, language itself has the biggest impact on the transcription’s accuracy. ASR achieves the lowest error rates for native English speakers. Even though the samples of non-native speakers show higher error rates for English, they still perform better compared to our German dataset. The quality most likely decreases for languages with a smaller corpus of training data, and more complex languages, such as languages with a number of grammatical cases, articles or capitalisation of nouns. Whisper, which performed best in our study, reports an average multilingual WER that is three times higher compared to the English WER [Radford et al. 2022]. This discrepancy in performance is further highlighted considering the spread reported by Whisper on the Common Voice 15 dataset, which ranges from 4.3% to 55.7% WER. Even though ASR might meet the accuracy requirements for caption users for English, it might fail to meet them for other languages and outside of narrowly defined scenarios, which is a real problem for the accessibility of events.

Our English video recordings had a similar WER as the samples taken from the LibriSpeech corpus. Most of the tested ASR services seem to have a robust general performance and do not only achieve particularly good results on a common dataset. An evaluation on LibriSpeech can reflect the general accuracy of an ASR model, if it is not explicitly optimised for this dataset. We still see the danger that extremely low error rates reported on common datasets cause us to misjudge general ASR performance. This might also be a challenge for the regulation of ASR as requested by the NAD [Blake 2019]: setting a threshold on a publicly available dataset could cause a model to be overly adapted to that dataset, to meet legal requirements.

Whisper as an open source model outperforms commercial services in many cases and reports quite low error rates in general. These findings are also reported by other studies [AssemblyAI 2023; Hughes 2023; Radford et al. 2022; Stephenson 2022]. This is a strong indicator that E2E models can further increase the accuracy of ASR. However, accuracy is not the only relevant factor for ASR, and E2E models have shown weaknesses in streaming capabilities, latency and computational efficiency [Li 2022]. They also tend to be inaccurate in predicting word-level timestamps [Bain et al. 2023]. The strength of an open source model like Whisper is its ability to support further scientific research, and to be adapted to specific use cases such as Higher Education.

Adding vocabularies did not significantly improve the accuracy of the different ASR services. However, some technical terms, abbreviations and names did only appear in the transcript when a vocabulary was provided. Even though adding the vocabulary did not significantly decrease the overall WER, these words might be fundamental for understanding the content and might consequently make the transcript more accessible. Presenters or speakers should consider the use of vocabularies if their text contains abbreviations or specialist vocabulary, even though there is no guarantee that any of these words will be correctly identified and transcribed by an ASR model. Ideally, such vocabularies can be created with little effort, such as by automatically extracting rare words from presentation slides.

We also found no indication that the processing duration affects the resulting accuracy. Longer processing time does not necessarily result in a higher accuracy and most likely results from the available computing resources. One method to increase the processing speed is to split the recording into short clips, transcribe these segments concurrently, and then merge the transcribed texts. A related study reported a speed increase of up to 12 times without sacrificing transcription accuracy [Bain et al. 2023]. It is very likely that Deepgram uses such an approach, given its very fast performance in our evaluation.

Streaming APIs showed a significantly higher error rate compared to batch transcription. To achieve near real-time transcription, a low real-time factor is required to reduce the latency between audio and transcription results. Most probably, vendors have a separate ASR system for streaming, which sacrifices some accuracy to increase speed. Hybrid ASR models could use a different processing pipeline and E2E models could use a smaller model size, as larger models generally take longer to process. However, as ASR is particularly useful for real-time events, accuracy should also be evaluated with the corresponding models as long as there is a noticeable performance gap between streaming and non-streaming models.

The confidence outputs of ASR services were not transferable to the accuracy of a transcript. The confidence values merely seem to represent the probabilities within an ASR model, not the actual correctness. Visually highlighting words regarding their confidence, as explored by Seita et al. [2018], might cause more confusion for the readers than it actually helps. It is also questionable if the confidence outputs can be used as part of an error prediction of ASR, as suggested by Kafle and Huenerfauth [2016].

Alternative error metrics showed similar results compared to the WER and seem to be interchangeable. The Character Error Rate (CER) may be useful for evaluating languages that do not separate words with spaces, such as Chinese [Radford et al. 2022]. It is common in speech recognition research to apply text normalisation to make the WER more reliable to formatting differences. Without these modifications, the average error rate in our study was around twice as high. The extended normalisation that replaces common contractions and unifying numbers showed a significant reduction of the WER, compared to only lowercase transformation and the removal of punctuation. These normalisations should therefore be used for metrics that are based on string edit distances to better reflect the amount of content errors. However, institutions like the FCC demand correct capitalisation and punctuation of captions to aid comprehension [Federal Communications Commission 2014]. As the WER is less meaningful without text normalisation, we need additional metrics that reflect the accuracy of punctuation, capitalisation and other aspects such as formatting.

Complementary to large-scale automated evaluations and the search for additional metrics, it is crucial to evaluate ASR qualitatively, especially with DHH individuals. User studies can identify additional issues with ASR, such as the readability of ASR-generated text [Butler 2019] or the impact of latency in interactive conversations [McDonnell et al. 2021]. Focus groups can also provide ideas on how to improve the use of ASR [Seita et al. 2022], or help assess the acceptance of new features like automatic content summarisation [Alonzo et al. 2020]. Acceptance of ASR as an accessible accommodation is an important factor, as it can be perceived as a second-rate service by users [National Deaf Center on Postsecondary Outcomes 2020]. Human involvement, for example, by correcting or monitoring ASR [McDonnell et al. 2021; Seita et al. 2022], can potentially increase the acceptability of ASR independent of its actual accuracy.

Skip 6CONCLUSION Section

6 CONCLUSION

We provided a comprehensive overview on state-of-the-art quality of many common ASR services that puts extremely low error rates reported by scientific research and general claims by vendors into perspective. We could confirm very high accuracy rates in some cases, but found a wide range in quality across vendors, speakers and languages. The range of error rates in our dataset shows that average accuracy rates can be misleading. This rather weak reliability must be considered if ASR is used as an accessibility tool without human supervision.

We also found a significantly higher error rate for streaming ASR. As automatic transcription is particularly useful for real-time events, this performance gap must be closed. Technical enhancements like an additional vocabulary showed only little improvement. In addition, the usage of ASR metadata like the word-level confidence to indicate errors is doubtful.

With the increasing availability of ASR in commercial tools, it is already part of our social and work life. But only because ASR offers a solution for the complex task of transcription, it is not necessarily accessible and enables DHH individuals to participate in various events. As demanded by the DHH community, we need a declaratory ruling on the use of ASR for transcription, which largely depends on a binding metric and quality threshold. Despite criticism that the WER does not necessarily reflect the usefulness of a text [Favre et al. 2013; Mishra et al. 2011; Wang et al. 2003], it still holds potential as a complementary accessibility metric, primarily because it is objective, unbiased and commonly used as an evaluation metric in ASR research. The WER is particularly useful for verbatim transcription, and extensive text normalisation can reduce errors caused by non-semantic differences. ASR continues to improve and error rates decrease, making metrics based on a string edit distance more robust. However, it cannot be used exclusively, as text normalisation removes important aspects of punctuation or capitalisation, and other factors such as speaker diarisation or caption formatting are not covered.

In many situations, speakers may not be able to choose a particular or even pre-trained ASR system, as it will be determined by the meeting or presentation software. However, speakers can be more inclusive when presenting to diverse audiences by speaking more clearly and slowly to give readers time to adjust to transcription errors. A prior study indicates that speakers adjust their behaviour in meetings with DHH peers when ASR is used [Seita et al. 2018].

Skip 7LIMITATIONS Section

7 LIMITATIONS

Our objective was to provide a realistic picture of the accuracy of ASR, but the generalisability of our results is limited. The setting we chose, Higher Education, provided rather ideal conditions and avoided particular weaknesses of ASR. However, DHH individuals are confronted with less ideal scenarios in real life. Results may differ in other environments or for more conversational speech. Furthermore, our dataset reflects biases inherent in the material available online and the over-representation of certain groups in Higher Education. The findings are also restricted to the date of the evaluation, as AI is a rapidly changing area of research. Due to the large amount of data, the analysis relies on the WER as the most common automated metric, which is a limited representation of ASR quality.

Skip 8FUTURE WORK Section

8 FUTURE WORK

Although this study provides only a snapshot of ASR quality, and performance will certainly improve in the future, many of the problems that caption users currently face are likely to remain. Our chosen scenario avoided many of the challenges of ASR, such as speaker diarisation, overlapping speech and mixed language. Furthermore, ASR performs notably better for English and German compared to many other languages. Evaluations of ASR in more challenging environments and for under-resourced languages may find considerably higher error rates and less useful transcription results. Increasing the amount of training data can effectively improve the overall performance of an ASR model and its robustness to specific biases, such as accented speech, or the speech of the elderly or children. For ASR to be truly inclusive, it is also important to address the challenges of transcribing speech from people with speech impairments such as fluency disorders or dysarthria.

Due to the variety of applications of ASR, a single metric appears insufficient to comprehensively represent the quality of an ASR model. Currently, the WER serves as the standard metric for reporting accuracy, but because text normalisation is used to reduce non-semantic differences, information about the correctness of punctuation and capitalisation is lost. A future metric could provide a more detailed analysis that distinguishes between error types and reports accuracy on several aspects. In addition to punctuation and capitalisation, the type of word errors (e.g., technical terms, numbers, abbreviations and homonyms) could be classified according to their impact on text comprehension. Ideally, empirical evaluations can establish a correlation between such an automated metric and subjective ratings.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

We thank Andreas Burkard for his efforts in making this article accessible and Kathy-Ann Heitmeier for her contributions to English language refinement.

Footnotes

REFERENCES

  1. Adams Chuck, Campbell Alastair, Cooper Michael, and Kirkpatrick Andrew. 2022. Web Content Accessibility Guidelines (WCAG) 2.2: W3C Recommendation. Retrieved April 15, 2023 from https://www.w3.org/TR/WCAG22/Google ScholarGoogle Scholar
  2. Addlesee Angus, Yu Yanchao, and Eshghi Arash. 2020. A comprehensive evaluation of incremental speech recognition and diarization for conversational AI. In Proceedings of the 28th International Conference on Computational Linguistics. 34923503. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. Aksënova Alëna, Esch Daan van, Flynn James, and Golik Pavel. 2021. How might we create better benchmarks for speech recognition? In Proceedings of the 1st Workshop on Benchmarking: Past, Present, and Future. 2234. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. Alonzo Oliver, Seita Matthew, Glasser Abraham, and Huenerfauth Matt. 2020. Automatic text simplification tools for deaf and hard of hearing adults: Benefits of lexical simplification and providing users with autonomy. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI’20). ACM, New York, NY, 1–13. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Amazon. 2023. AWS: Amazon Transcribe Features. Retrieved April 15, 2023 from https://aws.amazon.com/transcribe/features/Google ScholarGoogle Scholar
  6. Tom Apone, Marcia Brooks, and Trisha O'Connell. 2010. Caption accuracy metrics project. The WGBH National Center for Accessible Media, Boston. Retrieved from http://ncamftp.wgbh.org/ncam-old-site/file_download/CC_Metrics_research_paper_final.pdfGoogle ScholarGoogle Scholar
  7. Ardila Rosana, Branson Megan, Davis Kelly, Kohler Michael, Meyer Josh, Henretty Michael, Morais Reuben, Saunders Lindsay, Tyers Francis, and Weber Gregor. 2020. Common Voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. 42184222. https://aclanthology.org/2020.lrec-1.520Google ScholarGoogle Scholar
  8. AssemblyAI. 2023. Conformer-1: A Robust Speech Recognition Model. Retrieved April 15, 2023 from https://www.assemblyai.com/blog/conformer-1/Google ScholarGoogle Scholar
  9. Baevski Alexei, Zhou Henry, Mohamed Abdelrahman, and Auli Michael. 2020. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20). Article 1044, 12 pages.Google ScholarGoogle Scholar
  10. Bagwell Chris. 2023. SoX - Sound eXchange. Retrieved April 15, 2023 from https://sourceforge.net/projects/sox/Google ScholarGoogle Scholar
  11. Bain Max, Huh Jaesung, Han Tengda, and Zisserman Andrew. 2023. WhisperX: Time-accurate speech transcription of long-form audio. In Proceedings of INTERSPEECH 2023. 44894493. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. Ballenger Sheryl. 2022. Access for deaf and hard of hearing individuals in informational and educational remote sessions. Assistive Technology Outcomes & Benefits 16, 2 (2022), 4555.Google ScholarGoogle Scholar
  13. Baumann Timo, Atterer Michaela, and Schlangen David. 2009. Assessing and improving the performance of speech recognition for incremental systems. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 380388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Blake Reid. 2019. Petition for Declaratory Ruling and/or Rulemaking on Live Closed Captioning Quality Metrics and the Use of Automatic Speech Recognition Technologies. National Association of the Deaf. https://www.fcc.gov/ecfs/search/search-filings/filing/10801131063733Google ScholarGoogle Scholar
  15. Butler Janine. 2019. Perspectives of deaf and hard of hearing viewers of captions. American Annals of the Deaf 163, 5 (2019), 534553.Google ScholarGoogle ScholarCross RefCross Ref
  16. Catania Fabio, Crovari Pietro, Spitale Micol, and Garzotto Franca. 2019. Automatic speech recognition: Do emotions matter? In Proceedings of the 2019 IEEE International Conference on Conversational Data and Knowledge Engineering (CDKE’19). IEEE, Los Alamitos, CA, 916. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. Chan William, Park Daniel, Lee Chris, Zhang Yu, Le Quoc, and Norouzi Mohammad. 2021. SpeechStew: Simply mix all available speech recognition data to train one large neural network. arXiv:2104.02133 [cs.CL] (2021). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. Che Xiaoyin, Luo Sheng, Yang Haojin, and Meinel Christoph. 2017. Automatic lecture subtitle generation and how it helps. In Proceedings of the 2017 IEEE 17th International Conference on Advanced Learning Technologies (ICALT’17). IEEE, Los Alamitos, CA, 3438. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. Chiu Chung-Cheng, Sainath Tara N., Wu Yonghui, Prabhavalkar Rohit, Nguyen Patrick, Chen Zhifeng, Kannan Anjuli, Weiss Ron J., Rao Kanishka, Gonina Ekaterina, Jaitly Navdeep, Li Bo, Chorowski Jan, and Bacchiani Michiel. 2018. State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’18). IEEE, Los Alamitos, CA, 47744778. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chung Yu-An, Zhang Yu, Han Wei, Chiu Chung-Cheng, Qin James, Pang Ruoming, and Wu Yonghui. 2021. w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU’21). IEEE, Los Alamitos, CA, 244250. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. Cumbal Ronald, Moell Birger, Lopes José, and Engwall Olov. 2021. “You don’t understand me!”: Comparing ASR results for L1 and L2 speakers of Swedish. In Proceedings of INTERSPEECH 2021. 44634467. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. Favre Benoit, Cheung Kyla, Kazemian Siavash, Lee Adam, Liu Yang, Munteanu Cosmin, Nenkova Ani, Ochei Dennis, Penn Gerald, Tratz Stephen, Voss Clare, and Zeller Frauke. 2013. Automatic human utility evaluation of ASR systems: Does WER really predict performance? In Proceedings of INTERSPEECH 2013, the 14th Annual Conference of the International Speech Communication Association. 34633467.Google ScholarGoogle ScholarCross RefCross Ref
  23. Commission Federal Communications. 2014. Closed Captioning Quality Report and Order, Declaratory Ruling, FNPRM. Federal Communications Commission. https://www.fcc.gov/document/closed-captioning-quality-report-and-order-declaratory-ruling-fnprmGoogle ScholarGoogle Scholar
  24. Gandhi Sanchit, Platen Patrick von, and Rush Alexander M.. 2022. ESB: A benchmark for multi-domain end-to-end speech recognition. arXiv:2210.13352 [cs.CL] (2022).Google ScholarGoogle Scholar
  25. Geirhos Robert, Jacobsen Jörn-Henrik, Michaelis Claudio, Zemel Richard, Brendel Wieland, Bethge Matthias, and Wichmann Felix A.. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence 2, 11 (2020), 665673.Google ScholarGoogle ScholarCross RefCross Ref
  26. Godfrey John J., Holliman Edward C., and McDaniel Jane. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. IEEE, Los Alamitos, CA, 517520. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. Google. 2023. Google Cloud: Speech-to-Text. Retrieved April 15, 2023 from https://cloud.google.com/speech-to-text/Google ScholarGoogle Scholar
  28. Hughes John. 2023. Introducing Ursa from Speechmatics. Speechmatics. https://www.speechmatics.com/ursaGoogle ScholarGoogle Scholar
  29. Jette Miguel. 2020. The Podcast Challenge: Testing Rev.ai’s Speech Recognition Accuracy. Rev AI. https://www.rev.com/blog/speech-to-text-technology/the-podcast-challenge-testing-rev-ais-speech-recognition-accuracyGoogle ScholarGoogle Scholar
  30. Kafle Sushant and Huenerfauth Matt. 2016. Effect of speech recognition errors on text understandability for people who are deaf or hard of hearing. In Proceedings of the 2016 Workshop on Speech and Language Processing for Assistive Technologies (SPLAT’16). 2025. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. Kafle Sushant and Huenerfauth Matt. 2017. Evaluating the usability of automatically generated captions for people who are deaf or hard of hearing. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS’17). ACM, New York, NY, 165174. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kafle Sushant and Huenerfauth Matt. 2019. Predicting the understandability of imperfect English captions for people who are deaf or hard of hearing. ACM Transactions on Accessible Computing 12, 2 (June 2019), Article 7, 32 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kawas Saba, Karalis George, Wen Tzu, and Ladner Richard E.. 2016. Improving real-time captioning experiences for deaf and hard of hearing students. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS’16). ACM, New York, NY, 1523. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kinoshita Keisuke, Ochiai Tsubasa, Delcroix Marc, and Nakatani Tomohiro. 2020. Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’20). IEEE, Los Alamitos, CA, 70097013. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. Koenecke Allison, Nam Andrew, Lake Emily, Nudell Joe, Quartey Minnie, Mengesha Zion, Toups Connor, Rickford John R., Jurafsky Dan, and Goel Sharad. 2020. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14 (2020), 76847689. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. Henry Shawn Lawton, Freed Geoff, and Brewer Judy. 2022. Making Audio and Video Media Accessible - Captions/Subtitles. Web Accessibility Initiative. https://www.w3.org/WAI/media/av/captions/Google ScholarGoogle Scholar
  37. Li Jinyu. 2022. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing 11, 1 (2022), e8. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. Liu Jane W. S. W.. 2000. Real-Time Systems. Prentice Hall PTR.Google ScholarGoogle Scholar
  39. Lo Chen-Chou, Fu Szu-Wei, Huang Wen-Chin, Wang Xin, Yamagishi Junichi, Tsao Yu, and Wang Hsin-Min. 2019. MOSNet: Deep learning-based objective assessment for voice conversion. In Proceedings of INTERSPEECH 2019. 1541–1545. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. McDonnell Emma J., Liu Ping, Goodman Steven M., Kushalnagar Raja, Froehlich Jon E., and Findlater Leah. 2021. Social, environmental, and technical: Factors at play in the current use and future design of small-group captioning. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (Oct. 2021), Article 434, 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. AI Meta. 2023. Speech Recognition on LibriSpeech Test-Other. Retrieved April 15, 2023 from https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-otherGoogle ScholarGoogle Scholar
  42. Microsoft. 2023. Azure: Speech to Text. Retrieved April 15, 2023 from https://azure.microsoft.com/en-us/products/cognitive-services/speech-to-text/Google ScholarGoogle Scholar
  43. Mishra Taniya, Ljolje Andrej, and Gilbert Mazin. 2011. Predicting human perceived accuracy of ASR systems. In Proceedings of the 12th Annual Conference of the International Speech Communication Association. 19451948. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. Morris Andrew, Maier Viktoria, and Green Phil. 2004. From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. In Proceedings of the 8th International Conference on Spoken Language Processing. 27652768. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  45. Outcomes National Deaf Center on Postsecondary. 2020. Auto Captions and Deaf Students: Why Automatic Speech Recognition Technology Is Not the Answer (Yet). Retrieved April 15, 2023 from https://nationaldeafcenter.org/news-items/auto-captions-and-deaf-students-why-automatic-speech-recognition-technology-not-answer-yet/Google ScholarGoogle Scholar
  46. Disorders National Institute on Deafness and Other Communication. 2017. Captions for Deaf and Hard-of-Hearing Viewers. Retrieved April 15, 2023 from https://www.nidcd.nih.gov/health/captions-deaf-and-hard-hearing-viewersGoogle ScholarGoogle Scholar
  47. Panayotov Vassil, Chen Guoguo, Povey Daniel, and Khudanpur Sanjeev. 2015. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15). IEEE, Los Alamitos, CA, 52065210. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. Radford Alec, Kim Jong Wook, Xu Tao, Brockman Greg, McLeavey Christine, and Sutskever Ilya. 2022. Robust speech recognition via large-scale weak supervision. arXiv:2212.04356 [eess.AS] (2022).Google ScholarGoogle Scholar
  49. Romero-Fresco Pablo. 2009. More haste less speed: Edited versus verbatim respoken subtitles. Vigo International Journal of Applied Linguistics 6 (2009), 109133.Google ScholarGoogle Scholar
  50. Romero-Fresco Pablo and Pérez Juan Martínez. 2015. Accuracy rate in live subtitling: The NER model. In Audiovisual Translation in a Global Context. Palgrave Studies in Translating and Interpreting Book Series. Palgrave Macmillan UK, London, UK, 2850. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. Saon George, Kurata Gakuto, Sercu Tom, Audhkhasi Kartik, Thomas Samuel, Dimitriadis Dimitrios, Cui Xiaodong, Ramabhadran Bhuvana, Picheny Michael, Lim Lynn-Li, Roomi Bergul, and Hall Phil. 2017. English conversational telephone speech recognition by humans and machines. arXiv:1703.02136 [cs.CL] (2017). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  52. Seita Matthew, Albusays Khaled, Kafle Sushant, Stinson Michael, and Huenerfauth Matt. 2018. Behavioral changes in speakers who are automatically captioned in meetings with deaf or hard-of-hearing peers. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS’18). ACM, New York, NY, 6880. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Seita Matthew, Lee Sooyeon, Andrew Sarah, Shinohara Kristen, and Huenerfauth Matt. 2022. Remotely co-designing features for communication applications using automatic captioning with deaf and hearing pairs. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI’22). ACM, New York, NY, Article 460, 13 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Shangguan Yuan, Knister Kate, He Yanzhang, McGraw Ian, and Beaufays Françoise. 2020. Analyzing the quality and stability of a streaming end-to-end on-device speech recognizer. In Proceedings of INTERSPEECH 2020. 591595. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  55. Speechmatics. 2023. Do You Have a VIP (Very Inclusive Product) for Unified Comms?Retrieved April 15, 2023 from https://page.speechmatics.com/Unified-Comms-Inclusivity-Guide.htmlGoogle ScholarGoogle Scholar
  56. SpeechText.AI. 2023. SpeechText.AI Home Page. Retrieved April 15, 2023 from https://speechtext.aiGoogle ScholarGoogle Scholar
  57. Stephenson Scott. 2022. A Note to Our Customers: OpenAI Whisper’s Entrance into Voice. Deepgram. https://blog.deepgram.com/a-note-to-our-customers-openai-whispers-entrance-into-voice/Google ScholarGoogle Scholar
  58. Tadimeti Divya, Georgila Kallirroi, and Traum David. 2022. Evaluation of off-the-shelf speech recognizers on different accents in a dialogue domain. In Proceedings of the 13th Language Resources and Evaluation Conference. 60016008. https://aclanthology.org/2022.lrec-1.645Google ScholarGoogle Scholar
  59. Tatman Rachael and Kasten Conner. 2017. Effects of talker dialect, gender & race on accuracy of Bing speech and YouTube automatic captions. In Proceedings of INTERSPEECH 2017. 934938. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  60. Vaessen Nik. 2023. JiWER: Similarity Measures for Automatic Speech Recognition Evaluation. Retrieved April 15, 2022 from https://pypi.org/project/jiwer/Google ScholarGoogle Scholar
  61. Wald Mike. 2006. Creating accessible educational multimedia through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education 3 (2006), 131141. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  62. Wang Ye-Yi, Acero A., and Chelba C.. 2003. Is word error rate a good indicator for spoken language understanding accuracy. In Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, Los Alamitos, CA, 577582. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  63. Wells Tian, Christoffels Dylan, Vogler Christian, and Kushalnagar Raja. 2022. Comparing the accuracy of ACE and WER caption metrics when applied to live television captioning. In Computers Helping People with Special Needs. Lecture Notes in Computer Science, Vol. 13341. Springer, 522–528. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Xiong W., Wu L., Alleva F., Droppo J., Huang X., and Stolcke A.. 2018. The Microsoft 2017 conversational speech recognition system. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’18). IEEE, Los Alamitos, CA, 59345938. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Xu Qiantong, Baevski Alexei, Likhomanenko Tatiana, Tomasello Paden, Conneau Alexis, Collobert Ronan, Synnaeve Gabriel, and Auli Michael. 2021. Self-training and pre-training are complementary for speech recognition. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’21). IEEE, Los Alamitos, CA, 30303034. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  66. Zhang Yu, Park Daniel S., Han Wei, Qin James, Gulati Anmol, Shor Joel, Jansen Aren, Xu Yuanzhong, Huang Yanping, Wang Shibo, Zhou Zongwei, Li Bo, Ma Min, Chan William, Yu Jiahui, Wang Yongqiang, Cao Liangliang, Sim Khe Chai, Ramabhadran Bhuvana, Sainath Tara N., Beaufays Françoise, Chen Zhifeng, Le Quoc V., Chiu Chung-Cheng, Pang Ruoming, and Wu Yonghui. 2022. BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 15191532. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Measuring the Accuracy of Automatic Speech Recognition Solutions

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Accessible Computing
          ACM Transactions on Accessible Computing  Volume 16, Issue 4
          December 2023
          46 pages
          ISSN:1936-7228
          EISSN:1936-7236
          DOI:10.1145/3639855
          • Editors:
          • Tiago Guerreiro,
          • Stephanie Ludi
          Issue’s Table of Contents

          Copyright © 2024 Copyright held by the owner/author(s).

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 January 2024
          • Online AM: 8 December 2023
          • Accepted: 5 December 2023
          • Revised: 29 November 2023
          • Received: 19 July 2023
          Published in taccess Volume 16, Issue 4

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)1,429
          • Downloads (Last 6 weeks)731

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader