Automatic speech recognition: can you understand me?

Automatic Speech Recognition (ASR) is a digital communication method that transforms spoken discourse into written text. This rapidly evolving technology is used in email, text messaging, or live video captioning. Current ASR systems operate in conjunction with Natural Language Processing (NLP) technology to transform speech into text that people – and machines – can read. NLP refers to the methodologies and computational tools that analyze data produced in a natural language, such as English.

Self-study is the most frequent pedagogical approach taken when integrating ASR into language education, as it usually mediates learner-device interactions instead of learner-learner exchanges.

Examples
iSpraak.com (Nickolai, 2015), a cloud-based ASR tool, 'listens' to how a student pronounces a text provided by the teacher and returns a similarity score based on native speech patterns. The auto-scoring feature encourages independent study: learners keep practicing until they reach a certain score, but the teacher does not need to listen to every file produced.
Auto-generated transcripts from speech-to-text engines such as Microsoft Stream can also support independent language development (Liakin et al., 2015). As learners compare what the tool 'understood' to what they were trying to say, they improve their performance. Some of these tools pair ASR with automated translation, which can further help learners self-assess their accuracy.
An emerging ASR application is the use of Virtual Assistants (VA) such as Alexa or Siri (Istrate, 2019; see also Underwood, this volume). The communicative functions that VAs motivate include uttering commands ("Alexa, play some music!") or asking factual questions ("Siri, what is the weather like in Tokyo today?"). Successfully getting a VA to perform the desired action or to provide the needed information requires not only pronunciation accuracy, but also some knowledge of L2 vocabulary and sentence structure: the learners are not reading or repeating model sentences. If the task involves asking questions and using the information obtained, listening comprehension is an additional skill practiced.

Benefits
Using ASR for pronunciation training may encourage learner autonomy: the immediate feedback provided by the software, in the form of a transcript or an accuracy score, makes learners more aware of their progress, and the ability to carry out the exercises without the teacher gives them more control over their practice.
Speaking tasks with VAs also increase speaking opportunities beyond the classroom. VAs are not suitable for conversational practice, yet, but producing the short action-oriented or information-seeking utterances typical in these tasks is still a good proficiency-building exercise that can prepare learners for more involved oral discourses. In fact, frequent use of VAs for independent practice has been linked to significant improvements in L2 speaking proficiency (Dizon, 2020).

Potential issues
An important issue in ASR's pedagogical application is data privacy. As with other web-based interactions, exchanges with VAs produce personal data that could be commercially exploited. Thus, it is important for educators to be mindful of the data privacy policies for the technologies they use.
A second concern is robustness. ASR accuracy depends much on the acoustic conditions (performance suffers in noisy environments) and, most importantly for language educators, the speaker's experience with the language. Users often complain that the ASR tool 'detected the wrong thing', even though they know they were saying it right.
Although 'comprehension' of accented speech keeps improving, ASR performance is still not ideal when transcribing speech produced by lowproficiency learners. This issue may be resolved as more data from this type of learner becomes available. ASR accuracy with non-native speech has improved due to increased computing power and data availability from commercial sources (telephone-based transactions, for example). These sources of data, however, do not include low-proficiency speakers: who dares to complete a phone transaction in a language they are not fluent in?
EdTech companies offering data-based learning solutions hold the key to improve ASR's robustness: tools such as Extempore are using a wide range of non-native deidentified speech data in their servers for research and development (Figure 1).
Auto-generated transcripts that are still highly accurate with novice learners will be a welcome grading aid for teachers. Reading is faster than listening, particularly if the audio file is plagued with the long pauses typical in lowproficiency speech. While auto-generated fluency scores can indicate progress on the temporal aspects of speech (frequency and mean duration of pauses, percentage of speaking time), transcripts can help teachers provide feedback on lexical and syntactic accuracy faster. Looking to the future The pedagogical examples described above show that ASR technology can have an important impact in language teaching and learning: automated comparison with native speech patterns encourages pronunciation accuracy, self-access speaking tasks promote learner autonomy, and independent oral practice with VAs builds proficiency.
There is a need for increased speaking practice outside the classroom targeting skills beyond pronunciation.
Through robust ASR-enabled applications, this supplemental oral practice can be completed without necessarily turning into additional grading for the teacher. Thus, as ASR with low-proficiency speakers becomes more reliable, this technology will be more widely adopted for independent and classroom-based language learning.