Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.


Introduction
Language plays a key role in online education, giving people access to large amounts of information contained in articles, books, and video lectures. Thanks to spoken language and other forms of communication, such as a sign-language, people can participate in interactive discussions with teachers and take part in lively brainstorming with other people. Unfortunately, education is not available to everybody. According to the UNESCO report, 40% of the global population do not have access to education in the language they understand [1]. 'If you don't understand, how can you learn?' the report says. English is the leading language on the Internet, representing 25.9% of the world's population [2]. Regrettably, research by EF (Education First) [3] shows a large disproportion in English proficiency across countries and continents. People from regions of 'very low' language proficiency, such as the Middle East, are unable to navigate through English-based websites or communicate with people from an Englishspeaking country.
Computer-Assisted Language Learning (CALL) helps to improve the English language proficiency of people in different regions [4]. CALL relies on computerized self-service tools that are used by students to practice a language, usually a foreign language, also known as a non-native (L2) language. Students can practice multiple aspects of the language, including grammar, vocabulary, writing, reading, and speaking. Computer-based tools can also be used to measure student's language skills and their learning potential by using Computerized Dynamic Assessment (C-DA) test [5]. CALL can complement traditional language learning provided by teachers. It also has a chance to make second language learning more accessible in scenarios where traditional ways of learning languages are not possible due to the cost of learning or the lack of access to foreign language teachers.
Computer-Assisted Pronunciation Training (CAPT) is a part of CALL responsible for learning pronunciation skills. It has been shown to help people practice and improve their pronunciation skills [6][7][8]. CAPT consists of two components: an automated pronunciation evaluation component [9][10][11] and a feedback component [12]. The automated pronunciation evaluation component is responsible for detecting pronunciation errors in spoken speech, for example, for detecting words pronounced incorrectly by the speaker. The feedback component informs the speaker about mispronounced words and advises how to pronounce them correctly. This article is devoted to the topic of automated detection of pronunciation errors in non-native speech. This area of CAPT can take advantage of technological advances in machine learning and bring us closer to creating a fully automated assistant based on artificial intelligence for language learning.
The research community has long studied the automated detection of pronunciation errors in non-native speech. Existing work has focused on various tasks such as detecting mispronounced phonemes [9] and lexical stress errors [13]. Researcher have given most attention to studying various machine learning models such as Bayesian networks [14,15] and deep learning methods [9,10], as well as analyzing different representations of the speech signal such as prosodic features (duration, energy and pitch) [16], and cepstral/spectral features [9,13,17]. Despite significant progress in recent years, existing CAPT methods detect pronunciation errors with relatively low accuracy of 60% precision at 40%-80% recall [9][10][11]. Highlighting correctly pronounced words as pronunciation errors by the CAPT tool can demotivate students and lower the confidence in the tool. Likewise, missing pronunciation errors can slow down the learning process.
One of the main challenges with the existing CAPT methods is poor availability of mispronounced speech, which is required for the reliable training of pronunciation error detection models. We propose a reformulation of the problem of pronunciation error detection as a task of synthetic speech generation. Intuitively, if we had a generative model that could mimic mispronounced speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. The probability of pronunciation errors for all the words in a sentence can then be calculated using the Bayes rule [18]. In this new formulation, we move the complexity to learning the speech generation process that is well suited to the problem of limited speech availability [19][20][21]. The proposed method outperforms the state-of-the-art model [9] in detecting pronunciation errors in AUC metric by 41% from 0.528 to 0.749 on the GUT Isle Corpus of L2 Polish speakers.
To put the new formulation of the problem into action, we propose three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speechto-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. The effectiveness of these techniques is assessed in two tasks: detecting mispronounced words (replacing, adding, removing phonemes, or pronouncing an unknown speech sound) and detecting lexical stress errors. The results presented in this study are the culmination of our recent work on speech generation in pronunciation error detection task [11,22,23], including a new S2S technique.
In short, the contributions of the paper are as follows: • A new paradigm for the automated detection of pronunciation errors is proposed, reformulating the problem as a task of generating synthetic speech. • A unified probabilistic view on P2P, T2S, and S2S techniques is presented in the context of detecting pronunciation errors. • A new S2S method to generate synthetic speech is proposed, which outperforms the state-of-the-art model [9] in detecting pronunciation errors. • Comprehensive experiments are described to demonstrate the effectiveness of speech generation in the tasks of pronunciation and lexical stress error detection.
The outline of the rest of this paper is: Section 2 presents related work. Section 3 describes the proposed methods of generating synthetic speech for automatic detection of pronunciation errors. Section 4 describes the human speech corpora used to train the pronunciation error detection models in the experiments. Section 5 presents experiments demonstrating the effectiveness of various synthetic speech generation methods in improving the accuracy of the detection of pronunciation and lexical stress errors. Finally, conclusions and future work are presented in Section 6.

Phoneme recognition approaches
Most existing CAPT methods are designed to recognize the phonemes pronounced by the speaker and compare them with the expected (canonical) pronunciation of correctly pronounced speech [9,14,24,25]. Any discrepancy between the recognized and canonical phonemes results in a pronunciation error at the phoneme level. Phoneme recognition approaches generally fall into two categories: methods that align a speech signal with phonemes (forced-alignment techniques) and methods that first recognize the phonemes in the speech signal and then align the recognized and canonical phoneme sequences. Aside these two categories, CAPT methods can be split into multiple other categories: Forced-alignment techniques [15,[24][25][26] are based on the work of Franco et al. [27] and the Goodness of Pronunciation (GoP) method [14]. In the first step, GoP uses Bayesian inference to find the most likely alignment between canonical phonemes and the corresponding audio signal (forced alignment). In the next step, GoP calculates the ratio between the likelihoods of the canonical and the most likely pronounced phonemes. Finally, it detects mispronunciation if the ratio drops below a certain threshold. GoP has been further extended with Deep Neural Networks (DNNs), replacing the Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM) techniques for acoustic modeling [24,25]. Cheng et al. [26] improves GoP performance with the hidden representation of speech extracted in an unsupervised way. This model can detect pronunciation errors based on the input speech signal and the reference canonical speech signal, without using any linguistic information such as text and phonemes.
The methods that do not use forced-alignment recognize the phonemes pronounced by the speaker purely from the speech signal and only then align them with the canonical phonemes [28][29][30][31][32][33]. Leung et al. [9] use a phoneme recognizer that recognizes phonemes only from the speech signal. The phoneme recognizer is based on Convolutional Neural Network (CNN), a Gated Recurrent Unit (GRU), and Connectionist Temporal Classification (CTC) loss. Leung et al. report that it outperforms other forced-alignment [24] and forced-alignment-free [29] techniques in the task of detecting mispronunciations at the phoneme-level in L2 English.
There are two challenges with presented approaches for pronunciation error detection. First, phonemes pronounced by the speaker must be recognized accurately, which has been proved difficult [10,[34][35][36]. Phoneme recognition is difficult, especially in non-native speech, as different languages have different phoneme spaces. Second, standard approaches assume only one canonical pronunciation of a given text, but this assumption is not always true due to the phonetic variability of speech, e.g., differences between regional accents. For example, the word 'enough' can be pronounced by native speakers in multiple ways: /ih n ah f/ or /ax n ah f/ (short 'i' or 'schwa' phoneme at the beginning). In our previous work, we solve these problems by creating a native speech pronunciation model that returns the probability of the sentence to be spoken by a native speaker [11].
Techniques based on phoneme recognition can be supplemented by a reference speech signal obtained from the speech database [37][38][39] or generated from the phonetic representation [11,40]. Xiao et al. [37] use a pair of speech signals from a student and a native speaker to classify native and non-native speech. Mauro et al. [38] use the speech of the reference speaker to detect mispronunciation errors at the phoneme level. Wang et al. [39] use Siamese networks to model the discrepancy between normal and distorted children's speech. Qian et al. [40] propose a statistical model of pronunciation in which they build a model that generates hypotheses of mispronounced speech.
In this work, we use the end-to-end method to detect pronunciation errors directly, without having to recognize phonemes as an intermediate step. The end-to-end approach is discussed in more detail in the next section.

End-to-end methods
The phoneme recognition approaches presented so far rely on phonetically transcribed speech labeled by human listeners. Phonetic transcriptions are needed to train a phoneme recognition model. Human-based transcription is a time-consuming task, especially with L2 speech, where listeners need to recognize mispronunciation errors. Sometimes L2 speech transcription may be even impossible because different languages have different phoneme sets, and it is unclear which phonemes were pronounced by the speaker. In our recent work, we have introduced a novel model (known as WEAKLY-S, i.e., weakly supervised) for detecting pronunciation errors at the world level that does not require phonetically transcribed L2 speech [22]. During training, the model is weakly supervised, in the sense that in L2 speech, only mispronounced words are marked, and the data do not need to be phonetically transcribed. In addition to the primary task of detecting mispronunciation errors at the world level, the second task uses a phoneme recognizer trained on automatically transcribed L1 speech.
Zhang et al. [10] employ a multi-task model with two tasks: phoneme-recognition and pronunciation error detection tasks. Unlike our WEAKLY-S model, they use the Needleman-Wunsch algorithm [41] from bioinformatics to align the canonical and recognized phoneme sequences, but this algorithm cannot be tuned to detect pronunciation errors. The WEAKLY-S model automatically learns the alignment, thus eliminating a potential source of inaccuracy. The alignment is learned through an attention mechanism that automatically maps the speech signal to a sequence of pronunciation errors at the word level. Tong et al. [39] propose to use a multi-task framework in which a neural network model is used to learn the joint space between the acoustic characteristics of adults and children. Additionally, Duan et al. [42] propose a multitask model for acoustical modeling with two tasks for native and non-native speech respectively.
The work of Zhang et al. [10] and our recent work [22] are end-to-end methods of direct estimation of pronunciation errors, setting up a new trend in the field of automated pronunciation assessment. In this article, we use the end-to-end method as well, but we extend it by the S2S method of generating mispronounced speech.

Other trends
All the works presented so far treat pronunciation errors as discrete categories, at best producing the probability of mispronunciation. In contrast, Bi-Cheng et al. [43] propose a model capable of identifying phoneme distortions, giving the user more detailed feedback on mispronunciation. In our recent work, we provide more finegrained feedback by indicating the severity level of mispronunciation [22].
Active research is conducted not only on modelling techniques but also on speech representation. Xu et al. [44] and Peng et al. [45] use the Wav2vec 2.0 speech representation that is created in an unsupervised way. They report that it outperforms existing methods and requires three times less speech training data. Lin et al. [46] use transfer learning by taking advantage of deep latent features extracted from the Automated Speech Recognition (ASR) acoustic model and report improvements over the classic GOP-based method.
In this work, we use a mel-spectrogram as a speech representation in the pronunciation error detection model. We also use a mel-spectrogram to represent the speech signal in the T2S and S2S methods of generating mispronounced speech.

Lexical stress error detection
CAPT usually focuses on practicing the pronunciation of phonemes [9,11,14]. However, there is evidence that practicing lexical stress improves the intelligibility of nonnative English speech [47,48]. Lexical stress is a phonological feature of a syllable. It is part of the phonological rules that govern how words should be pronounced in a given language. Stressed syllables are usually longer, louder, and expressed with a higher pitch than their unstressed counterparts [49]. The lexical stress is related to the phonemic representation. For example, placing lexical stress on a different syllable of a word can lead to various phonemic realizations known as 'vowel reduction' [50]. Students should be able to practice both pronunciation and lexical stress in spoken language. We study both topics to better understand the potential of using speech generation methods in CAPT.
The existing works focus on the supervised classification of lexical stress using Neural Networks [17,51], Support Vector Machines [16,52], and Fisher's linear discriminant [53]. There are two popular variants: a) discriminating syllables between primary stress/no stress [13], and b) classifying between primary stress/secondary stress/no stress [51,54]. Ramanathi et al. [55] have followed an alternative unsupervised way of classifying lexical stress, which is based on computing the likelihood of an acoustic signal for a number of possible lexical stress representations of a word.
Accuracy is the most commonly used performance metric, and it indicates the ratio of correctly classified stress patterns on a syllable [54] or word level [16]. On the contrary, Ferrer et al. [13], analyzed the precision and recall metrics to detect lexical stress errors and not just classify them.
Most existing approaches for the classification and detection of lexical stress errors are based on carefully designed features. They start with aligning a speech signal with phonetic transcription, performed via forced-alignment [16,17]. Alternatively, ASR can provide both phonetic transcription and its alignment with a speech signal [54]. Then, prosodic features such as duration, energy and pitch [16] and cepstral features such as Mel Frequency Cepstral Coefficients (MFCC) and Mel-Spectrogram [13,17] are extracted. These features can be extracted on the syllable [17] or syllable nucleus [13,16] level. Shahin et al. [17] computes features of neighboring vowels, and Li et al. [54] includes the features for two preceding and two following syllables in the model. The features are often preprocessed and normalized to avoid potential confounding variables [13], and to achieve better model generalization by normalizing the duration and pitch on a word level [13,53]. Li et al. [51] adds canonical lexical stress to input features, which improves the accuracy of the model.
In our recent work, we use attention mechanisms to automatically derive areas of the audio signal that are important for the detection of lexical stress errors [23]. In this work, we use the T2S method to generate synthetic lexical stress errors to improve the accuracy of detecting lexical stress errors.

Synthetic speech generation for pronunciation error detection
Existing synthetic speech generation techniques for detecting pronunciation errors can be divided into two categories: data augmentation and data generation.
Data augmentation techniques are designed to generate new training examples for existing mispronunciation labels. Badenhorst et al. [56] simulate new speakers by adjusting the speed of raw audio signals. Eklund [57] generates additional training data by adding background noise and convolving the audio signal with the impulse responses of the microphone of a mobile device and a room.
Data generation techniques are designed to generate new training data with new labels of both correctly pronounced and mispronounced speech. Most existing works are based on the P2P technique to generate mispronounced speech by perturbing the phoneme sequence of the corresponding audio using a variety of strategies [11,[58][59][60][61]. In addition to P2P techniques, in our recent work, we use T2S to generate synthetic lexical stress errors [22]. Qian et al. [40] introduce a generative model to create hy-potheses of mispronounced speech and use it as a reference speech signal to detect pronunciation errors. Recently, we proposed a similar technique to create a pronunciation model of native speech to account for many ways of correctly pronouncing a sentence by a native speaker [11].
Synthetic speech generation techniques have recently gained attention in other related fields. Fazel et al. [21] use synthetic speech generated with T2S to improve accuracy in ASR. Huang et al. [62] use a machine translation technique to generate text to train an ASR language model in a low-resource language. At the same time, Shah et al. [20] and Huybrechts et al. [19] employ S2S voice conversion to improve the quality of speech synthesis in the data reduction scenario.
All the presented works on the detection of pronunciation errors treat synthetic speech generation as a secondary contribution. In this article, we present a unified perspective of synthetic speech generation methods for detecting pronunciation errors. This article extends our previous work [11,22,23] and introduces a new S2S method to detect pronunciation errors. To the best of our knowledge, there are no papers devoted to generating pronunciation errors with the S2S technique and using it in the detection of pronunciation errors.

Methods of generating pronunciation errors
To detect pronunciation errors, first, the spoken language must be separated from other factors in the signal and then incorrectly pronounced speech sounds have to be identified. Separating speech into multiple factors is difficult, as speech is a complex signal. It consists of prosody (F0, duration, energy), timbre of the voice, and the representation of the spoken language. Spoken language is defined by the sounds (phones) perceived by people. Phones are the realizations of phonemes -a human abstract representation of how to pronounce a word/sentence. Speech may also present variability due to the recording channel and environmental effects such as noise and reverberation. Detecting pronunciation errors is very challenging, also because of the limited amount of recordings with mispronounced speech. To address these challenges, we reformulate the problem of pronunciation error detection as the task of synthetic speech generation.
Let s be the speech signal, r be the sequence of phonemes that the user is trying to pronounce (canonical pronunciation), and e be the sequence of probabilities of mispronunciation at the phoneme or word level. The original task of detecting pronunciation errors is defined by: where the formulation of the problem as the task of synthetic speech generation is defined as follows: The probability of pronunciation errors for all the words in a sentence can then be calculated using the Bayes rule [18]: From Equation 3, one can see that there is no need to directly learn the probability of pronunciation errors p(e|s, r), since the complexity of the problem has now been transferred to learning the speech generation process p(s|e, r). Such a formulation of the problem opens the way to the inclusion of additional prior knowledge into the model: (1) Replacing the phoneme in a word while preserving the original speech signal results in a pronunciation error (P2P method). (2) Changing the speech signal while retaining the original pronunciation results in a pronunciation error (T2S method).
(3) There are many variations of mispronounced speech that differ in terms of the voice timbre and the prosodic aspects of speech (S2S method).
To solve Equation 3, we use Markov Chain Monte Carlo Sampling (MCMC) [63]. In this way, the prior knowledge can be incorporated by generating N training examples {e i , s i , r i } for i = 1..N with the use of P2P (prior knowledge 1), T2S (prior knowledge 2), and S2S (prior knowledge 3) methods. Accounting for the prior knowledge, intuitively corresponds to an increase in the amount of training data, which contributes to outperforming state-of-the-art models for detecting pronunciation errors, as presented in Section 5. Equation 3 can then be optimized with standard gradient-based optimization techniques. In the following subsections, we present the P2P conversion, T2S, and S2S methods of generating correctly and incorrectly pronounced speech in details.

P2P method
To generate synthetic mispronounced speech, it is enough to start with correctly pronounced speech and modify the corresponding sequence of phonemes. This simple idea does not even require generating the speech signal itself. It can be observed that the probability of mispronunciations depends on the discrepancy between the speech signal and the corresponding canonical pronunciation. This leads to the P2P conversion model shown in Figure 1a.
Let {e noerr , s, r} be a single training example containing: the sequence of 0s denoting correctly pronounced phonemes, the speech signal, and the sequence of phonemes representing the canonical pronunciation. Let r be the sequence of phonemes with injected mispronunciations such as phoneme replacements, insertions, and deletions: then the probability of mispronunciation for the j th phoneme is defined by: The probabilities of mispronunciation can be projected from the level of phonemes to Figure 1. Probabilistic graphical models for three methods to generate pronunciation errors: P2P, T2S and S2S. Empty circles represent hidden (latent) variables, while filled (blue) circles represent observed variables. s -the speech signal, r -the sequence of phonemes that the user is trying to pronounce (canonical pronunciation), the superscript represents a variable with generated mispronunciations.
the level of words. A word is treated as mispronounced if at least one pair of phonemes in the word {r j , r j } does not match. At the end of this process, a new training example is created with artificially introduced pronunciation errors: {e err , s, r }. Note that the speech signal s in the new training example is unchanged from the original training example, and only phoneme transcription is manipulated.

Implementation
To generate synthetic pronunciation errors, we use a simple approach of perturbing phonetic transcription for the corresponding speech audio. First, we sample these utterances with replacement from the input corpora of human speech. Then, for each utterance, we replace the phonemes with random phonemes with a given probability.

T2S method
The T2S method expands on P2P by making it possible to create speech signals that match the synthetic mispronunciations. The T2S method for generating mispronounced speech is a generalization of the P2P method, as can be seen by the comparison of the two methods shown in Figures 1a and 1b. One problem with the P2P method is that it cannot generate a speech signal for the newly created sequence of phonemes r . As a result, pronunciation errors will dominate in the training data containing new sequences of phonemes r . Therefore, it will be possible to detect pronunciation errors only from the canonical representation r , ignoring information contained in the speech signal. To mitigate this issue, there should be two training examples for the phonemes r , one representing mispronounced speech: {e err , s, r }, and the second one for correct pronunciation: {e noerr , s , r }, where: Because we now have the speech signal s , another training example can be created as: {e err , s , r}. In summary, T2S method extends a single training example of correctly pronounced speech to four combinations of correctly and incorrect pronunciations: • {e noerr , s, r} -correctly pronounced input speech • {e err , s, r } -mispronounced speech generated by the P2P method • {e noerr , s , r } -correctly pronounced speech generated by the T2S method • {e err , s , r} -mispronounced speech generated by the T2S method

Implementation
The synthetic speech is generated with the Neural TTS described by Latorre et al. [64]. The Neural TTS consists of two modules. The context-generation module is an attention-based encoder-decoder neural network that generates a mel-spectrogram from a sequence of phonemes. The Neural Vocoder then converts it into a speech signal. The Neural Vocoder is a neural network of architecture similar to Parallel Wavenet [65]. The Neural TTS is trained using the speech of a single native speaker. To generate words with different lexical stress patterns, we modify the lexical stress markers associated with the vowels in the phonetic transcription of the word. For example, with the input of /r iy1 m ay0 n d/ we can place lexical stress on the first syllable of the word 'remind'.

S2S method
The S2S method is designed to simulate the diverse nature of speech, as there are many ways to correctly pronounce a sentence. The prosodic aspects of speech, such as pitch, duration, and energy, can vary. Similarly, phonemes can be pronounced differently. To mimic human speech, speech generation techniques should allow a similar level of variability. The T2S method outlined in the previous section always produces the same output for the same phoneme input sequence. The S2S method is designed to overcome this limitation.
S2S converts the input speech signal s in a way to change the pronounced phonemes (phoneme replacements, insertions, and deletions) from the input phonemes r to target phonemes r while preserving other aspects of speech, including voice timbre and prosody (Equation 7 and Figure 1c). In this way, the natural variability of human speech is preserved, resulting in generating many variations of incorrectly pronounced speech. The prosody will differ in various versions of the sentence of the same speaker, while the same sentence spoken by many speakers will differ in the voice timbre.
s ∼ p(s |e noerr , r , s) Similarly to the T2S method, the S2S method outputs four types of speech pronounced correctly and incorrectly: {e noerr , s, r}, {e err , s, r }, {e noerr , s , r }, and {e err , s , r}.

Implementation
Synthetic speech is generated by introducing mispronunciations into the input speech, while preserving the duration of the phonemes and timbre of the voice. The architecture of the S2S model is shown in Figure 2. The mel-spectrogram of the input speech signal s is forced-aligned with the corresponding canonical phonemes r to get the duration of the phonemes. The speaker id has to be provided together with the input speech to enable the source speaker's voice to be maintained. Mispronunciations are introduced into the canonical phonemes r according to the P2P method described in Section 3.1. Mispronounced phonemes r along with phonemes duration and speaker id are processed by the encoder-decoder, which generates the mel-spectrogram s . The encoder-decoder transforms the phoneme-level representation into frame-level features Figure 2. Architecture of the S2S model to generate mispronounced synthetic speech while maintaining prosody and voice timbre of the input speech. The black rectangles represent the data (tensors) and the orange boxes represent processing blocks. This color notation is used in all machine learning model diagrams throughout the article. and then generates all mel-spectrogram frames in parallel. The mel-spectrogram is converted to an audio signal with Universal Vocoder [66]. Without the Universal Vocoder, it would not be possible to generate the raw audio signal for hundreds of speakers included in the LibriTTS corpus. Details of the S2S method are shown in the works of Shah et al. [20] and Jiao et al. [66]. The main difference between these two models and our S2S model is the use of the P2P mapping to introduce pronunciation errors.

Summary of mispronounced speech generation
Generation of synthetic mispronounced speech and detection of pronunciation errors were presented from the probabilistic perspective of the Bayes-rule. With this formulation, we can better understand the relationship between P2P, T2S and S2S methods, and see that the S2S method generalizes two simpler methods. Following this reasoning, we can argue that using the Bayes rule gives us a nice mathematical framework to potentially further generalize the S2S method, e.g. by adding a language variable to the model to support multilingual pronunciation error detection. There is another advantage of modelling pronunciation error detection from the probabilistic perspective -it paves the way for joint training of mispronounced speech generation and pronunciation error detection models. In the present work, we are training separate machine learning models for both tasks, but it should be possible to train both models jointly using the framework of Variational Inference [67] instead of MCMC to infer the probability of mispronunciation in Equation 3.

Corpora of continuous speech
Speech corpora of recorded sentences is a combination of L1 and L2 English speech. L1 speech is obtained from the TIMIT [68] and the LibriTTS [69] corpora. L2 speech comes from the Isle [70] corpus (German and Italian speakers) and the GUT Isle [71] corpus (Polish speakers). In total, we used 125.28 hours of L1 and L2 English speech from 983 speakers segmented into 102812 sentences. A summary of the speech corpora is presented in Table 1, whereas the details are presented in our recent work [22].
The speech data are used in all the pronunciation error detection experiments presented in Section 5. From the collected speech, we held out 28 L2 speakers and used them only to assess the performance of the systems in the mispronunciation detection task. It includes 11 Italian and 11 German speakers from the Isle corpus [70], and 6 Polish speakers from the GUT Isle corpus [71]. The human speech training data is extended with synthetic pronunciation errors generated by the methods presented in Section 3.

Corpora of isolated words
The speech corpora consist of human and synthetic speech. The data were divided into training and testing sets, with separate speakers assigned to each set. Human speech includes native (L1) and non-native (L2) English speech. L1 speech corpora are made of TIMIT [68] and Arctic [72]. L2 corpora contain speech from L2-Arctic [32], Porzuczek [73], and our own recordings of 25 speakers (23 Polish, 1 Ukrainian and 1 Lithuanian). The synthetic data were generated using the T2S method and are only included in the training set. The data are summarized in Table 2. For a more detailed description of speech corpora, see Section 4 of our recent work [23]. The speech corpora of isolated words are used in the lexical stress error detection experiment presented in Section 5.3.

Experimental setup
The effect of using synthetic pronunciation errors based on the P2P, T2S and S2S methods is evaluated in the task of detecting pronunciation errors in spoken sentences at the word level. First, we analyze the P2P method by comparing it with the stateof-the-art techniques and measure the effect of adding synthetic pronunciation errors to the training data. We then compare P2P with T2S and S2S to assess the benefits of using more complex methods of generating pronunciation errors. The accuracy of detecting pronunciation errors is reported in standard Area Under the Curve (AUC), precision and recall metrics.

Overview of our WEAKLY-S model
We use the pronunciation error detection model (WEAKLY-S) recently proposed by us [22]. To train the model, the human speech training set is extended with 292,242 utterances of L1 speech with synthetically generated pronunciation errors. To generate pronunciation errors, the P2P, T2S, and S2S methods described in Section 3 are used.
The WEAKLY-S model produces probabilities of mispronunciation for all words, conditioned by the spoken sentence and canonical phonemes. Mispronunciation errors include phoneme replacement, addition, deletion, or an unknown speech sound. During training, the model is weakly supervised, in the sense that only mispronounced words in L2 speech are marked by listeners and the data do not have to be phonetically transcribed. Due to the limited availability of L2 speech and the fact that it is not phonetically transcribed, the model is more likely to overfit. To solve this problem, the model is trained in a multi-task setup. In addition to the primary task of detecting mispronunciation error at the word level, the second task uses a phoneme recognizer which is trained on automatically transcribed L1 speech. Both tasks share components of the model, which makes the primary task less likely to overfit.
The architecture of the pronunciation error detection model is shown in Figure 3. The model consists of two sub-networks. The Mispronunciations Detection Network (MDN) detects word-level pronunciation errors e from the audio signal s and canonical phonemes r, while the Phoneme Recognition Network (PRN) recognizes phonemes r o pronounced by a speaker from the audio signal s. The detailed model architecture is presented in Section 2 of our recent work [22].

Results -P2P method
We conducted an ablation study to measure the effect of removing synthetic pronunciation errors from the training data. We trained four variants of the WEAKLY-S model to measure the effect of using synthetic data against other elements of the model. WEAKLY-S is a complete model that also includes synthetic data during training. In the NO-SYNTH-ERR model, we exclude synthetic samples of mispronounced L1 speech, significantly reducing the number of mispronounced words seen during training from 1,129,839 to just 5,273 L2 words. The NO-L2-ADAPT variant does not fine-tune the model on L2 speech, although it is still exposed to L2 speech while being trained on a combined corpus of L1 and L2 speech. The NO-L1L2-TRAIN model is not trained on L1/L2 speech, and fine-tuning on L2 speech starts from scratch. This means that this model will not use a large amount of phonetically transcribed L1 speech data and ultimately no secondary phoneme recognition task will be used.
L2 fine-tuning (NO-L2-ADAPT) is the most important factor influencing the performance of the model (Fig. 4 and Table 3), with an AUC of 0.517 compared to 0.686  for the full model. Training the model on both L2 and L1 human speech together is not enough. This is because L2 speech accounts for less than 1% of the training data and the model naturally leans towards L1 speech. The second most important feature is training the model on a combined set of L1 and L2 speech (NO-L1L2-TRAIN), with an AUC of 0.565. L1 speech accounts for over 99% of training data. These data are also phonetically transcribed, and therefore can be used for the phoneme recognition task. The phoneme recognition task acts as a 'backbone' and reduces the effect of overfitting in the main task of detecting errors in the pronunciation of words. Finally, excluding synthetically generated pronunciation errors (NO-SYNTH-ERR) reduces an AUC from 0.686 to 0.615. Although, the synthetic data provides the least improvement to the model, it still increases the accuracy of the model by 11.5% in AUC, contributing to setting up a new state-of-the-art. We compare the WEAKLY-S model with two state-of-the-art baselines. The Phoneme Recognizer (PR) model by Leung et al. [9] is our first baseline. The PR is based on the CTC loss [74] and outperforms multiple alternative approaches of pronunciation assessment. The original CTC-based model uses a hard likelihood threshold applied to the recognized phonemes. To compare it with two other models, following our recent work [11], we have replaced the hard likelihood threshold with a soft threshold. The second baseline is PR extended by the pronunciation model (PR-PM model [11]). The pronunciation model takes into account the phonetic variability of the speech spoken by native speakers, which results in greater precision in detecting pronunciation errors. The results are shown in Table 4. It turns out that the WEAKLY-S model outperforms the second-best model in terms of an AUC by 30% from 0.528 to 0.686 and precision by 23% from 0.612 to 0.752 on the GUT Isle Corpus of Polish speakers. We are seeing similar improvements on the Isle Corpus of German and Italian speakers. The use of synthetic data is an important contribution to the performance of the WEAKLY-S model.

Results -T2S and S2S methods
The main limitation of the P2P method is that it does not generate a new speech signal. The method introduces mispronunciations by operating only on the sequence of phonemes for the corresponding speech. In this experiment, we demonstrate the T2S and S2S methods that can directly generate a speech signal to overcome this limitation. The S2S method introduces mispronunciations into the input native speech while preserving the prosody (phoneme durations) and timbre of the voice. Preserving speech attributes other than pronunciation increases speech variability during training and makes the pronunciation error detection model more reliable during testing. The T2S method can be considered as a simplified variant of the S2S method, in which there is only text as input. The T2S and S2S methods are compared with the P2P method. Three WEAKLY-S models are trained, differing in the technique of generating mispronounced speech contained in the training data. The S2S method outperforms the P2P method by increasing an AUC score by 9% from 0.686 to 0.749 in the Gut Isle corpus of Polish speakers (Table 5). Additionally, an AUC increases from 0.815 to 0.834 for major pronunciation errors (Table 6), according to a similar experiment presented in Section 3.4 of [22]. Interestingly, the T2S method is only slightly better than the P2P method, which suggests that the variability of the generated mispronounced speech provided by the S2S method is really important. The presented experiments show the potential of the S2S method in improving the accuracy of detecting pronunciation errors. The S2S method is able to control voice timbre, phoneme duration, and pronunciation, opening the door to transplanting all three properties from non-native speech and potentially further improving the accuracy of the model.
One downside of the S2S method is its complexity. Compared to the straightforward P2P method, the 9% improvement in an AUC is associated with high costs. The method involves training a complex multi-speaker S2S model to convert between input and output mel-spectrograms and requires training a Universal Vocoder model to convert a mel-spectrogram into a raw speech signal.
To better understand what prevents the model from achieving higher accuracy, we measure the performance of the model on synthetic pronunciation errors. We divide all synthetic pronunciation errors into four categories to reflect the severity of pronunciation errors. The 'low' category includes mispronounced words with only one mismatched phoneme between the canonical and pronounced phonemes of the word. The 'medium' category includes two mispronounced phonemes. The 'high' category gets three, and the 'very high' category includes four mispronounced errors. The AUC across different severity levels varies from 0.928 (low severity) to 1.00 (very high severity) as shown in Table 7. These AUC values are significantly higher than the results for non-native human speech, suggesting that making synthetic speech errors more similar  (37.26-43.11) to non-native speech may improve the accuracy of detecting pronunciation errors.

Experimental setup
The P2P, T2S, and S2S are generative models that provide the probability of generating a particular output sequence. This probability can be used directly to detect pronunciation errors without generating the mispronounced speech and adding it to the training data. In this experiment, we show how to apply this approach in practice. One of the challenges in detecting pronunciation errors is that a native speaker can pronounce a sentence correctly in many ways. The classic approach for detecting pronunciation errors is based on identifying the difference between pronounced and canonical phonemes. All pronunciations that do not correspond precisely to the canonical pronunciation will result in false pronunciation errors. One way to solve this problem is to use the P2P technique to create a native speech Pronunciation Model (PM) that determines the probability that a sentence is pronounced by a native speaker. A low likelihood value indicates a high probability of mispronunciation.
To evaluate the performance of the PM model, the pronunciation error detection model has been designed such that the PM model can be turned on and off. To disable the PM, we are modifying it so that it only takes into account one way of correctly pronouncing a sentence. In an ablation study, we measure whether the PM model improves the accuracy in detecting pronunciation errors at the word level. Note that in this experiment, synthetically generated pronunciation errors are not used explicitly. Instead, the native speech pronunciation model is used to implicitly represent the Table 7. Accuracy (AUC) in detecting pronunciation errors assessed in synthetic speech at different severity levels of mispronunciation for the best S2S method.

Severity AUC
Low (phoneme distance=1) 0.928 Medium (phoneme distance=2) 0.974 High (phoneme distance=3) 0.993 Very High (phoneme distance=4) 1.00 Figure 5. Architecture of the system for detecting mispronounced words in a spoken sentence based on the native speech pronunciation model. generative speech process.

Overview of the pronunciation error detection model
The design of the pronunciation error detection model consists of three subsystems: a Phoneme Recognizer (PR), a Pronunciation Model (PM), and a Pronunciation Error Detector (PED), shown in Figure 5. First, the PR model estimates a belief over the phonemes produced by the student, intuitively representing the uncertainty in the student's pronunciation. The PM model transforms this belief into a probability that a native speaker would pronounce the sentence this way, given the phonetic variability. Finally, the PED model decides which words were mispronounced in the sentence by processing three pieces of information: a) what the student pronounced, b) how likely it is that the native speaker would pronounce it that way, and c) what the student was supposed to pronounce. Details of the entire model of pronunciation error detection are presented in Section 3 of our recent work [11]. We will now only show the details of the PM model that are relevant to this experiment.

Overview of the native speech pronunciation model
PM is an encoder-decoder neural network, following Sutskever et al. [75]. Instead of building a text-to-text translation system between two languages, we use it for the P2P conversion. The sequence of phonemes r that the native speaker was supposed to pronounce is converted to the sequence of phonemes r they had pronounced, denoted as r ∼ p(r |r). Once trained, PM acts as a probability mass function, computing the probability sequence π of the recognized phonemes r o pronounced by the student conditioned by the expected (canonical) phonemes r. PM is denoted as in Eq. 8.
The PM model is trained on P2P speech data generated automatically by passing the speech of the native speakers through the PR. By using PR to annotate the data, we can make the PM model more robust against possible phoneme recognition inaccuracies in PR at the time of testing.

Results
The complete model with PM enabled is called PR-PM that stands for a Phoneme Recognizer + Pronunciation Model. The model with PM turned off is called PR-LIK that stands for Phoneme Recognizer outputting the likelihoods of recognized phonemes. PR-LIK is an extension of the PR-NOLIK model -the mispronunciation  detection model proposed by Leung et al. [9] that only returns the most likely recognized phonemes and does not use phoneme likelihoods to detect pronunciation errors. PR-NOLIK detects mispronounced words based on the difference between the canonical and recognized phonemes. Therefore, this system does not offer any flexibility in optimizing the model for higher precision by fine-tuning the threshold applied to the phoneme recognition probabilities. Turning off PM reduces the precision between 11% and 18%, depending on the decrease in recall between 20% to 40%, as shown in Figure 6. One example where the PM helps is the word 'enough' that can be pronounced in two similar ways: /ih n ah f/ or /ax n ah f/ (short 'i' or 'schwa' phoneme at the beginning.) The PM can take into account the phonetic variability and recognize both versions as correctly pronounced. Another example is coarticulation [76]. Native speakers tend to merge phonemes of adjacent words. For example, in the text 'her arrange' /hh er -er ey n jh/, two adjacent phonemes /er/ can be pronounced as one phoneme: /hh er ey n jh/. The PM model can correctly recognize multiple variations of such pronunciations.
Complementary to the precision-recall curve shown in Figure 6, we present in Table  8 one configuration of the precision and recall scores for the PR-LIK and PR-PM systems. This configuration is chosen in a way to: a) make the recall for both systems close to the same value, and b) to illustrate that the PR-PM model has much greater potential to increase precision than the PR-LIK system. A similar conclusion can be drawn by checking various different precision and recall configurations in the precision and recall plots for both Isle and GUT Isle corpora.

Experimental setup
The full CAPT learning experience includes both the detection of pronunciation and lexical stress errors. To investigate the potential of speech generation in the lexical stress error detection task, we evaluate the T2S method, which is a simpler version of the S2S method evaluated in Section 5.1.4.
The lexical stress error detection model is trained to measure the benefits of employing synthetic mispronounced speech. The first model, denoted as Att TTS is based on an attention mechanism and is trained on both human and synthetic speech with pronunciation errors. In this model, 1980 the most popular English words [77] were synthesized with correct and incorrect stress patterns using the method outlined in Section 3.2, and added to the speech corpora of isolated words presented in Section 4.2. The Att NoTTS model is trained only on human speech. Each of the two models presented has its simpler version without the attention mechanism, marked as NoAtt TTS and NoAtt NoTTS. Both models will help to understand whether the benefits of using synthetic pronunciation errors depend on the model capacity.
The accuracy of detecting lexical stress errors is measured in terms of an AUC metric. To be comparable to the study by Ferrer et al. [13], we use precision as an additional metric, while setting recall to 50%.

Overview of the lexical stress detection model
As shown in Figure 7, the lexical stress error detection model consists of three subsystems: Feature Extractor, Attention-based Classification Model, and Lexical Stress Error Detector. The Feature Extractor extracts prosodic features and phonemes from the speech signal s and the forced-aligned canonical phonemes r. Prosodic features include: F0, intensity [dB SPL] and duration of phonemes. The F0 and intensity features are computed at the frame level. The Attention-based Classification Model uses the attention mechanism [78] to map frame-level and phoneme-level features to a syllablelevel representation. It then produces lexical stress error probabilities at the syllable level. The Lexical Stress Error Detector reports a lexical stress error if the expected (canonical) and estimated lexical stress for a given syllable do not match and the corresponding probability is higher than the specified threshold. The detailed architecture of the model is presented in Section 3 of our recent work [23].
The NoAtt TTS and NoAtt NoTTS models do not have the attention mechanism. Instead, as a representation at the syllable level, they use the average acoustic feature values for the corresponding syllable nucleus. The hypothesis is that synthetic data will not be beneficial to a simpler model due to its limited capacity.

Results
Enriching the training set with the incorrectly stressed words increases an AUC score from 0.54 to 0.62 (Att TTS vs. Att NoTTS in Figure 8 and Table 9). Data augmentation helps because it increases the number of words with incorrect stress patterns in the training set. This prevents the model from using the strong correlation between phonemes and lexical stress in the correctly stressed words. Using data augmentation in the simpler model without the attention mechanism slightly reduced an AUC score from 0.45 to 0.44 (NoAtt NoTTS vs NoAtt TTS). The NoAtt TTS model has limited capacity due to not using the attention mechanism to model prosodic features, and thus is unable to benefit from synthetic speech. We compare our results with the work of Ferrer et al. [13]. There were 46.4% (191 out of 411) of incorrectly stressed words in their corpus, well over 9.4% (189 out of 2109) words in our experiment. The fewer lexical stress errors that users make, the more difficult it is to detect them. Under these conditions, we can state that our lexical stress detection model based on T2S generated synthetic speech achieves higher scores in precision and recall compared to the work of Ferrer et al. [13].

Conclusions
We propose a new paradigm for detecting pronunciation errors in non-native speech. Rather than focusing on detecting pronunciation errors directly, we reformulate the detection problem as a speech generation task. This approach is based on the assumption that it is easier to generate speech with specific characteristics than to detect those characteristics in speech with limited availability. In this way, we address one of the main problems of the existing CAPT methods, which is the low availability of mispronounced speech for reliable training of pronunciation error detection models.
We present a unified look at three different speech generation techniques for detecting pronunciation errors based on P2P, T2S and S2S conversion. The P2P, T2S, and S2S methods improve the accuracy of detecting pronunciation and lexical stress errors. The methods outperform strong baseline models and establish a new state-of-the-art. The best S2S method outperforms the baseline method [9] by improving the accuracy of detecting pronunciation errors in AUC metric by 41% from 0.528 to 0.749. The S2S method has the ability to control many properties of speech, such as voice timbre, prosody (duration), and pronunciation. This opens the door to the generation of mispronounced speech that can mimic certain aspects of non-native speech, such as voice timbre. The S2S method can be seen as a generalization of the simpler methods, T2S and P2P, providing a general framework for building a first-class models of pronunciation assessment. For better reproducibility, in addition to using publicly available speech corpora, we recorded the GUT Isle corpus of non-native English speech [71]. The corpus is available to other researchers in the field.
In the future, we plan to extend the S2S method in order to generate synthetic speech as close as possible to non-native speech: a) we will extract the voice timbre from the speech of non-native speakers and transfer it to native speech, following the paper of Merritt et al. on text-free voice conversion [79], and b) we will mimic the distribution of pronunciation errors in non-native speech. We expect both changes to increase the accuracy of detecting pronunciation errors in non-native speech. In the long run, we hope to demonstrate that "synthetic speech is all you need" by training the model with synthetic speech only and achieving state-of-the-art results in the pronunciation error detection task. This may revolutionize computer-assisted English L2 learning and CAPT. Moreover, such a paradigm may be transferred to the whole domain of computer-assisted foreign language learning.