Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

Computer-Assisted Language Learning (CALL) applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR) yield an equal error rate (EER) of 10.3%, which is significantly better than the EER for the other measures in isolation.


Introduction
The increasing demand for innovative applications that support language learning has led to a growing interest in Computer-Assisted Language Learning (CALL) systems that make use of ASR technology. Such systems can address oral proficiency, one of the most problematic skills in terms of time investments and costs, and are seriously being considered as a viable alternative to teacher-fronted lessons. However, developing ASR-based CALL systems that can provide training and feedback for second language (L2) speaking is not trivial.
First of all, because non-native speech is atypical in many respects and, as such, it poses serious problems to ASR systems [1][2][3][4]. Non-native speech may deviate from native speech with respect to pronunciation, morphology, syntax, and the lexicon. Pronunciation is considered a difficult skill to learn in a second language (L2), and even highly proficient non-native speakers often maintain a foreign accent [5]. An important limiting factor in acquiring the pronunciation of an L2 is considered to be interference from the first language (L1). As a consequence, the pronunciation of non native speakers may deviate in various respects and to different degrees from that of native speakers. Deviations may concern prosodic or segmental aspects of speech or both. At the segmental level, the deviations maybe limited to phonetic properties without really compromising phonemic distinctions, or they may blur phonemic distinctions and thus have more serious consequences for intelligibility. For instance, non-native speakers may use phonemes from their L1 when speaking the target language [5] or they may have difficulties in perceiving and/or realizing phonetic contrasts that are not distinctive in their mother tongue. Illustrations of this phenomenon are provided by Italian speakers of English who realize English /p/, /t/, /k/, /b/, /d/, and /g/ with voice onset time (VOT) values that differ from those employed by native speakers [5]. Such deviations might cause misunderstandings in certain cases, but do not necessarily hamper communication because the distinction between separate phonemes, that is, /p/ versus /b/ in the target language is preserved, albeit differently realized. Native speakers will probably perceive the difference and consider it as foreign accent. More problematic deviations may arise when the difficulty in perceiving and realizing phonetic features of the target language that are not distinctive in the mother tongue leads non-native speakers to blur the proficiency. While for practicing pronunciation it may suffice to read sentences aloud, to practice grammar learners need to have some freedom in formulating answers in order to show whether they are able to produce correct forms. Less constrained output is not only problematic because it is more difficult to predict but also because, in general, it is accompanied by a higher incidence of disfluencies and hesitations. In a study on read and spontaneous speech produced by non-native speakers of Dutch [12], we found that extemporaneous speech contains many more filled pauses and disfluencies than read speech. The more freedom is allowed to the learner, the more complex the recognition task will be. In addition, tasks with more freedom will in general be characterized by a higher cognitive load, which, in turn, is likely to lead to more disfluencies being produced [17], thus making the recognition task even more difficult.
The second category of techniques for dealing with non native speech, that is, those that are aimed at improving decoding, comprises methods for optimizing the acoustic models, the lexicon, and the language model in order to com pensate for the deviations in pronunciation, morphology, and syntax.
All the factors mentioned above make it clear that to develop ASR-based CALL systems for oral proficiency it is necessary to take measures at different levels. A first important measure consists in designing exercises that allow some freedom to the learners in producing answers, but that are predictable enough to be handled by ASR. How much freedom can be allowed is ofcourse dependent on the quality of decoding.
These are exactly the problems we face in the DISCO project, which is aimed at developing a prototype of an ASR-based CALL application for practicing oral skills in Dutch as a second language (DL2) and providing intelligent feedback on important aspects of speaking performance such as pronunciation, morphology, and syntax. The application should be able to detect and give feedback on errors that are made by learners of DL2 at the A2 level of the Common European Framework (CEF). This is achieved by generating a predefined list of possible (correct and incorrect) responses for each exercise.
In this project we intend to use a two-step procedure in which first the content of the utterance is determined (what was said), and subsequently the form of the utterance is analysed (how it was said). In the first (recognition) step the system should tolerate deviations in the way utterances are spoken, while in the second (error detection) step, strictness is required (see also [18,19]). In the first step of the two-step procedure, two phases can be distinguished, (a) utterance selection, and (b) utterance verification (UV). When learners are allowed some freedom in formulating their responses, there is always the possibility that the learner's response is not present in the predefined list and is recognized incorrectly in phase (a) as one of the utterances of the predefined list. Also, even if the utterance is present in the list, it can also be recognized incorrectly. Giving feedback on the basis of an incorrectly recognized utterance is confusing and thus should be avoided. Therefore, utterance verification (UV) is carried out in phase (b). EURASIP Journal on Audio, Speech, and Music Processing 3 In this paper we present two experiments we carried out in order to test both utterance selection and utterance verification for our system using state-of-the-art techniques. In the utterance selection phase one of the utterances from the predefined list is selected, and in the utterance verification phase it is determined whether this utterance should be passed on to the following stages of the CALL system (error detection, feedback, etc.). While in the final system both phases should work in tandem, we studied (optimized, evaluated, etc.) the two phases in isolation, for diagnostic purposes, to acquire a better understanding, and thus, finally, to obtain a better functioning system.
In Section 2 we discuss related work on non-native speech recognition and utterance verification. In Section 3, we introduce our system architecture and relate the choices for the experimental settings to previous work. In Sections 4 and 5, we present two experiments that are aimed at optimizing and evaluating utterance selection and utterance verification using realistic test material. In Section 6, we discuss the results of the two experiments in combination and consider the implications for our CALL application.

Related Work
In automatic speech recognition (ASR) the recognition result is often obtained using the maximum a posteriori (MAP) decision rule decoder: where p(w | x) is the posterior probability of a word sequence w in a set of word sequences W given a sequence of acoustic observations x and w is the recognition result that maximizes the posterior probability. By using Bayes rule (1) can be reformulated as (2), and given that x is the same for all word sequences in W , it can be rewritten as (3): By implementing (3), we can still find the optimal sequence of words w in W . However, it is generally not only important to find the best sequence of words w relative to the other sequences (see (3)) but also quantitatively assess the confidence in the recognition result in an absolute sense. This number is called the confidence measure (CM) of the recognition result and the problem of accepting or rejecting a recognition result is called utterance verification (UV).
Both (non-native) speech decoding and utterance verifi cation are the key aspects of this research. We will now relate our research on both problems to other recent work.

Non-Native Speech
Decoding. In the ASR community, it has long been known that the differences between native and non-native speech are so pervasive as to degrade ASR performance considerably (e.g., [1, 20, 21]). These differences affect essentially all three components of an ASR system. As explained in Section 1, non-natives often use different words and word orders (language model), produce sounds differently (acoustic models), pronounce words differently (lexicon) (see, e.g., [2]), and generally have a lower speech rate and produce more disfluencies [10][11][12]. A short overview of research on the three components of the ASR is provided in this section.
In attempts aimed at improving ASR performance on non-native speech, the acoustic models have received most attention. Various kinds of acoustic models can and have been used. First of all, it is possible to train acoustic models on speech material of the target language (L2). However, the recognition performance obtained with such models is usually not sufficient or at any rate considerably lower than the performance on native speech, because of the various deviations in the speech of non-natives [20,21]. Models can also be obtained by training exclusively on non-native (L2) speech [22,23], or on combinations of L1 and L2 speech. Regarding the latter, two different approaches can be adopted: "model merging" and "parallel models." In the "parallel models" approach, acoustic models for both languages are stored, and during decoding the recognizer determines which models fit the data better [24][25][26][27]. In the "model merging" (or model interpolation) approach, acous tic models ofboth languages are combined, in order to obtain a new set of acoustic models [26]. The obvious disadvantage of these L1-L2 approaches is that they can only be applied to fixed L1-L2 pairs. An alternative approach that can be applied consists in employing adaptation techniques, such as the common Maximum Likelihood Linear Regression (MLLR) and MAP techniques, which have been shown to improve recognition performance [20,21,23,26,28].
Improving ASR performance on non-native speech can also be carried out at the level of the lexicon. An obvious way to model pronunciation variation at the level of the lexicon is by adding pronunciation variants to the lexicon [29,30]. In the case of non-native speech these variants should reflect possible L1-induced mispronunciations of words L2 learners may produce [18,31,32]. These variants can be generated by means of rules obtained from studying non-native speech [18,32]. Another possibility to generate non-native variants for an L2 lexicon is to apply an L1 phoneme recognizer to L2 speech [31]. The advantage of the latter approach is that no learner data are needed, but a disadvantage is that phoneme recognizers for all source languages (L1s) are needed. The work in [31] also carried out speaker adaptation, and the improvements they obtained with speaker adaptation were much larger than those obtained with lexicon adaptation.
The choices regarding the language model depend to a large extent on the design of the CALL system, the type of items present in the CALL system. In spoken CALL systems, use could be made of closed or open items. For instance, the learner could be asked to repeat an utterance that is spoken by the system, or read an utterance presented on the screen. In these cases, the required responses are known, which in turn makes it possible to derive specific language models for every item. Alternatively, in some cases, a language model might not be used at all, depending on 4 EURASIP Journal on Audio, Speech, and Music Processing the approach that is chosen. For more open items in a CALL system (e.g., a question, or a turn in the dialogue), a possibility is to try to elicit constrained responses. This makes it possible to activate a specific language model for every item containing only those utterances that are expected in that given context. In these cases, a "stricter" language model can be used [33][34][35]. In this way, recognition performance can again be maximized without affecting the face validity of the application. This is done, for instance, in the Auralog programs [36]. In spite of the constraints that are introduced to improve ASR performance, the students can still have the feeling that they are interacting with the system and that they have control over the conversation [36].

Utterance Verification.
In the literature roughly three approaches for tackling the UV problem can be distin guished: (1) posterior probability estimation, (2) statistical hypothesis testing, and (3) confidence predictors. We will now give a short overview of these approaches (see [37] for a more detailed overview).
(1) One approach to CM is to directly estimate the posterior probability of the recognition result w given the acoustic observations x: and reject the recognition result w when it is below a given threshold 9. The greatest challenge with respect to this approach is accurately estimating the denominator p(x). One solution is to estimate it from a word lattice [38], and this generally provides a good result when the lattice contains enough word hypotheses. The lattice-based approach can be viewed as approximating the posterior probability where p(x) is written as X ;p(x | w;)p(w;) and i ranges over all sequences of words in a pruned search space. Another approach to estimating X ;p(x | w;)p(w ;) is using a free phone recognizer (FPR) [39,40] and approxi mate: where uFPR is the optimal phone string found using a free phone recognizer.
(2) Another popular method to UV is statistical hypoth esis testing, in which the null hypothesis Ho states that the recognition result is a correct representation of the speech signal and the alternative hypothesis Ha states that the recognition result is not a correct representation. Then the criterion of accepting the null hypothesis becomes: in which the numerator equals the acoustic likelihood of w, the denominator equals the acoustic likelihood of all sequences of words other than w (usually called the antimodel), and 9 a predefined threshold. The main difficulty with this approach is defining and training the antimodel.
(3) Apart from estimating the posterior probability or statistical hypothesis testing, another method to UV is using predictors such as: (1) acoustic stability, (2) hypothesis density, (3) duration information, and combining these using a machine learning model. Some machine learning techniques that have been used in the past are artifical neural networks (ANN), linear discriminant analysis (LDA) classifiers, and binary decision trees.
Acoustic stability [38] refers to stability of the recognition result given different weightings of the acoustic model and language model scores. When the recognition result remains stable given fluctuations in these weightings, it means that we can be more confident that it is correctly recognized. Hypothesis density [41] refers to the average density of the word lattice generated during decoding. When there are a lot of competing hypotheses in a pruned search space at each point in time this means that we can be less confident that the recognition result is correct. Duration modelling for UV usually comes down to capturing the amount of deviation of the phoneme durations in the recognition result from normal phone durations [42]. Deviating durations in the recognition result decreases the confidence that it is recognized correctly.

Experimental System
In Figure 1, the architecture of our CALL system is shown. The input of the system is the learner's speech and a list of predicted responses in the form of transcriptions of sequences of words. Utterance selection is then performed to choose the best fitting (1-Best) response from this list. In the next phase the 1-Best response is verified. If the response is accepted, error detection on this response is carried out. Errors are detected on multiple levels, that is, syntax, morphology, and pronunciation. If the response is not accepted, the user is prompted to try again.
It is difficult for general Hidden Markov modelling meth ods to discriminate between utterances that are acoustically very similar [43]. Therefore, in the final CALL system we will probably use the following procedure: the output of the first step is a cluster of similar responses (e.g., according to a phonetically-based distance measure), and a more detailed analysis is carried out in the second (error detection) step to determine what was actually uttered and where to give feedback on.
We will now explain the main choices we made for our system regarding utterance selection and utterance verification procedures.

Utterance Selection.
In the literature many approaches have already been proposed to improve the performance of speech recognition for non-natives. A large deal of the research concerned one or a small number of fixed (L1-L2) language pairs. In these approaches material of the source language (L1) or material for specific L1-L2 pairs was employed to enhance ASR for these language pairs. However, EURASIP Journal on Audio, Speech, and Music Processing 5 F i g u r e 1: System architecture.
since our system is intended for learners of Dutch with different mother tongues, approaches that require material of L1 or specific L1-L2 pairs are not feasible in our case for either of the three components of an ASR system (acoustic models, lexicon, and language model). Consequently, we made the following choices.
For the acoustic models we decided to start with training the acoustic models on Dutch native speech. Next, we used read speech of language learners (DL2 speech) to retrain the acoustic models (see Section 4.1.4). Such retraining of the acoustic models is also possible in a realistic CALL application, albeit not online, after a so-called enrolment phase, as used in dictation systems. Especially if the system has to be used extensively by a learner, it is possible to make it as suitable as possible for that specific learner. At the level of the lexicon we could not make use of L1 phoneme recognizers, as was done by [31], and thus we added pronunciation variants to the lexicon that were generated by means of data-derived rules (for further details, see Section 4.1.5). Finally, we decided to use specific language models for every item in the CALL system that are based on a list of predicted (correct and incorrect) responses (see Section 4.1.3).

Utterance Verification.
In Section 2.2, we have given a short overview of the three key approaches to UV, that is, (1) posterior probability estimation, (2) statistical hypothesis testing, and (3) predictor combination. Most of these approaches are aimed at UV in large vocabulary tasks, that is, posterior probability estimation using word lattices and predictor features like acoustic stability and hypoth esis density. Furthermore, training explicit antimodels for statistical hypothesis testing is conceptually and practically difficult for speakers with a large variety of L1 backgrounds [44]. For these reasons, we have chosen a form of predictor combination in which a likelihood ratio similar to (6) in statistical hypothesis testing is combined with phone durations. The rationale behind this choice is explained in detail in Section 5.1.2.

Experiment 1: Utterance Selection
To goal of this experiment is to develop a procedure for selecting utterances from a list of predicted responses and to evaluate the effects of different language models, pronunciation lexicons, and acoustic models.
The speech material for the present experi ments was taken from the JASMIN speech corpus [45], which contains speech of children, non-natives, and elderly people. Since the non-native component of the JASMIN corpus was collected for the aim of facilitating the development of ASRbased language learning applications, it is particularly suited for our purpose. Speech from speakers with different mother tongues was collected, because this realistically reflects the situation in Dutch L2 classes. These speakers have relatively low proficiency levels, namely, A1, A2, and B1 of the Common European Framework (CEF), because it is for these levels that ASR-based CALL applications appear to be most needed.
The JASMIN corpus contains speech collected in two different modalities: read speech and human-machine dia logues. The latter were used for our experiments because they more closely resemble the situation we will encounter in our CALL application. The JASMIN dialogues were collected through a Wizard-of-Oz-based platform and were designed such that the wizard was in control of the dialogue and could intervene when necessary. In addition, recognition errors were simulated and difficult questions were asked to elicit some typical phenomena ofhuman-machine interaction that are known to be problematic in the development of spoken dialogue systems, such as hyperarticulation, restarts, filled pauses, self-talk, and repetitions.
The material we used for the present experiments consists of speech from 45 speakers, 40% male and 60% female, with 25 different L1 backgrounds. Ages range from 19 to 55, with a mean of 33. The speakers each give answers to 39 6 EURASIP Journal on Audio, Speech, and Music Processing questions about a journey. We first deleted the utterances that contain crosstalk, background noise, and whispering from the corpus. After deletion of these utterances the material consists of 1325 utterances. The mean signal-to-noise-ratio (SNR) ofthe material is 24.9 with a standard deviation of5.1.
Considering all these characteristics, we can state that the JASMIN non-native dialogues are similar to the speech we will encounter in our CALL application for various reasons: (1) they contain answers to relatively constrained questions, (2) they contain semispontaneous speech, (3) of non natives with different L1s, (4) which features spontaneous phenomena such as filled pauses and disfluencies. However, since hesitation phenomena were purposefully induced in the JASMIN dialogues, their incidence is probably higher than in typical non-native dialogues. To simulate the ASR task in our CALL application, we generated lists of the answers given by each speaker to each of the 39 questions. These lists mimic the predicted responses in our CALL application task because they contain (a) responses to relatively closed questions and (b) morpho logically and syntactically correct and incorrect responses.

Language Modelling.
Our approach is to use a con strained language model (LM) to restrict the search space. In total 39 LMs were generated based on the responses to each of the 39 questions. These responses were manually transcribed at the orthographic level. Filled pauses, restarts, and repetitions were also annotated.
Filled pauses are common in everyday spontaneous speech and generally do not hamper communication. It seems therefore that students using a CALL application should be allowed to produce a limited amount of filled pauses. In our material 46% of the utterances contain one or more filled pauses and almost 13% of all transcribed units are filled pauses.
However, 11% of the utterances contain one or more other disfluencies such as restarts, repairs and repetitions. While these also occur in normal speech, albeit less fre quently, we think that in a CALL application for training oral proficiency students should be stimulated to produce fluent speech. On these grounds, we decided not to tolerate restarts, repetitions and repairs and to ask the students to try again when one of these phenomena is produced. Therefore, in our research we did not focus on restarts, repairs, and repetitions, we only included their orthographic transcriptions in the LM and their manual phonetic transcriptions in the lexicon.

The LMs are implemented as FSMs with parallel paths of orthographic transcriptions of every unique answer to the question. A priori each path is equally likely. An example of such a question is "Hoe wilt u naar deze stad reizen?" (How do you want to travel to this city?) and a small part of the responses is
(1) /ikgaat met devliegtuig/ (/I am going by plane/*), The baseline LM that is generated from this list is depicted in Figure 2. Each of the parallel paths with words on the arcs represents a unique answer to a question. Silence is possible before and after each word (not shown).
To be able to decode possible filled pauses between words, we generated another LM with self-loops added in every node. Filled pauses are represented in the pronunciation lexicon as /@/ or /@m/, phonetic representations of the two most common filled pauses in Dutch. The filled pause loop penalty was empirically optimized. An example of this language model is depicted in Figure 3.
To examine whether filled pause loops are an adequate way of modelling filled pauses, we also experimented with an oracle LM. This is an LM containing the reference orthographic transcriptions, which include the manually annotated filled pauses without filled pause loops.

Acoustic Modelling.
We trained three-state tied Gaus sian Mixture Models (GMM). Baseline triphone models were trained on 42 hours of native read speech from the CGN corpus [47]. In total 11 660 triphones were created, using 32 738 Gaussians.
As discussed in Section 2.1, it has been observed in several studies that by adapting or retraining native acoustic models (AM) with non-native speech, decoding performance can be increased. To investigate whether this is also the case in a constrained task as described in this paper, we retrained the baseline acoustic models with non-native speech.
New AMs were obtained by doing a one-pass Viterbi training based on the native AMs with 6 hours ofnon-native read speech from the JASMIN corpus. These utterances were spoken by the same speakers as those in our test material (comparable to an enrollment phase).
Triphone AMs are the de facto choice for most researchers in speech technology. However, the expected performance gain from modelling context dependency by using triphones over monophones might be minimal in a constrained task. Therefore, we also experimented with non native monophone AMs trained on the same non-native read speech.

As explained in Section 2.1 non-native pronunciation generally deviates from native pronunciation, both at the phonetic and the phonemic level. To model pronunciation variation at the phonemic level, we added pronunciation variants to the lexicon.
To derive pronunciation variants, we extracted contextdependent rewrite rules from an alignment of canonical and realized phonemic representations of non-native speech from the JASMIN corpus (the test material was excluded).
Prior probabilities of these rules were estimated by taking the relative frequency of rule applications in their context. We generated pronunciation variants by successively applying the derived rewrite rules to the canonical rep resentations in the baseline lexicon. Variant probabilities were calculated by multiplying the applied rule probabilities. Canonical representations have a standard probabilityofone. Afterwards, probabilities of pronunciation variants per word were normalized so that these probabilities sum to one.
By introducing a cutoff probability, pronunciation lexi cons were created that contain only variants above this cutoff. In this way lexicons with on average 2, 3, 4, and 5 variants per word were created.

Evaluation. We evaluated the speech decoding setups using the utterance error rate (UER), which is the percentage of utterances where the 1-Best decoding result deviates from the transcription. Filled pauses are not taken into account during evaluation. That is, decoding results and reference transcriptions were compared after deletion of filled pauses.
For each UER the 95% confidence interval was calculated to evaluate whether UERs between conditions were significantly different.
As explained in the introduction, we do not expect our method to carry out a detailed phonetic analysis in the first phase. Since it is not necessary to discriminate between phonetically close responses at this stage, a decoding result can be classified as correct when its phonetic distance to the corresponding transcription is below a threshold. The phonetic distance was calculated through an alignment  Table 1, the UERs for the different language models and acoustic models can be observed. In all cases, the LM with filled pause loops performed significantly better than the LM without loops. Furthermore, the oracle LM with manually annotated filled pauses (with positions) did not perform significantly better than the LM with loops. Decoding setups with AMs retrained on non-native speech performed significantly better than those with AMs trained on native speech. The performance difference between monophone and triphone AMs was not significant.

Results. In
As expected, error rates are lower when evaluating using clusters of phonetically similar responses. To better appreciate the results in Table 1 it is important to get an idea ofthe meaning ofthese distances. The distances between the example responses in Section 4.1.3 are shown in Table 2. The density of the phonetic distances between all response pairs to all questions is depicted in Figure 5. Since there are only few responses with a phonetic distance smaller than 5, differences between 0 and 5 are marginal. Performance differences between 0 (equal to transcription) and 10 (one of the answers with a phonetic distance of 10 or smaller to the 1-Best equals the transcription) and between 5 and 15 were significant.  As can be seen in Table 3, performance decreased using lexicons with pronunciation variants generated using datadriven methods. The more variants are added, the worse the performance. Furthermore, there is no significant difference between using equal priors or estimated priors.

Discussion.
The results presented in the previous section indicate that large and significant improvements could be obtained by optimizing the language model and the acoustic models. On the other hand, pronunciation modelling at the level of the lexicon did not produce significant improve ments. On the contrary, adding variants to the lexicon caused a decrease in performance. Adding estimated prior probabilities to the variants improved the results somewhat, but still the error rates remain higher than those for the canonical lexicon. These results might be surprising because, in general, adding a limited number of carefully selected pronunciation variants to the lexicon helps improve performance to a certain extent [29,30]. However, in the case of non-native speech this strategy is not always successful [31]. Possible explanations might be sought in the nature of the variation that characterizes non-native speech. Non native speakers are likely to replace target language phonemes by phonemes from their mother tongue [3,5]. When the non-native speech is heterogeneous in the sense that it is produced by speakers with different mother tongues, as in our case, it may be extremely difficult to capture the rather diffuse pattern of variation by including variants in the lexicon (see also [4]).
The findings that better results are obtained with non native acoustic models and with a language model with filled pause loops are not surprising, after all the utterances are spoken by non-natives, recorded in the same environment and contain a lot of filled pauses. In fact, these results do not differ significantly from the results obtained with an oracle language model, in which the exact position of the filled pauses is copied from the manual transcriptions. This is an important result because non-natives are known to produce numerous filled pauses in unprepared, extemporaneous speech [12]. From these results we can conclude that external filled pause detection, for which better results were found for a large vocabulary task [49], is not necessary in this case.
Another reassuring result is that performance improved using non-native acoustic models. These were obtained by retraining native models on a relatively small amount (around 8 minutes per speaker) of non-native read speech material. It appears that this was sufficient to obtain signif icantly better results. In the final application we might then use a relatively short enrolment phase and do acoustic model retraining (and/or online speaker adaptation), to obtain better recognition results.
While in this experiment the correct transcription of the response was always in the language model, our system must also be able to reject utterances when they are not present in the language model, while still accepting correctly recognized utterances. This is the topic of the experiment presented in the following section.

Experiment 2: Utterance Verification
The goal of this experiment is to develop a procedure for utterance verication. Our approach consists of combining an acoustic likelihood ratio with duration-related predictors into one confidence measure. 5.1.1. Material. We used the same material as in the first experiment, but to simulate the case in which the spoken utterance is not present in the list, we also generated language models in which the correct utterance is left out. In this way, each of the 1325 utterances in our dataset is decoded two times: one time when its representation is present in the language model and one time when it is not present.

Confidence Predictors.
As mentioned in Section 4.2, posterior probability estimation using rich word lattices is often used in large vocabulary applications, where it usually provides accurate confidence measures, although it is computationally expensive. Since in our case the search space only contains a limited set of sequences of words, the decoding lattice is not rich enough to estimate p (x) (see (4)). Estimating p(x) on the basis of a free phone recognizer (FPR) is a more simple and faster approach, generally giving reasonably good results. For these reasons, we have used the ratio: p(x | w )p(w ) p(x I Mfpr)p(mfpr) as our baseline confidence measure. However, because we have equal prior probabilities for all language model paths and we do not use a language model during free phone recognition the priors p(w) and p(uFPR) can be discarded and (7) boils down to: This ratio bears a close relation to (6) used in the statistical hypothesis testing approach to UV. The main difference is that in the denominator in (8) all paths are used, while in (6) only the alternative paths are used to compare with the recognition result to be verified. Modelling the alternative paths in an antimodel is especially difficult in our task because it is very difficult to determine what exactly it should represent if the utterance is produced by language learners with generally low levels of proficiency and very diverse L1 backgrounds (see also [44]). Furthermore, training such an antimodel requires a large amount of non native speech data that is not available for Dutch.
We hypothesize that combining our baseline CM (LR) with other predictors that contain additional information about the quality of the recognition result will give better results than using LR alone. However, using the average hypothesis density in the word lattice as a predictor is probably not informative because in our task the word lattice is very small and contains very few competing hypotheses. Furthermore, a predictor like acoustic stability is difficult to define because different weightings of the language model have no effect on the combination score (because a priori each sequence of words in the language model is equally likely).
We expect that phone durations might contain addi tional information, because the phone segmentation of an incorrectly decoded sequence of words will generally be characterized by deviations in phone durations and this is not directly coded in the acoustic likelihoods in LR. Therefore, we want to add information about these phone duration deviations.
When the input speech representation is not present in the list and the utterance is recognized as another sequence of words that is present in the LM, the phone segmentation of this sequence of words will generally be characterized by deviations in phone durations. A straightforward way to capture this is to count the phones in the segmentation with durations that deviate substantially from the mean phone duration. We have implemented this by using predictors similar to those introduced in [42].

EURASIP Journal on Audio, Speech, and Music Processing
Phone duration distributions were derived from man ually verified phonemic transcriptions of 42 hours of read native speech from the CGN corpus [47]. For each of the 46 phonemes the 1st, 5th, 95th, and 99th percentile duration was calculated from these distributions. The predictors that were extracted from the segmentation are the number of phonemes in the decoded utterance that are shorter than the 1st (nrshorter-1) and 5th (nrshorter_5) percentile and the number of phonemes that are longer than the 95th (nrJonger-95) and 99th (nrJongerJ99) percentile durations. These predictors were normalized by the total number of phonemes in the recognized utterance.

Predictor Combination.
To combine the five predic tors, that is, LR, nrshorter-1, nrshorter_5, nrJonger_95, nrJonger_99, into one confidence measure we have used a logistic regression model. Logistic regression modelling is a straightforward and fast method known to produce accurate predictions when a binary variable is a linear function of several explanatory variables [50]. It fits the logit of the probability (logarithm of the odds) of a binary event as a linear function of the set of explanatory variables: logit(p(y I p)) = 1 pp) = ^0 + where p(y | p) is the probability of a correctly or incorrectly decoded utterance y given the confidence predicting vari ables p. The optimal weights 3 are chosen through Maximum Likelihood Estimation (MLE) in WEKA [51]. We trained and tested the model by using Leave-One-Speaker-Out crossvalidation where the model is trained on all speakers except one and then tested on the utterances of the speaker that were left out during training. This is repeated until all speakers are tested.

Evaluation.
We evaluated the discriminative ability of our utterance verifier using Receiver Operator Characteristic (ROC) curves, in which the two types of error rates, that is, the false-positive and false-negative rates, are plotted for different thresholds. Using the point on the ROC curve where the error rates of both types are equal, the equal error rate (EER), the different confidence indicators and their combinations are evaluated. 95% confidence intervals were calculated to investigate whether differences between EERs were significantly different.

5.2.
Results. The utterance error rate (UER) of our speech decoder on the set ofdecoding results where the correct tran scription was present in the LM was 10.0% (see Section 4.2). In this case errors consist of substitutions with competing language model paths. The UER on the set without the correct transcriptions in the LM was of course 100.0%, so on average 55.0% of all the cases was incorrectly recognized. The task for the UV was to discriminate the correctly and incorrectly recognized cases. In Table 4, this ability is shown in terms of EER for the individual predictors and several predictor combinations. ROC curves of the best performing predictor and two combinations are shown in Figure 6.

10.3%
Within the individual predictors LR performs best (14.4%) and all the duration-related predictors perform much worse. The best result for a single duration predictor is 27.3% for nrshorter_1. When we combined all durationrelated predictors, duration_comb, the EER relative to the best performing duration-related predictor dropped significantly from 27.3% (with a confidence interval ±1.7) to 25.3%. Finally, by combining the LR with duration-comb, the EER relative to LR decreased significantly by 4.1% from 14.4% to 10.3%.
In Tables 5(a) and 5(b), percentages are shown using the EER threshold and using all predictors for the two different sets of decoding results, with and without the correct transcription in the LM, respectively. For example, in the set of results with the correct transcription in the LM, 80.8% is classified as correct when it indeed was correctly decoded and 9.2% was classified as incorrect (false reject). In the set without the correct transcription in the LM 91.7% was   classified as incorrect when it was incorrectly decoded, and 8.3% was classified as correct (false accept). The performance on the whole dataset is shown in Table 5(c).

5.
3. Discussion. The duration-related predictors have a weak performance individually, but they still contain additional information relative to the likelihood ratio LR. The durationrelated predictor distributions of correctly and incorrectly decoded utterances overlap severely. This was still the case when we normalized these predictors for the speaking rate within the utterance or when we used the probability of the phoneme durations in the utterance as a predictor. The latter we calculated through a kernel density estimation of the duration probability density per phoneme trained on the CGN native read speech data. Using these more complex predictors the model was not able to make substantially better predictions. By introducing a UV procedure and using the EER threshold, we are able to filter out 91.7% of the utterances that are not in the predicted list ofresponses. This comes with the cost of also rejecting utterances that are correctly decoded and accepting utterances that are incorrectly decoded. The ratio between these error rates depends on the threshold setting. We will discuss threshold calibration in the following section.

General Discussion
We carried out two experiments in order to evaluate methods for utterance selection and utterance verification which are going to be used in a CALL application for low-proficient L2 learners of Dutch. For utterance selection with the transcription of the response in the language model, our best error rates were between 10.0% and 6.9% after optimizing acoustic and language models. In 90% of the cases, the decoding result was equal to the corresponding transcription of the response (phonetic distance of 0) and in 93.1% of the cases, the decoder was able to select a cluster of transcriptions with a phonetic distance of 15 or smaller to the 1-Best in which the corresponding transcription was present.
Using an utterance verifier that combined acoustic likelihoods and duration information of the decoding result, 89.8% of the correctly decoded responses is accepted and 70% of the incorrectly decoded utterances could be rejected when the transcription of the response was present in the language model. In addition, 91.7% of the utterances with no representation in the language model could correctly be rejected.
These results apply when we only perform error detec tion to the 1-Best decoding result, but as explained in Section 3 error detection will probably be performed on the cluster of responses that have a small phonetic distance to the 1-Best decoding result. For example, if it is not clear whether a segment or a (short) word was pronounced or not, this can be ascertained in the second step through a more detailed analysis [19]. At the moment, we think that in the second step we can handle utterances with a phonetic distance smaller than 5, which usually corresponds to a difference of 1 or 2 segments, or possibly even utterances with a phonetic distance smaller than 10, which often boils down to a deviation by a short word. For the latter category, the best result obtained is an error rate of around 8%. This is encouraging, especially if we keep in mind that in a language learning application we can be conservative, in the sense that if we are not sufficiently confident about the recognition result we can always ask the language learner to try again.
Until now we have evaluated the performance of UV using the EER threshold, but this might not be the optimal threshold setting in the actual application. In our application the recognized utterance will be probably shown to the user so that he/she knows whether the utterance was correctly recognized, and where the feedback is based on. If the system makes an error in recognizing the utterance, this will then be clear for the user. The system can make two types of errors: (a) a false rejection, in which case a correctly decoded utterance is classified as incorrect by the UV or (b) a false acceptance, in which case an incorrectly decoded utterance is classified as correct. To determine which of these errors is more detrimental at this stage of the application, it is necessary to consider how such errors can be handled in the application and what their possible consequences are. In the case of a rejection, and therefore also of a false rejection, it is possible to ask the user to repeat the utterance. In concrete terms then, a false rejection implies that the user is unnecessarily asked to repeat the utterance. In the case of a false acceptance an utterance will be shown to the user that (s)he actually did not produce. This type of error would seem to be more detrimental because it can affect the credibility of the system. However, the degree of seriousness will depend on the degree of discrepancy between the utterance that was actually produced and the one that was recognized and shown by the system: the larger the deviation the more serious the error. On the other hand, large deviations are less likely than small deviations. On the basis of such considerations we can indicate the seriousness of the two types of errors and therefore the costs that should be assigned to false rejections and false acceptances.
There are now three different factors that are important in choosing an application-dependent threshold, namely (1) the prior probability of a correct decoding pcorrect, (2) the cost of a false rejection CFR, and (3) the cost of a false acceptance CFA. To formalize the idea of taking into account different error costs and different prior distributions in the process of choosing a threshold, we can estimate the total cost of a specific threshold setting with a cost function: Ctotal = pFRCFRpcorrect + pFACFA ( 1 ~ pcorrect) , where p FR and p FA are the probabilities of false rejection and false acceptance, respectively. This kind of cost function is also used in the NIST evaluation of speaker recognition systems [52]. Minimizing Ctotal on a development set will provide us with the optimal threshold setting given the application-dependent parameters CFR, CFA, and pcorrect. Using the UV with this application-dependent threshold calibration procedure could make an excellent research vehicle for future experiments with different error costs.