ASR for Under-Resourced Languages From Probabilistic Transcription

In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition (ASR) is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: A probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semisupervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which nonspeakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of second-language speech perception. Third, EEG distribution coding is a new technique in which nonspeakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. Both EEG distribution coding and text-derived phone language models were shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing.


. Columbia University
Abstract-In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition (ASR) is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: a probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semi-supervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which non-speakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of secondlanguage speech perception. Third, EEG distribution coding is a new technique in which non-speakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. EEG distribution coding and text-derived phone language models were both shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing.

I. INTRODUCTION
A UTOMATIC speech recognition (ASR) has the potential to provide database access, simultaneous translation, and text/voice messaging services to anybody, in any language, dramatically reducing linguistic barriers to economic success. To date, ASR has failed to achieve its potential, because successful ASR requires very large labeled corpora; the human transcribers must be computer-literate, and they must be native speakers of the language being transcribed. Large corpora are beyond the resources of most under-resourced language communities; we have found that transcribing even one hour of speech may be beyond the reach of communities that lack large-scale government funding. In order to create the databases reported in this paper, for example, we sought paid native transcribers, at a competitive wage, for the 68 languages in which we have untranscribed audio data. We found transcribers willing to work in only eleven of those languages, of which only seven finished the task.
Instead of recruiting native transcribers in search of a perfect reference transcript, this paper proposes the use of probabilistic transcripts. A probabilistic transcript is a probability mass function, ρ Φ (φ), specifying, as a real number between 0 and 1, the probability that any particular phonetic transcript φ is the correct transcript of the utterance. Prior to this work, machine learning has almost always assumed that the training dataset contains either deterministic transcripts (ρ DT (φ) ∈ {0, 1}, commonly called "supervised training") or completely untranscribed utterances (commonly called "unsupervised training," in which case we assume that ρ LM (φ) is given by some a priori language model). This article proposes that, even in the absence of a deterministic transcript, there may be auxiliary sources of information that can be compiled to create a probabilistic transcript with entropy lower than that of the language model, and that machine learning methods applied to the probabilistic transcript are able to make use of its reduced entropy in order to learn a better ASR. In particular, this paper considers three useful auxiliary sources of information: 1) SELF-TRAINING: ASR pre-trained in other languages is used to transcribe unlabeled training data in the target language. 2) MISMATCHED CROWDSOURCING: Human crowd workers who don't speak the target language are asked to transcribe it as if it were a sequence of nonsense syllables. 3) EEG DISTRIBUTION CODING: Humans who do not speak the target language are asked to listen to its extracted syllables, and their EEG responses are interpreted as a probability mass function over possible phonetic transcripts.

II. BACKGROUND
Suppose we require that, in order to develop speech technology, it is necessary first to have (1) some amount of recorded speech audio, and (2) some amount of text written in the target language. These two requirements can be met by at least several hundred languages: speech audio can be recorded from podcasts or radio broadcasts, and text can be acquired from Wikipedia, Bibles, and textbooks. Recorded speech is, however, not usually transcribed; and the requirement of native language transcription is beyond the economic capabilities of many minority-language communities.

A. ASR in Under-Resourced Languages
Krauwer [27] defined an under-resourced language to be one that lacks: stable orthography, significant presence on the internet, linguistic expertise, monolingual tagged corpora, bilingual electronic dictionaries, transcribed speech, pronunciation dictionaries, or other similar electronic resources. Berment [3] defined a rubric for tabulating the resources available in any given language, and proposed that a language should be called "under-resourced" if it scored lower than 10.0/20.0 on the proposed rubric. By these standards, technology for underresourced languages is most often demonstrated on languages that are not really under-resourced: for example, ASR may be trained without transcribed speech, but the quality of the resulting ASR can only be proven by measuring its phone error rate (PER) or word error rate (WER) using transcribed speech. The intention, in most cases, is to create methods that can later be ported to truly under-resourced languages.
The International Phonetic Alphabet (IPA [21]) is a set of symbols representing speech sounds (phones) defined by the principle that, if two phones are used contrastively (i.e., they represent distinct phonemes) in any language, then those phones should have distinct symbolic representations in the IPA. This makes the IPA a natural choice for transcripts used to train cross-language ASR systems, and indeed ASR in a new language can be rapidly deployed using acoustic models trained to represent every distinct symbol in the IPA [39]. However, because IPA symbols are defined phonemically, there is no guarantee of cross-language equivalence in the acoustic properties of the phones they represent. This problem arises even between dialects of the same language: a monolingual Gaussian mixture model (GMM) trained on five hours of Levantine Arabic can be improved by adding ten hours of Standard Arabic data, but only if the log likelihood of crossdialect data is scaled by 0.02 [18].
Better cross-language transfer of acoustic models can be achieved, but only by using structured transfer learning methods, including neural networks (NN) and subspace Gaussian mixture models (SGMM). SGMMs use language-dependent GMMs, each of which is the linear interpolation of languageindependent mean and variance vectors [37], e.g., 16% relative WER reduction was achieved in Tamil by combining SGMM with an acoustic data normalization technique [32]. NN transfer learning can be categorized as tandem, bottleneck, pre-training, phone mapping, and multi-softmax methods. In a tandem system, outputs of the NN are Gaussianized, and used as features whose likelihood is computed with a GMM; in a bottleneck system, features are extracted from a hidden layer rather than the output layer. Both tandem [44] and bottleneck [47] features trained on other languages can be combined with GMMs [47] or SGMMs [20] trained on the target language in order to improve WER.
A hybrid ASR is a system in which the NN terminates in a softmax layer, whose outputs are interpreted as phone or senone [7] probabilities. Knowledge of the target language phone inventory is necessary to train a hybrid ASR, but it is possible to reduce WER by first pre-training the NN hidden layers with multilingual data [17], [45]. A hybrid ASR can be constructed using very little in-language speech data by adding a single phone-mapping layer [42] or senonemapping layer [10] to the output of the multilingual NN. A multi-softmax system is a network with several different language-dependent softmax layers, each of which is the linear transform of a multilingual shared hidden layer [17], [38], [47].

B. Self-Training
Self-training is a class of semi-supervised learning techniques in which a classifier labels unlabeled data, and is then re-trained using its own labels as targets. Self-training is frequently used to adapt ASR from a well-resourced language to an under-resourced language [5], [30], or in some cases, to create target-language ASR by adapting several sourcelanguage ASRs [48]. A self-trained classifier tends to be too conservative, because the tails of the data distribution are truncated by the self-labeling process [40]; on the other hand, a self-trained classifier needs to be conservative, because the error rate of the learned classifier increases at a rate more than proportional to the error rate of the self-labeling process [19]. Self-training is therefore most useful when the in-language training data are filtered, to exclude frames with confidence below a threshold [46], and/or weighted, so that some frames are allowed to influence the learned parameters more than others [16]. Self-training of NN systems has been shown to be about 50% more effective (1.5 times the error rate reduction) as self-training of GMM systems [19].

C. Mismatched Crowdsourcing
In [24], a methodology was proposed that bypasses the need for native language transcription: mismatched crowdsourcing sends target language speech to crowd-worker transcribers who have no knowledge of the target language, then uses explicit mathematical models of second language phonetic perception to recover an equivalent phonetic transcript (Fig. 1). Majority voting is re-cast, in this paradigm, as a form of error-correcting code (redundancy coding), which effectively increases the capacity of the noisy channel; interpretation as   Mismatched Crowdsourcing: crowd workers on the web are asked to transcribe speech in a language they do not know. Annotation mistakes are modeled by a finite state transducer (FST) model of utterance-language pronunciation variability (reduction and coarticulation), composed with an FST model of non-native speech misperception (mapping utterance-language phones to annotation-language phones), composed with an inverted graphemeto-phoneme (G2P) transducer.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. a noisy channel permits us to explore more effective and efficient forms of error-correcting codes. Assume that crosslanguage phone misperception is a finite-memory process, and can therefore be modeled by a finite state transducer (FST). The complete sequence of representations from utterance language to annotation language can therefore be modeled as a noisy channel represented by the composition of up to three consecutive FSTs ( Fig. 1): a pronunciation model, a misperception model, and an inverted grapheme-to-phoneme (G2P) transducer.

D. Electrophysiology of Speech Perception
The human auditory system is sensitive to within-category distinctions in speech sounds, but such pre-categorical perceptual distinctions may be lost in transcription tasks, where a listener must filter their percepts through the limited number of categorical representations available in their native language orthography. EEG distribution coding is a proposed new method that interprets the electrical evoked potentials of an untrained listener (measured by electroencephalography or EEG) as a probability distribution over the phone set of the utterance language (Fig. 2). A transcriber, in this scenario, listens to speech in both his native language and an unfamiliar non-native target language, while his EEG responses are recorded. From his responses to English speech, an Englishlanguage EEG phone recognizer is trained [9]. Misperception probabilities ρ(ψ|φ) are then estimated: for each non-native phone φ, the classifier outputs are interpreted as an estimate of ρ(ψ|φ).

III. ALGORITHMS THAT INDUCE A PROBABILISTIC TRANSCRIPT
A deterministic transcript is a sequence of phone symbols, φ = [φ 1 , . . . , φ M ] where φ m is a symbol drawn from the phone set of the utterance language.
A probabilistic transcript is a probability mass function (pmf) over the set of deterministic transcripts. Capital letters denote random variables, lowercase denote instances: Φ m is a random variable whose instance is φ m . Denote the probability of transcript φ as ρ Φ (φ ), where ρ ("reference") means that ρ Φ (φ ) is a reference distribution-a distribution specified by the probabilistic transcription process, and not dependent on ASR parameters during training. The distribution label Φ is omitted when clear from the instance label, e.g., ρ(φ ), but ρ Φ (u). Superscript denotes waveform index, while subscript denotes frame or phone index. Absence of either superscript or   Fig. 3. A probabilistic transcript (PT) is a probability mass function (pmf) over candidate phonetic transcripts. All PTs considered in this paper can be expressed as confusion networks, thus, as sequential pmfs over the nullaugmented space of IPA symbols. In this schematic example, ∅ is the null symbol, symbols in brackets are IPA, and numbers indicate probabilities.
subscript denotes a collection, thus Φ = Φ 1 , . . . , Φ L (with instance value φ = φ 1 , . . . , φ L ) is the random variable over all transcripts of the database. In all of the work described in this paper, the probabilistic transcript is represented as a confusion network [31], meaning that it is the product of independent symbol pmfs ρ(φ m ): The pmf ρ(φ ) can be represented as a weighted finite state transducer (wFST) in which edges connect states in a strictly left-to-right fashion without skips, and in which the edges connecting state m to state m + 1 are weighted according to the pmf ρ(φ m ) (Fig. 3).
Three different experimental sources were tested for the creation of a PT. Self-training is now well-established in the field of under-resourced ASR; we adopted the algorithm of Vesely, Hannemann and Burget [46]. Mismatched crowdsourcing used original annotations collected using published methods [25]. EEG was not used independently here, but rather, was used to learn a misperception model applicable to the interpretation of mismatched crowdsourcing.

A. Self-Training
The first set of PTs is computed using NN self-training. The Kaldi toolkit [36] is first used to train a cross-lingual baseline ASR, using training data drawn from six languages not including the target language. The goal of self-training, then, is to adapt the NN to a database containing L speech waveforms in the target language, each represented by acoustic feature matrix x = [x 1 , . . . , x T ], where x t is an acoustic feature vector. The feature matrix x represents an utterance of an unknown phone transcript φ = [φ 1 , . . . , φ M ] which, if known, would determine the sequence but not the durations of senones (HMM states) s = [s 1 , . . . , s T ].
The feature matrix x is decoded using the cross-lingual baseline ASR, generating a phone lattice output. Using scripts provided by previous experiments [46], the phone lattice is interpreted as a set of posterior senone probabilities ρ(s t |x t ) for each frame, and the senone posteriors serve as targets for re-estimating the NN weights. Experiments using other datasets found that self-training should use best-path alignment to specify a binary target for NN training [46], but, apparently because of differences in the adaptation set between our experiments and previous work, we achieve better performance using real-valued targets. As in previous work, senones with a posterior probability below 0.7 were removed from the training set, thus the training target was a number between 0.7 and 1.0.

B. Mismatched Crowdsourcing
The second set of PTs was computed by sending audio in the target language to non-speakers of the target language, and asking them to write what they hear. It would be preferable to recruit transcribers who speak a language with predictable orthography, but since transcribers in those languages were more expensive, this experiment instead recruited transcribers who speak American English. Denote using T the set of mismatched transcripts produced by these English-speaking crowd workers, which we wish to interpret as a pmf over target-language phone sequences, ρ(φ|T ). As an intermediate step, prior work [25] developed techniques to merge texts into a confusion network ρ(λ|T ) over representative transcripts in the annotation-language orthography (Fig. 4).
Once transcripts have been aligned and filtered to create the orthographic confusion network ρ(λ|T ), they are then translated into a distribution over phone transcripts according to: The terms other than ρ(λ|T ) in Equation (2) are estimated as follows. ρ(λ) is modeled using a unigram prior over the letter sequences in λ. ρ(φ) is modeled using either a cross-lingual phone unigram, a language-constrained cross-lingual unigram (the cross-lingual unigram, constrained to take values from the phone set of the target language), or a language-specific phone bigram ρ(φ) = M m=1 ρ(φ m |φ m−1 ). Sec. IV-C describes an algorithm for training the phone bigram without using proscribed test-language resources; Sec. V lists the PT accuracies achieved using each of these three approaches. ρ(λ|φ) is called the misperception G2P, as it maps to graphemes in the annotation language, λ, fsrom phones in the utterance language, φ. Section III-C describes methods that decompose ρ(λ|φ) into separate misperception and G2P transducers, but it can also be trained directly using representative transcripts λ (and their corresponding native transcripts) for speech in  languages other than the target language. The model learned in this way is essentially a machine translation model, which translates between graphemes in the annotation language (λ) to phonemes in any possible utterance language (φ). We assume that misperceptions depend more heavily on the annotation language than on the utterance language, and that therefore a model ρ(λ|φ) trained using a universal phone set for φ is also a good model of ρ(λ|φ) for the target language. Note that, while this assumption is not entirely accurate, it is necessitated by the requirement that no native transcripts in the target language can be used in building any part of our system.

C. Estimating Misperceptions from Electrocortical Responses
The misperception G2P described in Section III-B was estimated using a combination of mismatched and deterministic transcripts of non-target languages. However, with a small amount of transcribed data in the utterance language, it is possible to estimate the misperception G2P using electrocortical measurements of non-native speech perception. In this approach, the misperception G2P is decomposed into two separate transducers, a misperception transducer ρ(ψ|φ), and an annotation-language G2P ρ(λ|ψ): where φ is a phone string in the utterance language, ψ is a phone string in the annotation language, and λ is an orthographic string in the annotation language. ρ(λ|ψ) is an inverted G2P in the annotation language, e.g., trained on the CMU dictionary of American English pronunciations [28]. ρ(ψ|φ) is the mismatch transducer, specifying the probability that a phone string φ in the utterance language will be misheard as the annotation-language phone string ψ.
In principle, the mismatch transducer could be computed empirically from a phone confusion matrix, if experimental data on phone confusions were available for all phones in the target language, and those data were based on responses from a listener with the same language background as the crowd worker transcribers. These goals are hard to meet. An alternative is to use distinctive feature representations (originally proposed to characterize the perceptual and phonological natural classes of phonemes [22]) to predict misperceptions based on differences between the distinctive feature values of annotation-and utterance-language phones. Given the assumption that every distinctive feature shared by phones φ and ψ independently increases their confusion probability, their confusion probability can be expressed as where w k (φ, ψ) is smaller if φ and ψ share the k th distinctive feature. The assumption of independence is a simplifying assumption, given that many distinctive features have overlapping acoustic correlates. For example, the frequencies of the two lowest resonances of the vocal tract (the primary cues for vowel identity) are determined by articulatory gestures of the lips, jaw and tongue that are commonly represented by three or more distinctive features (e.g., height, backness, rounding, and advanced tongue root). Moreover, the weights w k will probably also depend on properties of the speaker and listener (language, dialect, and idiolect), but data to train such a rich model do not exist. However, a reasonable approximate model can be learned by assuming that w k depend only on information about the listener, which can be incorporated via measurements of electrocortical activity. In particular, the weights w k of the distinctive features can be set based on similarity of electrocortical responses (measured using EEG) as determined by a classifier trained to compute distinctive feature representations from electrocortical responses to the listener's native language phones. Thus, suppose a listener first hears phones φ = ψ in the native language, EEG response signals y are recorded, and a bank of binary classifiers g k (y) are trained to label the distinctive features f k (φ) [9]. Second, the same listener hears phones φ = ψ in a new language, and EEG response signals y are recorded; then the contributions in Eq. (4) can be estimated as

IV. ALGORITHMS FOR TRAINING ASR USING PROBABILISTIC TRANSCRIPTION
An ASR is a parameterized pmf, π(x, s|φ, θ), specifying the dependence of acoustic features, x, and senones, s, on the phone transcript φ and the parameter vector θ, where the notation π(·) denotes a pmf dependent on ASR parameters. Assume a hidden Markov model (HMM), therefore
The probability π(x, s, φ|θ) is computed by composing the following three weighted FSTs: where the notation has the following meaning. The probabilistic transcript, P T , is an FST that maps any phone string to itself. This mapping is deterministic and reflexive, but comes with a path cost determined by the transcription probability ρ(φ ), as exemplified in Fig. 3. The context transducer, C, maps any senone sequence s to a phone sequence φ [33]. This mapping is stochastic, and the path cost is determined by the HMM transition weights The acoustic model, H, maps any senone sequence to itself. This mapping is deterministic and reflexive, but comes with a path cost determined by the acoustic modeling probability The posterior probability π(s, φ|x, θ) is computed by composing the FSTs, pushing toward the initial state (normalizing so that probabilities sum to one), then finding the total cost of the path through PUSH (H • C • P T ) with input string s and output string φ. The analytical maximum of Q (θ, θ ) can be computed efficiently using the Baum-Welch algorithm, but experiments reported in this paper did not do so, for reasons described in the next subsection.

B. Segmental K-Means Training
The EM quality function, Q(θ, θ ), has properties that make it undesirable as an optimizer for L. Suppose, as often happens, that there is a poor phone sequence, φ p , that is highly unlikely given the correct parameter vector θ * , meaning that π(x, s, φ p |θ * ) is very low. Suppose that the initial parameter vector, θ, is less discriminative, so that π(x, s, φ p |θ) > π(x, s, φ p |θ * ). Indeed, the best speech recognizer is a parameter vector θ * that completely rules out poor transcripts, setting π(x, s, φ p |θ * ) = 0; but in this case Q(θ, θ * ) = −∞. It is therefore not possible for the EM algorithm to start with parameters θ that allow φ p , and to find parameters θ * that rule out φ p . With probabilistic transcription, this problem is quite common: if the human transcribers fail to rule out φ p (e.g., because the correct and incorrect transcripts are perceptually indistinguishable in the language of the transcribers), then the EM algorithm will also never learn to rule out φ p .
EM's inability to learn zero-valued probabilities can be ameliorated by using the segmental K-means algorithm [23], which bounds L(θ ) as L(θ ) ≥ R(θ, θ ): This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Given an initial parameter vector θ, therefore, it is possible to find a new parameter vector θ with higher likelihood by computing its maximum-likelihood senone sequence and phone sequence s * (θ), φ * (θ), and by maximizing θ with respect to s * (θ) and φ * (θ). Maximizing R(θ, θ ) rather than Q(θ, θ ) is useful for probabilistic transcription because it reduces the importance of poor phonetic transcripts.

C. Using a Language Model During Training
During segmental K-means, it is advantageous to incorporate as much information as possible about the utterance language. Define G to be an FST representing the modeled phone bigram probability π(φ |θ) = M m=1 π(φ m |φ m−1 , θ). Training results can be improved by using H • C • P T • G to compute segmental K-means.
By assumption, phone bigram information is not available from speech: we assume that there is no transcribed speech in the target language. A reasonable proxy, however, can be constructed from text. Fig. 5 shows text data downloaded from Wikipedia in Swahili, and a segment of a knowledge-based G2P for the Swahili language. Because this phone bigram will also be used in ASR testing, it is constructed using a knowledge-based method that requires zero test-language training data: The G2P is constructed by looking up "Swahili alphabet" on Wikipedia, downloading the resulting web page, and converting it by hand into an unweighted finite state transducer [15]. By passing the former through the latter, it is possible to generate synthetic phone sequences in the target language.
Composing P T • G is complicated by the presence of null transitions in the PT. A null transition in the PT matches a non-event in the language model, for which normal FST notation has no representation. In order to compose the PT with the language model, therefore, it is necessary to introduce a special type of "non-event" symbol, here denoted "#2", into the language model (Fig. 6). A language model "non-event" is a transition that leaves any state, and returns to the same state (a self-loop). Such self-loops, labeled with the special symbol "#2" on both input and output language, are added to every state in G (Fig. 6 (b)). The probabilistic transcript, then, is augmented with the special symbol "#2" as the outputlanguage symbol for every null-input edge (input symbol is φ m = ∅). been widely applied to GMM and HMM parameter estimation problems such as speaker adaptation [12]. Formally, for an unseen target language, denote its acoustic observations x = (x 1 1 , . . . , x L T ), and its acoustic model parameter set as θ, then the MAP parameters are defined as:

D. Maximum A Posteriori Adaptation
where π(θ) is the product of conjugate prior distributions, centered at the parameters of a cross-lingual baseline ASR. In a GMM-HMM, the acoustic model is computed by choosing a Gaussian component, G t , whose mixture weight is c jk = π G t |S t (k|j), and whose mean vector and covariance matrix are µ jk and Σ jk . Maximum likelihood trains these parameters by computing γ t (j, k) = π S t ,G t (j, k|x , θ), then accumulating weighted average acoustic frames with weights given by γ t (j, k). Segmental K-means quantizes π S t (j|x , θ) → {0, 1} using forced alignment, then proceeds identically. MAP adaptation assigns, to each parameter, a conjugate prior π(θ) with mode equal toθ (the parameters of the cross-lingual baseline), and with a confidence hyperparameter τ θ , resulting in reestimation formulae that are linearly interpolated between the baseline parametersθ and the statistics of the adaptation data, for example:

E. Neural Networks
The NN acoustic model is π X t |S t (v|j, θ) ∝ y t (j), whose parameters θ = {c j , w j , w vh } include the senone priors c j , the softmax weight vectors w j , and the parameters defining the hidden nodes h t (v, w vh ). NNs are trained by using a GMM-HMM to compute an initial senone posterior, π S t (j|x , θ), then minimizing the cross-entropy between the estimated senone posterior and the neural network output y t (j), using gradient descent in the direction This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Preliminary experiments showed that forced alignment improves the accuracy of NNs trained from probabilistic transcripts: the best path through the PT, and the best alignment of the resulting senones to the waveform, were both computed using forced alignment. The resulting best senone string was used to train a NN using Eq. (21).
V. AUDIO DATA AND MISMATCHED CROWDSOURCING Speech data were extracted from publicly available podcasts [43] hosted in 68 different languages. In order to generate test corpora (in which it is possible to measure phone error rate), advertisements were posted at the University of Illinois seeking native speakers willing to transcribe speech in any of these 68 languages. Of the ten transcribers who responded, six people were each able to complete one hour of speech transcription (the other four dropped out). One additional language was transcribed by workers recruited at I 2 R in Singapore, yielding a total of seven languages with native transcripts suitable for testing an ASR: Arabic (arb), Cantonese (yue), Dutch (nld), Hungarian (hun), Mandarin (cmn), Swahili (swh) and Urdu (urd). It is desirable to test the ideas in this paper with corpora larger than one hour per language, but larger corpora involve problems orthogonal to the purposes of this paper, e.g., the Babel corpora contain telephone speech, and therefore contain far more acoustic background noise than the podcast corpora used in this paper.
The podcasts contain utterances interspersed with segments of music and English. A GMM-based language identification system was developed in order to isolate regions that correspond mostly to the target language, which were then split into 5-second segments to enable easy labeling by the native transcribers. Native transcribers were asked to omit any 5second clips that contained significant music, noise, English, or speech from multiple speakers. Resulting transcripts covered 45 minutes of speech in Urdu and 1 hour of speech in the remaining six languages. The orthographic transcripts for these clips were then converted into phonemic transcripts using language-specific dictionaries and G2P mappings. In order to make it possible to transfer ASR from training languages (which have native transcripts) to a test language (that has no native transcripts), the phone set must be standardized across all languages; for this purpose, the phone set was based on the international phonetic alphabet (IPA; [21]). Similarly, in order to transfer ASR from training languages to a test language, the training transcriptions must be converted to phonemes using a grapheme-to-phoneme transducer (G2P). G2Ps were therefore assumed to be available in all training languages, but not in the test language. Since these G2Ps are only used for training and not test languages, five of them (Arabic, Dutch, Hungarian, Cantonese and Mandarin) were trained using lexical resources, and only two (Urdu and Swahili) were constructed using the zero-resource knowledgebased method described in Sec. IV-C. English words in each transcript are identified and converted to phones with an English G2P trained using CMUdict [28], then other words are converted into phonetic transcripts using languagedependent dictionaries and G2Ps. The Arabic dictionary is from the Qatari Arabic Corpus [11], the Dutch dictionary is from CELEX v2 [1], the Hungarian dictionary was provided by BUT [14], the Cantonese dictionary is from I 2 R, and the Mandarin dictionary is from CALLHOME [6]. For each language, we chose a random 40/10/10 minutes split into training, development and evaluation sets. Mismatched transcripts were collected from annotators on Amazon Mechanical Turk. Each 5-sec speech segment was further split into 4 non-overlapping segments to make the nonnative listening task easier. The crowdsourcing task was set up as described in [25]; briefly, the segments were played to annotators, who transcribed what they heard (typically in the form of nonsense syllables) using English orthography. Each segment was transcribed by 10 distinct annotators. More than 2500 annotators participated in these tasks, with roughly 30% of them claiming to know only English (Spanish, French, German, Japanese, Chinese were some of the other languages they reported knowing).
The quality of a probabilistic transcript derived from mismatched crowdsourcing is significantly improved by using a phone language model during the decoding process (ρ(φ) in Eq. (2)). Phone language models for each target language were computed from Wikipedia texts using the methods described in Sec. IV-C. Label phone error rate (LPER) of the 1-best path through the resulting PTs are shown in Table I, computed with reference to a native transcript in each language. As shown, the use of a phone language model, derived from Wikipedia text, reduces LPER by about 10% absolute, in each language.
LPER of the 1-best path does not accurately reflect the extent of information in the PTs that can be leveraged during ASR adaptation. Consider, for example, the four Urdu phones [p,p h ,b,b H ]. An attentive English-speaking transcriber must choose between the two letters <p,b> in order to represent any of these four phones. The misperception G2P therefore maps the letters <p,b> into a distribution over the phones [p,p h ,b,b H ]. There is no reason to expect that the maximizer of ρ(φ|λ) is correct, but there is good reason to expect the correct answer to be a member of a short N -best list (N ≤ 4 phones/grapheme). A fuller picture is therefore obtained by pruning the PT to a small number of paths, then searching for the most correct path in the pruned PT. One useful metric is entropy per segment, defined as H (Φ) = − 1 M M m=1 u log 2 ρ Φ m (u), e.g., a PT in which every segment has two equally probable options would measure H (Φ) = 1. Fig. 7 shows the trend of LPER (for three languages) obtained by pruning the PT at several different levels of H (Φ). LPER rates drop significantly across all languages within 1 bit of entropy per phone, illustrating the extent of information captured by the PTs.

VI. EEG RECORDING AND ANALYSIS
To compute distinctive feature weights for the misperception transducer shown in Eqs. (4) and (5), cortical activity in response to non-native phones was recorded by an EEG. Signals were acquired using a BrainVision actiCHamp system with 64 channels and 1000 Hz sampling frequency. All procedures were approved by the University of Washington Institutional Review Board.
Auditory stimuli were consonant-vowel (CV) syllables representing consonants of three languages: English, Dutch and Hindi. The inclusion of only two non-English languages was dictated by the relatively high number of repetitions needed for good signal-to-noise ratio from averaged EEG recordings. The choice of Dutch and Hindi was made based on language phonological similarity, defined as the number of many-toone mappings (N M 2O ) between the English phoneme inventory and the non-English phoneme inventory. Many-to-one mappings are expected to pose a problem for the non-native transcription task being modeled by the misperception transducer, so to test the contribution of EEG we chose languages that differed greatly in this property. Using distinctive feature representations of the phonemes in each inventory from the PHOIBLE database [34], a many-to-one mapping was defined by finding, for each non-English phoneme φ, the English phoneme ψ * (φ) to which it is most similar. The number of many-to-one collisions is then defined as where |Ω Ψ | is the size of the English phoneme inventory, and [·] is the unit indicator function. The frequency of manyto-one mappings is listed in Table II for several languages. Hindi was chosen for having a large number of many-to-one mappings with English, while Dutch has relatively few. Note that, although Hindi podcasts were not included in the training data described in Section V, colloquial spoken Hindi and Urdu are extremely similar phonologically [26], and considering that the auditory stimuli for the EEG portion of this experiment are simple CV syllables, it is reasonable to consider Hindi and Urdu as equivalent for the purpose of computing feature weights for the misperception transducer.
To construct the auditory stimuli, two vowels and several consonants were selected from the phoneme inventory of each language (18 consonants for English, 17 for Dutch, and 19 for Hindi). Consonants were chosen to emphasize differences in the many-to-one relationships between English-Dutch and Consonants used in the EEG experiment English-Hindi, while maintaining roughly equal numbers of consonants for each language. The consonants chosen for each language are given in Table III; the vowels chosen were the same for all three languages (/a/ and /e/). Two native speakers of each language (one male and one female) were recorded (44100 Hz sampling frequency, 16 bit depth) speaking multiple repetitions of the set of CV syllables for their language. Three tokens of each unique syllable were excised from the raw recordings, downsampled to 24414 Hz (for compatibility with the presentation hardware, Tucker Davis Technologies RP2.1), and RMS normalized. Recorded syllables had an average duration of 400 ms, and were presented via headphones to one monolingual American English listener. The stimuli were presented in 9 blocks of 15 minutes per block, for a total of 135 minutes. Syllables were presented in random order with an inter-stimulus interval of 350 ms. Twenty-one repetitions of each syllable were presented, for a grand total of 9072 syllable presentations.
EEG recordings were divided into 500 ms epochs. The epoched data were coded with a subset of distinctive features that minimally defined the phoneme contrasts of the English consonants. Where more than one choice of features was sufficient to define those contrasts, preference was given to features that reflect differences in temporal as opposed to spectral features of the consonants, due to the high fidelity of EEG at reflecting temporal envelope properties of speech [9]. The final set of features chosen was: continuant, sonorant, delayed release, voicing, aspiration, labial, coronal, and dorsal.
Epoched and feature-coded EEG data for the English syllables only were used to train a support vector machine classifier for each distinctive feature. The classifiers were then used (without re-training) to classify the EEG responses to the Dutch and Hindi syllables. Fig. 8 shows equal error rates of these classifiers when applied to the three languages. EER of the classifier when applied to English phones is comparable to those reported in [9], the only prior work to attempt a recognition of speech phonemes from EEG of the listener. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  Fig. 9. Phone confusion probabilities between English and Dutch phones using models in which the negative log probability is proportional to unweighted or weighted distance between the corresponding distinctive feature vectors. Left: unweighted. Right: feature weights equal negative log confusion probability of EEG signal classifiers.
Eq. (4) defines a log-linear model of ρ(ψ|φ), the probability that a non-English phoneme φ will be perceived as English phoneme ψ. Denote by ρ U (ψ|φ) the model of Eq. (4) with uniform binary weights for all distinctive features. Denote by ρ EEG (ψ|φ) the same model, but with weights w k derived from EEG measurements (Eq. (5)). Fig. 9 shows these two confusion matrices: ρ U (ψ|φ) on the left, ρ EEG (ψ|φ) on the right. The entropy of the binary weighting, ρ U (ψ|φ), is too low: when a Dutch phoneme φ has a nearest-neighbor ψ * (φ) in English, then few other phonemes are considered to be possible confusions. ρ EEG (ψ|φ) has a very different problem: since distinctive feature classifiers have been trained for only a small set of distinctive features, there are large groups of phonemes whose confusion probabilities can not be distinguished (giving the figure its block-matrix structure). The faults of both models can be ameliorated by averaging them in some way, e.g., by computing the linear interpolation ρ I (ψ|φ) = (1 − α)ρ U (ψ|φ) + αρ EEG (ψ|φ) for some constant 0 ≤ α ≤ 1.
In order to evaluate the effectiveness of the EEG-induced misperception transducer we looked at the LPER of mismatched crowdsourcing for Dutch when performed using 1) a multilingual misperception model ρ(λ|φ) (the machine translation model described in Sec. III-B), 2) feature-based misperception transducer computed using binary weighting, ρ U (ψ|φ), or 3) EEG-induced transducer combined with the featurebased transducer, ρ I (ψ|φ). Both method (2) and method (3) required the use of a G2P in order to compute ρ(λ|ψ): the Dutch G2P was estimated using the CELEX database, while the Hindi G2P was estimated using the zero-resource knowledge-based method described in Sec. IV-C. The constant α = 0.29 was chosen as the average of the values selected by all folds in a leave-one-out cross-validation. LPER of the multilingual model was 70.43% (as shown in Table I), of the feature-based model, 69.44%, and of the EEG-interpolated model, 68.61%.

VII. AUTOMATIC SPEECH RECOGNITION
ASR was trained in four target languages in topline, baseline, and experimental conditions. Training methods are detailed in Sec. VII-A. Results are desribed in Sec. VII-B.

A. ASR Methods
Automatic speech recognition (ASR) systems were trained in four languages (hun=Hungarian, cmn=Mandarin, swa=Swahili, yue=Cantonese), using three different types of transcription. First, a topline MONOLINGUAL system was trained in each language using speech transcribed by a native speaker of that language. Second, a baseline CL (crosslingual) system was trained using data from other languages, and tested in the target language. Third, the experimental PT-ADAPT system was created by adapting the cross-lingual system to probabilistic transcriptions in the target language. The MONOLINGAL topline system is trained using native transcripts, and converted to the phone set of the test language using the G2Ps described in Sec. V. These resoures were not available to the CL or PT-ADAPT systems, which were not permitted to use any natively transcribed training data in the test language.
Audio data, native transcripts, and probabilistic transcripts are as described in Sec. V. The MONOLINGUAL topline system was trained using 40 minutes of training data, then stream weights and insertion penalties were calculated using 10 minutes of development test data. Monolingual systems were trained using a maximum likelihood (ML) criterion using the 40 minute in-language training set: GMM parameters were initialized using a monophone system trained on the same 40 minutes, NN parameters were initialized using a restricted Boltzmann machine trained on five hours of unlabeled audio in the same language. The CL baseline systems were each trained using 40 minutes of training data in languages other than the test language. CL systems were trained using ML, maximum mutual information (MMI), minimum phone error rate (MPE), and state-based minimum Bayes risk (sMBR, [13]) training criteria. The PT-ADAPT system was initialized using the CL system (ML training), then adapted to the target language using PTs based on mismatched crowdsourcing (these transcripts are described in detail in Sec. V). Probabilistic transcripts based on EEG were not used to adapt ASR, because it is not yet possible to use EEG to generate probabilistic transcripts on a scale sufficient for ASR adaptation.
All systems were trained using the Kaldi [36] toolkit. Acoustic features consisted of MFCC (13 features), stacked ±3 frames (13 × 7 = 91 features), reduced to 40 dimensions using LDA followed by fMLLR. GMM-HMM systems directly observed this 40-dimensional vector; NN-HMM systems computed fMLLR+d+dd stacked ±5 frames (40 × 3 × 11 = 1320 features/frame). All systems used tied triphone acoustic models, based on a decision tree with 1200 leaves. Each GMM-HMM used a library of 8000 Gaussians, shared among the 1200 leaves. Each NN-HMM used six hidden layers with logistic nonlinearities, and with 1024 nodes per hidden layer, followed by a softmax output layer with 1200 nodes.
The PT-ADAPT system was adapted using MAP adaptation (Sec. IV-D) co mputed over weighted finite state transducers in Kaldi [36]. In order to efficiently carry out the required operations on the cascade H • C • P T • G, the cascade for P T includes an additional wFST restricting the number of consecutive deletions of phones and insertions of letters (to a maximum of 3). MAP adaptation for the acoustic model was carried out for a number of iterations (12 for yue & cmn, 14 for hun & swh, with a re-alignment stage in iteration 10).

B. ASR Results
Tables IV and V present phone error rates (PERs) for four different languages. The first column shows the phone error rate (PER) of monolingual topline systems: evaluation test results are followed by development test results in parentheses. The column titled CL lists cross-lingual baseline error rates. The column labeled ST lists the PERs of self-trained ASR systems. The column headed PT-ADAPT in Table IV lists PERs from CL ASR systems that have been adapted to PTs derived from mismatched crowdsourcing. Phone error rates are reported instead of word error rates because, in order to compute a word error rate, it is necessary to have either native transcriptions in the target language (thereby permitting the training of a grapheme-based recognizer) or a pronunciation lexicon in the target language. These resources are used by the monolingual topline, but not by any of the baseline or experimental systems.
The monolingual ASR is trained using only 40 minutes of audio and transcript data per language, but performs reasonably well (31.58% average PER, NN-HMM). The crosslingual ASRs, however, perform poorly. Using a text-based phone bigram (denoted TEXT) gives significant improvement over a cross-language phone bigram (denoted CL), but significantly underperforms a system that has seen the test language during training. This is true even if the system has seen closely related languages during training: the Cantonese cross-lingual system has seen Mandarin during training, and the Mandarin system has seen Cantonese during training, but neither system is able to generalize well from its six training languages to its test language. Three different types of discriminative training were tested. MMI performs consistently worse than MPE and sMBR, and is therefore not listed in Table IV. Averaged across all languages and systems shown in Table IV, the development-test PERs of ML, MPE, and sMBR training are 73.43%, 73.04%, and 72.98% respectively; differences are not statistically significant, therefore only the ML system was tested on evaluation test data.
Evaluation test PER of each experimental system (columns ST and PT-ADAPT, 20 systems) was compared to evaluation test PER of the corresponding CL system (ML training, TEXT LM) using the MAPSSWE test of the sc_stats tool [35]. Each neural net PT-ADAPT system was also compared to the corresponding ST system (4 comparisons). There are therefore 24 independent statistical comparisons in Tables IV and V; the study-corrected significance level is 0.05/24 = 0.002.
Self-training was only performed using NN systems; no self-training of GMMs was performed, because previous studies [17] reported it to be less effective. The Swahili ST system was judged significantly better than CL at a level of p = 0.002 (denoted *); the Cantonese, Mandarin and Hungarian ST systems were not significantly better than CL at this level.
The relative reductions in PER of the PT-ADAPT system compared to both CL and ST baselines were all statistically significant at p < 0.001 (denoted **). This suggests that adaptation with PTs is providing more information than that obtained by model self-training alone.
PT-adapt GMM-HMM systems were trained using four different training criteria: ML, MMI, MPE and sMBR. MMI training consistently underperformed MPE and sMBR, and is therefore not shown. MPE training of PT-ADAPT systems improves their PER by a little more than 1% on average, comparable to the improvement provided to CL baseline systems.
PER improvements for Swahili are larger than for the other three languages. We conjecture this may be due to the relatively good mapping between Swahili's phone inventory and that of English. For example: all Swahili vowel qualities are also found in English, and the Swahili phonemes that would be unfamiliar to an English speaker (prenasalized stops, palatal consonants) have representations in English orthography that are fairly natural ("mb", "nd", etc. for prenasalized stops; "tya", "chya", "nya", etc. for palatals). In contrast: Mandarin, Cantonese, and Hungarian each have at least two vowel qualities not found in English; Mandarin and Cantonese have many diphthongs not found in English; and some of the consonant phonemes (e.g., Mandarin retroflexes) do not have representations in English orthography that are obvious or straightforward.

VIII. DISCUSSION
Models of human neural processing systems have often been used to inspire improvements in machine-learning systems (for a catalog of such approaches and a warning, see [4]). These systems are often called neuromorphic, because the system is engineered to mimic the behavior of human neural systems. In contrast to that approach, our incorporation of EEG signals into ASR resonates with the Human Aided Computing approach used in computer vision [41], [49]. Together with our EEG work presented here, this class of approach represents a less explored direction for design of machine learning systems, whereby recorded neural data (rather than neuro-inspired models) are used as a source of prior information to improve system performance. Therefore, our work here suggests that, by thinking about the kinds of prior information required by a machine learning system, engineers and neuroscientists can work together to design specific neuroscience experiments that leverage human abilities and provide information that can be directly integrated into the system to solve an engineering problem. NN-HMM outperforms the GMM-HMM in all baseline conditions, but not always when adapted using PTs. Table V shows that PT adaptation improves the NN-HMM, but the benefit to a NN-HMM is not as great as the benefit to a GMM-HMM; for this reason, the accuracy of the PT-adapted GMM-HMM catches up to that of the NN-HMM. Preliminary analysis suggests that the NN is more adversely affected than the GMM by label noise in the PTs. A NN is trained to match the senone posterior probabilities π(s t |x , φ , θ) computed by a first-pass GMM-HMM. Many papers have demonstrated that entropy in the senone posteriors is detrimental to NN training. In PT adaptation, however, entropy is unavoidable. Forced alignment is better than using soft alignment, but is not sufficient to make PT adaptation of the NN-HMM always better than that of the GMM-HMM. Table I showed that PTs computed using a text-based phone bigram language model only achieve LPER in the range 50.45-70.88%, depending on the language. These high error rates are, perhaps, incomprehensible to most speech technology experts, who are accustomed to think of human transcriptions as having 0.0% error rate, but there is good reason for this: the transcribers don't speak the target language, so they find some of its phone pairs to be perceptually indistinguishable. Future work will seek methods that can improve the robustness of NN training in the face of label noise.
The primary conclusion of this article is economic. In most of the languages of the world, it is impossible to recruit native transcribers on any verified on-line labor market (e.g., crowdsourcing). Without on-line verification, native transcriptions can only be acquired by in-person negotiation; in practice, this has meant that native transcriptions are acquired only for languages targeted by large government programs. Native transcription (NT) permits one to train an ASR with PER of 31.58% (average, first column of Table V). Self-training (ST), by contrast, costs very little, and benefits little: average PER is 62.75% (Table V). Probabilistic transcription (PT) is a point intermediate between NT and ST: average PER is 52.29%, cost is typically $500 per ten transcribers per hour of audio. PT is therefore a method within the budget of an individual researcher. We expect that an individual researcher with access to a native population will wish to combine NT (as many hours as she can convince her informants to provide) with PT (on perhaps a much larger scale); future research will study the best strategy for combining these sources of information if both are available.

IX. CONCLUSIONS
When a language lacks transcribed speech, other types of information about the speech signal may be used to train ASR. This paper proposes compiling the available information into a probabilistic transcript: a pmf over possible phone transcripts of each waveform. Three sources of information are discussed: self-training, mismatched crowdsourcing, and EEG distribution coding. Auxiliary information from EEG is used, together with text-based phone language models, to improve the decoding of transcripts from mismatched crowdsourcing. Self-trained ASR outperforms cross-lingual ASR in one of the four test languages (Swahili). ASR adapted using mismatched crowdsourcing outperforms both cross-lingual ASR and selftraining in all four of the test languages.

X. ACKNOWLEDGMENTS
This work was supported by JHU via grants from NSF (IIS), DARPA (LORELEI), Google, Microsoft, Amazon, Mitsubishi Electric, and MERL, and by NSF IIS 15-50145 to the University of Illinois. Parts of this work were previously published in [29].   She is currently a scientist at the Institute for Infocomm Research (I2R), Singapore. Dr. Chen's research interests include automatic speech recognition, spoken keyword search, and speech summarization for low-resource languages, pronunciation modeling with applications in transliteration and computer-assisted language learning, and multilingual spoken language processing. Prior to joining I2R, she worked at MIT Lincoln Laboratory on her Ph.D. research, which integrates speech technology and speech science, with applications in speaker, accent, and dialect characterization. Dr. Chen is a member of the IEEE Rose Sloan is a first year PhD at Columbia University, where she is currently working in the speech lab on problems related to text-to-speech. She received her B.S. in computer science and linguistics from Yale University in spring of 2016.