Revisiting Models of Concurrent Vowel Identiﬁcation: The Critical Case of No Pitch Diﬀerences

When presented with two vowels simultaneously, humans are often able to identify the constituent vowels. Computational models exist that simulate this ability, however they predict listener confusions poorly, particularly in the case where the two vowels have the same fundamental frequency. Presented here is a model that is uniquely able to predict the combined representation of concurrent vowels. The given model is able to predict listener's systematic perceptual decisions to a high degree of accuracy.

widely accepted models generate segregated represen- templates of individual vowels, to predict the concur-28 rent vowel pair presented. 29 Meddis and Hewitt's model [5] is widely cited as it 30 is able to qualitatively predict human improvement 31 in vowel identification when pitch differences are in-32 troduced between the vowel-pair. However, when no 33 F0 differences are present, it under-predicts the cor-34 rect identifications made by humans in their study 35 (human: 57%, model: 37%). Recently, Chintanpalli 36 and Heinz [8] further highlighted that although the 37 model qualitatively reproduced the overall improve-38 ment with F0 differences, it very poorly accounted 39 for the specific confusions made. 40 Even when the F0s of all vowels presented are iden-41 tical, human CVI performance is greatly above chance 42 [3]. This implies that identification cues beyond pitch 43 differences are utilized that are not well accounted 44 for in existing models. In this identical-F0 scenario, 45 all existing models construct predictions of just indi-46 vidual vowels being identified by comparing unsepa-47 rated representations of concurrent vowel pairs with 48 internal templates of individual vowels. Furthermore, 49 to construct predictions of concurrent vowel pairs be-50 ing identified, either deterministic algorithms are used 51 (e.g. [4,5,7,8]), or probabilistic decisions are made 52 following assumptions of independence (e.g. [3,6]).

53
Here we explore the consequences of an alterna-54 tive recognition process, for the important case where 55 there is no F0 difference between vowel pairs. We hy-56 pothesize that predicting the complete internal repre-57 sentation of the presented stimulus would be an opti-58 mal solution to the CVI task, and might produce re-59 sults in line with human behaviour. Therefore, inter-60 nal representations should describe concurrent vowel 61 pairs (i.e. retaining dependent information), as op-62 posed to individual vowels. Our model simulates dif-63 ferent variants of auditory processing, followed by a 64 naive Bayesian classifier which allows for probabilistic 65 predictions of human decisions and systematic com-66 parison of different recognition strategies. Synthetic vowels (steady-state harmonic complexes) 71 were created using a Klatt-synthesizer [9]. The 72 specified by Chintanpalli and Heinz [8]. The funda-74 mental frequency of all vowels were 100 Hz, and all 75 vowels were set to 65 dB SPL. All vowels had a du-    ing experiments in humans [10,11] or guinea-pigs [12].

108
The outputs of each filter were then half-wave recti-   The task of the listeners, and our classifier, was to determine what stimulus had been presented for all instances of auditory activity (a). We did this using a naive Bayesian classifier, which determined regions of auditory activity (R k ) where a given stimulus class (C k ) was more probable than any other stimulus class to have produced said auditory activity (i.e. a ∈ R k if k = arg max i P (C i |a)). Given the presentation of a concurrent vowel pair, the probability that our model predicted a certain stimulus class had been presented was These high dimensional integrals were then evaluated 125 numerically. 126 We modelled two approaches for classification 127 which differed in the stimulus classes used, each pro-128 ducing a confusion matrix (P where v z ∈ {i,a,u,ae,Ç}). To obtain predictions 141 of concurrent vowel pair presentation probabil-142 ities, individual vowel presentation probabilities 143 were multiplied together. This approach, assum-144 ing individual vowels are identified independently 145 of one another, was initially proposed in [4].

146
For each model variant, we selected the variance of 147 the internal noise (σ 2 ; single free parameter) to pre-148 dict the closest fit to the overall percent of concurrent 149 vowels correctly identified by listeners.   (Fig. 2, circles), despite the fact that no 160 attempt was made to fit the confusions themselves.

161
Spectral processing models were best at predicting
cision probabilities, corresponding to their random-183 ness, was lower for models of individual-class recog-184 nition (<4.86 bits) than either the human data (5.11 185 bits) or the combined-class recognition model (>5.05 186 bits). Thus, the models of individual-class recognition 187 make more errors than people because they make the 188 wrong decisions consistently, and despite the proba-189 bilistic nature of the models.

190
The combined-class model which predicted human 191 decisions best used spectral processing, outperforming 192 the temporal representation. Perhaps surprisingly, 193 neither temporal nor spectral processing depended on 194 whether filterbanks were based on human or guinea-195 pig bandwidth estimates (Fig. 3b). Further investiga-196 tion revealed that for spectral processing, filters with 197 narrower bandwidths approached human like perfor-198 mance with more internal noise (Fig. 4, solid lines). 199 This was not the case when using a temporal pathway, 200 in which frequency resolution is not such a constraint 201 (Fig. 4, dashed lines). In contrast, identification from 202 individual classes (Fig. 4, dotted and dash-dotted 203 lines) did not converge on human performance for any 204 amount of internal noise.