Explaining Human Auditory Scene Analysis Through Bayesian Clustering

The way auditory stimuli are being processed to form perceptual unitary or segregated groups of sounds is still an ongoing discussion in the Auditory Scene Analysis literature. Mechanistic approaches to model this phenomenon have been somewhat successful but are often overly complicated and constrained to specific paradigms. Our approach is that of simplicity. We have previously proposed a higher-level source inference model in the Bayesian statistical framework that only implements a few simple but sensible rules applied to the stimuli’s statistics. Yet, it still captures results from behavioral data (Yates, Larigaldie, & Beierholm, 2017). We have expanded on this model to show its ability to adapt to a wider range of well-known perceptual auditory phenomena. Several original experiments have also been conducted to explore a broader range of stimuli statistics. Our model’s responses give insight into possible underlying processes in the brain that could provide a guide towards more behavioral experiments or medical exploration.


Introduction
While Gestalt psychology has mainly focused on object grouping in the visual modality, a lot of auditory streams segregation and combination phenomena have been described (A. S. Bregman, 1994).
However, across all modalities it is still unclear how our perceptual systems can cope with both omnipresent uncertainty and a virtually infinite number of perceptual cues to treat, order and categorize in real time. Uncertainty in high-level perception is usually successfully modeled using the Bayesian framework (Körding et al., 2007) (Trommershäuser, Körding, & Landy, 2012). But as the number of perceptual cues increases linearly, the amount of possible clusters explodes factorially. As a result, most Bayesian models cannot be applied to more ecological environments as they are limited to a very low number of perceptual cues before calculations become intractable.
On the other hand, lower-level mechanistic approaches can be less limited in terms of number of percepts to be considered (for a review, see (Bee & Micheyl, 2008), but usually produce complex and situational models. Furthermore, from a cognitive perspective, they are hard to interpret in terms of meaningful brain functions.
We have been developing a non-parametric Bayesian model (Yates et al., 2017). The aim of our model is to tackle limitations from both approaches by being able to consider a potentially unlimited number of perceptual objects in reasonable time while not sacrificing simplicity nor interpretability. Furthermore, it is designed to be abstract enough to easily incorporate new perceptual cues, and to be usable across modalities. Briefly, the model is a Bayesian clustering algorithm sequentially treating perceptual cues in order to infer the probability that they were produced by a common source, via dimensional proximity and parsimony of hypotheses.
The model assumes that given a source, all perceivable stimuli being created by this source should either have close characteristics on every dimension, or would require some time to change its state. That is, it is unlikely that a source could create two very different stimuli in a very short time. This implies that inference over such structure can be done by clustering of percepts. For example, if two sounds with frequencies F1 and F2 are produced by the same source, the pitch cannot change infinitely fast as an oscillator would require infinite impulse of energy to change its frequency discontinuously. This assumption can in general be summarized by proximity over a Kdimensional planewith K being the number of perceptual cues considered for each stimulus.

725
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 On top of this generative process, our model also assumes parsimony in the number of plausible clusters. This is done by introducing a non-parametric prior in the form of a Chinese Restaurant Process (Aldous, 1985), gradually decreasing the probability of considering a new source plausible as more sounds have already been assigned to previous sources.
Implementing only these two reasonable assumptions is enough to successfully reproduce several well-known auditory phenomena. Original experimental data were also collected to further explore the model's predictiveness.

Model specifications
The first aforementioned assumption can be modeled with the following generative process: Where is sound number , the cluster it belongs to (in other words: the source that caused it), ∆ the difference in the considered characteristic (for instance, frequency), ∆ the difference in time between two sounds' onsets and is a constant that can be fitted to participants' responses. It follows that a source is most likely to cause stimuli whose characteristics change slowly over time. Indeed, as ∆ ∆ ⁄ increases, the probability of the newer sound to be in the same cluster as the previous sound decreases with a normal decay.
The second assumption is modeled by a nonparametric prior weighing these probabilities according to the number of sounds already in each cluster: ( = | 1 … −1 ) = ( − 1) + when cluster has already been inferred, and: ( = | 1 … −1 ) = ( − 1) + when none of the previous clusters is equal to .
is the number of sounds in cluster , is the total number of sounds considered and is a constant that can be fitted to participants' responses. It follows that as more and more sounds are being considered, the probability of assignment to a new cluster decreases. On top of this, the model follows a rich get richer property, as clusters already comprised of many tones have a higher chance of getting more tones than clusters with fewer tones. Taken together, these properties can be considered as an implementation of Ockham's razor. For details of implementation see Yates et al. (2017).

Phenomena reproduction
A number of phenomena can be replicated by the model, but we will here only present two using auditory frequency as sensory cues.
The first phenomenon is the second experiment taken from Bregman & Campbell (1971), highlighting how the speed of presentation affects perception of streams of tones. Behavioral data shows that faster tones lead to an increased probability of subjects reporting two  Bregman & Campbell (1971). A slow sequence is perceived as a single stream, while a faster sequence in split in two. Stimuli are shown at the top, bottom is dendrogram tree-plots based on the posterior distribution over clustering. Across dendrograms, a unique color is assigned to clusters with more than 50 percent distance from other clusters perceived streams of sounds rather than one. Figure 1 shows how our model successfully captures this. The model likelihood term constrains how fast streams can change in frequency, hence too fast changes makes two streams more likely. The "slow sequence" had 100ms ISIs, 500ms tone duration and pitch differences of [0 4 8 26 30 34] semitones from the lowest tone. The "fast sequence" reduced the tone duration to 100ms.
The second phenomenon is taken from Bregman (1978) and shows that auditory streaming is cumulative. Sequences of tones that split into two perceived streams may initially be perceived as one stream. Figure 2 shows how our model successfully captures this. The non-parametric prior makes a single stream more likely when little information has been received. The "short sequence" has 26.6ms ISIs, 7 semi-tones pitch differences with two repetitions. The "long sequence" was instead repeated eight times.

Novel experiments
If auditory scene analysis is indeed a process of perceptual clustering of auditory stimuli into separate streams, then we would expect subjects to be able to cluster more than two sets of stimuli, and consequently perceive more than two streams. A set of novel experiments using a new paradigm have been designed in order to explore the influence of several sensory cues on the formation of auditory streams, and a potentially higher maximum number of perceived streams. Only one of these experiments, using frequency as a sensory cue, will be presented here. The key realization is that subjects lose the order information of tones when assigned to different streams. Therefore, subjects should not detect a difference between sequences 1 and 2 when medium tones do not share a stream with either low or high tones, as long as they are ignorant as to which tone started the medium stream. This is insured by introducing a general fade in effect at the start of every sequence. Figure 4 show that, as expected, higher frequency differences significantly decreased participants' capacity to tell the two sequences apart, implying that the middle tones are perceived as a separate cluster to either the high or low tones. This strengthens the argument that auditory streaming formation is being influenced by proximity in frequency space, and that humans may hold 3 or more auditory streams simultaneously.

Results in
Overall, we show how several aspects of auditory scene analysis can be modeled based on very few normative assumptions. Experiments support the qualitative predictions of the model, an aspect that future work will expand on.  First numbers in conditions indicate a difference in semitones from L to M1, second number is the difference between M2 and H. The difference in frequency between M1 and M2 was always 3 semi-tones.