Temporal Order Judgment Reveals Local-Global Auditory Processes

Speech signals can be considered as acoustic sequences composed of local units (e.g. phonemes) which form global acoustic patterns (e.g. syllables). Extraction of speech information at both local and global scales is essential for comprehension. To decipher this process, we employed the temporal order judgement (TOJ) paradigm and investigated how the auditory system processes acoustic sequences. We selected four vowel segments of 30ms and generated short acoustic sequences. We then examined listeners’ performance on TOJ of the vowel sequences using a same-different paradigm. The data showed that acoustic changes on a local scale caused by reversing vowel segments modify TOJ performance. Furthermore, the effect of local changes was attenuated when inter-onset interval between vowel segments increases, where segments can be recognised individually. A follow-up experiment showed that recognition of each segment was modulated by segment position and indicated that positions of acoustic segments contribute differently to TOJ. The results suggest that listeners perform TOJ by perceiving global patterns of acoustic sequences, which are further modulated by acoustic details on a local scale. Acoustic information on the local and global scales determines concurrently identification of short acoustic sequences.


Introduction
Speech comprehension requires listeners not only to process elementary acoustic segments, such as phonemes and syllables, butalso to encode their temporal order [5].For example, 'cat' and 'act' are twodifferent words composed of the same phonemes and only differ by the temporal order of phonemes.Therefore, extracting temporal order of acoustic elements is afundamental process in speech perception [19].
Previous research used sequences composed of artificial acoustic stimuli, such as pairs of hisses, buzz, tones, and clicks, and asked listeners to judge temporal order between different components.The minimum onset interval between sounds necessary for temporal order judgement (TOJ)w as found to be less than 30 ms [1,5,6,7,8].However, it wasargued that listeners could perceive global changes caused by different temporal orders of the two components to perform TOJ [18].Accordingly,other studies used sequences with more than twoa coustic components and repetitively presented those sequences to prevent listeners from perceiving global patterns [2,19].An onset interval of longer than 100 ms wasoften found.
The twotime constants found potentially reflect twodistinct processes involved in TOJ-one process relies on perceiving changes of acoustic sequences on ag lobal scale, which are induced by the changes of the temporal order of local elements; the other process relies on recognition of each local element and the temporal order is later determined cognitively.However,itisstill unknown howthese twoprocesses, processing acoustic details at alocal scale and perceiving global patterns, interact with each other and jointly determine TOJ.
Here, we investigate this question using an ovel paradigm.Specifically,w eu sed sequences composed of four short vowel segments, which had distinct acoustic structures and can be recognized individually.W ethen introduced changes of temporal order by reversing the order of vowel segments.Twoexperimental manipulations were made.First, we varied the positions of the segments that are reversed to examine whether local changes influences TOJ.Secondly,wemanipulated the length of intervals between segments, which influence recognition of individual segments [4,9,15].We hypothesize that if recognition of acoustic sequences involves processes that operates on both local and global scales, then both the position of the changes on al ocal scale and the size of intervals should modulate TOJ.

Experiment 1
We tested whether listeners can differentiate between the four short vowel segments presented in isolation, as stimuli types affect listeners' performance on TOJ [19].

Participants
TenEnglish native speakers (age 18 to 23 years; 7female; one left-handed)g avew ritten consent and participated in the experiment.None of the participants had hearing loss or neurological abnormalities according to participants' self-report.We conducted all experiments in accordance with procedures approvedbythe NYU committee on Activities Involving Human Subjects.

Stimuli and procedures
Four English vowels spoken by af emale speaker (close front, close-mid back, near-open near-front, and openmid near-front)w ere used as stimuli (The stimuli can be found at https://edmond.mpdl.mpg.de/imeji/collection/rZWJgrvQrz2AP8DL?q=).Amplitudes of all tokens were normalized individually to 60 dB SPL.Asegment of 30 ms waschopped with ar ectangular windowfrom the middle of each vowel and wasused in all experiments.
Am atch-to-sample paradigm wasu sed to examine the discriminability between different vowel segments (Figure 1, top).On each trial, the participants were first presented with one of four vowel segments and 700 ms later with twov owel segments sequentially as match alternatives, one of which wast he same as the sample.The two samples had inter-onset intervals uniformly distributed between 400 ms and 600 ms.The participants had to choose which one of the twovowel segments matched the sample by pressing twob uttons.No feedbacks were provided as we would liket ot est howw ell listeners can differentiate the vowel segments without previous experiences.Forty trials were presented for each comparison between two vowel segments.
All stimuli were presented using MATLAB (The Math-Works, Natick, MA)a t1 6b it, with as ampling rate of 44.1 kHz using headphones (Sennheiser HD 380 Professional, Sennheiser Electronic Corporation, Wedemark, Germany).T he d-prime value corresponding to the 100 percent accuracyis4.5 as ahalf incorrect trial wasadded.

Results and Discussion
The discriminability between vowel segments are shown in Figure 1(bottom).The data showed that d-prime values for all pairs of vowel segments are above 3.5 and close to 4.5.Listeners have no difficulty discriminating vowel segments from each other.T his result confirms that the acoustic information within 30 ms for each vowel segment suffices for the auditory system to differentiate different segments [9].Therefore, difficulties of listeners to identify temporal order of an acoustic sequence composed of such vowel segments cannot be due to the inability to recognize individual components.

Experiment 2
In Experiment 2A, we examined howT OJ is modulated by the position of order reversal on the local scale and the size of interval between components.In Experiment 2B, we further tested howt he position of vowel segments influence recognition of individual vowel segments.

Participants
TenEnglish native speakers (age 18-27 years; 7females) in Experiment 2A and ten English native speakers (age 18-27 years; 7f emales)i nE xperiment 2B gave written informed consent to participate in the study.A ll participants were right-handed with no known hearing deficits or neurological abnormalities.

Stimuli and procedures
In Experiment 2A, we created vowel sequences by concatenating randomly the four vowel segments with different temporal orders.The experimental procedure wasa same-different task, where twovowel sequences were presented sequentially with an inter-sequence interval equally distributed from 400 to 600 ms.Participants were asked to determine whether the twovowel sequences were the same or different.Six types of temporal order reversing were defined as af unction of the position of the twov owels that were involved (Figure 2,bottom): 1) reversing between the 1st and the 2nd segments; 2) between 2nd and 3rd; 3) between 3rd and 4th; 4) between 1st and 3rd; 5) between 2nd and 4th; 6) between 1st and 4th.The inter-onset intervals (IOI)b etween vowel segments were varied across six levels: 30, 50, 70, 110, 170 and 250 ms.Experiment 2A is of a6x6design with twowithin-participant factors of vowel reversing position and IOI.Each condition contained 20 trials, and at otal of 720 trials were included in the finala nalysis.The trials were randomly divided into four blocks and presented in pseudorandom order.In Experiment 2B, the experimental procedure wasa one-interval two-alternative-force choice paradigm.Atarget vowel segment wasfi rst presented to the participants ten times before each block.On each trial, the participants were presented with one sequence of four vowel segments and had to determine whether the target vowel segment wasp resented in the first half (the first and second positions of the vowel sequence)o ri nt he second half (the third and fourth positions).Target positions were binned into twoc ategories in the analyses: Boundary (the first and fourth positions)and Middle (the second and third positions).IOI wasa lso manipulated as in Experiment 1A.Experiment 2B waso fa2x6d esign with twof actors of target vowel position and IOI.The participants were tested in four blocks, with aspecificvowel segment used as target for each block.Each block contained 360 trials (30t rials for each condition).

Results and Discussion
Results of Experiment 2A and 2B are shown in Figure 2 and 3, respectively.T om easure the effects of reversing position and IOI in Experiment 2A, we conducted atwo-wayReversing position xIOI repeated measures ANOVA (rmANOVA )o nd -prime values (Figure 2).Asignificant main effect wasf ound for Reversing position (F(5,45) = 64.27,p<0.001)and for IOI (F(5,45) = 88.84,p<0.001).The interaction effect between Reversing position and IOI wasalso significant (F(9,225) = 8.83, p<0.001).We then measured at what IOI the effect of reversing position is significant by conducting aone-way rmANOVA at each IOI.We found asignificant main effect of the reversing position at IOIs from 30 ms to 170 ms (p <0.01, Bonferroni correction applied)).T omeasure the effects of segment position in Experiment 2B, we conducted atwo-way Segment position xI OI rmANOVA on d-prime values.As ignificant main effect wasf ound for Segment position (F (1,9) = 39.25, p<0.001)and for IOI (F (5,45) = 35.65,p<0.001).The interaction effect between Reversing position and IOI wasalso significant (F (5,45) = 6.18, p<0.001).We then measured at what IOI the effect of Segment positi on is significant by conducting aone-way rmANOVA at each IOI with the segment position as the main factor.W ef ound as ignificant main effect of the reversing position at IOIs from 30 ms to 110 ms (p <0.05, Bonferroni correction applied)(Figure 3).
Our results from Experiment 2A confirm the previous finding that listeners perceive global patterns of acoustic sequences to identify temporal order [2,19], as the participants' performance should not be modified by the reversing position if theyadopt astrategy to first recognize each component and then identify their temporal order.AsIOI increases and vowel segments are separated further apart, the effects of reversing position are attenuated.Experiment 2B showed that there is aposition effect of recognition of  each component (Figure 3, bottom).The vowel segments on the boundary positions can be better recognized, which explains the effect of reversing position in Experiment 2A.

General Discussion
We showed that TOJi nvolves auditory processes of both extracting local details and identification of global patterns.When IOI is short (e.g.<170 ms), TOJrelies on perception of global pattern of acoustic sequences, which are modulated by details of temporal order reversal on the local scale.When IOI increases over170 ms and each acoustic component can be recognized, the effect of local acoustic changes disappears.Our study here, though seemingly simple, reveals complicated auditory processes in speech perception -acoustic information, local and global, needs to be extracted concurrently to form ah olistic percept [14,16].
The results of reversing position in Experiment 2A and segment position in Experiment 2B echo findings from studies on forward and backward masking [10,11].As the vowel segments in the middle of acoustic sequences are masked by both the preceding and following segments, the masking effects probably lead to worse recognition of these segments comparing to of those at the boundary positions.This finding suggests that acoustic information in different temporal positions within acoustic sequences contributes differently to forming the globally perceived pattern of acoustic sequences.The effects of local acoustic details are modulated by IOI.As the time constants found in studies on forward and backward masking are often of tens of milliseconds [12,13], the effect of local acoustic changes should not occur for IOIs longer than 100 ms.However, in our study,t his effect persists with an IOI as long as 170 ms.This result is in line with previous findings that more than hundreds of milliseconds are needed to recognize individual components in acoustic sequences.
The findings of the present study echo ar ecurrent theme on resolution and integration of the auditory system [3,15].The auditory system needs to integrate acoustic information overalong timescale to perceive global patterns while extracting acoustic information on ashort timescale to decipher fast acoustic changes.Our results here lend as upport to ap roposal that concurrent local-global processes exist in the auditory system [14,16].

Figure 1 .
Figure 1.Top: the match-to-sample paradigm using in Experiment 1. Bottom: the results of d-prime value of the match-tosample.The vertical axis labels the sample; the horizontal axis labels the vowel segment which is different from the sample in the match pair.T he scale bar shows d-prime value.In each box of the confusion matrix, the numbers showt he group-averaged d-prime value and one standard error overparticipants (inparentheses).

Figure 2 .
Figure 2. Top: the same-different paradigm in Experiment 2A.Bottom: results of Experiment 2A.The vertical axis represents d-prime value and the horizontal axis inter-onset interval.The line style codes for switching positions of vowel segments.The shaded boxes and the double arrows in the legend indicate that the positions of vowel segments in the first sequence is reversed in the second sequence.

Figure 3 .
Figure 3. Top: the experimental paradigm in Experiment 2B.Bottom: d-prime value of Experiment 2B.The line style codes for segment position, boundary positions (solid)a nd central positions (dashed).The data showthat vowel segments are easier to be recognized in the boundary positions than in the central positions.