Attention During Story Listening Modulates Temporal Receptive Windows Across Human Cortex

Human brain integrates sensory information across time to represent dynamic complex daily-life environments. Previous studies have shown that higher level perceptual and cognitive cortical areas tend to integrate information over longer time windows, suggesting presence of a hierarchy of temporal receptive windows (TRW) across the brain. Yet, attentional modulations of TRW are unknown. Here, we investigated whether category-based attention modulates TRW within and beyond auditory cortex. Human subjects listened to narrated natural stories and their whole-brain BOLD responses were recorded during three tasks (i.e. passive listening, attention to humans, or attention to places) in separate runs. Contextual representation of the stories derived from an LSTM neural network trained on a language modeling task were used to fit voxelwise encoding models. Contextual information was distorted at multiple time scales to measure TRW during passive listening and during the two attention tasks. Our findings suggest that category-based attention modulates TRW across parietal and frontal cortices. The results also suggest that attention to places extends TRW in parietal cortex.


Introduction
Natural speech is represented at various time scales across the brain, from phonemes to semantically complex sentences (DeWitt & Rauschecker, 2012). Previous studies have introduced the notion of temporal receptive window (TRW) of a cortical circuit as the length of time before a response during which information is integrated and affects that response (Hasson, Yang, Vallines, Heeger, & Rubin, 2008). Moreover, in studies using natural speech stimuli, it has been shown that there exists a hierarchy of increasing TRW from early auditory cortex to higher cognitive areas (Lerner, Honey, Silbert, & Hasson, 2011). Recent reports provided evidence for attentional influences on cortical tuning for low-level features of the auditory stimuli (Mesgarani & Chang, 2012). Yet, it is currently unknown whether attention can modulate TRW.
Here, we investigated this question by studying TRW across the human brain during category-based attention in a natural story listening experiment. Five human subjects listened to over two hours of stories from The Moth Radio Hour (Lerner et al., 2011) while performing passive listening or one of the two attention tasks (i.e. attend to "humans", and attend to "places") in different runs. We recorded the whole-brain bloodoxygen-level-dependent (BOLD) responses using functional MRI. We then used rich contextual representations derived from a long short-term memory (LSTM) language model to fit voxelwise encoding models separately for each task and in each individual subject (Jain & Huth, 2018). To estimate TRW, we trained different language models via contextual information that was scrambled at different time scales. Then we fit separate voxelwise encoding models using features obtained from language models trained with different context scrambling levels and assessed prediction performance of models. Finally, we estimated TRW by analyzing the voxelwise prediction performance of the fit models. We compared TRW between the passive listening task and the two attention tasks and assessed attentional sensitivity of TRW across the brain. Furthermore, we compared TRW between the two attention tasks and computed a TRW bias index. Our findings suggest that category-based attention modulates TRW in parietal and frontal cortices. The results also suggest that attention to places extends TRW in parietal cortex. Moreover, TRW in strongly category-selective areas is biased toward the preferred category.

Experiment Design
The attention experiment was performed in a single sessions consisting of 12 runs. The stimulus consisted of 6 naturally spoken narrative stories from The Moth Radio Hour totaling over two hours. A cue word was displayed before each run to indicate the attention task: "humans", or "places". In the attend to humans task, subjects attended to human categories (e.g. woman, man, boy). In the attend to places task, subjects 168 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 attended to place categories (e.g. building, room, school). The passive listening experiment performed in a single sessions consisting of 10 runs and the stimulus consisted of 10 stories.

Stimulus Embedding
To assess an embedding of the stimulus stories, we used an LSTM model and trained it on a language modeling task (LSTM-LM, (Jain & Huth, 2018)). First, using a large corpus of English text, an embedding space was constructed by computing the co-occurrence statistics between each corpus word and a set of 985 common English words (Huth, de Heer, Griffiths, Theunissen, & Gallant, 2016). The corpus consisted of comments scraped from http://reddit.com, containing nearly 20M words. Then, the LSTM was trained in the embedding space. The LSTM comprised of 3 layers, each with 985dimensional hidden states. For each input word, the LSTM-LM used representation of 20 preceding words to output a 985-dimensional representation vector (Jain & Huth, 2018).

Context Scrambling
To assess the scrambled stimulus embedding at level l, we replaced the l th to 20 th words in the training samples with random words from the corpus. The LSTM was then trained from scratch using the scrambled context. This procedure led to 20 different stimulus embeddings for l ∈ [1, 20].

Voxelwise Model Fitting and Testing
Voxelwise models were fit using regularized linear regression with an l 2 penalty to avoid overfitting. A nested crossvalidation procedure was used to fit model for each voxel. In each of the 20 inner folds, models were fit on the training data for regularization parameters in the range [2 3 , 2 20 ]. Pearsons correlation between actual and predicted responses (prediction score) for the test data were computed. Optimal regularization parameters were then selected to maximize the average prediction score across inner folds. Afterwards, optimized parameters were used to fit models on the union of training and test data in each outer fold. To assess model performance, responses were predicted for the validation data using the fit models. Finally, models and prediction scores for each voxel were averaged across the the 20 outer folds.

Voxelwise Temporal Receptive Windows
In each voxel, we aggregated the prediction scores of the context-scrambled encoding models to form a 20-dimensional prediction profile. To capture the variance in the prediction profiles, we projected the prediction profiles onto the first principal component (PC) of the prediction profiles across all subjects during passive viewing. Only voxels for which the prediction score of the unscrambled model was higher than mean were used to assess PCs. Finally, the 98 th percentile of TRWs were adjusted to [0, 1].

Sensitivity and bias of TRW
We compared TRW between the passive listening task and the two attention tasks to assess the sensitivity of TRW to category-based attention. For each voxel a sensitivity index was calculated as where T RW 0 , T RW H , and T RW P are the TRW during passive listening, attention to humans, and attention to places. In a voxel with SI of 0 attention does not modulate TRW. A voxel with SI of 1 gets maximally modulated by category-based attention. Finally, we quantified a bias index as Maximized TRW during attention to humans versus attention to places yields positive versus negative BI in the range [−1, 1].
The average bias in human-selective areas (FFA and OFA) is 0.03 ± 0.01. The average bias in scene-selective areas (PPA and RSC) is −0.03 ± 0.02 (mean±std).

Conclusion
Our results demonstrate that during natural listening, brain optimizes search for categories by integrating information over longer time windows in higher cognitive areas. This finding implies that auditory perception in the real world is facilitated by a mechanism that dynamically modulates temporal receptive windows according to task demand. FFA, fusiform face area; OFA, occipital face area; PPA, parahippocampal place area; RSC, retrosplenial cortex. Voxels for which the prediction score of the unscrambled model was lower than the average appear as gray curvature and were not included in the analyses. (a) Normalized TRW averaged across five subjects. TRW increases from early auditory areas toward higher auditory areas and parietal and prefrontal areas. (b) Attentional sensitivity of TRW averaged across five subjects. Attentional sensitivity of TRW is relatively low in early auditory areas and category-selective areas. (c) Attentional bias of TRW averaged across five subjects. TRW bias was not significant in early auditory areas in temporal cortex (bootstrap test, p > 0.05). However, bias is more prominent in higher cognitive areas in prefrontal cortex (BA45, IFS, MFS, SFG), inferior parietal cortex (IPS and AG), and category-selective areas.