Faces and voices in the brain: a modality-general person-identity representation in superior temporal sulcus

Face-selective and voice-selective brain regions have been shown to represent face-identity and voice-identity, respectively. Here we investigated whether there are modality-general person-identity representations in the brain that can be driven by either a face or a voice, and that invariantly represent naturalistically varying face and voice tokens of the same identity. According to two distinct models, such representations could exist either in multimodal brain regions (Campanella and Belin, 2007) or in face-selective brain regions via direct coupling between face- and voice-selective regions (von Kriegstein et al., 2005). To test the predictions of these two models, we used fMRI to measure brain activity patterns elicited by the faces and voices of familiar people in multimodal, face-selective and voice-selective brain regions. We used representational similarity analysis (RSA) to compare the representational geometries of face- and voice-elicited person-identities, and to investigate the degree to which pattern discriminants for pairs of identities generalise from one modality to the other. We found no matching geometries for faces and voices in any brain regions. However, we showed crossmodal generalisation of the pattern discriminants in the multimodal right posterior superior temporal sulcus (rpSTS), suggesting a modality-general person-identity representation in this region. Importantly, the rpSTS showed invariant representations of face- and voice-identities, in that discriminants were trained and tested on independent face videos (different viewpoint, lighting, background) and voice recordings (different vocalizations). Our findings support the Multimodal Processing Model, which proposes that face and voice information is integrated in multimodal brain regions. Significance statement It is possible to identify a familiar person either by looking at their face or by listening to their voice. Using fMRI and representational similarity analysis (RSA) we show that the right posterior superior sulcus (rpSTS), a multimodal brain region that responds to both faces and voices, contains representations that can distinguish between familiar people independently of whether we are looking at their face or listening to their voice. Crucially, these representations generalised across different particular face videos and voice recordings. Our findings suggest that identity information from visual and auditory processing systems is combined and integrated in the multimodal rpSTS region.

Introduction a wealth of information regarding the person's identity, such as their name, our relationship to 79 them, and memories of previous encounters. Knowledge about how the brain processes faces 80 and voices separately has advanced significantly over the past twenty years: functional 81 magnetic resonance imaging (fMRI) revealed cortical regions that are face-selective 82 (Kanwisher et al., 1997) and regions that are voice-selective (Belin et al., 2000). Recent Despite these advances, we still have a limited understanding of how the brain combines and 97 integrates face and voice information. Two major models have been put forward. The 98 Multimodal Processing Model proposes that there are multimodal systems that process 99 information about people and receive input from both face-and voice-responsive regions (Ellis 100 et al., 1997; Campanella and Belin, 2007). Patient (Ellis et al., 1989;Gainotti, 2011) and fMRI will generalise across faces and voices. Two recent studies found some support for the 119 Multimodal Model by showing that multimodal regions in the STS and inferior frontal gyrus 120 (Hasan et al., 2016;Anzelotti and Caramazza, 2017) could discriminate between the activation 121 patterns of two face-identities based on voice information (and vice-versa). However, these 122 studies did not show that the regions that could decode identities across modalities could also 123 decode identities within each modality, which is a crucial feature of modality-general person-124 identity representations. Furthermore, these studies used very few identities and tokens per 125 identity. In our study, we included multiple, naturalistically varying face videos and voice 126 compare the representational geometry of face-identities with the representational geometry of 156 voice-identities (Analysis A), and to investigate the degree to which pattern discriminants for 157 each pair of identities generalise from one modality to the other (Analysis B). Analysis A 158 focused on the representational geometry of all of identities, i.e. the entire structure of pairwise 159 distances between the activity patterns elicited by these identities in each modality, and 160 compared geometries across modalities. Analysis B focused on the discriminability of pairs of 161 identities, and used a linear discriminant computed in one modality to test discriminability of 162 the same pair of identities in the other modality (in a similar way to traditional pattern 163 classification methods). These two analyses complement each other and allowed us to test 164 different predictions regarding the nature of modality-general person-identity representations. 165 166 For Analysis A (RSA comparing representational geometries), we predicted that brain regions 167 with modality-general person-identity representations would show matching representational 168 geometries for face-identities and voice-identities. This analysis is constrained by two 169 assumptions. The first assumption is that there is sufficient variability in the representational 170 distances between different identities within-modality, i.e. different degrees of similarity 171 between identities. If all identities are equally distinct from each other, we do not expect to find 172 correlations between geometries across the two modalities. The second assumption is that 173 modality-general information dominates over any modality-specific information that may be 174 present in the same voxels. Specifically, it is possible that the voxels comprising the pattern 175 estimates contain both unimodal and multimodal neurons (Quiroga et al., 2009). In this case, 176 identities could override the influence of modality-general information on the representational 178 geometry, and could result in non-matching representational geometries across modalities. 179

180
We thus also conducted Analysis B (RSA investigating identity discriminability), and we 181 predicted that brain regions with modality-general person-identity representations would be 182 able to discriminate between pairs of identities in one modality based on their representational 183 distance in the other modality. This analysis focuses on one pair of identities at a time, and thus 184 is not affected by the degree of variability in the representational distances between all 185 identities. In addition, this analysis is focused on pattern discriminants that generalise across 186 modalities, and therefore we believe that it is more sensitive to detect modality-general person-187 identity representations even in the presence of modality-specific information. 188

Participants 190
Participants were recruited at Royal Holloway, University of London and Brunel University 191 London to take part in a behavioural and fMRI experiment. All participants were required to be 192 native English speakers aged between 18 and 30, and to have been resident in the UK for a 193 minimum of 10 years. These requirements were set to increase the likelihood of participants 194 being familiar with the famous people whose faces and voices were presented in the 195 experiment. In addition, participants completed an online Recognition Task (see below) as part 196 of the screening procedure for the study and were only invited if they were able to recognise at 197 least 75% of our set of famous people from both their face and their voice.  The face stimuli were selected so that the background did not provide any cues to the identity 234 of the person. Other than the absence of speech, there were no constraints on the type of face 235 movement. Examples of face movements included nodding, smiling, and rotating the head. 236 However, all stimuli were selected to be primarily front-facing. Face stimuli were edited using 237 Final Cut Pro X (Apple, Inc.) so that they were three seconds long and centred on the bridge of 238 the nose. Six video-clips of the face of the same person were obtained from different original 239 videos set in a different background. 240 241 Voice stimuli were edited using Audacity® 2.0.5 recording and editing software 242 (RRID:SCR_007198) so that they contained three seconds of speech after removing long 243 periods of silence. Voice stimuli were converted to mono with a sampling rate of 44100, low-244 pass filtered at 10KHz, and RMS normalised using Praat. Six sound clips of the voice of the 245 same person were obtained from different original videos. All of the voice stimuli had a 246 different verbal content and were non-overlapping. The stimuli were selected so that the 247 speakers' identity could not be determined based on the verbal content, conforming to the 248 standards set by Van Lancker et al. (1985) and Schweinberger et al. (1997).

MRI data acquisition 261
Participants were scanned using a 3.0 Tesla Tim Trio MRI scanner (Siemens, Erlangen) with a 262 32-channel head coil at the Combined Universities Brain Imaging Centre (CUBIC) at Royal 263 Holloway, University of London. In each of the two scanning sessions, a whole-brain T1-264 weighted anatomical scan was acquired using magnetization-prepared rapid acquisition 265 After separate pre-processing of the images in each session, images from the second scanning 294 session were realigned to the structural image from the first session. Specifically, the structural 295 image from session two was coregistered to the structural image from session one, and the 296 transformation was then applied to all functional images from session two. As a result, all 297 functional images were in the same space. Vocal stimuli were presented in 20 blocks and included speech and non-speech vocalisations 303 obtained from 47 speakers (Pernet et al., 2015). Non-vocal stimuli were presented in 20 blocks 304 and consisted of industrial sounds, environmental sounds, and animal vocalisations. Within 305 each block stimuli were presented in a random order that was fixed across participants. 306 Participants were instructed to close their eyes and focus on the sounds. The TVA localiser was 307 presented directly after the main experimental runs. The duration of a single run was 308 approximately 10 minutes.  356 ROI definition. We used probabilistic maps from previous studies to define regional masks in 357 which we predicted that our regions of interest (ROIs) would be located. We then defined 358 ROIs by extracting all selective voxels within those regional masks for each participant. This lTVA). For all other regional masks, we used probabilistic maps that were obtained from a 367 previous study conducted in the lab (unpublished data). In this previous study we tested 22 368 participants using the same face and voice localisers as the current study (we did not use the 369 multimodal localiser in this previous study). We defined face-selective and voice-selective t-370 test images for each participant, thresholded each image at p<.05 (uncorrected), binarised the 371 resulting image, and summed all images across participants to create face-selective and voice-372 selective probabilistic maps. In cases where there was some overlap between the masks for 373 different regions we manually defined the borders of these masks using anatomical landmarks. aIT, lTP-aIT) -we considered the TP and aIT together as the peaks were difficult to separate 384 in most participants. We did not create a mask of the multimodal STS using this method due to 385 the voice-selective STS region being much larger than the face-selective STS region. However, 386 there was large overlap between our mask of the face-selective rpSTS and our masks of the 387 rSTS/STG and rTVA, suggesting that this face-selective rpSTS region also responds to voices. 388 389 All of the regional masks (in MNI space) were registered and resliced to each participant's 390 native space using FSL (version 5.0.9; RRID:SCR_002823; Jenkinson et al., 2012). These 391 masks were then used to extract ROIs from the t-test maps obtained from the contrasts of 392 interest from the face, voice, TVA, and multimodal localisers from the current study. All 393 voxels that fell within the boundaries of the mask and that were significantly activated at 394 p<.001 (uncorrected) were included in the subject-specific ROI. If there were fewer than 30 395 voxels at p<.001 the threshold was lowered to p<.01 or p<.05. If we could not define 30 396 selective voxels even at p<.05, the ROI for that participant was not included in the analyses. 397 We required that all ROIs be present in at least 20 participants (out of 30). Participants performed an anomaly detection task that involved pressing a button when they 411 saw or heard a novel famous person that was not part of the set of the 12 famous people that 412 they had been familiarised with prior to entering the scanner. Therefore, each run also 413 contained 12 task trials presenting six famous faces and six famous voices that were not part of 414 the set of famous people that the participants had been familiarised with. 415

416
Stimuli were presented in a pseudorandom order that ensured that within each modality each 417 identity could not be preceded or succeeded by one of the other identities more than once, and 418 that each stimulus could not be succeeded by a repetition of the exact same stimulus. Face and 419 voice clips were presented for three seconds with a SOA of four seconds. Thirty-six null 420 fixation trials were added to each run (~25% of the total number of trials). Thus, each run 421 contained 144 trials in total and lasted approximately 10 minutes. 422

423
The presentation order of the three runs was counterbalanced across participants. The same 424 three runs with the same face videos and voice recordings that were presented in scanning 425 session one were also presented in session two. However, the three runs were presented in 426 different orders in both sessions (counterbalanced across participants) and stimuli within each 427 run were presented in a new pseudorandom sequence. As an exception, the stimuli for the task 428 trials were different in the two sessions in order to maintain their novelty. 429 General linear models. We computed mass univariate time-series models for each participant. 431 Models were defined separately for each scanning session and each experimental run (six runs 432 in total). Regressors modelled the BOLD response following the onset of the stimuli and were 433 convolved with a canonical hemodynamic response function (HRF). We also used a high-pass 434 filter cutoff of 128 seconds and autoregressive AR(1) model to account for serial correlations. 435 The 12 different identities in each modality were entered as separate regressors in the model 436 the overall ability of that ROI to discriminate between identities. Mean LDC values for all 523 participants can then be subjected to random-effects inference comparing against zero. 524 Therefore, we predicted that crossmodal RDMs for regions with modality-general person-525 identity representations would show mean LDC values that are significantly greater than zero. 526 we also investigated the ability of each ROI to discriminate between identities within modality, 529 using the face and voice RDMs that were created in the previous analysis. We predicted that 530 face or voice RDMs for regions that represent face or voice identity, respectively, would show 531 mean LDC values that are significantly greater than zero. Code and data accessibility. All the code and data for the above analyses will be made 544 available after publication. 545 546 Exploratory whole-brain searchlight analyses. Despite including a broad range of 547 functionally defined ROIs, it is possible that modality-general person-identity representations 548 may exist in brain regions not included in our ROIs. Specifically, these representations may 549 exist in brain regions that are not face-selective or voice-selective. Therefore, we used an 550 exploratory whole-brain searchlight analysis to identify potential brain regions with person-551 focused solely on modality-general person-identity representations in this exploratory analysis, 553 as that was the main aim of this study. 554 555 For each participant we created 6mm radius spheres centred on each voxel within a grey-matter 556 mask of their brain (obtained from the segmentation procedure) using the RSA toolbox (Nili et  The whole-brain searchlight maps from each analysis were normalised to MNI space using the 570 normalisation parameters generated during the segmentation procedure and spatially smoothed 571 with 9-mm Gaussian kernel (full width at half maximum) to correct for errors in intersubject 572 alignment. For group-level analysis, all searchlight maps were entered into a one-sample t-test 573 to determine whether the correlation coefficient/mean LDC value was significantly greater than 574 zero at each voxel. We used the randomise tool (Winkler et al., 2014) in FSL for inference on 575 the resulting statistical maps (5,000 sign-flips). Clusters were identified with threshold-576  Table 1. 583 Table 1 Figure 1 shows those maps thresholded 602 to display all voxels that were present in at least 20% of the participants.

Mean response to faces and voices in ROIs 614
In order to confirm that each ROI showed the expected responsiveness to faces and voices, we 615 computed the regional mean of the parameter estimates for faces and for voices across 616 participants for each ROI and modality ( Figure 2). As expected, mean beta values for faces 617 were high and significantly greater than zero in all three face-selective ROIs (all one-sample t-compared with faces (p=.0002) despite being defined using our face localiser. This is most 623 likely due to the large overlap between this ROI and the voice-selective rSTS/STG and rTVA 624 ROIs. This finding demonstrates that the rpSTS also showed substantial responses to voices.  that, although we still included the frontal pole ROI in the main analyses, we cannot be 660 confident about the multimodal responses of this ROI. Also, we note that in all multimodal 661 ROIs (OFC, FP, rTP-aIT, lTP-aIT, Prec./P.Cing.) mean beta values for voices were 662 significantly higher than mean beta values for faces (all p≤.0011). We observed this 663 consistently across all participants. 664 across and within modalities in each ROI. We computed face and voice RDMs separately for 668 each session using the LDC and compared the RDMs using Pearson correlation (Figures 3 &  669 4). We then tested whether these correlations were significantly above zero.

694
We predicted that face and voice RDMs would be correlated in ROIs that represent person-695 identity independently from modality. However, our results showed no significant correlations 696 between face and voice RDMs in face-selective, voice-selective, or multimodal ROIs (Figure  697 3). It is possible that comparing RDM across different scanning sessions taking place on 698 separate days did not allow us to detect subtle consistencies in the representational geometry 699 for face-identities and voice-identities. To address this concern, we also compared face and 700 voice RDMs within the same scanning session. However, we still found no significant 701 correlations between face and voice RDMs. Therefore, using this method we found no 702 evidence of modality-general person-identity representations in our ROIs. 703

704
We also predicted that there would be correlations between RDMs within the same modality in 705 RDMs or between voice RDMs in any ROI were significant after correction for multiple 707 comparisons. 708 709 Analysis B: RSA investigating identity discriminability 710 Our second main analysis tested the generalisation of pattern discriminants from one modality 711 to the other. More specifically, we computed crossmodal RDMs and we tested whether linear 712 discriminants computed on pairs of faces could be used to discriminate between pairs of 713 voices, and vice-versa. We also tested whether each ROI could discriminate between pairs of 714 stimuli within the same modality. Mean LDC distances across all cells in crossmodal, face, and 715 voice RDMs were compared against zero. 716 717 718 719

740
We predicted that in brain regions with modality-general person identity representations the 741 mean LDC values for crossmodal RDMs would be significantly greater than zero. Our results 742 showed that mean LDC values in these RDMs were significantly greater than zero in the 743 rpSTS, and in the voice-selective lSTS/STG ( Figure 5; Table 2). These results show that the 744 rpSTS could discriminate pairs of face-identities based on pattern discriminants computed from 745 pairs of voice-identities (and vice-versa), and therefore appears to form modality-independent 746 person-identity representations. 747

748
We note that while the mean LDC values for crossmodal RDMs in the lSTS/STG were 749 significant, the mean LDC value for face RDMs was not. While this result suggests that this 750 region was able to discriminate identities based on crossmodal information, it is unlikely that a 751 modality-general representation could exist without face-identity discrimination. Therefore, 752 this result should be interpreted with caution. It is possible that in addition to the rpSTS, the 753 lpSTS also contains a modality-general person-identity representation and it could be driving 754 the positive result in the lSTS/STG. However, we were not able to test this because we could 755 not localise the lpSTS in our participants using our face localiser. 756

757
We also predicted that mean LDC values for face RDMs and voice RDMs would be 758 significantly greater than zero in ROIs that represent face-identity and voice-identity, 759 respectively. We found that mean LDC values in face RDMs were significantly greater than 760 zero in all ROIs originally defined as face-selective (rFFA, rOFA, rpSTS), in the TVAs, and in the multimodal Prec./P. Cing. (Figure 5; Table 3). These results show that all these regions 762 could discriminate between face-identities. A follow up analysis in which all overlapping 763 rpSTS voxels were removed from the rTVA showed that the significant result for faces in suggesting that identity discrimination in these regions is not solely driven by differences in 785 gender. 786 Finally, we conducted additional exploratory searchlight analyses across the whole brain to 789 determine whether there were brain regions with modality-general person-identity 790 representations that are not included in our ROIs. The first searchlight analysis investigated 791 correlations between face and voice RDMs across the whole brain, and we did not find any 792 regions showing such correlations between face and voice representational geometries. 793

794
The second searchlight analysis investigated crossmodal generalization of discriminants for 795 pairs of identities across the whole brain. We found a number of clusters in which the mean 796 LDC in crossmodal RDMs was significantly greater than zero (FWE corrected threshold p ≤ 797 .05), and below we report t-values and MNI coordinates for the peak grey matter voxels in each 798 cluster. Anatomical labels for peak voxels are based on the Harvard-Oxford cortical and 799 subcortical structural atlases. The results showed a large cluster (k=1927, p=.007) with peaks in 800 the right putamen (t=4.33, x=21, y=20, z=-1), the left posterior middle temporal gyrus 801 (t=4.04,x=-57, y=-19, z=-7), and the right precentral gyrus (t=3.89, x=54, y=8, z=32). 802 Significant clusters were also found in the right paracingulate gyrus (k=1340, p=. 003 of the rpSTS, demonstrating that this region was able to discriminate familiar identities based 814 on modality-general information in faces and voices. More specifically, the rpSTS could 815 discriminate pattern estimates for pairs of face-identities based on linear discriminants 816 computed from pattern estimates for pairs of voice-identities, and vice-versa. A crucial and 817 novel aspect of our study is that we showed that the rpSTS not only discriminates between 818 identities, but also generalises across multiple naturalistically varying face videos and voice 819 recordings of the same identity. By always comparing pattern estimates across independent 820 runs with different face and voice tokens for the same identities, we showed that the face-and