Left ventral occipitotemporal activation during orthographic and semantic processing of auditory words

The present fMRI study investigated the hypothesis that activation of the left ventral occipitotemporal cortex (vOT) in response to auditory words can be attributed to lexical orthographic rather than lexico-semantic processing. To this end, we presented auditory words in both an orthographic ("three or four letter word?") and a semantic ("living or nonliving?") task. In addition, a auditory control condition presented tones in a pitch evaluation task. The results showed that the left vOT exhibited higher activation for orthographic relative to semantic processing of auditory words with a peak in the posterior part of vOT. Comparisons to the auditory control condition revealed that orthographic processing of auditory words elicited activation in a large vOT cluster. In contrast, activation for semantic processing was only weak and restricted to the middle part vOT. We interpret our findings as speaking for orthographic processing in left vOT. In particular, we suggest that activation in left middle vOT can be attributed to accessing orthographic whole-word representations. While activation of such representations was experimentally ascertained in the orthographic task, it might have also occurred automatically in the semantic task. Activation in the more posterior vOT region, on the other hand, may reflect the generation of explicit images of word-specific letter sequences required by the orthographic but not the semantic task. In addition, based on cross-modal suppression, the finding of marked deactivations in response to the auditory tones is taken to reflect the visual nature of representations and processes in left vOT.


Introduction
Over the last two decades a multitude of neuroscientific studies has shown that the left ventral occipitotemporal cortex (vOT) plays a critical role in reading. In particular, results from lesion studies suggest that damage to this region causes a relatively isolated deficit in visual word recognition (e.g., Leff et al., 2001;Cohen et al., 2003;Gaillard et al., 2006). In addition, functional neuroimaging studies have shown that the left vOT is one of the most consistently activated brain regions during word reading tasks (Jobard et al., 2003;Mechelli et al., 2003;Turkeltaub et al., 2002). One of the most prominent accounts of left vOT functioning is the so-called "Visual Word Form Area" (VWFA) hypothesis by colleagues (e.g., Cohen et al., 2000, 2002;Dehaene and Cohen, 2011) which suggests that a circumscribed region in the middle segment of the left vOT [y = −50 to −60 in Montreal Neurological Institute (MNI) standard space] becomes specialized for the encoding of orthographic stimuli (i.e., written words) through reading experience. However, it should be noted that the assumption that the left vOT hosts neuronal representations tuned to written words is controversial. A prominent opposing view is that the region acts as interface area between generic visual input and phonology and semantics (Price and Devlin, 2003, 2004. One of the main assumptions of the VWFA hypothesis by Dehaene and Cohen (2011) is that orthographic representations in left vOT are primarily tuned to sublexical units (i.e., frequent letter-sequences within words). This assumption is based on studies that have shown that pseudowords (i.e., non-words that are consistent with the structural constraints of the writing system) elicit activation in left vOT at least as strongly as do words (e.g., Dehaene et al., 2002). In addition, other studies have shown that activation in this region is modulated by the familiarity of sublexical orthographic features (e.g., the frequency of pairs or triplets of letters) with frequent features eliciting higher activation relative to less frequent features (Binder et al., 2006;Vinckier et al., 2007). However, more recent studies that have simultaneously investigated the effects of multiple factors (e.g., sublexical and lexical frequency, length, etc.) on left vOT activation during reading have consistently failed to find sensitivity to sublexical familiarity in the left vOT (Hauk et al., 2008a;Graves et al., 2010;Woollams et al., 2011). Woollams et al. (2011, for example, tried to dissociate the effects of sublexical familiarity (i.e., bigram frequency) and lexicality (i.e., words vs. pseudowords). While an effect of lexicality (i.e., pseudowords N words) was found in left vOT, an effect of sublexical familiarity was only found in a more posterior inferior occipital region.
Our research group questioned the limitation to sublexical letter string computation of the VWFA and proposed that the region also hosts representations of frequently encountered whole words (Kronbichler et al., 2004). In this perspective, the left vOT region would be functionally equivalent to the orthographic word lexicon of cognitive dual-route models of reading (Coltheart et al., 2001;Coltheart, 2004). Initial evidence for this came from Kronbichler et al. (2004), who showed that activation in left vOT was affected by the lexical frequency of written words with reduced activation for high relative to low-frequency words. This was interpreted as reflecting easier access to frequently relative to rarely used whole-word representations. Although other early neuroimaging studies failed to find a sensitivity to lexical frequency in left vOT regions (Fiez et al., 1999;Fiebach et al., 2002), more recent studies consistently replicated this effect (Hauk et al., 2008a(Hauk et al., , 2008bGraves et al., 2010). Other studies have also shown that left vOT exhibits reduced activation to familiar compared to unfamiliar orthographic forms of the same phonological words (e.g., TAXI vs. TAKSI; Kronbichler et al., 2007Kronbichler et al., , 2009Bruno et al., 2008;Van der Mark et al., 2009;Twomey et al., 2011). These "orthographic familiarity" effects are also interpreted as reflecting orthographic whole-word coding in left vOT. In addition, Schurz et al. (2010) found a lexicality (words vs. pseudowords) by length (short vs. long letter strings) interaction effect on activation in left vOT. Specifically, a length effect (= higher activation for long relative to short letter strings) was present for pseudowords but absent for words. The absence of a length effect for familiar words is expected when letter strings are assimilated by whole-word representations. Specific support for orthographic whole-word coding in left vOT was also provided by an fMRI priming study (Glezer et al., 2009), which found that the priming effect on vOT activation present for word repetition (coat-coat) disappeared when the prime differed in just one letter from the target word (boat-coat). In contrast, for pseudowords the priming effect was dependent on the number of shared letters.
In the debate around the VWFA it has also been questioned whether left vOT is restricted to visual (word) processing as assumed by the original VWFA hypothesis . Instead, it has been argued, the region might have a polymodal function (Price and Devlin, 2003). Interestingly, a similar interpretation has recently been presented by Dehaene and Cohen (2011) who have suggested that the VWFA might be a "meta-modal" reading area. This was based on studies showing the left vOT activation in congenitally blind individuals during reading of Braille (e.g., Büchel et al., 1998;Reich et al., 2011). However, in a recent fMRI study our research group has argued that it might be premature to generalize these findings from congenitally blind to sighted individuals (Ludersdorfer et al., 2013). We showed that the left vOT region which exhibited an orthographic familiarity effect (i.e., higher activation for visual pseudowords relative to words) was also activated for unfamiliar visual stimuli (i.e., false-fonts) and strongly deactivated for unfamiliar auditory stimuli (i.e., reversed spoken words) relative to rest. According to the phenomenon of cross-modal suppression (Laurienti et al., 2002) such deactivations during auditory processing should occur in brain regions dedicated to visual processes. Therefore, the findings of Ludersdorfer et al. were taken to speak for a visual rather than meta-modal role of the left vOT.
Complementary to reading research, an interesting approach to investigate orthographic processing in left vOT has been presented by studies presenting auditory words in the context of spelling or writing tasks. The relatively few studies in this field have consistently reported left vOT activation (for reviews see Purcell et al., 2011;Planton et al., 2013). A recent spelling study from our lab (Ludersdorfer et al., 2015) investigated whether left vOT activation during spelling can be attributed to accessing orthographic whole-word representations. We presented a spelling task in which participants had to indicate whether a visually presented letter was present in the spelling of an auditorily presented word. In the critical condition correct spelling decisions had to be based on known word spellings (i.e., orthographic whole-word representations) since they could not be based on sublexical phoneme-letter conversions. No such reliance on orthographic word representations was possible in an additional spelling condition which presented unfamiliar pseudowords. Here, decisions had to be based on sublexical-phonological information (i.e., phoneme-letter correspondences). Consistent with the assumption of orthographic whole-word representations in left vOT we found that the decisions based on known word spellings led to higher activation in left vOT compared to sublexical-phonological decisions.
The present study attempted to strengthen the evidence for visualorthographic whole-word codes in left vOT. In reading studies, a major concern has been that the left vOT activation differences, which are taken to reflect lexical (= whole word) orthographic processes (e.g., Kronbichler et al., 2004) could, in fact, be driven by lexicosemantic processes (Devlin et al., 2006;Duncan et al., 2010). Evidence for this comes from fMRI priming studies showing reduced left vOT activation for words preceded by a semantically related prime relative to a unrelated prime (Wheatley et al., 2005;Devlin et al., 2006). In addition, some studies have found a positive relation between activation in left vOT regions and semantic variables such as word imageability (Wise et al., 2000;Sabsevitz et al., 2005;Hauk et al., 2008b). Critically, the argument that the left vOT activation could be driven by semantics also applies to findings from spelling studies. For example, the higher vOT activation observed by Ludersdorfer et al. (2015) for word compared to pseudoword spelling could be taken to be affected by the obvious difference in the availability of word meaning.
To disentangle lexical orthographic and lexico-semantic contributions to left vOT activation, the present study presented the same auditory words in both an orthographic and a semantic decision task. In the orthographic task, participants had to indicate whether the written form of the presented word consisted of three or four letters. The fact that all words (German nouns) had three phonemes ascertained that participants had to rely on known word spellings (i.e., orthographic whole-word representations) since sublexicalphonological strategies (e.g., counting the number of heard phonemes) would have led to erroneous responses. In the semantic task, participants had to decide whether the presented word referred to a living or a nonliving entity. If left vOT activation is indeed mainly driven by lexico-semantic processes, then higher activation for semantic relative to orthographic processing of auditory words was expected. The opposite pattern was expected when left vOT activation can be attributed to accessing orthographic whole-word codes (Ludersdorfer et al., 2015). One may also note that the latter finding would be generally consistent with an orthography-specific function of left vOT as assumed by the VWFA hypothesis (Dehaene and Cohen, 2011). In contrast, no differentiation or even higher activation for semantic relative to orthographic processing would speak for alternative accounts of left vOT function .
In addition to the direct comparison, activation for orthographic and semantic processing of auditory words in left vOT was evaluated in relation to a auditory control task in which participants had to evaluate the pitch of presented tones. The response of the left vOT to the tones was of interest for the hypothesis that processes and representations in this region are visual  rather than metamodal as has been recently suggested (Dehaene and Cohen, 2011). As mentioned above, Ludersdorfer et al. (2013) interpreted a marked negative response of left vOT to unfamiliar reversed speech stimuli together with a positive response to visual stimuli as support for a visual role of left vOT. The tone stimuli of the present pitch evaluation task are much simpler than reversed spoken words. Therefore, it was of interest whether the present tones would also result in a negative vOT response as expected from cross-modal suppression of a visual region.

Participants
Twenty-nine German-speaking participants (14 females) aged 18 to 35 years (M = 26 years) were recruited for the present fMRI study. All participants were right-handed, had normal or corrected-tonormal vision and reported no history of neurological disease or reading/spelling difficulties. All gave written informed consent and were paid for participation.

Tasks and procedure
In the scanner, participants performed three tasks: in the orthographic task, participants were presented with auditory words and had to indicate with a button press whether the written form corresponding to the presented word consisted of three or four letters. In the semantic task, participants were also presented with auditory words and had to decide whether the presented word referred to a living or a nonliving entity. In the pitch evaluation task, participants were presented with tones and had to decide whether the presented tone was high or low in pitch.
Each of the three tasks was presented in six blocks of four trials. Each block started with an instruction screen for 1600 ms indicating the task (i.e., orthographic, semantic, or pitch evaluation). Each of the following trials started with the presentation of a fixation cross centrally on the screen for 300 ms. Next, in the orthographic and the semantic task an auditory word (the average audio length was 2024 ms) was presented preceded by its definite article. In the pitch evaluation task a square wave tone (varying in length from 1600 to 1900 ms) was presented instead of a word. During the presentation of the word or tone the fixation cross remained visible on the screen. Following the auditory presentation, visual cues (i.e., single letters) appeared to the left and the right of the fixation cross. These cues represented the two response alternatives of the respective task. In the orthographic task, for example, the letters D (the beginning letter of DREI [German for three]) and V (the beginning letter of VIER [four]) were presented. The visual cues were visible for the entire response interval which was jittered between 3900 and 4700 ms. The total length of each block was 23.3 s. Across participants, the order of block presentation was pseudorandomized. Between the task blocks fixation periods of 14 s were inserted. The total length of the experiment amounted to approximately 11 min.
Participants were familiarized with all tasks outside the scanner. During recording of the functional brain images auditory stimuli were presented via MR-compatible headphones. Visual stimuli were projected on a semi-transparent screen by a video projector outside the scanner room. An MR-compatible response box was used for the participants to respond. Projection and timing of the stimuli as well as the recording of responses were controlled by Presentation software (Neurobehavioral Systems Inc., Albany, CA, USA).

Stimuli
For the orthographic and the semantic task of the present study 48 mono-syllabic German nouns were selected. While all words consisted of three phonemes, half of the words had three letters (e.g., HUT [hat]) and the remainder four (e.g., KNIE [knee]). In addition, half of the words referred to living entities (e.g., BUB [boy]) and half to nonliving entities (e.g., ZUG [train]). The average lexical frequency of the words was 88 per million (SD = 188) and the average summated bigram frequency was 18,983 (SD = 16,492). The large standard deviations of these measures resulted from some extremely high-frequent words [e.g., FRAU (woman) with 857 occurrences per million] and some words consisting of very frequent bigrams [e.g., TIER (animal) with a summated bigram frequency of 94,848]. The words had on average 4.4 (SD = 2.2) orthographic and 12.7 (SD = 6.3) phonological neighbors. Both neighborhood size measures were calculated using CLEARPOND (Marian et al., 2012) using only substitutions (i.e., number of words of the same letter or phoneme length that differ in only one letter or phoneme). All item characteristics are based on the SUBTLEX database for German (Brysbaert et al., 2011). Word items were divided into 2 subsets that, across participants, were used about equally often in the orthographic and in the semantic task. The subsets were matched on all mentioned item characteristics. For the pitch evaluation task square-wave tones (100 or 300 Hz) were used.

Image acquisition and analysis
During each of two functional runs, 148 images sensitive to blood oxygenation level dependent (BOLD) contrast were acquired with a T2*-weighted echo-planar imaging sequence (echo time = 40 ms, TR = 2000 ms, flip angle = 86°, 21 slices with a thickness of 6 mm, 220-mm field of view with a 64 × 64 matrix resulting in 3.44 × 3.44 mm in-plane resolution). In addition, a low-(3.5 × 3.5 × 6 mm) and a high-resolution (1 × 1 × 1.3 mm) structural scan were acquired from each participant with T1-weighted MPRAGE sequences. A 1.5-T Intera Scanner (Philips Medical System Inc., Maastricht, The Netherlands) was used for magnetic resonance imaging.
For preprocessing and statistical analysis, SPM8 software was used (http://www.fil.ion.ucl.ac.uk/spm) running in a MATLAB 7.6 environment (Mathworks Inc., Natick MA, USA). Functional images were realigned, unwarped, and slice-time corrected. The high-resolution structural image was preprocessed and normalized using the VBM8 toolbox (http://dbm.neuro.uni-jena.de/vbm8). The image was segmented into gray matter, white matter and CSF, denoised, and warped into MNI space by registering it to the DARTEL template of the VBM8 toolbox using the high-dimensional DARTEL registration algorithm (Ashburner, 2007). Based on these steps, a skull-stripped version of the structural image was created in native space. The functional images were co-registered to the skull-stripped structural image and then the parameters from the DARTEL registration were used to normalize the functional images to the MNI space. The functional images were further resampled to isotropic 3 × 3 × 3 mm voxels and smoothed with an 8 mm FWHM (full width half maximum) Gaussian kernel.
Statistical analysis of the fMRI data was performed within a twostage mixed effects model. For the participant-specific first-level models, the onsets of each task's trials were modeled by a canonical hemodynamic response function (HRF) resulting in three regressors (orthographic, semantic, pitch evaluation). Additionally, the onsets of block instructions and visual cues were modeled as regressors of no interest. This trial-based (event-related) analysis allowed us to best characterize positive and negative activations for auditory words and tones relative to rest (i.e., fixation baseline). An analysis based on the task blocks would have obscured these comparisons since the task blocks also contained visual input (instruction and cue screens) as well as motor responses. The first-level models also included six covariates corresponding to the movement parameters (rotations and translations). The functional data were high-pass filtered with a cut-off of 128 s and corrected for autocorrelation by an AR(1) model (Friston et al., 2002). In the first level models, the parameter estimates reflecting signal change for all conditions compared to rest were calculated in the context of a GLM (Henson, 2004). These participant-specific images were then used for the second-level random effects analysis. For statistical whole-brain comparisons we used a voxelwise threshold of p b 0.05, corrected for multiple comparisons using the family-wise error (FWE) rate, and a cluster extent threshold of 5 voxels. Anatomical descriptions for activation peaks are based on the probabilistic Harvard-Oxford Atlas (Desikan et al., 2006) as implemented in FSL (http://www.fmrib.ox.ac. uk/fsl), thresholded at 25%.

Behavioral results
Due to technical problems during data acquisition five participants had to be excluded from the behavioral analyses. Table 1 shows the mean response times (measured from the onset of the auditory stimuli) and accuracies (percentage of correct trials) for the remaining 24 participants. Outliers, more than two standard deviations from the group mean, were excluded (in all tasks less than three outliers were identified). Between-task comparisons showed that there was no difference in response times and accuracy between the orthographic and the semantic task (ts (23) b 1.49, ps N .15). However, both the orthographic and the semantic task led to prolonged response times relative to the pitch evaluation task (ts (23) N 9.81, ps b .001). In addition, the semantic task led to more errors than the pitch evaluation task (t (23) = 3.80, p b .001).
Additional within-task analyses showed that there was no difference between three and four letter responses in the orthographic task in neither response times (t (23) = 1.95, p = 0.15) nor accuracy (t (23) = − 1.57, p = 0.13). In the semantic task, although there was no difference in response times (t (23) b 1), accuracy was lower for living compared to nonliving entities (t (23) = − 3.6, p b 0.005). The latter difference, which was also responsible for the overall difference in accuracy between the semantic and the pitch evaluation task, resulted from some living items being frequently misclassified as nonliving (e.g., ABT [abbot]).

fMRI results
First, via whole-brain analyses we investigated activation differences between orthographic and semantic processing of auditory words, activations for auditory words in the two tasks relative to the tones, and activations for auditory words and tones relative to rest (positive and negative). In addition, we performed region-of-interest (ROI) analyses to visualize activation levels for all conditions relative to rest in left posterior ventral regions.

Orthographic versus semantic processing of auditory words
In a first step, we directly compared activation for orthographic and semantic processing of auditory words. As can be seen from the left column of Fig. 1, only few activation differences were observed. More activation for semantic processing was only observed in the precuneus with a peak at [−3 −70 31] in MNI standard space (cluster size: 14 voxels; peak t value: 5.30). In the opposite direction, also only one cluster was identified. More activation for orthographic relative to semantic processing of auditory words was observed in the left vOT with a peak at [− 45 − 64 − 11] (cluster size: 9 voxels; peak t value: 6.31). Even with a more lenient threshold (p b .001, uncorrected) we did not identify any left vOT region exhibiting more activation for the semantic relative to the orthographic task.
Table 2 and the middle column of Fig. 1 further show activation for the auditory words in the orthographic and semantic tasks relative to the tones in the pitch evaluation task. In line with the previous analysis, we primarily identified activation common to both orthographic and semantic processing. Regions activated for both tasks included bilateral superior and middle temporal gyri as well as the left inferior frontal gyrus, precentral gyrus, and parieto-occipital regions. Of specific interest were activations for the auditory words in left vOT. Here, we only identified activation for the orthographic task with a peak at [− 45 − 64 − 11]. However, to more thoroughly investigate left vOT activations, we repeated the analyses with a more lenient threshold (p b .001, uncorrected) and restricted the search space to left vOT regions (i.e., left fusiform and inferior temporal gyrus). As can be seen in Fig. 1-B, this analysis revealed a small semantic activation cluster in middle vOT at [− 39 − 58 − 17] (cluster size: 5 voxels, peak t value: 3.86) overlapping with the large orthographic activation cluster (with the lower statistical threshold the size of the orthographic activation cluster increased to 66 voxels).

Auditory stimuli versus rest
The right column of Fig. 1 shows positive and negative activations for the auditory words and tones versus rest. Positive activations for the auditory stimuli were found in bilateral superior/middle temporal regions and left precentral gyrus. Interestingly, even with a more lenient threshold (p b .001) no positive activations were observed in left vOT (only in the ROI-based analysis, we did identify positive activations for the auditory words in these regionssee below). Marked deactivations in response to all auditory stimuli were observed in dorsal and ventral occipital regions. Of particular interest were deactivations in left ventral posterior regions (see Fig. 1-B). While all auditory stimuli led to deactivations in posterior parts, only the auditory tones did so in anterior parts including the left vOT (see Appendix A for statistics).

ROI-based analyses
We further investigated activation levels for auditory words and tones by means of ROI-based analyses. These analyses mainly served to supplement the whole-brain analysis by providing a clear visualization of activation levels relative to rest in left ventral posterior regions of interest. Three non-overlapping spherical ROIs (r = 4 mm) were defined along the posterior-to-anterior dimension and centered on maximum intensity voxels of an "effects of interest" contrast (i.e., all conditions versus rest) which was anatomically restricted to the left fusiform and inferior temporal gyrus. The anterior ROI was centered on the maximum intensity voxel between y = −50 and −59, the middle ROI between y = − 60 and − 69, and the posterior ROI between y = − 70 and − 90. Fig. 2 depicts the approximate locations of ROIs. For all regions, mean contrast estimates were extracted for the three tasks versus rest. Outliers, more than two standard deviations from the group mean, were excluded (in all ROIs, less than three outliers were identified per tasks).
An initial TASK (orthographic, semantic, pitch evaluation) × ROI (posterior, middle, anterior) ANOVA revealed a significant interaction between the factors (F (4,112) = 11.56, p b .001) reflecting differentiated activation gradients for the three tasks across the ROIs. To investigate regional task differences, we additionally carried out one-way ANOVAs with the factor TASK for each ROI. These analyses showed that there was no difference between the tasks in the posterior ROI (F (2,56) b 1). In contrast, in both the middle and the anterior ROI significant task differences were identified [middle ROI: F (2,56) = 28.81, p b .001; anterior ROI: F (2,56) = 16.61, p b .001]. Paired t-tests showed that higher activation for the orthographic compared to the pitch evaluation task was present in both ROIs (ps b .05). In contrast, higher activation for the semantic compared to the pitch evaluation task was found in the anterior (p b .05) but not the middle ROI (p = .21). Direct comparisons of the auditory word tasks further revealed higher activation for the orthographic relative to the semantic task in both ROIs (ps b .05).
We additional performed comparisons of regional task activation levels relative to rest (see Table 3). These comparisons revealed that Table 1 Behavioral results. Mean response times (measured from stimulus onset) and accuracy (percentage of correct trials) for the orthographic, the semantic, and the pitch evaluation task. Standard deviations are presented in parentheses.

Task
Response times ( while the pitch evaluation task resulted in significant deactivations throughout all ROIs, more differentiated gradients were found for the orthographic and the semantic task. Activation for the orthographic task was negative in the posterior but positive in both the middle and the anterior ROIs. Activation for the semantic task was negative in both the posterior and the middle ROIs but did not differ from rest in the anterior ROI.

Discussion
The aim of the present fMRI study was to investigate whether the left vOT activation in response to auditory words can be attributed to lexical orthographic processing or instead to lexico-semantic processing. To this end, we presented auditory words in both an orthographic ("three or four letter word?") and a semantic ("living or non-living?") task. For orthographic decisions, an opaque phoneme-grapheme relation (all words had three phonemes but could have three or four letters) assured that participants had to access whole-word representations and did not solve the task sublexically (e.g., serially converting each phoneme into a grapheme). Our main result was that we found higher left vOT activation for orthographic relative to semantic processing of auditory words. Furthermore, comparing the auditory words to tones presented in a pitch evaluation task revealed activation for orthographic processing in a large left vOT cluster. In contrast, activation for semantic processing was only small and restricted to middle vOT. In addition, the auditory tones elicited marked deactivations throughout left vOT regions.
The present result of higher left vOT activation for orthographic relative to semantic processing of auditory words strongly supports the assumption that activation in response to auditory words in this region reflects orthographic rather than lexico-semantic processing. In particular, in line with the conclusion of our previous spelling study (Ludersdorfer et al., 2015) we argue that left vOT activation can be attributed to the access to orthographic wholeword codes as suggested by previous reading-based findings (Kronbichler et al., 2004(Kronbichler et al., , 2007Glezer et al., 2009). Critically, both the present study as well as Ludersdorfer et al. experimentally ascertained that participants had to rely on known word spellings (i.e., orthographic whole-word representations) to respond correctly in the orthographic spelling task. In general, the present results are also consistent with the assumption of an orthography-specific function of the left vOT by proponents of the VWFA hypothesis (Dehaene and Cohen, 2011) and speak against alternative accounts . fMRI results. Panel A shows the results of the whole-brain comparisons (p b .05, FWE corrected). Panel B shows the results for the same comparisons (p b .001, uncorrected) restricted to left ventral posterior regions (i.e., left fusiform and inferior temporal gyrus) superimposed on axial slices. The left column presents activation differences between orthographic and semantic processing of auditory words, the middle column presents activations for auditory words in the two tasks relative to the tones in the pitch evaluation task, and the right column presents activations and deactivations for the auditory stimuli relative to rest. *Brain regions activated for auditory words or tones relative to rest.
The identified activation cluster for orthographic processing (in the orthographic N semantic and the orthographic N tones contrast) with a peak at MNI coordinates [−45 −64 −11] closely corresponds to the activation cluster found in our previous spelling study (Ludersdorfer et al., 2015) identified by contrasting orthographic spelling decisions on auditory words with a non-spelling control condition (i.e., gender decision) presenting the same words. A similar vOT region has also been consistently identified in previous spelling and writing studies (Rapp & Lipka, 2011;Rapp & Dufor, 2011). In a recent meta-analysis of such studies, Planton et al. (2013) identified a left vOT cluster with a peak at [−46 −62 −12].
Interestingly, both the present orthographic activation peak as well as the left vOT peaks of previous spelling-based fMRI studies (see Planton et al., 2013) are located in a more posterior left vOT segment compared to what reading studies have identified as orthographyselective region or VWFA. The latter is classically located in middle vOT between y = −50 and −60 (e.g., Cohen et al., 2000Cohen et al., , 2002Glezer et al., 2009). However, one should note that although spelling activation peaks are usually located more posterior, activation clusters also reach in the more anterior parts of left vOT identified by reading studies. A possible explanation for the difference in peak location is that the more posterior peaks identified by spelling studies may not reflect the proper localization of abstract orthographic whole-word codes but the localization of explicit visual images of word-specific letter sequences. These letter sequences are derived from the more anteriorly located orthographic whole-word codes and are maintained for the demands of orthographic tasks such as deciding whether the written form of an auditorily presented word consists of three or four letters. In reading studies, no such potentially effortful derivation and maintenance of letter sequences is required, because activation of the letter string is stimulus-driven and constitutes an only transient process on the way to whole-word recognition. This reasoning finds support in the results of our previous spelling study (Ludersdorfer et al., 2015) in which the left vOT activation for orthographic word spelling had a peak at y = −64 compared to a non-spelling control condition and a more anterior peak at y = − 55 compared to pseudoword spelling. While the former contrast isolated spelling processes more generally, the latter specifically isolated the process of accessing orthographic word representations during spelling.
In this perspective it is of interest that in addition to the orthographic task also the semantic task elicited (weak) activation relative to the tones in left middle vOT. Activation of left vOT regions in response to auditory words presented in non-orthographic tasks (such as the present semantic task) is not an isolated finding. Several previous studies have identified activation in this region when participants listened to spoken words and had to make rhyming judgments (e.g., Booth et al., 2002;Yoncheva et al., 2010), repeat and think about the meaning of words  or simply evaluate whether a word did or did not follow an identical one (Ludersdorfer et al., 2013). Initially, these findings were taken to challenge an orthography-specific function of left vOT (Price and Devlin, 2003). The present findings, however, do not support this view. In contrast to the strong orthographic activation, semantic activation (relative to the tones) observed in the whole-brain analysis was only present at a very lenient statistical threshold (p b .001, uncorrected). Furthermore, the ROI-based analyses showed that in the anterior ROI in which semantic activation was found, orthographic processing still elicited significantly higher activation relative to semantic processing. It is possible that the weak activation reflects that auditory words presented in the semantic task also automatically accessed their corresponding orthographic representations. Although speculative, this interpretation is in line with Dehaene et al. (2010Dehaene et al. ( , 2015 who argue that left vOT activation in response to spoken words in nonorthographic tasks could reflect automatic top-down recruitment of orthographic codes during demanding tasks in which all available information is gathered to facilitate speech processing. It is, however, also possible that orthographic and semantic activations in this middle vOT region stem from distinct sub-regions that are difficult to distinguish in group-based analyses. It may be possible that while the orthographic task engaged an orthographic region (i.e., the VWFA), the semantic task might have engaged a functionally distinct but overlapping region involved in multimodal word processing. Cohen et al. (2004), for example, identified a lateral inferior multimodal area (LIMA), which was indistinguishable from the VWFA in group comparisons but was reliably identified at the individual level. In future studies Table 2 Brain regions activated for auditory words in the orthographic and the semantic task relative to the tones in the pitch evaluation task (p b .05,FWE corrected Activation plots on the right depict brain activation estimates (in arbitrary units) for auditory words in the orthographic and the semantic task as well as for the tones in the pitch evaluation task. Error bars denote ±1 SEM. Asterisks denote significant differences (p b .05).
it will be important to employ high-resolution MRI in combination with individual analyses in order to delineate fine-grained subdivisions of left vOT. A further finding of interest was that the tones elicited marked deactivations in left vOT regions. According to the phenomenon of cross-modal suppression (Kawashima et al., 1995;Laurienti et al., 2002) deactivations in response to auditory processing are expected in visual brain regions. Therefore, the presently found deactivations in large dorsal and ventral occipital areas (see Fig. 1) are not surprising. In contrast, as mentioned in the Introduction, the assumption that the left vOT is a visual region and constitutes the anterior end of the left ventral visual stream  has been criticized. Instead, it has been suggested that the region might have a polymodal (Price and Devlin, 2003) or meta-modal function (Dehaene and Cohen, 2011). The latter was based on studies showing left vOT activation in congenitally blind individuals during reading of Braille (e.g., Reich et al., 2011). However, the presently found deactivations in left vOT in response to tones rather support a visual role of left vOTat least in sighted individuals. More direct support for this comes from a previous study from our lab (Ludersdorfer et al., 2013) in which we showed that the left vOT exhibited marked deactivation to unfamiliar auditory stimuli together with a strong positive activation to unfamiliar visual stimuli. The characterization of the left vOT as dedicated to the visual domain has implications for our interpretation of the activations found in response to the auditory words. As mentioned, the higher activation for auditory words relative to the tones is taken to reflect access to orthographic representations. The deactivation to the tones suggests that these representations are shaped by visual experience (i.e., seeing visual words).
Generally unexpected was that activation differences between orthographic and semantic processing of auditory words were rather sparse with both eliciting largely common activation most prominently in left middle temporal and inferior frontal brain regions. A possible explanation for this is that the present study might have not cleanly separated semantic from orthographic processes so that word meaning may have not only been accessed during the semantic but also during the orthographic task. The present blocked task design, however, speaks against the possibility that this is merely an artifact of the experimental setup. Instead, this might point to a more general difficulty in separating semantic from orthographic processing of auditory words. The latter assumption finds support in some cognitive models of spelling which suggest that spellings for familiar words are not accessed directly from phonology but via the semantic system (e.g., Tainturier and Rapp, 2001).
Only one brain region, the left precuneus, was identified with higher activation for semantic compared to orthographic processing of auditory words. Interestingly, the precuneus is generally not associated with semantic processing per se. However, the region has been linked to mental imagery (Cavanna and Trimble, 2006), which poses a plausible strategy for the living/nonliving decision of the present semantic task. More surprising, however, was the failure to identify classical semantic regions such as the angular gyrus as well as the medial and anterior temporal lobes (Binder et al., 2009) in the comparison of the semantic with the auditory control condition. With respect to the angular gyrus, recent evidence suggests that the region is primarily engaged by the integration of complex semantic information (Binder et al., 2009;Binder and Desai, 2011;Seghier, 2013). It might be the case that the present living/nonliving decisions did not require such high-level semantic integration processing. The anterior temporal lobe, which has also been ascribed a pivotal role for amodal semantic memory by neuropsychological and PET studies (Lambon Ralph et al., 2010) is in general rarely identified in fMRI studies of semantic processing (Patterson et al., 2007). This probably results from a diminished fMRI signal in brain regions close to the air-filled sinuses (Devlin et al., 2000). With respect to the medial temporal lobe, there is also mixed evidence for activation during semantic processing task in fMRI studies (Otten et al., 2001;Chee et al., 1999). It has been suggested that when semantic processing is compared to very simple control tasks with relatively long interstimulus intervals (such as the pitch evaluation task of the present study) activation in the medial temporal cortex might be missed due to "mind-wandering" in the control task which also activates these regions (Tieleman et al., 2005). However, of specific interest for semantic activations of the present study is a previous study by Booth et al. (2002), which also contrasted semantic processing of auditory words (i.e., a semantic association task) to processing of tones. Similar to the present findings, Booth and colleagues did not identify activations in the mentioned classic semantic regions but mainly in middle temporal and inferior frontal regions.

Conclusion
The present study provided evidence for the hypothesis that activation of the left vOT in response to auditory words reflects the access to visual-orthographic representations of whole words. We found higher left vOT activation for orthographic processing ("three or four letter word?") relative to semantic processing ("living or nonliving?") of auditory words. Comparisons to tones presented in a pitch evaluation task showed that orthographic processing of auditory words elicited activation throughout left vOT. In contrast, semantic processing elicited only weak activation in middle vOT. We interpret our findings as speaking for orthographic processing in left vOT. In particular, we suggest that activation in the middle vOT, the classic localization of the Visual Word Form Area, can be attributed to the access to orthographic whole-word representations. While activation of such representations was experimentally ascertained in the orthographic task, it may have also occurred automatically in the semantic task. Activation in the posterior vOT on the other hand may reflect the generation of explicit images of letter sequences required by the orthographic but not the semantic task. Based on the phenomenon of cross-modal suppression, the finding of marked deactivation in response to the auditory tones also supports the view that orthographic word representations in left vOT are of visual nature. Acknowledgments P. Ludersdorfer was supported by the Doctoral College "Imaging the Mind" of the Austrian Science Foundation (FWF-W1233), F. Richlan was supported by the Austrian Agency for International Cooperation in Table 3 Statistical comparisons (paired t-tests) against rest for auditory words and tones in each of the left ventral posterior ROIs. Education and Research (OeAD PL 11/2015), and M. Kronbichler was supported by the Austrian Science Foundation (FWF P-23916-B18). We want to thank Carys Deeley for proofreading the manuscript.
Appendix A