Reading direct speech quotes increases theta phase-locking: Evidence for cortical tracking of inner speech?

Growing evidence shows that theta-band (4–7 Hz) activity in the auditory cortex phase-locks to rhythms of overt speech. Does theta activity also encode the rhythmic dynamics of inner speech? Previous research established that silent reading of direct speech quotes (e.g., Mary said: “This dress is lovely! ” ) elicits more vivid inner speech than indirect speech quotes (e.g., Mary said that the dress was lovely ). As we cannot directly track the phase alignment between theta activity and inner speech over time, we used EEG to measure the brain’s phase-locked responses to the onset of speech quote reading. We found that direct (vs. indirect) quote reading was associated with increased theta phase synchrony over trials at 250–500 ms post-reading onset, with sources of the evoked activity estimated in the speech processing network. An eye-tracking control experiment conﬁrmed that increased theta phase synchrony in direct quote reading was not driven by eye movement patterns, and more likely reﬂects synchronous phase resetting at the onset of inner speech. These ﬁndings suggest a functional role of theta phase modulation in reading-induced inner speech.


Introduction
Inner speech is the subjective experience of speaking or hearing speech when no-one is talking out loud. It is a pervasive psychological phenomenon in human cognition ( Heavey and Hurlburt, 2008 ). On the one hand, it plays an important role in thinking ( Sokolov, 2012 ), problem solving ( Baldo et al., 2005 ), working memory ( Marvel and Desmond, 2012 ), reading Scheepers, 2011 , 2018 ;Yao and Scheepers, 2015 ;Yao, 2021 ), and writing ( Chenoweth and Hayes, 2003 ). On the other, dysfunctions of inner speech are often associated with symptoms in mental health disorders such as rumination in depression, auditory verbal hallucinations in schizophrenia, and associated disorders ( McCarthy-Jones and Fernyhough, 2011 ).
The diverse uses and functions of inner speech are supported by its many forms, varying along several phenomenological dimensions ( Grandchamp et al., 2019 ;McCarthy-Jones and Fernyhough, 2011 ). Regarding its acoustic and structural details and how it is engaged, inner speech can be expanded or condensed, and can be intentional or spontaneous. Expanded inner speech preserves much of the phonological and syntactic qualities of overt speech whereas condensed inner speech keeps only the semantic core without verbal elaboration. Depending on cating a fast (vs. slow) speaking rate (e.g., He said quickly vs. He said slowly ) ( Stites et al., 2013 ;. This suggests that quotation-induced inner speech must contain speech-like temporal features in addition to 'default' phonological processing in silent reading ( Yao and Scheepers, 2015 ). Moreover, direct speech reading times were highly correlated between silent and oral reading, providing further evidence that quotation-induced inner speech shares temporal features with overt speech .
At the neural level, expanded inner speech is found to activate areas of the auditory cortex recruited in overt speech , 2020Brück et al., 2014 ;McGuire et al., 1996 ;Yao et al., 2012 ). Using fMRI and eye tracking,  compared neural responses to silent reading of direct vs. indirect speech quotes. While both kinds of reported speech activated the auditory cortex, direct speech quotes were associated with greater neural activity in areas of the auditory cortex that selectively respond to human voice ( Belin et al., 2000 ). These areas were more active when direct (vs. indirect) speech quotes were read by equally monotonous voices, suggesting more vivid inner speech was induced in a top-down fashion to provide vivid prosodic representations that were expected for direct speech but were absent in the monotonous stimuli ( Yao et al., 2012 ). While identifying the overlapping neural correlates between inner and overt speech represent encouraging progress in understanding inner speech, a detailed understanding of the exact neural mechanisms of inner speech is still lacking, particularly regarding its temporal features.
A recently discovered neural mechanism for encoding overt speech's temporal structure in perception involves the tracking of speech amplitude envelopes by neural oscillations ( Giraud and Poeppel, 2012 ). The phase alignment between neural oscillations and speech envelopes provides an efficient means for encoding acoustic features ( Ding and Simon, 2012 ), parsing syllabic boundaries ( Giraud and Poeppel, 2012 ), and combining smaller linguistic units into larger structures ( Ding et al., 2016 ;Gross et al., 2013 ). While syllables are tracked predominantly by theta (4-7 Hz) oscillations ( Ding et al., 2016 ), larger linguistic units of words and phrases are tracked by even slower oscillations ( < 3 Hz, see Keitel et al., 2017 ;Meyer et al., 2017 ). Moreover, cortical speech tracking is functionally relevant for comprehension ( Peelle and Davis, 2012 ) as listening to intelligible (vs. unintelligible) speech is often associated with more precise speech tracking Luo and Poeppel, 2007 ).
Given the temporal similarities between speech perception and expanded inner speech, we ask whether neural oscillations play a similar role in inner speech as in overt speech perception. Unlike overt speech, inner speech is not an external signal for the brain to 'track' as such, but emerges directly from neural activity itself. A dominant theory proposes that inner speech is the perceptual consequence of intended articulation ( Jack et al., 2019 ;Scott, 2013 ;Whitford et al., 2017 ). Under this framework, phase modulation of neural oscillations in the auditory cortex may be induced and entrained by motor signals (efference copies) from the speech production system ( Assaneo and Poeppel, 2018 ). Since speech rhythms depend on motor constraints inherent in producing speech, the phase coupling between efference copies and neural oscillations would 'transfer' such dynamics to the auditory cortex, giving rise to a quasiperceptual experience of speech. Other theories suggest that expanded inner speech may be reactivated memories of speech  or that it is perceptually simulated from a remix of stored speech features ( Barsalou, 2008 ;Yao and Scheepers, 2015 ). In either case, oscillatory neural firing patterns that encode speech rhythms in perception would be reactivated endogenously, organizing phonological and prosodic representations in speech-like phase structures. Regardless of how expanded inner speech may be generated, its temporal features are likely encoded and modulated by neural oscillations in the auditory cortex.
The present study explored this conjecture. Given that overt speech is predominantly phase-locked to theta oscillations in perception ( Assaneo and Poeppel, 2018 ;Giraud and Poeppel, 2012 ), we hypothesised a simi-lar relationship for expanded inner speech. This seems plausible given its shared temporal features with overt speech ( Stites et al., 2013 ;, and that the syllabic rate of expanded inner speech in English (~5.8 per second) falls within the theta range of 4-7 Hz ( Netsell et al., 2016 ). Although we cannot directly track the phase alignment between theta oscillations and inner speech over time, we can nonetheless measure theta phase-locking to the onset of inner speech, whose timing can be determined and aligned across trials in silent reading.
We tested these predictions with EEG (Experiment 1) via spontaneous induction of more vivid inner speech in silent reading of direct vs. indirect speech quotes ( Alderson-Day et al., 2020 ;Stites et al., 2013 ;Yao, 2021 ). We chose this paradigm because it addresses two methodological limitations in inner speech research. First, many previous studies have elicited inner speech via subvocalisation or phonological judgements of simple words or sentences, which lacks the temporal complexity and spontaneity of everyday inner speech ( Jones and Fernyhough, 2007 ). In contrast, the current paradigm induces naturalistic inner speech in silent reading without explicit instructions to imagine it, which is more ecologically valid than task-elicited inner speech ( Hurlburt et al., 2016 ). Second, many inner speech elicitation tasks such as phonological judgements and speech imagery are often confounded by aspects of language processing (e.g., orthographic, semantic, and syntactic processes). The current paradigm controls such language confounds by manipulating inner speech between visually and linguistically matched reading conditions. As such, differential effects between conditions can only be attributed to inner speech manipulations rather than other language processes involved in reading. In this paradigm, if quotation-induced inner speech is temporally aligned with theta activity, we should observe increased theta phase-locking to the onset of direct compared to indirect speech quotes, with sources of the phase-locked activity estimated in bilateral auditory cortices ( Binder et al., 2000 ;Hickok and Poeppel, 2007 ;Price et al., 1996 ;Scott and Johnsrude, 2003 ). To ensure that any such increased phase-locking reflects differences in phase modulation rather than power, we additionally manipulated the loudness of inner speech which was expected to affect signal amplitude rather than phase ( Tian et al., 2018 ). Finally, we verified that theta phase modulation could not be explained by eye movement patterns for direct and indirect speech quotes in a separate control experiment (Experiment 2).

Participants
Thirty-two native speakers of British English participated in the EEG study (10 male, 22 female, M age = 22.7, SD age = 6.4). All were righthanded, had normal or corrected-to-normal vision, no language or learning disorders, and no history of neurological or psychiatric disorders. They were paid £12 for 2 hours of their time. All participants gave written informed consent and the experimental procedure was approved by the University of Manchester Research Ethics Committee (ref: 16248).

Materials and experimental design
A 2 (Quotation Style: direct vs. indirect speech) × 2 (Loudness: loudvs. quiet-speaking) within-subject design was used. One-hundred-andtwenty quartets of short stories were written as reading materials. Each story (see Table 1 for an example) described a scenario containing either a direct speech quote (1a, 2a) or an indirect speech quote (1b, 2b). To provide a variety of scenarios, the contexts preceding the quotations described either a loud-speaking (1) or a quiet-speaking (2) scenario. Crucially, critical speech quotations were identical across contexts and were matched word for word between the direct and indirect speech conditions except for unavoidable tense and pronoun changes. This ensured that speech quotations were matched across conditions of each item for length (number of words and syllables) and other linguistic characteristics such as grammatical complexity, so as to isolate inner speech from potential linguistic confounds. In addition to the 120 critical test items, 60 filler stories (without experimental manipulations) were prepared to conceal the intended experimental manipulations. Of the 60 stories, 24 contained direct speech quotes, 12 contained indirect speech quotes, and another 24 did not contain any quoted speech.
The 480 critical stories were allocated to four stimulus lists using a Latin square design. Each list contained 120 stories with 30 stories per condition, plus the 60 filler stories. The order of the stories per list was randomised for each participant. Each list was randomly assigned to 8 participants.

Task procedure
Participants were seated in a sound-attenuated and electricallyshielded room to silently read a series of written stories. The experiment was run in OpenSesame ( Mathôt et al., 2012 ). The visual stimuli were presented on a gray background in a 30-pixel Sans font on a 24-inch monitor (120 Hz, 1024 × 768 resolution) approximately 100 cm from the participant.
The experiments started with 5 filler trials to familiarise participants with the procedure, after which the remaining 120 critical trials and 55 filler trials were presented in a random order. Each trial began with the trial number for 1000 ms, followed by a fixation dot on the left side of the screen (where the text would start) for 500 ms. The story was then presented in five consecutive segments at the center of the screen. Participants silently read each segment in their own time, and pressed the DOWN key on a keyboard to continue to the next segment. Of the five segments, the first three segments of each story described the story background. The 4th displayed the text preceding the speech quotation (e.g., After checking the upstairs rooms, Gareth bellowed: ) and the 5th segment displayed the speech quotation (e.g., "It looks like there is nobody here! "). In about a third of the trials, a simple question (e.g., Was the house empty? ) was presented to measure participants' comprehension, which participants answered by pressing the LEFT ('yes') or RIGHT ('no') keys. Answering the question triggered the presentation of the next trial.
Participants were given a short break every 20 trials, and there were 8 breaks in total. The experiment lasted approximately 45-60 min.

EEG acquisition and preprocessing
EEG and EOG activity was recorded with an analog passband of 0.16-100 Hz and digitised at a sampling rate of 512 Hz using a 64-channel Biosemi Active-Two system. The 64 scalp electrodes were mounted in an elastic electrode cap according to the international 10/20 system. Six external electrodes were used: two were placed on bilateral mastoids, two were placed above and below the right eye to measure vertical ocular activity (VEOG), and another two were placed next to the outer canthi of the eyes to record horizontal ocular activity (HEOG). Electrode-offset values were kept between − 25 mV and 25 mV.
The recorded EEG data were preprocessed in EEGLAB v14.1.2b ( Delorme and Makeig, 2004 ). All electrodes were referenced to a mastoid average. EOG activity was calculated by subtracting the EOG signals within each pair of EOG electrodes. The data of 66 (2 EOG) channels were high-pass filtered at 0.3 Hz to remove slow drifts and were downsampled to 200 Hz because the raw data were recorded at a sampling rate higher than actually needed for the analysis. For each participant, 120 critical trials were segmented from − 1200 to 2000 ms relative to the presentation onsets of speech quotations (segment 5). Artifact trials were automatically marked based on (1) whether the EEG amplitudes exceed + / − 100 μV in the [ − 200 1000]ms time window across 64 EEG channels, and (2) whether the probability of observing the trial's data is 5 standard deviations from the mean EEG values within each EEG channel and across all 64 EEG channels ( Delorme et al., 2001 ). The marked trials were then visually reviewed to check for validity. Trials with common, ICA-removable artifacts (e.g., blink-related peaks that exceeded + / − 100μV thresholds) and trials with artifacts outside the critical [ − 200 1000]ms window were kept. This resulted in an overall mean trial loss of 2.2%, with 2.5%, 2.0%, 2.4% and 2.1% in each of the four conditions. The remaining trials were filtered at 2-25 Hz and were submitted to an ICA to isolate eye movement and other artifacts. Using the 'runica' algorithm with the default options, the ICA included both EEG and EOG channels and produced 66 components in total. Artifact components were identified using a semi-automated procedure: Components for eye blinks and muscular artifacts were classified automatically using the EEGLAB extension 'MARA' ( Winkler et al., 2011 ). Components for HEOG were additionally identified if their activation time-courses were strongly correlated with HEOG activity (| r | > 0.7). All components were visually reviewed before being declared artifact signals and removed (number of components removed per participant ranged 3-23, M = 11.3, SD = 5.0). The resulting ICA weights (minus artifact components) were projected back to the pre-filtered data before ICA (0.3-100 Hz). The EOG channels were dropped from further analyses.

Statistical analysis 2.5.1. Reading time analysis
Reading times for direct and indirect speech quotes (sentence segment 5) were determined from the sentence presentation onsets to participants' key presses. They were divided by the number of syllables to account for variation in sentence length. For this analysis only, we excluded trials with extreme reading times that were longer than 500 ms per syllable (1.8% data loss). This cut-off was selected based on the distribution of reading times per syllable and on the fact that fixation durations in silent reading of English rarely go beyond 500 ms ( Rayner, 1998 ). To check any systematic differences in reading times between conditions, a Gamma generalised linear mixed model of RTs per syllable was fitted using the glmer function in the lme4 package ( Bates et al., 2015 ) in R. We included a full factorial fixed effect structure with deviation-coded Quotation Style and Loudness , and a maximal random effect structure with Subject and Item as crossed random factors.

Sensor space time-frequency analysis
At the subject level, time-frequency analyses of single-trial EEG data (to calculate intertrial phase-locking values (PLVs) and total power) were conducted using Morlet wavelet decomposition with seven cycles per wavelet, at frequencies from 1 to 30 Hz in SPM12 ( http://www.fil.ion.ucl.ac.uk/spm/ ). PLV is also referred to as intertrial phase coherence (ITC). It measures the variability in the relative phases over trials. It takes values from 0 to 1 with 0 reflecting no phase synchrony across trials and 1 reflecting identical phase in all trials (see equation 6 in Aydore et al., 2013 ). Total power averages power over multiple trials and takes positive values (0 = no signal). Both PLVs and total power were calculated at each time-frequency point and channel, with the latter being log-scaled and baseline corrected to [ − 200 0] time window. They were averaged by condition using a robust averaging procedure where statistical outliers in narrow time and frequency ranges were down-weighted without rejecting whole trials ( Litvak et al., 2011 ). The averaged PLV and total power were then converted into Nifti images by condition for each participant, including topography × time images for theta-band analysis and time × frequency images for broadband analysis. These images are 3D (x, y, time) or 2D (time, frequency) matrices of averaged PLV and power. For topography × time images, the 2D representation of the topography is created by projecting the sensor locations onto a plane, before interpolating the data linearly between them onto a 32 × 32 pixel grid. The converted Nifti images were submitted to a general linear model for statistical analysis using statistical parametric mapping (SPM) to compare conditions over time, frequency, and topographical space. Gaussian smoothing was applied to the scalp × time volumes to accommodate spatial/temporal variability over subjects and ensure the images conform to the assumptions of the topological inference approach ( Litvak et al., 2011 ).

Planned theta-band analysis.
Topography × time maps were generated, averaging PLVs and total power within the theta (4-7 Hz) frequency band. Data were interpolated to create a 32 × 32 pixel (4.3 mm × 5.4 mm) scalp map for each time point from − 200 to 1000 ms relative to the quotation onset (i.e. when the critical speech quotation is presented). Topographic images were stacked to create a 3D spacetime image volume. The volume was smoothed with a Gaussian kernel at FWHM = [16 mm 16 mm 16 ms ], about 3 times of voxel size, which is in accordance with the assumptions of Random Field Theory ( Kiebel and Friston, 2004 ;Worsley et al., 1996 ).

Broad-band analysis.
To verify that the observed results reflect theta-specific rather than broad-band phase-locked responses, a full time-frequency analysis was conducted. Time × frequency maps were generated, averaging PLVs and total power across all channels for each integer frequency from 1 to 30 Hz, and for each time point from − 200 to 1000 ms relative to the quotation onset. No smoothing was applied to ensure precise estimation of PLV and total power in the frequency dimension.

Group analysis.
Group-level analyses used F -tests to assess the effects of Quotation Style and Loudness on PLVs and total power across the scalp and time and across time and frequency. The resulting massunivariate SPMs entail a statistical test at each of tens of thousands of voxels and therefore require correction for multiple comparisons. Familywise error (FWE) correction was applied at the cluster level at p < .05, with a cluster defining threshold of p < .001 (i.e., clusters consisted of voxels 'surviving' this uncorrected threshold) using Random Field Theory ( Flandin and Friston, 2017 ) which takes image smoothness into account.

Source space analysis
Source estimation was performed on single-trial time-domain data using a template cortical mesh ( Mattout et al., 2007 ). The mesh consists of 8196 nodes, tessellating the gray/white matter boundary of a single individual, with a mean inter-node distance of 4 mm. The neural generators were constrained to a lattice of dipoles on the cortical mesh, oriented perpendicular to its surface. A forward model was defined using a Boundary Element Method (BEM), which was inverted under the minimum norm (MN) hyperprior model. The MN model was chosen over the multiple sparse priors (MSP) model because it deploys reconstructed activity in a non-focal fashion, and hence is most resilient against intersubject variability in group analysis ( Litvak and Friston, 2008 ). Estimated source activity was summarised in 3D NIfTI images by condition by subject, which were smoothed with a Gaussian kernel (FWHM = [8 mm 8 mm 8 mm ]). The time/frequency window was selected based on the sensor-level results. They were then submitted to grouplevel F tests to assess the effects of Quotation Style and Loudness. Similar to the sensor space analysis, the resulting SPMs were familywise error (FWE) corrected for multiple comparisons at the cluster level ( p < .05; with a cluster defining threshold of p < .001 using Random Field Theory ( Flandin and Friston, 2017 ).
To verify theta phase synchrony in the source space, we extracted source time-courses using a 5-mm spherical ROI at the peak voxel in each cluster. The time series underwent the same time-frequency transformation and averaging as in the sensor data analysis to estimate the PLVs at the same time/frequency windows for inversion.

Results and discussion
All participants were debriefed after the experiment. They found the experiment 'interesting' and 'enjoyable' but none consciously noticed the experimental manipulations on quotation styles or loudness.

Reading time analysis
Mean reading times per syllable and comprehension question accuracies are summarised by condition in Table 2 . The GLMMs of RTs/syllable and comprehension question accuracies showed no significant effects of Quotation Style, Loudness or their interaction ( p s > 0.15), suggesting that reading times and comprehension were statistically indistinguishable between conditions.

Sensor space analysis
We observed significant main effects of Quotation Style in both thetaband and broad-band analyses.
In theta frequencies (4-7 Hz), we found significantly higher PLV over trials in silent reading of direct speech than indirect speech quotes from approximately 250-500 ms following sentence presentation onsets. The increased phase-locked effects were clustered in the left temporal and parietal channels ( Fig. 1 top left). No significant Loudness main effects or Quotation Style × Loudness interaction were observed. In terms of total power, we did not find significant power differences between direct and indirect speech in the topography × time analysis ( Fig. 1 bottom  left), nor did we find any significant Loudness main effects or Quotation Style × Loudness interaction.
In the broad-band (time-frequency) analysis across all channels, we observed increased PLV at 5 Hz but not in other frequencies between 1 and 30 Hz ( Fig. 1 top right). No significant Loudness main effects or Quotation Style × Loudness interaction were observed. In comparison, we did not observe significant power differences between direct and indirect speech ( Fig. 1 bottom right), nor did we observe any signifi- The results support the hypothesised higher phase synchrony over trials after the reading onset of direct relative to indirect speech quotes. The increased phase synchrony is specific to theta frequencies, particularly at 5 Hz. This phase-locked effect did not coincide with increases in total power which suggests neuronal responses to direct and indirect speech reading differ predominantly in phase modulation of theta activity. The lack of Loudness effects in either phase or power suggest that (1) inner speech in silent reading of direct quotes may not contain detailed loudness information and that (2) the loudness of inner speech may only be detectable indirectly using explicit imagery tasks and neural adaptation paradigms ( Tian et al., 2018 ).
It is worth noting that the PLV topographies in our study are more left-lateralised than in other auditory studies in the literature. For example, the auditory evoked N1 response is typically associated with a fronto-central topography and can be affected by corollary discharge (e.g., Rosburg et al., 2008 ;Tian et al., 2018 ). We speculate that internally-generated speech at the sentence level may produce different topographies than externally-perceived auditory stimuli. While biaural auditory stimulation may produce more bilateral responses, inner speech may be internally generated from a more left-lateralised language network. Moreover, the impoverished spectro-temporal details in inner speech means that it may engage a slightly different location in the STG/STS than, e.g., bilateral primary auditory cortex, which may result in a slightly different orientation and different projection to the scalp. Thus, to estimate the possible sources of the theta phase-locked effects, we conducted further analyses in the source space.

Source space analysis
We performed source estimation (see methods) and summarised the phase-locked source energy in a time-(250-500 ms from sentence presentation onset) and frequency-(4-7 Hz) window, based on the sensor space results. We observed different source activity between direct and indirect speech quotes in the left occipito-temporal and fusiform area (BA37), bilateral ventral and middle temporal areas (BA20/21), and the left inferior and middle frontal area (BA45/46). The source activity difference (thresholded at p < .001, uncorrected) is illustrated in Fig. 2 . The MNI coordinates of the significant peaks and sub-peaks are provided in Table 3 to indicate the cortical regions that are likely to have contributed to the phase-locked effects observed on the scalp. These peaks are several cm apart and sources at these locations should be easily distinguishable via the minimum norm estimation method. It is worth not-

Fig. 2. Direct Speech > Indirect Speech tcontrast for source activity at 250-500 ms and 4-7 Hz .
Note: also shown are three ROIs for the source space phase-locking analysis.

Table 3
Whole-brain coordinates for Direct Speech > Indirect Speech t -contrast for source activity at 250-500 ms and 4-7 Hz, thresholded at p < .001 uncorrected. Note: L = left, R = right, X,Y,Z are coordinates in MNI space, k = cluster size, t = t value, z = z value, p cluster = p value at the cluster level.
ing that source localization errors for 64-channel EEG could amount to ~3 cm in distance ( Song et al., 2015 ). As such, these peaks may not reflect the exact location of the phase-locked source energy, but general estimates of where it may originate from. These general estimates and their corresponding interpretations would not be affected by location errors on the order of ~3 cm. Intertrial phase-locking analysis of the source ROIs, i.e. the peak voxel in each of the three clusters ( Table 3 ) The source space analysis confirmed that the sources of increased theta phase-locked responses in direct speech can be estimated in the bilateral temporal cortices (superior temporal sulcus, medial and inferior parts of temporal lobes) and the left inferior frontal areas. These areas are broadly in line with neural correlates of inner speech identified in previous fMRI studies ( Alderson-Day et al., 2016, 2020. However, they do not quite match a typical motor-to-auditory corollary discharge circuit, which often involves more posterior parts of the inferior frontal cortex (e.g., pars opercularis) for speech planning, and somatosensory cortex and/or secondary auditory cortex for corollary discharge ( Tian and Poeppel, 2013 ). Rather, they agree more with a memory retrieval-based simulation circuit where lexico-semantic information and episodic memories in the prefrontal, medial and inferior temporal, and inferior parietal regions are retrieved and transformed into auditory representations of speech ( Price, 2012 ;Tian et al., 2016 ).
Notably, the core inner speech circuit is complemented by additional visual processes in the occipital and fusiform areas (cf. similar occipitofusiform fMRI activations in . One possibility is that direct speech quotations are associated with more vivid multisensory simulations that include both auditory and visual aspects of the protagonist speaking. However, because occipito-fusiform activations are not observed when listening to direct vs. indirect speech quotations ( Yao et al., 2012 ), these visual processes may be specific to reading. Increased theta phase synchrony in the fusiform areas may implicate greater phase-locked visual word form processing and orthographicphonological conversion to inner speech, which may be necessary for inner speech to occur in silent reading.
Although the above interpretations may be plausible, one could also argue that different theta phase-locked activity in occipito-parietal and temporal regions may not be driven by differential vividness of inner speech or grapheme-to-phoneme conversion, but by different eye movement patterns in direct vs. indirect quote reading. Theoretically, the phase of an EEG signal can be affected by oculomotor and visual processes in three ways. First, phase resetting can be caused by oculomotor and muscular activity at saccade onset ( Berg and Scherg, 1991 ). Second, visually evoked responses following fixation onset can change its ongoing phase ( Ossandón et al., 2010 ). Third, phase can be disturbed by subsequent saccades if current fixation duration is not equally distributed between conditions ( Nikolaev et al., 2016 ).
To rule out potential eye movement-related effects on theta phase synchrony, Experiment 2 recorded eye movements in silent reading of the same direct vs. indirect speech quotes. We compared the distribution of saccades preceding and following first fixations to test the effects of oculomotor and muscular activity on phase resetting, and compared the distribution of first fixation onsets and durations to examine the effects of visually evoked responses on phase. Increased theta phase synchrony in silent reading of direct speech quotes could be explained by more concentrated distribution (less dispersion) of first fixations and/or neighbouring saccades, time-locked to sentence presentation onset. If this distribution is not statistically distinguishable between conditions, it would suggest that the increased theta phase synchrony could not be driven by reading differences between conditions, and would be more plausibly explained by increased inner speech when reading direct quotes.

Participants
Twenty-four native speakers of British English who did not participate in Experiment 1 participated in the eye tracking control experiment (12 male, 12 female, M age = 24.3, SD age = 6.1). The inclusion criteria were identical to the EEG experiment. They were paid £6 for one hour of their time. All participants gave written informed consent and the experimental procedure was approved by the University of Manchester Research Ethics Committee (ref: 16248).

Materials and experimental design
The materials and experimental design were identical to the EEG experiment.

Task procedure
The task procedure was identical to the EEG experiment except that the experiment was conducted in an eye tracking lab. A SR-Research EyeLink 1000 eye tracker was used, running at 500 Hz sampling rate. Viewing was binocular but only the right eye was tracked. A chin rest was applied to keep the viewing distance constant and to prevent strong head movements during reading.

Eye movement data preprocessing
Raw EDF data files were first converted into ASC files using a file converter provided by SR Research. Fixation and saccade events as well as timestamps for the trial ID and sentence presentation onsets were then extracted.

Results and discussion
Similar to Experiment 1, all participants were debriefed after the experiment and none of them consciously noticed the experimental manipulations on quotation styles or loudness.

Reading time analysis
As per Experiment 1, we excluded trials with extreme reading times that were longer than 500 ms per syllable (2.3% data loss) for the reading time analysis only. Mean reading times per syllable and comprehension question accuracies are summarised in Table 4 . The GLMM of RTs/syllable showed no significant effects ( p s > 0.21), suggesting that reading performance was statistically indistinguishable between conditions. The GLMM of comprehension question accuracies showed no effects of Quotation Style or Quotation Style × Loudness interaction ( p s > 0.54). However, it did reveal a significant Loudness main effect, b = 0.69, z = 2.32, p = .021, with answers to comprehension questions were Note: N = number of trials, S.D. = Standard Deviation. more accurate following loud-speaking scenarios (97.2%) than quietspeaking scenarios (94.5%). This 'loudness advantage' in comprehension was mirrored in Experiment 1 which did not reach significance.
Since this main effect was not directly relevant to our central contrast of direct vs. indirect speech, it was not explored any further. Overall, reading time and comprehension performance in Experiment 2 are comparable to those in Experiment 1.

Eye movement analysis
A single ROI around sentence segment 5 (direct and indirect speech quotes) was created. First fixations were defined as the first fixations that landed in the ROI. Pre-first-fixation saccade onsets, first fixation onsets and durations, as well as post-first fixation saccade onsets (first fixation offsets) were calculated from the presentation onset of sentence segment 5. Pre-first-fixation saccades that started before the stimulus presentation were included in the analysis but not shown in distribution plots in Fig. 3 .
As we were primarily interested in statistical dispersion differences of eye movement distribution, standard deviations of the four eye movement measures were summarised by condition at the subject level. Their means were also calculated to check any fixation/saccade latency differences between conditions. The by-subject standard deviations and means were submitted to paired-sample t -tests at the group level for statistical comparisons. To evaluate the evidence for the alternative (H1) hypothesis, Bayes Factors ( BF 10 ) were also calculated at the group level using the Jeffreys-Zellner-Siow (JZS) prior ( Bayarri and Garcia-Donato, 2007 ). Both descriptive and inferential statistics are reported in Table 5 .
There were no significant differences in any of the eye movement measures between silent reading of direct and indirect speech quotes | t s (23) | < 0.895, p s > 0.384, BF 10 s < 0.309. All BF 10 s were below 1/3, providing substantial evidence for H0 over H1 ( Jeffreys, 1998 ). As such, Experiment 2 verified that the increased theta phase synchrony over trials in direct speech quotes could not be attributed to differences in reading patterns between direct and indirect speech quotes and was more likely to reflect differences in inner speech.

General discussion
Motivated by findings on the phase alignment between theta activity and overt speech ( Assaneo and Poeppel, 2018 ;Giraud and Poeppel, 2012 ), the current study tested a similar phase relationship between theta activity and expanded inner speech by focusing on theta phase synchrony over trials at the onset of reading-induced inner speech. We used an established paradigm to induce perceptually vivid inner speech during silent reading of direct (vs. indirect) speech quotes ( Alderson-Day et al., 2020 ;Brück et al., 2014 ;Stites et al., 2013 ;Yao, 2021 ). Using EEG (Experiment 1), we observed increased phase synchrony over trials, but no change in power, in theta frequencies (4-7 Hz) at the onset of direct over indirect speech quotes. Different phase-locked source activity was also observed  between direct and indirect speech quotes in the left occipito-temporal and fusiform area (BA37), bilateral ventral and middle temporal areas (BA20/21), and the left inferior and middle frontal area (BA45/46). In contrast, no Loudness effect was observed in phase synchrony, which was in line with previous findings that the loudness of inner speech affects EEG amplitude rather than phase. However, the Loudness modulation of power was not observed in the current study, suggesting that the loudness of inner speech may be impoverished in silent reading, and may only be fully activated and reliably detectable during explicit imagery tasks using neural adaptation paradigms ( Tian et al., 2018 ). Using eye tracking (Experiment 2), we found no eye movement differences at the reading onset of direct vs. indirect speech quotes, which ruled out the possibility that the theta phase-locking differences were driven by differential eye movement patterns.
In particular, the differential theta phase synchrony cannot be explained by reading patterns between direct and indirect speech quotes. It is known that oral readers typically insert a pause before a direct speech quote (e.g., She said: [pause] "I am hungry! ") but not before an indirect speech quote (e.g., She said that she was hungry ) ( Yao, 2011 ). A similar pause may take place during silent reading of direct speech quotes and cause phase resetting at the onset of direct but not indirect speech quotes. We took precautions by using a self-paced reading paradigm where a key press was required, thereby creating an artifi-cial pause before the presentation of speech quotes in both direct and indirect speech conditions. Even if a reading pause did precede direct speech quotes, it would have been equalised by the artificial pauses preceding both conditions. This reading paradigm worked as intended as there was no statistical difference in reading times between conditions in the EEG or the eye tracking experiment. The latter experiment further verified that eye movements (first fixations and the neighbouring saccades) in quotation reading were statistically indistinguishable between direct and indirect speech quotes.
The increased theta phase synchrony over trials at the onset of direct quote reading is therefore more plausibly explained by increased inner speech processing. First, previous research has consistently demonstrated more vivid inner speech during silent reading of direct rather than indirect speech quotes ( Stites et al., 2013 ;Scheepers, 2011 , 2018 ;. Second, the present study observed increased phase synchrony over trials in theta frequencies only and at 5 Hz in particular, but did not observe theta power differences. Our findings are consistent with the speech tracking literature, which typically reports entrainment and reset of theta phase but not theta power Luo and Poeppel, 2007 ;Peelle et al., 2013 ) and reports optimal auditory-motor synchrony in the theta range, particularly at ~4.5 Hz ( Assaneo and Poeppel, 2018 ;Giraud and Poeppel, 2012 ). Third, the sources of the increased phase-locked activity were estimated in bilat-eral temporal cortices, the left inferior frontal gyrus, and in the occipitotemporal areas. These regions have been respectively associated with auditory speech processing ( Binder et al., 2000 ;Hickok and Poeppel, 2007 ;Price et al., 1996 ;Scott and Johnsrude, 2003 ), covert articulation/verbal working memory ( Paulesu et al., 1993 ;Shergill et al., 2001Shergill et al., , 2002, and orthographic-phonological conversion ( Blomert, 2011 ;Hashimoto and Sakai, 2004 ), all of which are plausible components of inner speech in reading ( Alderson-Day and Fernyhough, 2015 ). Although the exact roles of these brain regions remain to be established in inner speech, they are nonetheless compatible with an inner speech account of the observed theta phase results.
However, it remains inconclusive whether the increased theta phase synchrony reflects greater evoked responses ( Obleser et al., 2012 ) at the onset of inner speech or greater phase modulation of ongoing oscillations ( Luo and Poeppel, 2007 ). One possibility is that evoked responses in direct speech quotes may reflect heightened motor imagery (of vocalization) during inner speech. Given that covert articulation is a key component of inner speech ( Alderson-Day and Fernyhough, 2015 ), readers may engage in stronger, or more effortful subvocalization of direct speech quotes, particular when they are loudly rather than quietly spoken. However, no loudness effects were observed. The observed direct speech effects were neither detected in beta frequencies, which are typically modulated by motor imagery ( Kühn et al., 2006 ), nor observed in articulation-related motor areas (e.g., the premotor cortex) in the source analysis. It was therefore unlikely that the increased theta activity in direct speech quotes was driven by motor imagery. A second possibility relates to potentially greater speech monitoring or attention in inner speech ( Perrone-Bertolotti et al., 2014 ). In direct speech quote reading, higher degrees of self-monitoring or attention may be required to ensure that distinct phonological features are activated to represent the quoted speaker's voice, rather than one's own ( Clark and Gerrig, 1990 ). This may prepare the 'ventral' speech processing network Poeppel, 2007 , 2016 ) in anticipating distinct inner speech to facilitate comprehension of the quoted speech. As the speech processing network is most sensitive at theta frequencies ( Giraud and Poeppel, 2012 ), increased phase-locked responses in this range may reflect a transient burst of anticipatory signals for distinctly vivid inner speech before the onset of direct quote reading. This explanation is not backed by the data. If self-monitoring/attention needs to be maintained during distinctly vivid inner speech, power increases should be sustained throughout direct quote reading. Although both direct and indirect quote reading were associated with alpha power decreases, no significant power differences were observed between the two conditions. Moreover, the increased phase synchrony over direct speech quotes was detected at 250-500 ms post reading onset but not before, suggesting that it was not of an anticipatory nature. Moreover, the lack of loudness effects suggests that the observed differences in theta phase synchrony between direct and indirect speech quotes was more likely to reflect distinct phase (rather than power) modulation of theta activity. Thus, the increased phase synchrony is most likely to reflect increased phase resetting of theta oscillations ( Luo and Poeppel, 2007 ), which signals the start of more vivid inner speech in direct speech reading. Just as theta oscillations encode the rhythms of overt speech, they may also encode the rhythms of inner speech in direct speech reading. Although phonological representations are matched almost word for word between direct and indirect speech reading, they may be arranged in different rhythms. While the indirect speech rhythm may follow the default rhythm of reading, the direct speech rhythm may deviate from it in a more speech-like arrangement, giving rise to a more distinct, vivid speech percept than one's default 'reading voice'. Such a rhythmic deviation in direct speech would necessarily cause a reset of ongoing oscillations at the onset of reading, resulting in increased theta phase synchrony over trials. Because different sentences were used, the exact rhythmic structures of inner speech varied largely across trials. While intertrial phase patterns may be more synchronous at the start of direct speech reading (due to increased phase resetting), they became increasingly asynchronous as reading continued (with varying sentence structures) and were eventually indistinguishable from intertrial phase patterns in indirect quote reading.
Several open questions remain to be addressed. One concerns whether inner speech has an 'envelope' similar to overt speech. Indeed, word skipping and regressive eye movements in silent reading means that inner speech could be more fragmented and scrambled, and may not necessarily exhibit the same envelope structure as overt speech in perception ( Brumberg et al., 2016 ). Given the current technology, the 'envelope' of inner speech is not objectively measurable, and we are yet able to provide more direct evidence for cortical tracking of such an 'envelope' like in overt speech. Future research will need to characterize the temporal structures of inner speech in different tasks and develop new methods to directly measure inner speech tracking.
A second open question regards why increased phase synchrony in inner speech is not observed at higher frequencies than those typically observed in overt speech (e.g., theta frequencies). It is commonly recognised that silent reading (and hence inner speech) is faster than overt speech because it does not involve explicit articulation ( Alderson-Day and Fernyhough, 2015 ). It therefore seems counterintuitive that inner and overt speech would share similar timescales in an electrophysiological context. However, although inner speech is evidently faster than overt speech, the rate difference is relatively small. Recent research shows that the rate of expanded inner speech (phonologically detailed inner speech) is only ~11% faster than overt speech (5.8 Hz vs. 5.2 Hz) which is still within the theta range ( Netsell et al., 2016 ). Moreover, word skipping during silent reading ( Rayner, 1998 ) means that only key parts of a sentence are 'sampled' and converted into inner speech. This kind of 'fragmented' inner speech may give the illusion of being faster (as it covers the same amount of text in a shorter time) but may still possess temporal properties of overt speech . To test whether the frequencies of phase-locking depend on the rate of inner speech, future research will need to test whether phase-locked response would be observed at higher frequencies for fast vs. slow inner speech.
A third question concerns how individual differences in the vividness of their inner speech may account for the observation or lack of effects. Vividness of inner speech varies largely between individuals ( Alderson-Day et al., 2018 ) and individuals' sensitivity to the reading of direct speech quotes is also likely to differ. Through our post-experiment conversations with participants, we learned that some consciously imagined vivid voices during speech quote reading while others had no conscious awareness of an internal voice; some imagined specific people's voices (their friends or family) for the quoted speakers while others used their own inner speech during reading. The different levels of awareness and uses of inner speech inevitably introduced noise in our data, which may render weaker effects (e.g., loudness) more difficult to detect. That being said, we did observe significantly increased theta phase-locking during silent reading of direct (vs. indirect) speech quotes, highlighting that phase modulation may be a relatively robust and universal consequence of expanded inner speech in silent reading. To understand the effects of different kinds of inner speech, future research will need to model individual differences in inner speech and capture the specific type of inner speech used on a single-trial basis.
In sum, the present study characterised a neurophysiological correlate of inner speech in silent reading of direct (vs. indirect) speech quotes. The results showed increased theta phase synchrony over trials at the onset of direct quote reading. This phase modulation is most plausibly explained by perceptually vivid inner speech in direct quote reading. Although we cannot directly track the phase alignment between theta activity and inner speech over time, our findings open up an exciting research avenue towards a mechanistic understanding of inner speech. Future investigations will need to develop new methods for measuring the phase structure of inner speech and examine its temporal relations to theta oscillations and to eye movements in reading.

Data and code availability statement
The summarised behavioural data, SPM contrast images and analyses scripts (R scripts and MatLab scripts) that support the findings of this study are available at: https://reshare.ukdataservice.ac.uk/854892/ (DOI: 10.5255/UKDA-SN-854892 )