Bridging phenomenology and neural mechanisms of inner speech: ALE meta-analysis on egocentricity and spontaneity in a dual-mechanistic framework

The neural mechanisms of inner speech remain unclear despite its importance in a variety of cognitive processes and its implication in aberrant perceptions such as auditory verbal hallucinations. Previous research has proposed a corollary discharge model in which inner speech is a truncated form of overt speech, relying on speech production-related regions (e.g. left inferior frontal gyrus). This model does not fully capture the diverse phenomenology of inner speech and recent research suggesting alternative perception-related mechanisms of generation. Therefore, we present and test a framework in which inner speech can be generated by two separate mechanisms, depending on its phenomenological qualities: a corollary discharge mechanism relying on speech production regions and a perceptual simulation mechanism within speech perceptual regions. The results of the activation likelihood estimation meta-analysis examining inner speech studies support the idea that varieties of inner speech recruit different neural mechanisms.


Introduction
Inner speech is an internal, speech-like experience without the presence of an external sound (Alderson-Day and Fernyhough, 2015).Inner speech has been implicated in a wide variety of cognitive tasks including working memory (Baddeley, 1992;D'Esposito, 2007), silent reading (Filik and Barber, 2011;Yao and Scheepers, 2011;Yao et al., 2011), behavioural self-regulation (Diaz et al., 2014), as well as task switching and goal tracking (Emerson and Miyake, 2003;Miyake et al., 2004).Impairments of inner speech are often associated with mental disorders such as auditory verbal hallucinations in schizophrenia (Frith, 1987) and deficits in metacognition (Langland-Hassan et al., 2017;Morin, 2009).Given the functional role of inner speech in cognition and the negative consequences of its impairments, it is imperative that we develop a robust understanding of the cognitive and neural underpinnings of inner speech.In the current paper, we assess two mechanistic models of inner speech and explore how they can be reconciled with the phenomenology of inner speech in a unifying theoretical framework.This framework is examined and verified through an Activation Likelihood Estimation (ALE) meta-analysis of the existing literature.

Two mechanistic models of inner speech 1.The corollary discharge model
The corollary discharge model proposes that inner speech is the predicted perceptual consequence of a planned articulatory movement (Jack et al., 2019;Jacobson, 1932;Scott, 2013;Scott et al., 2013;Watson, 1913).The intent to speak generates an efference copy of the articulatory signal, which enters a forward model to predict what the intended articulation would sound like.This prediction is then perceived internally as inner speech.The model is supported by functional magnetic resonance imaging (fMRI) and positron emission tomography (PET) research, which often reports that tasks thought to elicit inner speech (e.g., rhyme judgement, cued speech and metrical stress judgement) activate brain areas related to speech production, particularly the left inferior frontal gyrus (L-IFG) (Aleman et al., 2005;Curcic-Blake et al., 2013;Lurito et al., 2000;Shergill et al., 2001).Using magnetoencephalography (MEG) in a mental imagery task, Tian and Poeppel (2010) show that kinaesthetic estimation of articulatory imagery is followed by increased auditory neural activity ~ 170 ms later, favouring the idea that articulatory signals are subsequently transformed into corollary discharge.The articulation-derived corollary discharge is believed to provide the perceptual content of inner speech, which is found to attenuate the impact of matching overt speech on subsequent speech perception (Scott, 2013), and to reduce the amplitude of the N100 event-related potential (ERP) response to overt speech when they are matched in content and time (Jack et al., 2019).
Although the corollary discharge model posits that inner speech arises from articulatory signals, empirical evidence shows that overt articulation of irrelevant speech (articulatory suppression) does not fully suppress inner speech (Wheeldon and Levelt, 1995).This implies that inner speech may not be driven entirely by articulation.Moreover, brain areas associated with speech production are not always activated during inner speech, especially when it occurs spontaneously (Hurlburt et al., 2016;Yao et al., 2011).In these studies, spontaneous inner speech predominantly activates the auditory perceptual areas, suggesting that it can be generated perceptually without relying on articulatory signals (Barsalou, 2008;Yao and Scheepers, 2011;Yao et al., 2011Yao et al., , 2021)).
This mixed evidence also highlights two unresolved questions for the corollary discharge model.First, the speed of inner speech is often faster than overt speech (Mackay, 1981;Netsell et al., 2016), suggesting that inner speech may not be a fully-fledged corollary discharge of the intended articulation.Corollary discharges are derived from efference copies that match and monitor our speech output.This matching process is both time-sensitive and content-specific (Jack et al., 2019), leading to perceptual attenuation of our own speech (Ford and Mathalon, 2005).If corollary discharges unfold faster than overt speech, this mismatch in timing could lead to errors in attributing the speech as our own.Consequently, faster inner speech is less likely to fully rely on corollary discharges.Instead, it could take a more abstract format by losing its audible-speech qualities, such that it does not need to adhere to the timescale afforded by fully-fledged corollary discharges.Alternatively, it could be condensed into incomplete fragments, creating the illusion of being faster than fully articulated overt speech.The reduced reliance on corollary discharges calls for alternative, complementary mechanisms to account for the perceptual qualities of inner speech at this faster timescale.
Second, inner speech often incorporates voices of others which contain vocal features distinct from one's own (McCarthy-Jones and Fernyhough, 2011;Alderson-Day et al., 2018).These 'foreign' vocal features cannot be provided solely by corollary discharges as they are physically constrained by one's own articulator.While our articulator offers some flexibility, e.g., in raising or lowering pitch to mimic how other people talk, it has limitations in generating voices that we cannot produce accurately out loud, such as those of Darth Vader or a dragon from The Hobbit.To attribute inner speech to other people, one needs to recognise that the voices they experience cannot be explained by their own corollary discharges.This mismatch suggests the need for additional mechanisms to create perceptual features that are not predicted by one's corollary discharges.

The perceptual simulation model
One way for inner speech to be generated perceptually is via perceptual simulation, as coined by embodied cognition theories (Barsalou, 1999(Barsalou, , 2008)).In overt speech perception, neurons in the auditory cortex encode perceptual features of speech in distinct firing patterns.These patterns are captured and stored by neurons in association areas, referred to as conjunctive neurons (Barsalou et al., 2003;Barsalou, 2008), which allows the former to be reactivated later to simulate the perceptual experience of the original speech (or part thereof) (Barsalou, 2008).As a variety of captured firing patterns accumulate, they can be integrated or remixed to create new patterns and consequently new perceptual experiences.This would allow for a finite number of captured speech experiences to enable perceptual simulations of potentially an infinite number of novel speech experiences (e.g., an imaginary speech by Donald Trump saying, "Make psychology great again!") (Barsalou et al., 2003).
The concept of novel perceptual experiences being generated using a finite number of stored memories is not limited to the embodied cognitive literature.Kosslyn et al. (2001) provide one early description of this mechanism in their examination of the neural foundations of mental imagery, where they describe a system in which mental images may be "created by combining and modifying stored perceptual information in novel ways".This mechanism whereby imagery is a simulation of perception occurring within sensory cortices continues to represent a dominant explanation of how visual imagery is generated (Dijkstra et al., 2019), and has been extended to the domain of inner speech where it is argued that memories of speech may be combined and reactivated such that an individual can generate novel inner speech experiences in the absence of motor activation (Carruthers, 2018).
Initial support for such a mechanism supporting inner speech can be found in Tian et al. (2016) who evidence a memory-based route for the mental imagery of speech.
The perceptual simulation model of inner speech neatly complements the corollary discharge model.First, perceptual simulations of speech do not depend on speech production, which could explain the limited effectiveness of articulatory suppression on inner speech (Wheeldon and Levelt, 1995) and the lack of activations in speech production areas during spontaneous inner speech in silent reading (Yao et al., 2011) and at rest (Hurlburt et al., 2016).Second, inner speech can be simulated at a faster timescale than overt speech as it is not constrained by how fast the articulators can physically move (Oppenheim and Dell, 2010).Third, vocal features that cannot be produced entirely by one's own articulator, such as vocal features of Darth Vader, or the opposite sex whose pitch is outside of one's own vocal range, can be perceptually simulated.
Nevertheless, it is worth noting that the perceptual simulation model has its own limitations.Theoretically, it remains underspecified how memories of speech fragments are combined in ways that adhere to grammatical rules of our language (e.g., Carruthers, 2018).Empirically, several neuroimaging studies did not observe activation of the auditory cortex during inner speech tasks (De Nil et al., 2000;Gulyás, 2001), which implies a lack of perceptual reactivation.Moreover, Aziz-Zadeh et al. (2005) found that repetitive transcranial magnetic stimulation (rTMS) of the L-IFG disrupted not only overt speech but also inner speech in a syllable counting task, suggesting that at least some types of inner speech depends on speech production regions.Finally, concurrent white noise has been found to either inhibit (Poulton, 1977), or improve performance on inner speech tasks (Wilding and Mohindra, 1980), suggesting varying dependence of inner speech on perception-related mechanisms.These disparate results highlight that neither corollary discharge nor perceptual simulation alone could offer a full mechanistic account of inner speech.

Reconciling heterogeneous findings in inner speech research
To reconcile conflicting findings in the inner speech literature, one needs to recognise that inner speech is not a homogeneous, uniform phenomenon, but a multi-dimensional, flexible process manifested in a variety of forms (Hurlburt et al., 2013;McCarthy-Jones and Fernyhough, 2011).Recent evidence suggests that the exact forms of inner speech are at least in part determined by task conditions.For example, an fMRI study by Hurlburt et al. (2016) proposes that inner speech elicited by an explicit task (e.g., being asked to imagine saying 'elephant') is mechanistically different to inner speech generated spontaneously (inner speech captured during resting state).In Regions of Interest (ROI) analyses, they showed that task-elicited inner speech was associated with increased activation of the left IFG and decreased activation in Heschl's gyrus, whereas spontaneous inner speech was associated with increased activation in Heschl's gyrus and no significant effects in the left IFG.They argue that explicit inner speech tasks may rely more on speech production and increase cognitive demands.In contrast, spontaneous inner speech seems to rely less on speech production and may be better explained by a perceptual imagery mechanism.
This notion of a production-perception mechanistic divide in inner speech is also demonstrated by Tian et al. (2016).In their fMRI study, they compared neural activations during imagined articulation and imagined hearing of simple syllables.They found that imagined articulation induced greater activity in a frontal-parietal sensorimotor system resembling a corollary discharge network, with activation encompassing regions involved articulation planning (inferior frontal, premotor, supplementary motor regions), forward model estimation (parietal somatosensory regions), and sound reconstruction (superior temporal regions).In contrast, imagined hearing primarily engaged lexico-semantic (mid temporal regions) and episodic memory networks (mid-frontal, intraparietal and mid-temporal regions).Auditory memories would be retrieved from this distributed network before being reassembled in the superior temporal regions to form an internally perceptible sensation.These findings converge on the idea that various forms of inner speech may be flexibly generated by two distinct neurocognitive mechanisms, with one relying on covert speech production (in line with corollary discharge) and the other on memory-based perceptual imagery (in line with perceptual simulation).
Instead of debating between individual mechanisms of corollary discharge or perceptual simulation, integrating them in a dualmechanistic model could provide much needed flexibility in accounting for the variety of inner speech and reconciling seemingly contradictory empirical findings.For example, the corollary discharge mechanism can efficiently produce inner speech in one's own voice and at one's own will.This type of inner speech is likely to be used in phonological judgement tasks (e.g., determining whether two words rhyme) and to activate brain areas associated with speech production.Conversely, the perceptual simulation mechanism can better explain inner speech spoken by another person as it bypasses one's physical constraints in articulating other people's voices.Tasks that elicit this kind of inner speech should be more likely to engage areas related to speech perception and memory.
Although this dual-mechanistic model is flexible and could theoretically explain all varieties of inner speech, it remains to be tested across a wider range of task conditions beyond those in Tian et al. (2016) and Hurlburt et al. (2016), and to be reconciled with the diverse phenomenology of inner speech (e.g., VISQ-R; Alderson-Day et al., 2018).

Bridging the mechanisms and the phenomenology of inner speech
There have been several attempts to characterise the phenomenology of inner speech (Clowes, 2007;Hurlburt et al., 2013;Perrone-Bertolotti et al., 2014).One prominent framework is provided by the Varieties of Inner Speech Questionnaire -Revised (VISQ-R) (Alderson-Day et al., 2018), which builds on an earlier version by McCarthy-Jones & Fernyhough (2011).The VISQ-R characterises the quality of inner speech using five factors: dialogic, condensed, other people in inner speech, evaluative/critical and positive/regulatory.The dialogic factor represents the extent to which the inner speech is a dialogue or a monologue, the condensed factor represents the extent to which inner speech is in abbreviated form or of typical structure and other people in inner speech reflects whether the voice is of the speaker or of another person.The final factors, evaluative/critical and positive/regulatory capture whether the inner speech serves evaluative purposes (e.g., thinking about a previous decision), or positive purposes (e.g., using inner speech to calm oneself), respectively.Whilst these dimensions are well motivated by the traditional Vygotskian model of inner speech (i.e., varying by dialogue and condensation) (Vygotsky, 1987), the extent to which they are supported by proposed neurocognitive mechanisms is largely underspecified.
A recent study by Grandchamp et al. (2019) extended on McCarthy- Jones and Fernyhough (2011), along with other theoretical and empirical works, to develop a neurocognitive model of inner speech which varies along three dimensions: condensation, dialogality and intentionality.In this 'ConDialInt' model, condensation measures the sensorimotor detail in the representation of inner speech, which can range from being fully detailed to being comparatively abstract.Dialogality captures the extent to which inner speech takes the form of a dialogue vs. a monologue, as well as the number of speakers within that speech.At one extreme inner speech can take the form of a monologue, consisting of a single voice speaking in a manner reminiscent of a soliloquy.At the opposite extreme, inner speech represents a dialogue involving the voice of the speaker and the imagined voices of others.Intentionality indicates how deliberately or spontaneously inner speech is generated.The model therefore displays significant overlap with the VISQ and VISQ-R whilst also being integrated into a cognitive framework.In describing the underlying neurocognitive mechanisms involved, Grandchamp et al. (2019) propose a hierarchical predictive control scheme that aims to subsume all subtypes of inner speech.In line with the corollary discharge model, the scheme consists of a series of feedforward and feedback connections that integrate a conceptualisation, a phonetic formulation and an articulatory planning component into a hierarchical whole.The ConDialInt model presents a novel and important contribution to the current understanding of inner speech as it brings both phenomenological and neurocomputational findings into a single framework.
However, some areas of the ConDialInt model remain to be reconciled with existing empirical evidence.For example, as the model is based on the corollary discharge mechanism, it cannot easily explain findings that some inner speech (e.g.spontaneous or in other-voice) does not activate speech production areas (Hurlburt et al., 2016;Raij and Riekki, 2017;Yao et al., 2011).One attempt to explain these findings through the lens of the ConDialInt model is to view the involvement of the articulators, and speech production regions, as a dynamic process which may be modulated based on task demands.This is akin to Oppenheim and Dell's (2010) concept of flexible abstractness, whereby the involvement of speech articulators in inner speech may be increased or decreased depending on the need for phonological detail brought about by situational demands.In this vein it may be argued that the lack of observable involvement of speech production regions in some forms of inner speech indicates a low level of activation within the corollary discharge system, rather than the involvement of a separate mechanism of generation.While this flexible implementation of the ConDialInt model offers a plausible explanation of some of the divergent findings within inner speech neuroimaging literature, it offers a limited account for the lack of suppression of inner speech through concurrent articulation (Wheeldon and Levelt, 1995), and for the speed difference between inner and overt speech (Mackey, 1981;Netsell et al., 2016).Crucially, it does not explain the unique vocal features of inner speech that cannot be produced by one's own articulators.Perceptual simulation, a mechanism used in various types of mental imagery (e.g.visual imagery; Pearson et al., 2015), can help explain these gaps, particularly for inner speech in voices other than our own.
The ConDialInt model's dimension of dialogality raises further questions which require consideration.First, by categorising inner speech by its dialogality in a manner that ranges from a single voice to multiple distinct voices, the ConDialInt model risks combining types of inner speech with different mechanisms of generation and neural correlates into a single subtype (e.g.corollary discharge and perceptual simulation).For example, inner speech which is highly dialogic could utilise one generative mechanism when the speaker hears his own inner voice, and a different generative mechanism when generating the voice of a second person with distinct vocal characteristics.The interplay between multiple characters inherent to Grandchamp et al. (2019) concept of dialogic inner speech also implicates systems such as Theory of Mind (Alderson-Day et al., 2016;2020), which might play a role in the inner speech experience but do not constitute the speech-like experience, per se.These questions present a hurdle to the empirical investigation of inner speech and could suggest that the dimension of dialogality could be further refined such that the chances of capturing neurocognitively distinct types of inner speech within a single dimension are reduced.
Additional questions arise regarding the condensation dimension of the ConDialInt model.Grandchamp et al. (2019) define condensed inner speech as that which "only involves the highest linguistic level (semantics), and has lost most of the acoustic, phonological, and even syntactic qualities of overt speech".Conceptually, the lack of these qualities implies that tasks requiring deliberate use of these qualities, like short-term memory rehearsal or phonological judgements, are unlikely to involve condensed inner speech.Rather, condensed inner speech seems to be inherently spontaneous.This raises questions whether the dimension of condensation is orthogonal to that of deliberateness-spontaneity.Empirically, due to its spontaneous nature, condensed inner speech is difficult to manipulate or measure.Although current neuroimaging techniques can in principle capture brain activity linked to semantic processing, such as in the left anterior temporal lobe (Visser et al., 2010), it remains challenging to attribute this neural activity to condensed inner speech specifically, as this activity can also be generated by other non-verbal semantic processing.Given the lack of empirical data on condensation, it is not possible to conduct a meta-analysis on this dimension, or to articulate how condensed inner speech may interact with known neurocognitive mechanisms related to corollary discharges and/or perceptual simulations.Therefore, the exclusion of condensation from our framework is not a critique of its existence or its inclusion in Grandchamp et al.'s model, but a reflection of practical concerns in the current meta-analysis.This does not preclude further adjustments to our framework as condensation becomes better understood and more empirically substantiated.As it currently stands, however, this dimension remains challenging to investigate or analyse.
We therefore propose a simpler, more flexible two-dimensional cognitive framework.The original dimension of condensation is not included because it is not empirically manipulatable or measurable.The original dimension of dialogality is replaced by egocentricity.Rather than classifying inner speech by whether it is a monologue, dialogue, or the number of speakers represented within inner speech, the dimension of egocentricity measures the extent to which inner speech is a recreation of one's own voice (high egocentricity), or the voice of another individual (low egocentricity).Egocentricity ensures that mechanistically distinct types of inner speech (i.e. in own-voice & other-voice) are clearly differentiated rather than intermixed, as in the case of dialogic inner speech.Finally, the dimension of intentionality is replaced by spontaneity.This renaming serves primarily to avoid the implication that spontaneous inner speech may not have an intent.For example, an individual might evaluate his plans for the day in a spontaneous manner, which demonstrates intent whilst also being spontaneous.It's crucial to note that 'deliberateness' and 'spontaneity' function on a continuum in our framework.The continuum ranges from highly deliberate inner speech, elicited by explicit task demands (e.g., rhyme judgements), to highly spontaneous inner speech, which emerges in the absence of cues or clear task demands (e.g., mind wandering).Between these extremes, other forms of inner speech exist but are less studied.These include, for example, rehearsing a conversation before it happens, engaging in internal monologue during problem-solving tasks like the Tower of Hanoi, or spontaneously generating musical lyrics.Our framework is conceptualised to accommodate these diverse types of inner speech along the continuum of deliberateness and spontaneity.
Instead of relying on corollary discharge exclusively (Grandchamp et al., 2019), the present framework is dual-mechanistic, additionally incorporating the perceptual simulation mechanism.It straightforwardly predicts the relative contributions of the two mechanisms along the northwest-southeast diagonal of the egocentricity × spontaneity space (Fig. 1).That is, the more egocentric and deliberate inner speech is (i.e. more northwestward in Fig. 1), the more strongly it relies on articulation-derived corollary discharge.For example, tasks involving explicit phonological judgements are likely to elicit egocentric and deliberate inner speech and to activate speech production areas (Aleman et al., 2005;Lurito et al., 2000).In contrast, the less egocentric or deliberate inner speech is (i.e. more southeastward in Fig. 1), the more likely it resorts to perceptual simulation.For example, imagery of another person's voice (low egocentricity) and inner speech that emerges spontaneously would preferentially recruit the perceptual simulation mechanism, given the physical constraints in articulating voices of others and/or a lack of explicit intentions.These types of inner speech are more likely to activate temporal auditory cortices (Alderson-Day et al., 2016;Hurlburt et al., 2016;Marvel and Desmond, 2012;Yao et al., 2011) and may be modulated by oscillatory activity in the auditory cortex rather than in articulatory regions (Yao et al., 2021).
In addition to predicting mechanistic involvement across different inner speech tasks, the proposed framework could also bridge the corollary discharge and perceptual simulation mechanisms and the phenomenological qualities identified in VISQ-R (Alderson-Day et al., 2018).For example, high-and low-egocentric inner speech would be phenomenologically perceived as self-and other-monologic inner speech, respectively.The intermix of the two would support the types of dialogic inner speech which include multiple speakers / voices.While the dimensions of dialogality and other people are more concerned with the sensorimotor features and the agency of inner speech, the dimensions of evaluative/critical and positive/regulatory primarily capture its cognitive functions.Although the kinds of inner speech used for these functions are yet to be empirically studied, they can nevertheless be represented along the egocentricity and spontaneity dimensions.However, the dimension of condensation is not explicitly considered in the current framework as it likely correlates with egocentricity and spontaneity and cannot be easily manipulated or objectively observed.

Aims & hypotheses
To verify the proposed framework and its predicted neuroanatomical underpinnings, the present study carried out an Activation Likelihood Estimation (ALE) meta-analysis of the existing neuroimaging literature.Functional activation coordinates are compiled across multiple studies, to identify which brain regions are consistently activated as inner speech varies along egocentricity and spontaneity.This kind of convergence analysis is more likely to reveal neural correlates inherent to inner speech, as it is less skewed by peripheral processes introduced by specific paradigms (e.g., increased working memory, verbal monitoring, or Theory of Mind).More importantly, it enables us to verify the distinct mechanisms of corollary discharge and perceptual simulation across a wider range of paradigms beyond the studies by Hurlburt et al. (2016) and Tian et al. (2016).
While the current state of neuroimaging literature is incomplete and focuses on specific forms such as those elicited by tasks (e.g., rhyme judgements), mind wandering, and silent reading, our meta-analysis relies on these well-studied forms to test our framework's predictions.The caveat lies in cautious generalisation to other types of inner speech.Less studied forms, such as internal dialogues incorporating voices of others, or those that occur in problem-solving tasks, are underexplored in the neuroimaging literature.The present framework aims to address this gap by conceptualising continuous dimensions of egocentricity and spontaneity in inner speech and serves as a generalised framework for further empirical validation and exploration of these under-studied forms.
We hypothesised that corollary discharge and perceptual simulation would be differentially engaged to produce a variety of inner speech.Inner speech would primarily engage the corollary discharge mechanism when higher in egocentricity and/or more deliberate, and rely more on the perceptual simulation mechanism as egocentricity decreases and/or spontaneity increases.In an ALE analysis, we predicted that inner speech which was deliberate and high in egocentricity would be associated with more consistent activations in speech production areas.These speech production areas primarily include the L-IFG, the left premotor cortex (L-PMC) and the supplementary motor area (SMA) (Booth et al., 2003;Lurito et al., 2000).Within the L-IFG, we expected greater activation of the pars opercularis subregion (BA44).This is because of previous work implicating the pars opercularis in phonological processing (Burton et al., 2005) and speech production (Tourville and Guenther, 2011), as well as it serving a putative role in articulatory planning and efference copy generation in previous studies of inner speech (Molnar-Szakacs et al., 2005;Tian et al., 2016).We also predicted involvement of the left superior temporal sulcus and gyrus (L-STS / L-STG) as the terminus of corollary discharge (Tian et al., 2016;Tourville and Guenther, 2011).Inner speech which is low in egocentricity and/or spontaneous would be associated with more consistent activations primarily in the L-STG/STS but also in the episodic memory network, including the left medial temporal gyrus (L-MTG), the left medial frontal gyrus (L-MFG) and the superior parietal lobe/precuneus (L-SPL/PC) (Hurlburt et al., 2016;Kleider-Offutt et al., 2019;Linden et al., 2011;Tian et al., 2016).

Literature search
The search was planned and conducted in line with PRISMA guidelines for meta-analyses and systematic reviews (Moher et al., 2009).The literature search was conducted using three electronic databases during May 2021 (Pubmed, Web of Science, Scopus) using the search query ("magnetic resonance imaging" OR "mri" OR "fmri" OR "positron emission tomography" OR "pet") AND ("inner speech" OR "auditory imagery" OR "covert speech" OR "speech imagery" OR "inner voice" OR "inner experience").Searches were limited to publications mentioning these terms within the title, abstract or author keywords.No further search criteria (e.g.date of publication) was utilised.This yielded 598 results, with 274 remaining after duplicates were removed.Manual searches of the reference sections of resulting articles were conducted in order to include relevant studies which were not captured by the search terms, this yielded a further 22 relevant studies which underwent screening along with the 274 studies, resulting in a total of 296 studies being screened (see Fig. 2 for a visualisation).The final studies are presented in Table 1.

Eligibility criteria 2.2.1. Contrast selection & grouping
Contrasts which compared inner speech to a baseline were selected.For the majority of the studies, these were either an inner speech > rest or an inner speech > fixation symbol contrast.Four additional studies utilised a baseline in which participants matched visual symbols (Aparicio et al., 2007;Booth et al., 2003;Hernandez et al., 2013;MacSweeney et al., 2009).Given that visual matching is not known to elicit inner speech, the inner speech > visual matching contrasts were also included in the analyses.
Studies were then grouped based on their egocentricity and spontaneity a priori.Studies were allocated to high and low egocentricity groups, based on whether the paradigm required participants to generate inner speech in their own voice (high egocentricity) or in another person's voice (low egocentricity).This yielded 16 in the high egocentricity group and 4 in the low egocentricity group.
Within the dimension of spontaneity, studies in which participants were required, explicitly or implicitly, to generate inner speech were classified as deliberate inner speech studies.For example, De Nil et al. (2000) asked participants to internally read single words and Hernandez et al. ( 2013) asked participants whether pairs of visually presented words rhymed.Studies in which inner speech occurred spontaneously, either in tasks not reliant on inner speech or in the resting state, were classified as spontaneous inner speech studies.For example, studies by Yao et al. (2011) and Alderson-Day et al. ( 2020) used a reading comprehension task, which does not require the use of inner speech to complete.Inner speech in these tasks emerges from spontaneous perceptual simulations of literary characters when reading direct quotations.Research by Hurlburt et al. (2016) also examined spontaneous inner speech but adopted a different approach.Participants were asked to report their internal state in the moments preceding the sounding of random auditory beeps.fMRI analysis then focused on the moments in which participants reported that they were engaging in inner speech.The division of studies by spontaneity yielded 14 studies in the deliberate inner speech group and 4 studies in the spontaneous inner speech group.Given that no studies were found which examined inner speech which was both spontaneous and low in egocentricity (i.e.spontaneous inner speech in other voices), we could not group studies into the four unique quadrants of the two dimension model.
It is worth noting that the numbers of included studies were unbalanced between the groups defined above.This was primarily because a disproportionately large number of studies used phonological judgement tasks such as rhyme judgement tasks (32 %).To ensure our contrasts are not significantly skewed by overrepresented paradigms like rhyme judgements, we ran one set of analyses on the 'unbalanced' dataset, and re-ran the analyses on a sub-dataset where the numbers of studies were balanced across paradigms.This 'balanced' dataset contained 2 studies per paradigm-type, with a total of 14 experiments split across 7 paradigm categories (allocations in Appendix A1).When a particular paradigm-type was employed by more than 2 studies, the experiments with the largest sample sizes were selected.The 7 paradigm-types were: (1) other voice imagery, (2) tongue twister imagery, (3) mind wandering, (4) direct quotation reading, (5) word generation, (6) phonological judgement, (7) single word reading.
Within the results section, analysis using all included studies was labelled as the unbalanced dataset.Analysis of the paradigm-adjusted dataset was labelled as the balanced dataset.

Activation likelihood estimation
Activation likelihood estimation (ALE) analysis was carried out using the BrainMap GingerALE tool, version 3.0.2(www.brainmap.org).ALE analysis compiles reported activation coordinates across multiple fMRI studies to identify which brain regions are most likely associated with a cognitive task (Eickhoff et al., 2012;Turkeltaub et al., 2012).All MNI coordinates were converted to Talairach space using the icbm2tal transformation implemented in GingerALE (Lancaster et al., 2007).ALE analysis of the unbalanced and balanced datasets used a cluster-forming threshold of p < 0.001 (uncorrected, 1000 permutations), and a cluster-level family-wise error (FWE) corrected threshold of p < 0.05, as recommended by Müller et al. (2018).Because the subgroups divided by egocentricity and spontaneity each had relatively fewer number of studies, the ALE analysis of the subgroups used a more liberal cluster-forming threshold of p < 0.01 (uncorrected, 1000 permutations) and a cluster-level family-wise error (FWE) corrected threshold of p < 0.05.The more liberal threshold of p < 0.01 is appropriate for smaller sample sizes, and has been adopted by previous ALE studies (Di et al., 2017;Falcone and Jerram, 2018;Ruiz Vargas et al., 2016).
Given the low number of studies in the low egocentricity (N = 4) and spontaneous (N = 4) conditions, both in absolute terms and relative to their high egocentricity/deliberate counterparts (N = 16 and 14, respectively), we adhered to GingerALE recommendations and did not run any contrast or conjunction analyses.The resulting ALE maps were rendered in MRIcroGL V1.2.2 (https://www.nitrc.org/projects/mricrogl/) with anatomical labelling of significant clusters and peaks being automatically calculated by GingerALE using the Talairach Daemon (htt p://talairach.org/) and exported to a spreadsheet.

Publication bias check: fail-safe N analysis
To evaluate how robust the ALE results are against publication bias (i.e.null results not being published, also known as the 'file-drawer effect'), a fail-safe N analysis was conducted on all datasets.This consists of re-running the GingerALE analysis whilst iteratively adding an increasing number of randomly-generated null-result studies (Acar et al., 2018).The fail-safe N is calculated per ALE cluster.Its value represents the highest number of null studies that can be added to a dataset whilst maintaining the significance of the cluster.Null-result experiments were generated in R, version 4.0.5 (https://www.r-project.org/) using the GenerateNull script (https://github.com/NeuroStat/GenerateNull; as used in Acar et al., 2018).The R script creates a pre-specified number of null-studies matched for the number of participants and foci contained within the real experiment list.Foci within the generated null-studies are distributed randomly throughout the grey matter.Given that there is an estimated upper bound of 30 unpublished studies with null findings per 100 published neuroimaging studies investigating language (Samartsidis et al., 2020), we re-analysed the unbalanced pooled dataset (N = 22) with up to 7 additional null studies (30 %) and re-analysed the balanced pooled dataset (N = 14) with up to 4 additional null studies (28.6 %).Analysis of the datasets divided by egocentricity and spontaneity were also re-analysed using the following additional null studies for the unbalanced versions: low egocentricity (Nnull = 1; 25 %), high egocentricity (Nnull = 5; 31.3 %), deliberate (Nnull = 4; 28.6 %), spontaneous (Nnull = 1; 25 %).The balanced versions were re-analysed using the following additional null studies: low egocentricity (Nnull = 1; 25 %), high egocentricity (Nnull = 3; 30 %), deliberate (Nnull = 2; 25 %), spontaneous (Nnull = 1; 25 %).The clusters which survive the significance thresholds after the addition of ~30 % null studies are considered robust against potential file drawer effects.

Outlier check: jackknife analysis
The fail-safe N analysis was complemented by a jackknife analysis to cross-validate that the observed results were not driven by any single study in the dataset (Amanzio et al., 2013;Shao and Tu, 1995).This involved repeatedly re-running the analysis whilst excluding a single, different study each time.The results were then visually analysed and compared to the clusters produced in the original analysis in convergence coordinates and cluster size.Each cluster was scored as a percentage, which represents the proportion of analysis iterations in which the convergence was replicated.Clusters which were present in over 80 % of the iterations were considered robust (Yaple and Yu, 2020).

Ethics statement
All meta-analyses described in this research paper used pre-existing data collected across numerous peer-reviewed neuroimaging studies.These studies were all given ethical approval by their respective boards of ethics.

ALE clusters for inner speech -all studies
The upper panel of Fig. 3 shows the ALE results on the unbalanced dataset, illustrating the brain areas that displayed significant convergence across all included studies.The associated Talairach coordinates are presented in the 'unbalanced' section of Table 2.In total, six clusters were identified.The largest cluster was centred at the left medial frontal gyrus / supplementary motor area (Brodmann Area 6; BA6) and extended across the left superior frontal gyrus (BA6) and left cingulate gyrus (BA24).Additional clusters were centred on the left precentral The lower panel of Fig. 3 shows the ALE results on the balanced dataset in which all paradigm types were represented by an equal number of studies with the largest sample sizes.The associated Talairach coordinates are presented in the 'balanced' section of Table 2.In total, three clusters showed significant convergence.The largest cluster was centred at the left medial frontal gyrus / supplementary motor area (BA6) and extended across the left superior frontal gyrus (BA6) and left cingulate gyrus (BA24).Two smaller clusters were centred on the right insula (BA13) and right culmen.

Inner speech as a function of egocentricity
The upper panel of Fig. 4 shows brain areas which displayed significant convergence for high egocentricity and low egocentricity studies, respectively, in the unbalanced dataset.Their Talairach coordinates are reported in the 'unbalanced' section of Table 3.The High Egocentricity studies converged on five clusters (coloured in red).The largest cluster was centred on the left precentral gyrus (BA6) and extended across the left inferior frontal gyrus and middle frontal gyrus.Additional clusters encompassed the left medial frontal gyrus / supplementary motor area (BA6), left precentral gyrus (BA44), right culmen and right insula (BA13).The Low Egocentricity studies converged on a single cluster centred on the right insula (BA13) and extending across the right inferior frontal gyrus (BA44) (coloured in blue).
The lower panel of Fig. 4 shows brain areas which displayed significant convergence for high egocentricity and low egocentricity studies, respectively, in the balanced dataset.The associated Talairach coordinates are presented in the 'balanced' section of Table 3.The High Egocentricity studies converged on five clusters (coloured in red).The largest cluster was centred on the left medial frontal gyrus /  Note: Coordinates (x,y,z) represent the location of peak ALE statistic per cluster in Talairach space.The Unbalanced subsection shows the results of the dataset that included all eligible studies.The Balanced subsection shows the results of the dataset with an equal number of studies for each included paradigm-type.Area names in bold font represent regions whose activation was replicated in the balanced dataset.Asterixes in Fail-Safe N and Jackknife signify sufficient robustness against publication bias and outliers, respectively, as defined in the Methods section.
supplementary motor area (BA6).Two additional clusters centred on the left precentral gyrus (BA6 & BA44), one of which also encompassed the left inferior frontal gyrus.The final clusters were centred on the right insula (BA13) and right culmen.The Low Egocentricity studies converged a single cluster centred on the right insula (BA13) and extending across the right inferior frontal gyrus (BA44).

Inner speech as a function of spontaneity
The upper panel of Fig. 5 shows brain areas which displayed  Note: Coordinates (x,y,z) represent the location of peak ALE statistic per cluster in Talairach space.The Unbalanced subsection shows the results of the dataset that included all eligible studies.The Balanced subsection shows the results of the dataset with an equal number of studies for each included paradigm-type.All activation observed in the unbalanced dataset was replicated in the balanced dataset.Asterixes in Fail-Safe N and Jackknife signify sufficient robustness against publication bias and outliers, respectively, as defined in the Methods section.
significant convergence for spontaneous and deliberate studies, respectively, in the unbalanced dataset.The Talairach coordinates associated with spontaneity analyses are presented in the 'unbalanced' section of Table 4.The Spontaneous studies converged on a single cluster centred on the left middle temporal gyrus and extending across the left superior temporal gyrus.The Deliberate studies converged on four clusters.The largest cluster was centred on the left precentral gyrus (BA6) and extended across the left inferior frontal gyrus and middle frontal gyrus.The three additional clusters were centred on the left medial frontal gyrus / supplementary motor area (BA6), left precentral gyrus (BA44) and right culmen / declive.The lower panel of Fig. 5 shows brain areas which displayed significant convergence for spontaneous and deliberate studies, respectively, in the balanced dataset.The associated Talairach coordinates are presented in the 'balanced' section of Table 4.The Spontaneous studies converged a single cluster centred on the left middle temporal gyrus (BA22) and extending across the left superior temporal gyrus (BA21).The Deliberate studies converged on two clusters.The largest cluster was  Note: Coordinates (x,y,z) represent the location of peak ALE statistic per cluster in Talairach space.The Unbalanced subsection shows the results of the dataset that included all eligible studies.The Balanced subsection shows the results of the dataset with an equal number of studies for each included paradigm-type.Area names in bold font represent regions whose activation was replicated in the balanced dataset.Asterixes in Fail-Safe N and Jackknife signify sufficient robustness against publication bias and outliers, respectively, as defined in the Methods section.
centred on the left medial frontal gyrus / supplementary motor area (BA6), with an additional cluster centred on the left precentral gyrus (BA6).

Discussion
Building on the phenomenological variety of inner speech (Alderson-Day et al., 2018;Grandchamp et al., 2019) and the suggestions of two underlying neural mechanisms (Hurlburt et al., 2016;Tian et al., 2016), the current paper aimed to unify phenomenology and neural mechanisms in a two-dimensional framework.It proposes that inner speech can cognitively vary by egocentricity (in self-voice vs. other-voice) and spontaneity (deliberate vs. spontaneous), which has the potential to bridge the phenomenological qualities of inner speech (except for condensation) with distinct neurocognitive mechanisms of corollary discharge and perceptual simulation.Specifically, it hypothesised that inner speech would primarily engage the corollary discharge mechanism when higher in egocentricity and/or more deliberate, and rely more on the perceptual simulation mechanism as egocentricity decreases and spontaneity increases.Although not directly tested in the present paper, the framework also illustrated that phenomenological qualities of dialogicality, evaluative/critical and positive/regulatory and elicitation methods of inner speech could in principle be accounted for along the continuous dimensions of egocentricity and spontaneity.
To validate the utility of this framework, we carried out an ALE metaanalysis to identify neural correlates that converged (1) across all inner speech paradigms and (2) across inner speech at opposite ends of egocentricity and spontaneity, respectively.An ALE analysis of all available studies found significant convergence on the left medial frontal gyrus / supplementary motor area (L-MFG / L-SMA; BA6), left precentral gyrus (L-PCG; BA6 & BA44), right insula (R-Insula, BA13), right culmen (R-Culmen) and left inferior frontal gyrus (L-IFG; BA44 / BA45).However, after adjusting the number of studies by paradigm, convergence was only observed over the left medial frontal gyrus / supplementary motor area (L-MFG / L-SMA; BA6), right insula (R-Insula, BA13) and right culmen (R-Culmen), supporting the hypothesis that distinct neural mechanisms could be involved in different forms of inner speech.Specifically, High Egocentricity inner speech converged on L-MFG / L-SMA (BA6), L-PCG (BA6 & BA44), R-Insula (BA13) and R-Culmen, whereas Low Egocentricity inner speech converged on R-Insula (BA13) only.Spontaneous inner speech converged on the left middle temporal gyrus (L-MTG; BA22) with a substantial portion of the cluster (> 25 %; Appendix A3) covering the left superior temporal gyrus (L-STG; BA22).Deliberate inner speech converged on the L-MFG / L-SMA (BA6), L-PCG (BA6), L-PCG (BA44) and R-Culmen, with only the L-MFG / L-SMA (BA6), L-PCG (BA6) converging after sample size adjustments by paradigms.Despite a relatively low number of studies available in the inner speech literature, our analyses are robust and sensitive, having accounted for the unbalanced number of studies by paradigm, and crossvalidated the results against file drawer effects and outlier studies.

Inner speech -all studies
The brain regions which showed significant convergence across all inner speech studies were regions broadly associated with overt speech production.The L-PCG and L-MFG / L-SMA encompass the primary and secondary motor areas and the L-IFG is typically reported in speech production tasks (Frankford et al., 2019).Convergence within these areas therefore suggests that some form of motor planning occurs during the generation of inner speech, as proposed by the corollarydischarge model of inner speech.However, given the proximity of the L-MFG cluster to regions associated with hand/finger movement (Amiez and Petrides, 2014), convergence in L-MFG could also reflect finger movements related to button presses (a common feature in many inner speech tasks), rather than a process inherent to inner speech.The convergence on R-Insula is somewhat unexpected, as this region is under-investigated in the inner speech literature.Nevertheless, other research suggests that the insula could be involved in articulation and could be part of the corollary discharge circuit.For example, a study of macaques showed that stimulation of the insula triggers orofacial motor programmes such as chewing, mouthing, lip smacking and swallowing (Jezzini et al., 2012).In humans, speech production research (Oh et al., 2014) and lesion symptom research (Cereda et al., 2002;Dronkers, 1996;Duffau et al., 2001;Starkstein et al., 1988) has also causally associated the insula with articulation.However, other studies have suggested that activation of the insula reflects interoceptive processing (Marvel and Desmond, 2012;Modinos et al., 2009;Morin and Hamper, 2012).Interoception refers to the processes which underlie self-awareness: such as the detection, filtering and integration of information regarding one's own body (Craig, 2009).Thus, insular involvement in inner speech could represent increased self-awareness associated with inner speech (Morin and Hamper, 2012;Morin and Michaud, 2007;Morin, 2009).
It is worth noting that the above results (N = 22) could be skewed heavily towards a select number of paradigms -with phonological judgement elicitation paradigms accounting for more than 30 % of studies included.Once the number of studies was balanced between paradigm types (N = 14), convergence over L-PCG (BA6 & BA45) and L-IFG (BA44), areas typically associated with speech production and inner speech, was no longer observed.The absence of L-IFG and L-PCG was unlikely to be caused by a mere lack of statistical power, as an ALE analysis of a smaller sample of phonological judgement studies (N = 7) observed significant convergence over a cluster compassing both the L-PCG and L-IFG (Appendix A2).On balance, the results are likely to reflect a genuine lack of convergence in these areas across different paradigms, as supported by their poor fail-safe N scores.This evidence (or the lack thereof) aligns with the argument made by Hurlburt et al., (2016); Tian et al., (2016) and in this paper, that mechanisms in addition to corollary discharge must be considered when modelling different forms of inner speech.
However, we did not observe significant convergence within the speech perception areas either, which seemed to contradict the hypothesised involvement of perceptual simulation in inner speech.The lack of perceptual convergence may reflect differential levels of perceptual simulation along egocentricity and spontaneity.When inner speech is more egocentric and deliberate, it may be generated predominantly by corollary discharge, which can attenuate neural activity within speech perception areas (Ford et al., 2021;Hurlburt et al., 2016;Leube et al., 2010;Shergill et al., 2013).At opposite ends of these dimensions, perceptual simulation is more strongly engaged and activation is more likely to converge in speech perception areas.

Inner speech as a function of egocentricity
We hypothesised that high egocentricity inner speech would primarily activate the corollary discharge mechanism while low egocentricity inner speech would the perceptual simulation mechanism.We predicted that the former would be associated with more consistent activations in speech production areas such as the L-IFG, L-PMC and SMA, whereas the latter would be associated with converging activations in speech perception areas (e.g., L-STG/STS) and in the memory network (e.g., L-MTG, L-MFG,L-SPL/PC).
The ALE analysis confirmed that high egocentricity inner speech was indeed associated with converging activations in L-IFG, L-PMC and the L-MFG / L-SMA, as well as the right insula and right culmen.The convergence was consistently detected in both the unbalanced and the balanced datasets, suggesting that it was unlikely to be skewed by any particular paradigm.ALE analysis of low egocentricity inner speech did not reveal significant convergence over speech perception areas or activations in the memory network.Instead, we observed significant convergence in a region encompassing the right insula (R-Insula) and right inferior frontal gyrus (R-IFG) across both datasets.The lack of convergence over speech perception and memory regions could reflect different levels of perceptual simulation along the dimension of spontaneity at low egocentricity, and that perceptual simulation may only gain predominance when inner speech is both of low egocentricity and spontaneous.This type of inner speech is under-investigated and as such is not represented in our dataset, with all studies examining low egocentricity but deliberate inner speech.The observed convergence over the right insula and inferior frontal gyrus are unlikely to reflect articulatory or phonological representations which are found, in a meta-analysis by (Vigneau et al., 2011), to be located exclusively in the left hemisphere.Activations of these regions are, however, associated with auditory verbal hallucinations, the majority of which are heard voices in second or third persons and low in egocentricity (Sommer et al., 2008).Further research on non-hallucination participants suggests these right-hemisphere homologues may play a role in detecting unexpected self-voice changes (Johnson et al., 2021).While convergence across these regions could therefore indicate a greater demand on self-monitoring or an inherent inaccuracy of recreating acoustic representations of other voices as compared to one's own voice, there is currently insufficient research to evaluate or elaborate on this potential link.

Inner speech as a function of spontaneity
Within the dimension of spontaneity we hypothesised that deliberate inner speech would preferentially recruit the corollary discharge mechanism, with spontaneous inner speech favouring the perceptual simulation mechanism.Similarly, we predicted that the former would be associated with increased convergence in speech production areas such as L-IFG, L-PMC and SMA, whereas the latter would be associated with converging activations in speech perception areas (e.g., L-STG/STS) and in the memory network (e.g., L-MTG, L-MFG, L-SPL/PC).
Largely in line with our predictions, analysis of deliberate inner speech yielded significant convergence of activation over speech production regions.Specifically, clusters of L-IFG, L-PMC, L-SMA, and parts of the right cerebellum (R-Culmen) were significant in the unbalanced dataset, but only L-PMC and L-SMA were consistently observed in the balanced dataset.The lack of convergence in L-IFG is of interest as it is invariably a part of the corollary discharge network according to computational and neuroanatomical models of speech production (Chen et al., 2011;Tourville and Guenther, 2011).For example, the DIVA model of speech production proposes that the left inferior frontal gyrus contains a speech sound map which serves as an repository of speech motor programs for each phonemic, syllabic or multi-syllabic sound a speaker might want to produce (Tourville and Guenther, 2011) -with the motor commands contained within each motor program then representing the efference copies which are passed into forward models.The lack of observed L-IFG convergence in the balanced dataset could suggest that its involvement is not ubiquitous across all inner speech paradigms which deliberately generate inner speech.Given our previous proposal that the associated between inner speech and the L-IFG was driven by the predominance of phonological-judgement paradigms in inner speech research, we conducted a post-hoc ALE analysing the deliberate inner speech studies which were class as phonological-judgement tasks in the Contrast Selection & Grouping stage of data analysis (see Section 2.3.).The results of the ALE analysis demonstrated convergence over the L-IFG across phonological judgement tasks (Appendix A2).While the subdivision of the deliberate study pool reduces statistical power, justifying a degree of caution, these preliminary finding raises two questions: (1) Does L-IFG convergence in phonological judgement tasks represent a subprocess specific to phonological judgement (e.g.speech segmentation; Burton, 2001), rather than inner speech per se? (2) If L-IFG activation during phonological-judgement does represent the generation of inner speech, does this indicate that other paradigms recruit different neurocognitive mechanisms to generate inner speech?
Analysis of spontaneous studies revealed significant convergence in L-MTG in both unbalanced and balanced datasets, with a substantial portion of the cluster encompassing the L-STG in both datasets.Significant convergence over speech perception brain regions (L-STG) aligns with our proposal that spontaneous inner speech preferentially relies on the perceptual simulation of speech within speech perceptual regions.
The inclusion of studies utilising distinct paradigms: mind wandering sampling (Grandchamp et al., 2019;Raij and Riekki, 2017) and direct quotation reading (Alderson-Day et al., 2020;Yao et al., 2011), provides some evidence that the convergence over the L-MTG and L-STG are not attributable to a specific paradigm.This possibility was further examined via jackknife analysis, with results not reaching the predetermined robustness threshold.While typically an indicator of results being driven by an outlier study, the applicability of jackknife analysis to datasets with few studies is uncertain given the large proportion of data being removed with each iteration (e.g. 25 % in a four study dataset).It is also noteworthy that these results align with the findings of an additional study not included in the GingerALE analysis, Hurlburt et al., (2016), in which task-elicited inner speech was associated with increased activation of the left inferior frontal gyrus and spontaneous inner speech was associated with increased activation of speech perception brain regions.Notably, Hurlburt et al., (2016) observed increased activation of Heschl's gyrus rather than the L-STG, but this discrepancy can likely be explained by their use of a region-of-interest approach which did not include the L-STG.The convergence of the cluster on portions of the L-MTG is also of interest.Given both the proximity and contiguity of the L-MTG cluster to the L-STG, and some evidence suggesting its involvement in the phonological processing of speech, it is plausible that this role relates to the phonological processing of the elicited inner speech (Ashtari et al., 2004).A role for the L-MTG in inner speech would align with previous findings suggesting that structural and connectivity abnormalities of the L-MTG are involved in the pathogenesis of auditory verbal hallucinations in schizophrenia (Cui et al., 2018;Zhang et al., 2017).However, the exact role the L-MTG plays in inner speech and auditory verbal hallucinations, and its relation to the proximate L-STG/STS, remains unclear.

Methodological considerations
Activation-likelihood estimation provides a useful approach to address some of the weaknesses of individual neuroimaging studies.By calculating converging regions of neural activation across studies with distinct paradigms, ALE can help distinguish between paradigm-specific correlates which might not directly subserve the investigated behaviour, and paradigm-independent correlates which are more likely parts of the core neural circuit of interest.By pooling together numerous studies, ALE also allows for an increased power to detect true effects (Acar et al., 2018).However, there remain several considerations which should be made when interpreting the meta-analytical data.As explored in the introduction, a fundamental shortcoming within the inner speech neuroimaging literature is the predominance of task-elicited inner speech paradigms and relative lack of spontaneous inner speech experiments.This imbalance was reflected in our dataset, with a small pool of spontaneous inner speech experiments.A similar challenge exists within the egocentricity dimension.Despite inner speech experiences in day-to-day life often following a dialogic structure (Fernyhough, 1996(Fernyhough, , 2004)), a comparatively small number of studies investigated low egocentric or dialogic inner speech as compared to high egocentric inner speech.This underlines an apparent tendency within the inner speech neuroimaging literature to adopt paradigms based on the ease of their implementation as opposed to their similarity to day-to-day inner speech.In both low egocentricity and spontaneous conditions, this led to a smaller pool of studies than ideal and prevented more comprehensive analysis into the effects of specific paradigms and the contrasting of dimensions.
A further limitation of this activation-likelihood estimation study and the ALE technique more generally, is that they analyse fMRI or PET data which are inherently correlational.Whilst this can be used to identify relationships between neural activation and behaviour, the degree to which behaviour is caused by that neural activation cannot be easily determined using these observational techniques.The results of these analyses could therefore serve as an empirical and theoretical basis on which future, causal research may be based.One avenue for further causal research could involve the use of brain stimulation techniques to disrupt processing within the speech production and speech perception regions, individually, as performance in various inner speech tasks is recorded.Using the model and predictions laid out in this paper, specific hypotheses can be made as to which tasks would be impaired by suppression of speech production regions as compared to speech perception regions.
Despite these shortcomings, the ALE findings are the result of best efforts given the current state of the literature, and serve to highlight the importance of interpreting inner speech as a phenomenon which can vary in its phenomenology, sensorimotor properties and neural correlates.While providing evidence in support of a model which explains a diverse range of findings within the neuroimaging literature, the metaanalysis also underscores the need for future research to incorporate a more diverse range of analytical techniques and elicitation paradigms to fully elucidate the mechanisms by which it can be generated.

The utility of the current framework
The conceptual aim of the current framework was to explain the mechanisms by which inner speech can be generated and the variables that influence these mechanisms.The results of the ALE analyses broadly support these motivations.First, we provide evidence that a framework classifying inner speech across egocentricity and spontaneity dimensions can allow for the identification of different neural circuits during inner speech generation.In turn, these various neural circuits indicate that inner speech is generated via multiple, distinct mechanisms.Second, by centring the framework around two fundamental dimensions (egocentricity and spontaneity) inherent to all varieties of inner speech, the framework also allows for existing studies to be placed within the two dimensions post-hoc.This helps identify which subtypes of inner speech are well documented within the research literature, and which subtypes of inner speech remain under-investigated (e.g.low egocentricity and spontaneous subtypes).Finally, although not directly tested in the ALE analyses, the framework allows for diverse phenomenologies to be easily mapped into the two dimensions.This can then be used to generate predictions on the neural correlates and cognitive mechanisms associated with the generated inner speech.
The lack of reliable L-IFG convergence challenges the predominant view that inner speech is invariably generated by the motor speech production system (Alderson-Day and Fernyhough, 2015;Jones and Fernyhough, 2007).L-IFG involvement in inner speech could instead be restricted to specific paradigms, with a preliminary analysis indicating that phonological judgement tasks are strongly associated with the L-IFG.This is notable as phonological judgements are commonly used to reliably induce inner speech in research settings.Future studies should weigh the convenience of phonological judgements as an inner speech induction technique against the possibility of them demonstrating distinct neural and cognitive mechanisms when compared to other inner speech subtypes.
It is also notable that the current framework failed to predict the lack of convergence over the L-STG/STS in low egocentricity studies, with convergence instead being observed over the R-IFG.Given that the R-IFG is not commonly implicated as a region causally involved in the generation of inner speech, further research elucidating the neural and cognitive mechanisms driving low egocentricity inner speech is required.As the pool of low egocentricity studies consisted entirely of studies which were also deliberate, it remains unclear as to whether the observed neural correlates are specific to studies which are both low in egocentricity and deliberate, or whether they are a feature of low egocentricity inner speech more broadly.The investigation of inner speech which is low in egocentricity and spontaneous represents a compelling area for future research given the current paucity of research and its regular occurrence within day-to-day inner speech experiences (McCarthy-Jones and Fernyhough, 2011).
The broader development and testing of the framework also exposed a relative lack of research investigating the exact mechanisms and neural correlates driving perceptual simulation in inner speech.This is of interest as the concept of perceptual simulation has received wide attention in explaining other types of sensory imagery, such as visual imagery (Ranganath and D'Esposito, 2005;Reddy et al., 2010).The precise involvement of different neural networks (perception, memory, lexical) in the perceptual simulation of speech therefore remains a topic requiring further consideration and empirical investigation.Moreover, the exact processes which facilitate the modification and combination of stored memories in order to create novel speech experiences remains poorly understood.Future research could examine the extent to which the more comprehensive literature on memory modification and reactivation within the visual cortex applies to the auditory cortex and inner speech (Favila et al., 2020).
The results of the ALE analyses yield distinct patterns of neural activation than observed in Grandchamp et al., (2019).Grandchamp et al., (2019) observed consistent L-IFG activation throughout their investigation of inner speech across dialogality and intentionality dimensions -therefore lending support to a purely corollary discharge approach.However, our ALE analyses found convergence over the L-IFG to be particularly unreliable, as determined by both observed convergence across conditions and fail-safe N / jackknife analyses.Whilst the reasons for this divergence are difficult to determine without carrying out further analysis, it is notable that Grandchamp et al., (2019) predominantly utilised highly intentional inner speech tasks, some of which also involved semantic processing.For example, both the monologal self-voice inner speech condition and the monologal other-voice inner speech condition required participants to generate definitions for a visually presented object.Within these conditions, it is plausible that activation of the L-IFG could reflect semantic processing during object name retrieval (Krieger-Redwood and Jefferies, 2014) rather than inner speech, per se.However, the involvement of the L-IFG in the verbal mind wandering condition remains less clear given the lack of a significant semantic component to the task.
It is also notable that Grandchamp et al., (2019) reported minimal activation of L-STG and L-MTG during their low intentionality task, which is at odds with Hurlburt et al., (2016) and our ALE analysis of spontaneous studies.Grandchamp et al., (2019) propose that the absence of L-STG / L-MTG activation in their study could be explained by their inclusion of verbal mind wandering experiences which were more condensed than that used in Hurlburt et al., (2016).Although plausible, it is unclear from a neurocognitive perspective why condensed inner speech would not result in any activation of speech perceptual regions when compared to an implicit baseline, nor is it clear the extent to which the analysed experiences were actually condensed.We judge Grandchamp et al., (2019) alternative explanation to be more likely, that the lack of L-STG / L-MTG activation in low intentionality inner speech was caused by insufficient statistical power to detect the effect.It is also plausible that the task methodology, which required participants to report the timing of the mind wandering experiences after the 30-s trial, produced timing data which is not accurate enough to isolate verbal mind wandering experiences from other cognitions during fMRI modelling and analysis.Nevertheless, Grandchamp et al., (2019)'s dimension of condensation does remain an area which is worthy of further elucidation and could explain some of the divergent findings within our analyses.Given that it was excluded from our framework, in part, due to ambiguity in implementation, testing, and evidence, it is a concept worth revisiting when a larger corpus of research is available.
Given the finding that L-IFG activation is not an invariable feature across all forms of inner speech, and that there are more general differences in neural correlates across egocentricity and spontaneity of inner speech, we argue our current framework is of significant utility when compared to models which posit that a motor-route of generation subsumes all inner speech subtypes.Our pragmatic approach views inner speech as a dynamic phenomenon which varies in its phenomenological attributes and mechanisms of generation.Whilst we provide one framework which seeks to explain the exact relationship between phenomenological attributes and neurocognitive mechanisms, it is clear that further research on less studied inner speech subtypes (e.g.spontaneous inner speech in other voices) is vital to refining the model and developing a complete understanding of how inner speech is implemented in the brain.There is also a need for research to investigate the exact causal mechanisms by which corollary discharge and perceptual simulation operate, beyond much of the available research, including our own, investigating these mechanisms at a correlational level.The benefits of a more complete understanding of inner speech are not limited to basic research, but could have a tangible impact on translational studies.For example, accurate and reliable functional mapping of the brain regions involved in inner speech generation could maximise the efficacy of brain stimulation interventions of auditory verbal hallucinations, a therapeutic approach which has yielded mixed results to date (Moseley et al., 2015).

Conclusion
In line with studies highlighting the diverse nature of inner speech (Alderson-Day et al., 2018;Hurlburt et al., 2016), the results of the ALE meta-analysis further demonstrated that distinct neural mechanisms were differentially engaged for inner speech that varies along its egocentricity and spontaneity.In particular, speech production areas implicated in the motor-route of generation are consistently engaged in highly egocentric and deliberate inner speech, but not with inner speech which is low in egocentricity or spontaneous.The current study makes three important contributions: First, it provides evidence that varieties of inner speech are supported by more than one neural mechanism.Second, it provides a flexible and useful cognitive framework that bridges between the diverse phenomenology of inner speech and the two underlying neural mechanisms.Third, we demonstrated that our current understanding of inner speech is highly skewed by paradigms that require explicit phonological judgements.It is crucial that we test different types of inner speech across a range of paradigms to triangulate the neurocognitive mechanisms that causally produce various forms of inner speech, as well as auxiliary mechanisms that underpins inner speech (e.g., working memory, attention, verbal monitoring, Theory of Mind, etc.).In conclusion, the present study provides a novel contribution to the research literature by showing that different neural mechanisms are engaged for inner speech that varies in its egocentricity and spontaneity.It also provides a flexible cognitive framework that bridges the phenomenology of inner speech and its underlying neural mechanisms.The study highlights the importance of testing different types of inner speech across a range of paradigms to better understand the neurocognitive mechanisms that causally produce and support inner speech.

Fig. 1 .
Fig. 1.Schematic illustration of the egocentricity × spontaneity framework, with its underpinning neural mechanisms and its hypothetical mapping with phenomenological qualities.Note: Red colour represents the involvement of the corollary discharge mechanism.Blue colour represents the involvement of the perceptual simulation mechanism.Highlighted brain areas on the four brains indicate which and to what extent brain areas would be activated in the four quadrants of this framework (colour saturation levels indicate the strengths of involvement/activation along the northwest-southeast diagonal).Brain areas in red represent speech production (planning) regions including the left inferior frontal gyrus, left premotor cortex and supplementary motor area.Regions in blue represent speech perception regions in the left superior temporal cortex.The black arrow indicates the pathway along which efference copies are sent from the production areas to the perception areas.The frontoparieto-temporal memory network is expected to be involved in the perceptual simulation mechanism but is not drawn to keep the illustration simple and tidy.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 2 .
Fig. 2. Flow diagram of literature review process.Notation boxes in the screening section represent the various reasons for study exclusion and the number of studies excluded.

Fig. 3 .
Fig. 3. Areas showing significant ALE statistic across all studies shown at FWE p < 0.05 at the cluster-level.

Fig. 4 .
Fig. 4. Areas showing significant ALE scores in High Egocentricity (red) and Low Egocentricity studies (blue) at FWE p < 0.05 at the cluster-level.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 5 .
Fig. 5. Areas showing significant ALE scores in Spontaneous studies (blue) and Deliberate (red) and at FWE p < 0.05 at the cluster-level.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. A2 .
Fig. A2.ALE scores in an analysis of phonological judgement studies at FWE p < 0.05 at the cluster-level.

Table 1
List of eligible studies which were included in the meta-analysis.

Table 2
Significant clusters across all studies in the balanced and unbalanced datasets.

Table 3
Clusters showing significant ALE statistic across high egocentricity and low egocentricity studies, respectively.Shown at FWE p < 0.05 at the cluster-level.

Table 4
Clusters showing significant ALE statistic across Spontaneous and Deliberate studies, respectively.Shown at FWE p < 0.05 at the cluster-level.