A meta-analysis of fMRI studies of language comprehension in children

The neural representation of language comprehension has been examined in several meta-analyses of fMRI studies with human adults. To complement this work from a developmental perspective, we conducted a meta-analysis of fMRI studies of auditory language comprehension in human children. Our analysis included 27 independent experiments involving n = 625 children (49% girls) with a mean age of 8.9 years. Activation likelihood estimation and seed-based effect size mapping revealed activation peaks in the pars triangularis of the left inferior frontal gyrus and bilateral superior and middle temporal gyri. In contrast to this distribution of activation in children, previous work in adults found activation peaks in the pars opercularis of the left inferior frontal gyrus and more left-lateralized temporal activation peaks. Accordingly, brain responses during language comprehension may shift from bilateral temporal and left pars triangularis peaks in childhood to left temporal and pars opercularis peaks in adulthood. This shift could be related to the gradually increasing sensitivity of the developing brain to syntactic information.


Introduction
Hemodynamic activity during language comprehension has been extensively examined using fMRI in adults. Since typical sample sizes of single studies range from about 10 to 30 participants, meta-analytic methods have been used to increase statistical power and detect robust effects across experiments (Binder et al., 2009;Ferstl et al., 2008;Rodd et al., 2015;Vigneau et al., 2006). Following this approach, three canonical regions underlying language comprehension in adults were consistently found: the left inferior frontal gyrus (IFG), the left middle temporal gyrus (MTG), and the left superior temporal gyrus (STG; Binder et al., 2009;Ferstl et al., 2008;Rodd et al., 2015;Vigneau et al., 2006). While the pars opercularis of the left IFG and the left STG was related to syntactic processing (Rodd et al., 2015;Vigneau et al., 2006), the pars triangularis and orbitalis of the left IFG and the left MTG was related to semantic processing (Binder et al., 2009;Ferstl et al., 2008). When pooling across tasks, peak activation was localized in the pars opercularis of the left IFG and MTG (Rodd et al., 2015).
A growing body of literature has reported fMRI results obtained from language comprehension experiments with children. At the word level, experiments have targeted phonological processing with the two-word rhyme judgement task (Cao et al., 2008;Cone et al., 2008;Desroches et al., 2010) and the first-sound matching task (Raschle et al., 2014), semantic processing with the noun categorization task (Balsamo et al., 2006), and (morpho-)syntactic processing with the morphological awareness task (Arredondo et al., 2015). At the sentence level, the description definition task has been frequently used (Bartha-Doering et al., 2018;Berl et al., 2014;Moore-Parks et al., 2010). In this task, children listen to a noun preceded by a short description. They are asked to judge whether the two are matching (e.g. "A long yellow fruit is a banana") or not (e.g. "Something you sit on is a spaghetti"). Other tasks include a judgment on whether two semantically or syntactically manipulated sentences convey the same meaning (Borofsky et al., 2010;Nuñez et al., 2011), the semantic/syntactic acceptability task (Brauer et al., 2011), and the sentence-picture matching task . Moreover, the passive listening task has also been employed in a number of studies (Knoll et al., 2012;Monzalvo et al., 2012). This task has also often been used for stories (Horowitz-Kraus et al., 2016;Romeo et al., 2018;Sroka et al., 2015), sometimes additionally asking children to answer content-related questions or retell the stories (Vannest et al., 2019).
While meta-analyses of fMRI studies of language comprehension in children are currently not available, the present literature is synthesized in a qualitative and a quantitative review (Skeide and Friederici, 2016;Weiss-Croft and Baldeweg, 2015). These reviews suggest that activity in left-lateralized regions in the IFG, MTG, and STG known from the adult literature is already broadly established by 3 years of age. Focused activity in the pars opercularis of the left IFG, however, emerges only gradually towards adulthood when children become more sensitive to syntactic information (e.g. morphology, word order). Younger children, in contrast, rely more on semantic information and thus more strongly recruit the pars triangularis of the left IFG (Skeide and Friederici, 2016;Weiss-Croft and Baldeweg, 2015).
Here we conducted the first statistical synthesis of the fMRI literature on language comprehension in children. To this end, we quantified the overlap of hemodynamic activations reported in previous studies using activation likelihood estimation (ALE; Eickhoff et al., 2012;Turkeltaub et al., 2002) and seed-based effect size mapping (SDM; Albajes-Eizagirre et al., 2019). Following the currently available original and review articles, we hypothesized two major differences in the activation patterns associated with language processing in children compared to adults. First, we expected activation peaks in the temporal cortex to be less strongly lateralized to the left hemisphere. This hypothesis was based on a number of individual studies in children which have found significant clusters of activation not only in the left MTG and STG, but also in the right MTG and STG. The latter effect is typically not consistently found in adults (e.g. Holland et al., 2007;Horowitz-Kraus et al., 2015;Sroka et al., 2015;Szaflarski et al., 2006). Second, we hypothesized that the distribution of activity in the left IFG would differ such that peak activity in children would be found in the pars triangularis, while peak activity in adults would be found in the pars opercularis. This hypothesis was based on previous work indicating that at least until their first years in school, children rely more strongly on semantic information-typically associated with enhanced recruitment of the pars triangularis of the left IFG-and only gradually become more sensitive to syntactic information-typically associated with peaks of activation in the pars opercularis of the left IFG (e.g. Nuñez et al., 2011;Skeide et al., 2014).

Literature search
The PubMed database (https://www.ncbi.nlm.nih.gov/pubmed/) was used to identify articles containing the terms "fMRI AND language AND children" or "functional MRI AND language AND children" in their respective title or abstract. As of August 2019, this search yielded 356 results after removing duplicate entries. These results were screened to exclude any articles that did not meet one or more of the following predefined inclusion criteria: (1) The article was written in English.
(2) There was at least one group of healthy, monolingual children with a mean age between 3 and 15 years. The lower boundary of this age range was set by the feasibility of task-based fMRI studies in children and the upper boundary was chosen to include a gap of 3 years until adulthood (18 years). (3) The children completed a natural language task during fMRI scanning, not an artificial language task (e.g. scrambled syllables). (4) The authors conducted a random-effects analysis using a general linear model to obtain whole-brain within-group results. (5) Peak coordinates were reported in Tailarach or Montreal Neurological Institute (MNI) space.
Thus, articles only reporting the results of conjunction analyses of multiple tasks (e.g. reading and listening tasks) or groups (e.g. children and adults or children with and without reading difficulty) were excluded to maintain the main focus of our analysis on the activation associated with specific types of language tasks in typically developing children. Applying these criteria, we identified 41 eligible articles, 37 of which used language comprehension tasks, whereas four deployed language production tasks (e.g. overt or covert verb generation tasks). As previously noted by Weiss-Croft and Baldeweg (2015), the small number of language production experiments might be explained by the problem of speech-related movement artifacts, which is aggravated in studies with children.
In a recent simulation study, Eickhoff et al. (2016) demonstrated that at least 17 to 20 independent experiments should be included in any ALE-based meta-analysis of neuroimaging data. This ensures acceptable robustness and power by minimizing the chance that the results are driven by single experimental results and by maximizing the chance to detect small-and medium-size effects. Accordingly, we were not able to run separate analyses for both comprehension and production tasks. Instead, we decided to narrow down our analysis to those articles investigating language comprehension. Of these, 27 articles included at least one condition in which stimuli (words, sentences, or stories) were presented auditorily, whereas ten articles exclusively used visual stimuli. We excluded these ten reading experiments because, as before, their number was insufficient to conduct a robust and sufficiently powered ALE-based meta-analyses (Eickhoff et al., 2016), thus focusing our analysis on auditory language comprehension experiments. Finally, two of the remaining articles were excluded because fMRI was recorded while children were asleep in the scanner (Redcay et al., 2008). One further article (Monzalvo et al., 2012) was excluded because the same contrast and group of children had already been included as part of another, more comprehensive publication (Monzalvo and Dehaene-Lambertz, 2013). This entire selection process, which is summarized as a flowchart in Fig. 1, yielded a final sample of 24 articles that could be included in the present meta-analysis.
Post hoc, we included various related search terms ("brain", "function*", "magnet*", "BOLD", "child", "development", and all names of the tasks used in the already included studies). Furthermore, we screened the reference lists and tables of the original articles and reviews. These screening procedures did not reveal any additional suitable studies that were not yet captured by the initial PubMed search.

Activation likelihood estimation
To identify converging activation across the experiments reported in these articles, we conducted an activation likelihood estimation (ALE; Fig. 1. PRISMA flowchart of the selection process for included articles. An initial screening of the 356 articles listed on PubMed (as of August 2019) revealed 41 articles that reported whole-brain coordinates in standard space obtained from fMRI experiments targeting language processing in healthy children. Of these, 24 articles using auditory language comprehension tasks were included in the statistical analysis. ICA ¼ independent component analysis. Eickhoff et al., 2012;Turkeltaub et al., 2002) as implemented in the GingerALE software, version 3.0.2 (http://brainmap.org/ale/). ALE performs a coordinate-based meta-analysis of the peak coordinates reported in fMRI experiments to determine where in the brain results converge at an above-chance level. "Experiment" refers to one type of task (i.e. auditory language comprehension) in one specific sample (i.e. one group of children). Hence, multiple fMRI contrasts reported within a single article constitute multiple independent experiments if they are obtained from different samples, but should be pooled into a single experiment when they are obtained from the same sample to control for within-group effects (Eickhoff et al., 2017;Turkeltaub et al., 2012). In our meta-analysis, the 24 articles reported 32 contrasts for auditory language comprehension. Ten of these contrasts were investigated in identical or overlapping samples and their foci were thus pooled into one experiment (see Table 1). This procedure resulted in 27 experiments reporting 453 foci in total (mean ¼ 16.8, median ¼ 11 per experiment). Eight of these experiments reported foci in Talairach space and were converted to MNI space using the icbm2cal function as implemented in GingerALE (Lancaster et al., 2007).
As a first step, the ALE algorithm created a binary map for each experiment in which all activated voxels were assigned a value of 1 and all other voxels are assigned a value of 0. Next, to account for the uncertainty associated with using condensed peak information instead of parametric whole-brain maps, a three-dimensional Gaussian distribution was fitted around each of these peaks, smoothing out their activation across the neighboring voxels. ALE determines the amount of uncertainty based on the sample size of the respective experiment, with foci from larger samples being smoothed with a narrower kernel (Eickhoff et al., 2009). This resulted in separate mean activation maps for all experiments, which were then combined into a single ALE map using a random-effects approach. To do so, each voxel was assigned a value corresponding to the union of its activation probabilities from the individual mean activation maps. This so-called ALE value indicates, for each gray matter voxel, the degree of convergence in activation between all included experiments. Finally, the map of ALE values was statistically thresholded to check in which voxels convergence could be expected to be above chance level. As recommended on the basis of a recent simulation by Eickhoff et al. (2016), we combined an uncorrected cluster-forming voxel-wise height threshold of p < .001 and a cluster-wise family-wise error (FWE) correction with a threshold of p < .05 based on 1000 random permutations. All voxels surviving this threshold were interpreted as showing above-chance convergence between experiments reflecting the "true" activation associated with auditory language comprehension in children. Local peaks within these significant clusters were assigned their respective anatomical gray matter labels using the Anatomy toolbox, version 2.2c (Eickhoff et al., 2007(Eickhoff et al., , 2006(Eickhoff et al., , 2005 in SPM12 (https://www.fil.ion.ucl.ac.uk/spm/software/ spm12/). This toolbox provides anatomical labels for peak coordinates in MNI space based on probabilistic maps and, for peaks within the IFG, their probability (in %) of belonging to Brodmann Area 44 (pars Table 1 Descriptive information of the 27 experiments included in the meta-analysis.  Amunts et al., 2004). Finally, the Talairach Daemon atlas as implemented in GingerALE (Lancaster et al., 2000(Lancaster et al., , 1997 was used to determine the Brodmann Area of peaks outside the IFG.

Comparison with previous adult meta-analysis
The pattern of activation associated with language comprehension in children was compared to previous meta-analytic work on language comprehension in adults. To achieve this, we reproduced the metaanalysis by Rodd et al. (2015), which included 54 studies on semantic and syntactic language processing with a total of 957 adult subjects and 320 foci. Details of the literature search, inclusion criteria, and meta-analytic methods can be found in the original publication (Rodd et al., 2015). For the purpose of the present study, we deviated from the original analysis in two aspects. First, we excluded any experiments using visual stimuli (i.e. reading experiments), in line with our meta-analysis in children, which included only experiments using auditory stimuli. This resulted in a subset of 23 studies in adults with a total of 431 subjects and 105 foci. Second, in the original publication, data were thresholded and corrected based on the false discovery rate. In contrast, here we used an uncorrected cluster-forming voxel-wise height threshold of p < .001 and a cluster-wise FWE-corrected threshold of p < .05, which is identical to the threshold that we used for analyzing the data of the children. The cluster-wise FWE-corrected threshold was preferred in accordance with a recent simulation study by Eickhoff et al. (2016). These authors demonstrated that this threshold provides the highest statistical power and therefore the highest sensitivity to detect "true" effects that were known a priori by simulating the data. At the same time, this threshold turned out not to inflate the number of spuriously significant clusters. Thresholding using the false discovery rate as in Rodd et al. (2015), on the other hand, was shown to lead to both substantially reduced statistical power and an increase in the number of spurious clusters.
After obtaining the thresholded ALE map of the adult experiments, we compared it statistically to the ALE map of experiments in children. To this end, we subtracted ALE maps to identify clusters where activation was found more consistently in one group compared to the other group (children > adults, adults > children). Additionally, we created a conjunction map showing similarities in activation between the two groups. In each case, the resulting ALE map was thresholded using an uncorrected cluster-forming voxel-wise height threshold of p < .001 and a cluster-wise FWE-corrected threshold of p < .05. Clusters that were significant at this level were anatomically labeled using the Anatomy toolbox in SPM12 and assigned to a Brodmann Area based on the Talairach Daemon atlas.

Seed-based effect size mapping
An alternative approach to statistically synthesize results from multiple fMRI experiments is seed-based effect size mapping (SDM; Albajes-Eizagirre et al., 2019). Similar to ALE, SDM uses a coordinate-based random-effects approach that combines the information of peak coordinates in standard space across multiple experiments. While ALE treats all peak coordinates the same, SDM accounts for the effect size associated with each peak and reconstructs the original parametric maps of the individual experiments before combining them into a meta-analytic map. Hence, while ALE maps quantify the degree of overlap in peak activation across experiments, SDM estimates the effect size of activation or deactivation for each voxel. Although the SDM method is still less commonly used compared to ALE (Acar et al., 2018), we thought it might complement our main results in three aspects. First, the fact that SDM uses a different algorithm than ALE renders it possible to scrutinize the robustness and replicability of the results obtained from ALE. Second, SDM differentiates between voxels with significant activation and deactivation while ALE only captures activation . Finally, SDM makes it possible to include covariates and compute meta-regression analyses as a means to estimate the influence of potentially confounding variables.
We performed the additional SDM meta-analysis using the same peak coordinates as before but adding, whenever possible, their associated tor z-values (the latter being converted to a t-value). This analysis was conducted using the SDM-PSI software, version 6.11 (https:// www.sdmproject.com/). First, effect size maps were built for the 27 individual experiments. This was accomplished by (a) converting the tvalue of each peak coordinate into an estimate of effect size (Hedge's g) using standard formulas (Hedges, 1981) and (b) convolving these peaks with a fully anisotropic unnormalized Gaussian kernel (α ¼ 1, FWHM ¼ 20 mm) within the boundaries of the default gray matter template as provided by SDM (voxel size ¼ 2 Â 2 Â 2 mm). Effect sizes for peaks with unknown tor z-values were estimated from a threshold-based imputation based on the mean effect size of peaks for which t-values are known. Imputation was conducted separately for groups of experiments with different statistical thresholds . Second, the individual effect size maps were combined using a random-effects general linear model. Third, the statistical significance of activations in the resulting meta-analytic effect size map was examined by comparing it to 1000 random permutations of activation peaks within the gray matter template. Finally, the meta-analytic maps were thresholded using an uncorrected voxel-wise height threshold of p < 0.001 and a cluster-wise extent threshold of k ¼ 50 voxels, which approximately corresponds to the FWE-corrected thresholding procedure implemented in ALE . Peak coordinates of the resulting meta-analytic clusters of activation were anatomically labeled using the Anatomy toolbox in SPM12 and assigned to a Brodmann Area based on the Talairach Daemon atlas.
SDM was also used to assess the effect of four potentially confounding variables on the results of the meta-analysis, namely, age (mean age of children in each experiment), baseline (1 ¼ rest/fixation, 2 ¼ active), type of language task (1 ¼ story listening, 2 ¼ decision tasks at the sentence level, 3 ¼ decision tasks at the word level), and software package used for image processing and statistical analysis in the original publication (1 ¼ SPM, 2 ¼ FSL/LIPSIA/AFNI). For each the four variables, a separate linear model was calculated in SDM to identify clusters that significantly covaried with the respective variable. All preprocessing and thresholding parameters were kept the same as in the main analysis.

Jackknife sensitivity analysis
To explore how potentially spurious results in the literature (e.g. driven by publication bias) would affect the results of our ALE analysis, we conducted a Jackknife sensitivity analysis. To this end, we ran 27 different meta-analyses in ALE, each with a different experiment of the original sample being left out. We visually inspected how well each of these simulations reproduced the original results in terms of number, location, and size of significant ALE voxels. Substantial variability would indicate that the results are driven by the specific study that had been left out, thus compromising the robustness to spurious (e.g. false positive or p-hacked) findings.

Fail-safe N analysis
To further evaluate the robustness of the present results against unpublished studies with null results in the "file drawer" (e.g. driven by bias towards publishing positive results), we carried out a fail-safe N analysis. The rationale behind this approach is to investigate the effect of iteratively adding null-result experiments to our original sample (Acar et al., 2018). Null-result experiments were created in R, version 3.6.1 (https: //www.r-project.org), matching the real experiments in terms of sample size and number of foci reported, but with foci being distributed randomly across the gray matter. Next, new meta-analyses were computed in ALE by iteratively adding one null experiment after another to the original data. For each significant cluster in the original analysis, the fail-safe N was defined as the highest number of null experiments that could be added until the cluster failed to reach statistical significance. Thus, fail-safe N indicates how many fMRI studies with non-significant results could be hidden in the file drawer without compromising the significance of a certain cluster. To increase reliability, the whole procedure was repeated with ten different, randomly generated sets of null-result experiments, each representing one potential file drawer. The mean fail-safe N of these ten simulations was calculated separately for each cluster.
Following Acar et al. (2018), we also pre-specified lower and upper boundaries for the fail-safe N of each cluster based on the following considerations. A recent modelling approach to data from the BrainMap database (http://brainmap.org/) indicates that there might be up to 30 unpublished null studies per 100 published neuroimaging studies in the language domain (Samartsidis et al., 2019). Using this conservative estimate of the file drawer effect, we pre-specified that the fail-safe N for each cluster should exceed a lower boundary of eight added null experiments (equaling 30% of the real data). The upper boundary was pre-specified as the number of null experiments that could be added so that the real experiments still made up for 10% or more of the foci contributing to a particular cluster. This ensures that the significance of a cluster is driven by the majority of experiments instead of few highly influential ones. Only if the actual fail-safe N obtained from the simulation is between these two boundaries, the cluster can be assumed to be robust against both a potential file drawer effect and hyper-influential effects of a few experiments.

Descriptive statistics
Twenty-seven experiments reported in 24 articles published between 2003 and 2019 were included in the present meta-analysis. Participants were 625 typically developing, monolingual children with a mean age of 8.9 years (range: 3-15 years). Gender was approximately equally distributed (49% females) and children were almost exclusively righthanded (96%). Of the 27 experiments, eight involved judgments at the word level, 12 involved judgments at the sentence level, and seven involved listening to spoken stories. A descriptive overview of these experiments is provided in Table 1 and the distributions of mean ages and sample sizes of the experiments are depicted in Fig. 2.

Activation likelihood estimation
Five activation clusters associated with auditory language comprehension in children showed significant convergence across the experiments (p < .05, cluster-wise FWE-corrected). The largest peak was found in the pars triangularis of the left IFG (Brodmann Area [BA] 45). The corresponding cluster extended across the pars opercularis of the left IFG (BA 44) to left middle and superior frontal cortices (BA 46, BA 6), and left precentral cortices (BA 6). Moreover, the cluster extended across the pars orbitalis of the left IFG (BA 47) to the left insula (BA 13). A smaller cluster was detected in the pars triangularis and the pars orbitalis of the right IFG and the right insula. Two other clusters covered the STG (BA 22, BA 41) and the MTG (BA 21, BA 38) bilaterally. Finally, one more cluster was identified in left premotor and anterior cingulate regions (BA6, BA8, BA32, BA24; Fig. 3, Table 2).

Comparison with adult meta-analysis
A comparison of this pattern of activations associated with language comprehension in children to the pattern observed in adults revealed a number of similarities, including clusters of common activation in the left IFG (BA 13, BA 45), the left MTG and STG (BA 22), the right STG (BA 13, BA 22, BA 41), and the left medial frontal gyrus (BA 6, BA 9; Figs. 4 and 5, Table 3).
Children revealed significantly more consistent activation in the right STG and MTG (BA 21, BA 22), the left medial and superior frontal gyri (BA 6, BA 8, BA 9), the pars triangularis of the IFG (BA 45), the left STG and MTG (BA 21, BA 41), and the left and right insulae (BA 13). Adults showed more consistent activation than children in the pars opercularis of the left IFG (BA 44; Fig. 6, Table 4).

Seed-based effect size mapping
Repeating the meta-analysis using seed-based effect size mapping, we reproduced the five clusters obtained with ALE and their respective peaks in the pars triangularis of the left IFG (BA 45), the right insula (BA 13), bilateral MTG (BA 21), and left premotor cortex (BA 6). One additional cluster not identified by ALE emerged in the left fusiform gyrus. Furthermore, the left frontal and bilateral temporal clusters as obtained from SDM were markedly larger in size (Fig. 7, Table 5).

Effects of potentially confounding variables
None of the potentially confounding variables we examined (mean age of children, type of language task, type of baseline condition, and software package used for statistical analysis) were significantly related to any of the converging activation clusters for language comprehension in children. Changing the cluster-forming threshold of p < .001 to the extremely liberal threshold of p < .05, we found an effect of age in the left supplementary motor area (BA6), no effects of baseline or software package, and an effect of task in the left supplementary motor area (BA8), left inferior temporal gyrus (BA 37), and right superior temporal gyrus (BA 21). The effect of age indicated that the older the children within an experiment, the stronger the activation in BA6. The effect of task indicated that the more complex the auditory speech stimuli used in the   Fig. 4. ALE map of significant clusters associated with language comprehension in adults. These data were reproduced using the sample of studies reported in a previous meta-analysis by Rodd et al. (2015). Maps depict clusters with above-chance overlap (p < .05, cluster-wise FWE-corrected) and their associated ALE value (color bar), that is, the degree of non-random convergence in activation between experiments at any given voxel. experiment (stories vs. sentences vs. words), the stronger the activation in BA 8, BA 37, and BA 21. However, these effects failed to reach significance at the established conservative threshold.

Jackknife sensitivity analysis
In the Jackknife sensitivity analysis, the five significant clusters revealed by ALE were reproduced in all 27 simulations, regardless which of the original experiments was left out (Table 6).

Fail-safe N analysis
The fail-safe number of null experiments that could be added without altering the significance of the five clusters ranged from N ¼ 24 for cluster 5 (right insula) to N ¼ 115 for cluster 4 (left superior frontal gyrus; Fig. 8). In each case, this number exceeded the required lower boundary of fail-safe N ¼ 8, that is, the maximum number of null studies we estimated to be in the file drawer. Only for cluster 4 (left superior frontal gyrus), the value of fail-safe N ¼ 115 slightly exceeded the desired upper boundary (in this case, fail-safe N ¼ 113), potentially indicating that this cluster was driven by a very small number of experiments (Acar et al., 2018).

Discussion
To our knowledge, here we report the first statistical synthesis of the fMRI literature on auditory language comprehension in healthy children. Meta-analyzing data reported in 24 original research articles with a total sample size of more than 600 children, we detected significant overlap in hemodynamic activation in left IFG and MTG/STG, as well as, to a lesser degree, their right-hemispheric homologues. Compared to a previous meta-analysis in adults, children revealed significantly more consistent activation in bilateral (especially right) STG and the pars triangularis and pars orbitalis of the left IFG, and significantly less consistent activation in the pars opercularis of the left IFG. In contrast to previous reviews, in which results are reported on the level of entire gyri or sulci, the present meta-analysis provides precise coordinates of consistent activation peaks in standard space. This information provides the basis for future regionof-interest studies on language processing.
According to the work of Eickhoff et al. (2016), the statistical power of the current meta-analysis to detect not only large, but also small-and medium-size effects can be assumed to be acceptable. Nevertheless, meta-analytic power is intrinsically limited by the number of currently available data (27 independent experiments). It should also be noted that most of the included individual experiments relied on sample sizes of 10-40 children (Fig. 2B). This presumably limited their power to detect small-and medium-size effects. These effects, in turn, were not reported as peak coordinates in the respective articles and could therefore not be included in the present analysis (Weiss-Croft and Baldeweg, 2015).
The robustness of the present results to different meta-analytic approaches was confirmed by comparing activation likelihood estimation Fig. 5. ALE map of significant clusters associated with language comprehension in both children and adults as a result of a conjunction analysis. Maps depict clusters with above-chance overlap (p < .05, cluster-wise FWE-corrected) in the ALE maps of both children (Fig. 3) and adults (Fig. 4). The color bar represents the voxel-wise minimum convergence between these two images.    with seed-based effect size mapping. This comparison revealed that both frameworks generated largely overlapping activation clusters. We also analyzed the robustness of the present findings to publication bias in the literature. To this end, we simulated false positives in the published literature (Jackknife sensitivity analysis) and a file drawer of unpublished studies with non-significant results (fail-safe N analysis). These analyses indicated that all of the identified clusters were robust against deleting single experiments and against adding randomly generated null experiments. The reported differences between children and adults could be in part explained by differences in age, task, and baseline, or also to a lesser degree by the different number of studies included. While we found no evidence for significant effects of age, task, and baseline, we cannot exclude that the results are influenced by the different number of studies which is inherent to the current literature. The lack of an age effect might be explained by the age sampling variability intrinsic to the current literature. Specifically, about 60% of all studies included children with a mean age between 8 and 12 years while the age range of 3-7 years is slightly underrepresented and the age range of 13-15 years is strongly underrepresented (Fig. 2A). This might have limited the statistical power of the meta-analysis to detect age-related differences.
Pooling across multiple studies, we provide evidence that the lateralization of language processing to the left hemisphere does not appear adult-like yet at a mean age of about 9 years. This finding is not in line with previous reviews stating that language lateralization is largely established by 3-5 years of age (Skeide and Friederici, 2016;Weiss-Croft and Baldeweg, 2015). It cannot be excluded, however, that the lack of lateralization we observed was overestimated due to the large age range of the present meta-analytic sample (3-15 years). Another explanation for this discrepancy could be that systematic reviews combine whole-brain and region-of-interest results. The current meta-analysis, however, is entirely based on whole-brain results to ensure that different brain regions are equally likely to reveal a significant effect .
Our observation that children, compared to adults, recruit bilateral superior temporal cortices more consistently is in line with a large body of literature suggesting that the functional responses of the language system are not mature before young adulthood (Nuñez et al., 2011;Skeide et al., 2014;Wang et al., 2019). Specifically, children still have to rely on low-level semantic and syntactic processing implemented in the temporal cortex, while high-level semantic and syntactic processing only gradually emerges towards adulthood with an increasing involvement of the left IFG (Nuñez et al., 2011;Skeide et al., 2014;Wang et al., 2019). The notion of immature language processing in children is further corroborated by the described activation differences in the left IFG. Following our previous work, we interpret the observation that children do not yet recruit the pars opercularis to an adult-like extent as a lack of specialization of controlled syntactic processing (Nuñez et al., 2011;Skeide et al., 2014;Wang et al., 2019). Alternatively, the increasing involvement of the left pars opercularis could also be related to controlled phonological processing that refines in the course of literacy learning in school (Brennan et al., 2013). Phonological processing, however, is typically related to the dorsal pars opercularis, while in the present study, the main difference between children and adults was found in the ventral pars opercularis, a subregion that is typically related   Fail-safe N analysis for the five significant clusters associated with language comprehension in children. For every significant cluster obtained from the ALE analysis, fail-safe N indicates how many null experiments with nonsignificant findings could be hidden in an imaginary file drawer without compromising the statistical significance of the cluster. Light gray shading indicates the desirable fail-safe N values based on a priori considerations (see main text for details). Error bars indicate 95% confidence intervals of the mean.
to syntactic processing (Brennan et al., 2013;Zaccarella and Friederici, 2015). Disentangling phonological, semantic, and syntactic processes during language comprehension will only be possible on a larger data basis and thus remains as a challenge for future work.
Besides the left IFG, several other regions revealed consistent activation during auditory language comprehension. Within the left temporal lobe, the left MTG is linked to the activation of lexical representations (Lau et al., 2008) and the left STG is linked to the decoding of spectro-temporal features of phonemes (Hickok and Poeppel, 2007). The right STG, in contrast, is associated with decoding supra-segmental acoustic features, i.e. the prosody of the speech input (Friederici, 2011). Within the precentral gyrus, the premotor area is thought to support language comprehension by activating subvocal articulation codes for phonemes (Pulvermüller et al., 2006). In addition to activation differences in the language system, we found that children activated left medial and superior frontal gyri and the right insula more consistently than adults. These areas are linked to executive functions (e.g. cognitive control, performance monitoring, salience detection) and may point to the general effect that language comprehension tasks are more demanding for children than for adults (de la Vega et al., 2016;Uddin, 2015;van Noordt and Segalowitz, 2012).

Conclusion
The present meta-analysis suggests two developmental activation shifts during language comprehension that require longitudinal corroboration, namely, a triangularis-to-opercularis shift in the left inferior frontal cortex and a bilateral-to-left shift in the temporal cortex. These trajectories can be interpreted as neurodevelopmental correlates of the gradually increasing sensitivity to syntactic information.