Infants' performance in spontaneous-response false belief tasks: A review and meta-analysis.

Evidence obtained with new experimental paradigms has renewed the debate on the development of theory of mind in general and false belief ascription in particular. Namely, several studies contend to prove that infants already have the capacity to attribute false beliefs. The aim of the current meta-analysis is to review and summarize the empirical evidence about spontaneous-response false belief tasks in infants younger than 2 years old. Fifty-six false belief conditions using the violation-of-expectation, the anticipatory looking and interactive paradigms were included in this meta-analysis, including 1469 infants. The role of several moderators was examined, following Wellman et al.ös meta-analysis (2001). Results show that correct performance on spontaneous-response false belief tasks was about 1.76 times more likely than incorrect performance (β = 0.57, 95% CI 0.33; 0.80, p <  .0001). Mediator analyses revealed that (i) year of publication had a significant influence on performance, reducing the average log odds of successful performance (β = -0.11, 95% CI: -0.16; -0.06, p < .0001); and (ii) correct performance was more likely than incorrect performance when the task was conducted in the violation-of-expectation paradigm (β = 0.75, 95% CI: 0.25; 1.26, p =  .003). However, heterogeneity was high across the studies and the funnel plot revealed an asymmetric distribution suggesting that studies with small effect sizes were not published. These results cast doubt on the alleged robustness of the phenomenon: its effect size decreases as time passes, it seems to depend on the type of paradigm employed, and the variance across studies is not well understood yet.


Introduction
In our everyday social interactions, we frequently attribute mental states (e.g., beliefs, desires and intentions) to other people in order to predict or explain their behavior. For example, if you know that Ben wants to drink orange juice, and you know that Ben believes that the orange juice is in the fridge, you can predict that he will search for the orange juice in the fridge. This ability to impute mental states to others (and also to oneself) to make predictions about the future behavior of other agents is called theory of mind (Premack & Woodruff, 1978, p. 515). Competence in theory of mind consists of understanding agents as holders of beliefs and desires, which jointly give rise to intentions and goals, carried out in action.
False-belief ascription was chosen as diagnostic of theory of mind capacity (Bennett, 1978;Dennett, 1978;Harman, 1978) because an agent with a false belief holds an informational state incongruent with reality. Then, this ability requires that an individual understands the situation in the terms of the agent, which are necessarily different from how the individual himself conceives of it. As a result, developmental psychologists devised the false belief task (FBT) to investigate children's ability to represent another person's mistaken belief (Wimmer & Perner, 1983).
There are two main versions of the classical FBT: the "change of location" or the "unexpected transfer" version (Baron-Cohen, Leslie, & Frith, 1985;Wimmer & Perner, 1983), and the "unexpected contents" variant (Gopnik & Astington, 1988;Hogrefe, Wimmer, https://doi.org/10.1016/j.infbeh.2019.101350 Received 20 February 2019; Received in revised form 26 July 2019; Accepted 6 August 2019 month-old infants in a spontaneous-response FBT. They predicted that infants would look in anticipation where an agent with a false belief will search for an object, on the grounds of the visual information previously available to both. Interestingly, the object was not just changed of location but removed from the scene. After Southgate and colleagues' positive results, other authors started to use infants' first look as a dependent measure of false belief understanding.
In addition, a new set of spontaneous-response FBT emerged that took advantage of infants' intentional interaction abilities, such as active helping and referential communication. These tasks typically involve simple language and introduce participants to a scene in which an agent does or does not witness an event. Children are then given a verbal prompt -but one that only indirectly taps their representation of the agent's epistemic state. Authors measure young children spontaneous behaviors in response to the prompt, such as helping and pointing, as a proxy for belief understanding.
In the first FBT using the interactive paradigm, infants watched as an agent placed a toy inside a box, and later as another person switched the toy from one box to a second one while the agent either witnessed the switch (true belief condition) or not (false belief condition). Then, the agent unsuccessfully attempted to open the box that originally contained the toy and infants, who were previously instructed on how to open the boxes, spontaneously helped her. Buttelmann et al. (2009) expected that infants helped the agent to open the other box, where the toy is actually hidden, only in the false belief condition. So far, many other studies have employed infants' helping or pointing behavior to measure false belief ascription.
Each of the previous studies on early false belief understanding provided positive results showing that infants, from 15 months old, can attribute false beliefs to another agent. This new picture of young children tracking another agent's false beliefs gave rise to a paradox in development: on the one hand, infants succeed in different non-verbal and spontaneous versions of the FBT but, on the other hand, children cannot reliably pass the verbal, explicit versions of the FBT until they are 4 years old.
Different theoretical approaches tried to explain this developmental paradox. For instance, advocates of a nativist or an early, full mentalistic capacity account take the results of spontaneous-response FBT at face value, claiming that they prove genuine false belief understanding (Carruthers, 2013;Jacob, 2013;Leslie, 1994;Leslie, Friedman, & German, 2004;Scott & Baillargeon, 2017). They argue that 3 year-olds' difficulties with the explicit FBT are related to processing demands but not to their understanding of beliefs, thus applying the strategy first proposed by Fodor (1992) of an innate, full competence but limited performance due to the task's requirements. Working memory and related executive control abilities change throughout children's development but not their ability to understand propositional attitudes in general, which is assumed to be complete, universal, and innate or it appears very early in ontogeny. In other words, 2 year-old children do attribute full-blown beliefs -beliefs with propositional contents-as part of their understanding of others as intentional agents.
Empiricists, on the other side, deny that infants' performance on spontaneous-response FBT show that they attribute beliefs in first place. From their point of view, infants' behavior can be explained by low-level processes such as a novelty preference (Heyes, 2014) or simpler behavioral rules (Perner & Ruffman, 2005). For example, Heyes (2014) has forcefully argued that the novelty generating the surprise in the VOE and the AL procedures appears at a lower representational level than the true or false belief representations. Infants' looking behavior is a function of the degree to which the observed (perceptual novelty) and remembered or expected (imaginal novelty) low-level properties of the test stimuli -their colors, shapes, and movements-are novel with respect to the earlier events encoded by the infants in the experiment. On Perner's view, on the other hand, infants' performance in these tasks can be accounted in terms of learned or innate behavioral rules, such as that "people tend to look for an object where they last saw it and not necessarily where the object actually is" (Perner & Roessler, 2012).
More recently, though, a two-systems account of belief-ascription has been put forward (Apperly & Butterfill, 2009;Butterfill & Apperly, 2013;Low, Apperly, Butterfill, & Rakoczy, 2016). According to this dual-system view, overcoming the spontaneous-response FBT depends on a fast and implicit but inflexible system. Such system reasons about "belief-like" states, called "registrations", and develops early in ontogeny. A registration is a non-propositional, extensional state that holds relations to objects and properties and has its effects directly on action. Someone who reasons about registrations would track beliefs, true and false, but in a limited range of situations. For example, infants can attribute registrations regarding the location of an object and thus succeed in many spontaneousresponse FBT (Apperly & Butterfill, 2009). However, they cannot track beliefs that involve quantifiers, indefinitely complex combinations of properties, or aspectuality, that is, situations in which a protagonist refers to an object under one aspect in contrast to another (Apperly & Butterfill, 2009;Rakoczy, Bergfeld, Schwarz, & Fizke, 2015;Rakoczy, 2017). The fact that registrations have signature limits explains why young children fail spontaneous tasks in which the protagonist was mistaken about the identity of an object (Fizke, Butterfill, van de Loo, Reindl, & Rakoczy, 2017). Since registrations are relational states, they fail to capture the intensionality characteristic of propositional attitudes and the aspectuality of the agent's mental state (Rakoczy, 2017).
On the other hand, a late-developing and flexible system supports the attribution of genuine beliefs. This system is effortful and explicit, but also inefficient, and supports children's success in explicit FBT. Beliefs, unlike registrations, are propositional, intentional states, which engage in holistic and complex interactions with other propositional attitudes. Such a flexible system allows children to overcome the limitations posed by the early system and track more complex beliefs.
Lately, Tomasello (2018) has argued that insofar as the previous theories are based on individual cognition, the developmental puzzle will not be solved. The key to understanding beliefs requires an account of the processes of social and mental coordination with other agents and their perspectives. According to Tomasello (2018), the origins of this coordination lie in infants' join attentional activities around their first birthdays, in which they relate two perspectives. To pass the spontaneous-response FBT, infants just need to display their ability to triangulate, the same one that they use in joint attention, which involves tracking what the agent sees or has seen in the test, and how this information will affect the agent´s behavior. But such epistemic tracking does not imply that the child should understand that the agent´s belief is incorrect; in other words, the infant does not need to compare the agent's perspective with his own view of the situation or with the objective situation (Tomasello, 2018).
While three-year-old children answer the false belief question focusing on the objective perspective, 4-to 5-year-old children pass the FBT because they come to coordinate the three perspectives involved: the child's, the agent's and the objective perspective. Much social and communicative interaction with others is required before such achievement, which is mainly characterized by communicative exchanges involving joint attention to mental content (Tomasello, 2018).
Almost fifteen years after the seminal study from Onishi and Baillargeon, there is no consensus about the capacities displayed by the infants in the non-verbal tasks. After the positive evidence obtained in the pioneer studies, new tasks either confirmed a similar pattern of results or failed to replicate the original findings (most of them were not published; see . Given the mixed pattern of the accumulated evidence, it is time for a meta-analysis of the literature on implicit theory of mind. Specifically, this meta-analysis explores how correct performance in spontaneous-response FBT has revealed and whether (and how) other variables mediate this effect.

Inclusion and exclusion criteria
We used the following criteria to select studies to include in the meta-analysis.
1 We only included studies testing participants' capacity to attribute false beliefs in which their mean age was 26 months old or younger. Thus, we excluded all false belief conditions testing older children. 2 We only included tasks that employed one of these three paradigms for implicit false belief ascription: the violation of expectation methodology, the anticipatory looking and the interactive paradigms (grouping both tasks that require infants' helping and pointing behaviors). 3 We only included implicit false belief conditions; thus, we excluded true belief, ignorance and control conditions that might be present in the studies as well. 4 We only included false belief conditions in which typical-developmental infants were tested. Then, we excluded false belief tests carried out on samples with atypical development. 5 We included all the tasks published from Onishi and Baillargeon seminal paper (2005) until 4 th February 2019. Although we checked for unpublished data in the overview of studies done by , most of these data are now published and we have already included in the current meta-analysis 3 .

Moderator and mediator analyses
Each condition included in the meta-analysis was coded for the following dependent variables, most of them following Wellman et al. (2001) 1 Author 2 Year of publication 3 Paradigm: the paradigm employed in each condition, that is, the violation of expectation (VOE), the anticipatory looking (AL) or the interactive paradigm. 4 Age: the mean age of participants in each condition (expressed in months). In some cases, the authors reported the mean age of participants from the whole group (grouping together either true and false belief conditions or two false belief conditions). 5 Sample size: the number of participants in a condition. 6 Dropped participants: the number of participants excluded from each condition. In many cases, the authors reported the total amount of excluded participants in the whole study, grouping together false and true belief conditions, or two false belief conditions. 7 Familiarization trials: the number of familiarization trials performed in each condition. When the authors employed warm-up trials, instead of familiarization trials, we coded the familiarization trials as 0. 8 Belief: the type of belief attributed to the agent. We coded four levels, distinguishing beliefs about: a an object location (L); b a non-obvious property of an object (NP); c the identity of an object (I); or d (un)expected contents (C).
3 There is only one missing study from the survey published by  that satisfies the points 1 to 4 of our inclusion criteria and that is still unpublished (e.g., a study from Surian and Franchin). Unfortunately, we could not get more unpublished results. However, the metaanalysis is a technique that draws conclusions from the available results and estimates how many articles were not published. 4 We did not include three variables from Wellman and colleagues' meta-analysis: the nature of the target object, the type of question and the temporal marker. The first one was difficult to code in our data. It distinguished three levels -a real object, a toy or a videotaped object-and it was not clear if the target real object was also a toy or not in many conditions. The remaining two variables corresponded to the linguistic version of the test.
P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 9 Agent: three levels that described the nature of the agent as: a a real present person; b a videotaped person; or c a virtual agent (including cases of a human-like individual, a geometric shape or an animal). 10 Real presence of the target object: whether, at the test trial, the target object was: a real and present (i.e., the object was inside one of the boxes); or b absent (i.e., the object had been removed from the scene). 11 Object movements: the number of displacements of the object that the agent did not see before the object arrived at its final position. 12 Motive of the transformation: two levels capturing whether the key transformation (e.g., the change of location or the substitution of unexpected contents) was done: a to trick: to explicitly trick the agent, or b for other reason: for some other reason including no explicit reason at all. 13 Salience of the agent's mental state: we distinguished four levels: a absence: the false belief state had to be tracked from the agent's absence during the key events; b back turned: the false belief state had to be tracked due to the agent's back turned during the key transformation; c first-person experience: the false belief experience was demonstrated initially on the children themselves; and d blindfold: the agent used a blindfold during the transformation. 14 Interaction: we coded whether there was interaction between the child and the agent during the test or not. 15 Design: indicates the design employed in each condition. We distinguished between: a a within-subjects design (WS): participants were tested in both test events (congruent and incongruent events in the VOE case) or in both conditions (false and true belief conditions); and b a between-subjects design (BS): participants were tested in only one test event (in the VOE case) or in only one false belief condition. 16 Test trials: the number of test trials that were conducted with each child.
Depending on the paradigm employed, we coded the dependent variable in each case. In the VOE paradigm, we coded the mean looking time (in seconds) that infants spent looking to (a) the expected event, that is, when the agent acted according to her false belief; and (b) the unexpected event, that is, when the agent acted in a way that was not according to her false belief. In the AL and the interactive paradigms, we coded the percentage of passers, that is, the number of infants that showed the correct anticipatory behavior (in the AL paradigm) 5 or the correct helping or pointing behavior (in the interactive paradigms).

Search strategies & coding procedures
The literature on implicit false belief tasks is relatively recent (the first study was published in 2005), so new studies are known in the area and are cited as new articles appear. However, we checked the databases EbscoHost -PsycINFO, SCOPUS, and Web of Science (WoS) to ensure we were taking into account all the studies published up to date. The search terms employed were false belief, implicit false belief tasks, false belief + infants, spontaneous response false belief tasks, violation of expectation + false beliefs, looking paradigm + false beliefs, theory of mind.
We addressed studies published in English and we only included published studies. All the studies were coded by the first author, after agreement on the codes with the third author. Table 1 shows the final list of references included in the current meta-analysis.

Statistical methods
We calculated the effect sizes based on the data described in the articles and on the responses from the authors to requests for further information 6 .
On the one hand, the dependent variable in the VOE paradigm is measured on a quantitative scale (mean looking time). Then, we used the standardized mean difference (d) as the effect size. The formula of d is equal to (m1i-m2i)/spi, where m1i and m2i represent the means of the two groups, and spi the pooled standard deviation of the two groups (for which the standard deviations of the scores 5 AL tasks also report the differential looking score (DLS). This a continuous variable that considers the total amount of looking time at both locations (correct and incorrect areas of interest). This score results from the difference between the looking time to the correct window and the looking time to the incorrect window divided by the total looking time to both windows . We coded the DLS when authors reported it, or we calculated the DLS if authors reported the mean looking times to each window. For the calculation we used the same formula as in : DLS = (time(correct AOI) -time(incorrect AOI)) / (time(correct AOI) + time(incorrect AOI)). We could calculate it in 11 out of the 15 conditions. However, we lacked the information to calculate the effect sizes and standard errors. This is the reason why we did not use this quantitative measure. 6 In one study, authors reported that 25 out of 50 infants did a correct first saccade for both false belief 1 and false belief 2 conditions (Grosse Wiesmann et al., 2018). Separate information of passers in each condition is not reported and could not be obtained via personal communication. However, as these conditions differ in the number of movements the object makes outside of the agent's view (a variable that we coded), then we split half of the sample for FB1 and the other for FB2 in order to maintain the same overall performance and coding the two different conditions. P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 in the two groups and both sample sizes should be reported). The variance of d is equal to For within-subjects designs, in which infants were tested on both congruent and incongruent test trials, the standardized mean change was calculated. Its formula is equal to (m1i -m2i)/Sdif, where m1i and m2i refer to the observed means at the two measurement occasions. To calculate the standard deviation of the paired differences, the corresponding standard deviations of the two measurement occasions and the correlation coefficient between both should be reported. No condition that employed a withinsubject design reported the correlation between participants' two measures. Then, we fixed a medium correlation of 0.5 for these cases 7 .
On the other hand, studies conducted with the AL and the interactive paradigms provide data for single groups with respect to a dichotomous dependent variable: 1 correct anticipatory look (it means, looking in anticipation to the place where the agent should go according to her false belief), or incorrect anticipatory look; or 2 correct helping or pointing behavior (i.e., opening the correct box, giving the appropriate object, or pointing to the correct location), or incorrect helping or pointing behavior.
Therefore, we used the logit transformed proportion (log odds) as an outcome measure for the studies from both paradigms. The log odds for single groups is equal to the log((xi/ni)/(1-(xi/ni))) where xi denotes the number of individuals experiencing the event of interest and ni refers to the total number of individuals in the group. The variance of log odds is equal to 1/xi + 1/(ni-xi).
Finally, we converted d scores into log odds by applying the formula d · (π/√3); and the variance of d into the variance of log odds Table 1 List of the studies and conditions included in the meta-analysis.  (2015) 1

References FB conditions used in the meta-analysis
Total 56 7 Another formula for the standardized mean change consists of (m1i -m2i)/S1 (Ausina & Meca, 2015), requiring no correlation coefficient to perform the calculus. We compared the results of this formula to the ones obtained with the previous formula with a correlation of .50 and the values were very close. P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 following the formula v(d) · (π2/√3). Then, we calculated the effect size standard error (SE) which is equal to the square root of the variance. 95% confidence intervals were calculated in the statistical program R Core Team (2013). The lower bound of the confidence interval results from applying the formula log odds -(1.96 · SE), while the upper bound of the interval is determined by log odds + (1.96 · SE).
We employed a random effects model in which we assume that the true effects are normally distributed (Borenstein, Hedges, Higins, & Rothstein, 2009). Under this model, two sources of variability are assumed: intra-study variability (due to sampling error) and inter-study variability (each study estimates its own parametric effect) (Ausina & Meca, 2015).
Heterogeneity was assessed by calculating different statistics. First, we used the Q statistic, which is the weighted sum of squares (WSS) on a standardized scale, and its p-value as a test of significance. As a standard score, it can be compared with the expected WSS (on the assumption that all studies share a common effect) to yield a test of the null and also an estimate of the excess variance (Borenstein et al., 2009). We also quantify T 2 that reflects the variance of the true effects; and I 2 that reflects the proportion of observed dispersion that is due to true heterogeneity and is expressed as a ratio. Higgins, Thompson, Deeks, and Altman (2003) suggest that values on the order of 25%, 50%, and 75% might be considered as low, moderate, and high, respectively. Finally, when we add moderators to the model, we report R 2 which reflects the amount of heterogeneity accounted for the variables included in the model.
Possible publication bias was assessed with a funnel plot. The funnel plot is a graphical display of effect sizes against their standard errors and shows that, in the absence of publication bias, studies will be symmetrically distributed around the mean effect size, since the sampling error is random. In the presence of publication bias, the studies are expected to follow the model, with symmetry at the top, a few studies missing in the middle, and more studies missing near the bottom (Borenstein et al., 2009, p. 284).
We used the metafor package (Viechtbauer, 2010) in R Core Team (2013) to calculate effect sizes, heterogeneity estimates, to run the meta-analysis, and to perform the forest, the funnel, and the meta-regression plots. 8
From these definitive 33 papers, we additionally excluded some conditions in which authors tested older children (Fizke et al., 2017;Grosse Wiesmann, Friederici, Disla, Steinbeis, & Singer, 2018;Kulke, Reiß, Krist, & Rakoczy, 2018;Priewasser, Rafetseder, Gargitter, & Perner, 2018;Schuwerk, Priewasser, Sodian, & Perner, 2018) or deaf participants (one condition from Meristo et al., 2012), and other conditions due to insufficient data to calculate the effect sizes (two VOE conditions from Powell, Hobbs, Bardis, Carey, & Saxe, 2018). The final selection of articles amounted to 56 false belief conditions in which a total of 1469 infants were tested. Table 2 reports the descriptive information of each false belief condition included in the meta-analysis like the year of publication, the sample size, infants' mean age, the paradigm employed, the effect size and the 95% confidence intervals. Table 3 provides a descriptive summary of the database grouped by the variables coded. The mean age of all the participants of the database was 19.55 months and, from the 56 false belief conditions, 19 employed the VOE paradigm, 15 used the AL procedure and 22 the interactive methodology.

Model 1
We built a random effects meta-analytic model with maximum likelihood estimation on infants' performance in spontaneousresponse false belief tasks. This first model (model 1) included no covariate. The results showed that the estimated average effect size (calculated in log odds) is equal to β = 0.57, 95% CI [0.33; 0.80] (see Fig. 2). It, therefore, suggests that the average log odds corresponds to an odds of 1.76, meaning that correct performance on spontaneous-response false belief tasks was about 1.76 times more likely than incorrect performance 9 . In terms of probabilities, the probability of correct performance was 64% 10 . The null 8 All the data employed in this meta-analysis as well as the R script are available at https://osf.io/re8uj/? view_only=d2605771dd664831a104318db9ff7aa9 9 The calculus of the odds is the exponential function of the log odds. In this case, the exponential function of 0.5663 results in 1.76. 10 Probabilities result from applying the formula odds/(1+odds). P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 hypothesis can be rejected (z = 4.69, p < .0001).
The total set of studies was heterogeneous as evidenced by significant Q-values (Q[1,55] = 163.81, p < .001). T 2 , which reflects the amount of true heterogeneity (i.e., the variance of the true effects), is 0.50 (SE = 0.15). The proportion of observed dispersion in the data that is due to heterogeneity is specified by I² = 71.5%, which indicates a medium-high proportion of variance due to heterogeneity rather than chance.
The funnel plot showed that all the effect sizes falling within the grey area would be statistically non-significant in a two-tailed test (Fig. 3). We can see an asymmetrical distribution of the effect sizes, with more studies on the right part of the plot -showing larger effect sizes-that have less precision (smaller samples). This graphical display suggests a publication bias: some studies with nonsignificant results were not published.

Model 2
Part of the heterogeneity found may be due to the influence of moderators (Viechtbauer, 2010). We examined this possibility by fitting a mixed-effects model including theoretically relevant predictors as the year of publication, infants' age, the paradigm employed, the type of belief, the type of agent, the motive of the transformation, the design of the study, and the presence of the target object as moderators. This second model 2 includes 8 predictors. We performed a multiple meta-regression on model 2 using the restricted maximum likelihood method. Table 4 shows the results.
In this model, the estimated amount of residual heterogeneity is equal to T 2 = 0.23, suggesting that 53.80% (R 2 ) of the total amount of heterogeneity can be accounted for the eight moderators included in the model. The proportion of observed dispersion due to heterogeneity (I²) is reduced to 52.14%. We can reject the null hypothesis based on the omnibus test (Q M = 49.67, df = 12, p < .0001).
Before analyzing the implication of each moderator, a statement of caution is crucial. Given the large number of moderators considered and their confounds with one another, we need to be circumspect in the interpretation of the results about the influence of these variables on children's performance.
Examining the predictors of model 2, we find that year of publication (z = -4.54, p < .0001), infants' age (z = 2.26, p = .02) and the motive of transformation (to trick the agent; z = 0.32, p = .04) have a significant influence on performance. The test for residual heterogeneity is significant (Q E = 87.20, df = 43, p < .0001), indicating that other moderators, not considered in the model, are influencing children's performance. These results indicate that as time passes, the probability of passing the test diminishes -0.15 units (95% CI: -0.22; -0.09) in terms of the average log odds of successful performance. This means that incorrect P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 performance on spontaneous-response false belief tasks is 0.86 times more likely than correct performance as the year of publication is higher, suggesting that null results or children's incorrect performance on the task are currently being published. On the other hand, when older infants are tested, the log odds of successful performance increase 0.08 units (95% CI: 0.01; 0.15), meaning that correct performance is 1.08 times more likely than incorrect performance as infants get older. When the motive of the transformation is to trick the agent, it results in an increase of 0.65 (95% CI: 0.02; 1.29) units in the log odds of correct performance. This indicates that correct performance is 1.91 times more likely than incorrect performance. Anyway, the confidence intervals of the effect sizes of these two predictors almost touch zero. Then, we should cautiously interpret them. The effects of all other predictor Note: n = sample size; CI: confidence interval. P. Barone,et al. Infant Behavior and Development 57 (2019) 101350 were not significant (all p values > .05).

Model 3
Lastly, we examined a simpler model by fitting a mixed-effects model only including the year of publication and the paradigm as moderators. The results of the multiple meta-regression for model 3, that includes two predictors, are presented in Table 5.
The estimated amount of residual heterogeneity is equal to T 2 = 0.19, suggesting that 61.30% (R 2 ) of the total amount of heterogeneity can be accounted for the inclusion of these two moderators in the model. The proportion of observed dispersion due to heterogeneity (I²) is 48.19%. We can reject the null hypothesis based on the omnibus test (Q M = 40.12, df = 3, p < .0001). A closer inspection of the predictors shows that year of publication appears to have a significant influence on performance (z = -4.32, p < .0001) as well as the VOE paradigm (z = 2.92, p = .003).
Results indicate that as the year of publication is higher, there is a reduction of -0.11 (95% CI: -0.16; -0.06) in the average log odds of successful performance. This result indicates that incorrect performance on spontaneous-response false belief tasks is 0.90 times more likely than correct performance as time passes, in the same vein as the result of model 2. On the other hand, performing the task in the VOE paradigm implies a change of 0.75 (95% CI: 0.25; 1.26) in increasing the log odds of successful performance in the task. This means that correct performance is 2.12 times more likely than incorrect performance when the test is run with the VOE  Barone, et al. Infant Behavior and Development 57 (2019) 101350

Fig. 2.
Forest plot of all the false belief conditions grouped by the paradigm employed. Each study is represented on a different row. The effect size and the 95% confidence interval for each of the conditions are reported on the right. The black square in the middle of each line represents the effect size of that condition and its size varies according to its weight in the meta-analysis. The mean effect size of all conditions and its 95% confidence interval appears at the bottom. The meta-analytic information of each paradigm appears below each subgroup of conditions. paradigm.

Models' comparison
The models were compared using the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Table 6 contains the values of each criterion for the three models. Comparing model 2 with model 1, both AIC and BIC decrease: AIC is reduced from 160.26 to 123.74, and BIC decreases from 164.28 to 148.40. This shows that the fit of the second model is improving. If we then consider model 3, in comparison with model 2, AIC increases (AIC = 129.09) while BIC considerably decreases (BIC = 138.84). BIC penalizes for each additional variable included in the model, thus, the fit of model 3 -with only two parameters-  Note: SE = standard error; CI = confidence interval; I = identity; L = location; NP = non-obvious property; WS = within-subject.

Table 5
Results of the analysis of a multiple meta-regression for model 3, assuming a mixed effects model. Note: SE = standard error; CI = confidence interval. P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 improves in comparison to model 2 that possesses eight predictors. As the AIC for model 2 and model 3 are not substantially different, the fit of the third model is better than the fit of the first one and therefore it is preferable over the other two models.

A meta-analysis per paradigm
We also run separate meta-analysis per each of the paradigms employed (see Fig. 2). Results show that the VOE paradigm had a significant mean effect size β = 1.28, 95% CI [0.86; 1.69], implying that correct performance on the task was about 3.58 times more likely than incorrect performance (see Table 7). The results from the interactive paradigms also indicate a significant mean effect size β = 0.31, 95% CI [0.01; 0.61], suggesting that correct performance on the task was about 1.36 times more likely than incorrect performance. However, this is a lower probability and the confidence interval almost reach zero. On the other hand, the metaanalysis for the AL paradigm provided no significant mean effect size (β = 0.16, 95% CI: -0.24; 0.57).
Results from the VOE and the interactive paradigm suggest that the average effect size found when all paradigms were included as moderators come from these methodologies.

Main effects analyses
We also inspected the main effects of all the variables we have coded. We will first consider the effect of the year of publication and we will then analyze the other methodological moderators separately. Year of publication might have a special status, likely indexing publication bias rather than methodological differences 11 .

Year of publication
The year of publication, on its own, has a significant influence on performance (z = -5.32, p < .0001). Insofar as the year of publication is higher, the probability of passing the test falls -0.14 units (95% CI: -0.19; -0.09) in terms of the average log odds of successful performance (see Fig. 4). This result suggests that children's incorrect performance on spontaneous-response false belief tasks is 0.87 times more likely than correct performance as the year of publication increases, in the same direction as we found in models 2 and 3.

Methodological moderators
We found that the following variables produced a significant main effect: the paradigm, the type of agent, the object's movements, the salience of the agent's mental state and the sample size. Each of them has, on its own, a significant influence on young children's performance. All the remaining methodological moderators produced non-significant main effects. Table 8 presents the estimates, the p values and the confidence intervals for all the significant predictors.
The effect of the paradigm was already analyzed in model 3. On the other hand, when the agent was a videotaped person, there was a change of -0.62 (95% CI: -1.15; -0.08) in reducing the average log odds of successful performance. 12 Regarding the object's movements, when it was moved twice outside the view of the agent, there was a change of -1.16 (95% CI: -1.90; -0.42) in reducing the average log odds of successful performance. Then, when the agent's false belief was induced because she was turned back during Note: AIC = Akaike information criterion; BIC = Bayesian information criterion.

Table 7
Results of the meta-analyses per paradigm, assuming random effects models. Note: SE = standard error; CI = confidence interval; I² = dispersion due to heterogeneity.
11 Thanks to an anonymous reviewer for this suggestion. 12 We may say that, in the anticipatory looking paradigm, agents are mainly videotaped persons. However, this category does not completely overlap with paradigm: a videotaped person is also used in one VOE study and there are some AL tasks that employ a virtual agent (not a person). P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 the key events, there was a change of -0.77 (95% CI: -1.31; -0.24) 13 . The sample size also seemed to influence performance: when bigger the sample size, lower the probability of finding a correct performance in the task (β = -0.01, p = .02, 95% CI: -0.02; 0.00). However, the confidence interval for this effect size includes zero. Finally, the following Figs. 5-7 show how the variables are distributed in the conditions recruited for the meta-analysis. Fig. 5a shows how all the studies that found big effect sizes were run with small samples. When bigger sample sizes were employed, the effect sizes tended to decrease. Fig. 5b, on the other hand, shows the relation between dropped participants and the effect sizes found in each condition: only a few studies excluded more than 20 participants. In addition, when all infants were included in the sample, the resulting log odds of correct performance was low (it only happened in two studies).
Effect size as a function of infants' age is shown in Fig. 6. We can see that, as infants become older, correct performance tended to decrease although it was a non-significant trend (p = .06) 14 . Interestingly, most of the tests were carried out with 18-month-old infants. This age group, however, showed an uneven performance on the task: they correctly passed the VOE tests but showed all possible results in the other two paradigms considered (above, at chance and below chance performance). Interestingly, there were no studies carried out in the specific age range from 19 to 23 months old. Also, there was only one study employing the VOE paradigm with older infants (24 months old) and interactive tests were consistently conducted from 18-month-old. Fig. 7 shows the dispersion of the data in each paradigm according to (a) the design employed, (b) the interaction between the Fig. 4. Meta-regression scatterplot of the log odds of correct performance as a function of the year of publication, separating the studies according to the paradigm employed (anticipatory looking, interactive or violation-of-expectation paradigms). Each circle represents one false belief condition and its size varies depending on its weight on the meta-analysis. Each meta-regression line takes into account the effect size predicted by both moderators (year of publication and paradigm). Note: SE = standard error; CI = confidence interval; MS = mental state. 13 Similarly, the agent is turned back during the belief induction phase only in the AL tasks. However, this category does not overlap with all the studies using the AL paradigm since it does not occur in four tasks. 14 This result contradicts with the effect of infants' age found in model 2: when age was included as a predictor, a positive estimate was obtained.
However, the negative influence that age has on its own was explained in model 2 by another predictor (i.e., the year of publication).
child and the experiment, and (c) the motive of the transformation. Interactive paradigms mostly employed between-subjects designs (by testing the child only in a FB condition), while within-subjects designs were used in the other paradigms (although not as often as between-subjects ones; Fig. 7a). Fig. 7b shows how interaction overlaps with the paradigm: AL and VOE paradigms are non-interactive measures and interaction appears in those tasks in which infants should help or point to the agent. Finally, Fig. 7c shows that the transformation of the scene (i.e., change of location) was mainly done to trick the agent in the interactive tasks. All the other changes that provoked a false belief in the agent were done for other (including no) reason.  Fig. 6. Scatterplot of the log odds of correct performance as a function of infants' age. Each condition represents one circle and it is colored according to the paradigm employed.

Discussion
This meta-analysis reveals that, on average, correct performance on spontaneous-response false belief tasks is more likely than incorrect performance, suggesting that the tasks are tapping a real phenomenon. However, the meta-analysis simultaneously uncovers important issues: (i) an increase in year of publication implies a decrease in correct performance; (ii) a considerable heterogeneity is present among studies; (iii) the asymmetrical distribution of effects sizes in the funnel plot suggests publication bias; and (iv) correct performance depends on the type of paradigm used. Taken together, these results indicate that the developmental puzzle is still unsolved.
First of all, we saw that large effect sizes were found at the onset of the research in implicit false belief attribution, but this outcome could not be replicated in more recent experiments. Relatedly, those large effect sizes were found with small samples, whereas later studies used bigger samples.
Secondly, the high percentage of heterogeneity found across studies suggests that, despite the significant effect, we still do not know which factors really explain the variance across studies. Since two of the parameters (i.e., the year of publication and the paradigm) considerably reduced such variance among studies -from a high ratio to a medium ratio of heterogeneity-, there was still much variance in the dataset that the model could not explain.
Thirdly, the skewed distribution of the effect sizes hints at the presence of publication bias in this literature. Current initiatives to look for unpublished papers and to replicate previous studies imply that many researchers become aware of this bias. For example, some journals have invited to submit replications of the tasks (for instance, the special issue Understanding theory of mind in infancy and toddlerhood published by the journal Cognitive Development) and the scientific community is doing an effort to conduct direct as well as conceptual replications . Especially relevant for the future of the field is the ManyBabies framework (Frank et al., 2017) which is currently running a large size replication on infants' theory of mind. The project ManyBabies 2 aims to run three waves of multi-lab studies developing tasks to measure VOE, AL and interactive behaviors. In this effort, researchers from Fig. 7. Scatterplot of the log odds of correct performance as a function of the paradigm employed. Each condition represents one circle and it is colored according to (a) the design employed -between-subject (BS) or within-subject (WS) design; (b) the interaction between the kid and the agent, (c) motive of the false belief transformation -whether to trick the agent or for no reason at all. P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 different theoretical point of views will collaborate to devise the experiments and run them in their own labs. The results of the present meta-analysis, then, constitute a first step on that direction insofar as it provides an overall picture of the field that would be comparable with a future overview of the area as new results are accumulated. Fourth, the type of task employed seems to be relevant. When the year of publication and the paradigm were included in the model, only the VOE paradigm -among the other two-produced a significant effect. This consequence implies that infants reliably performed above chance only within this methodology. But if false belief attribution were achieved at these ages, then it should take place regardless of the task used, just as it was found for elicited-response FBT. Studies employing within-subject design are critical to shed light on this lack of convergence across paradigms (Poulin-Dubois & Yott, 2018;Powell et al., 2018).
In addition, many of the VOE effects are due to a single lab; further, analytic decision-making varies substantially from paper to paper in this body of work. These studies combine three looking time measures in different ways. Disruptive looking and continuouslooking measures (used to end a trial), and a cumulative minimum looking time (before which a trial could not end) are systematically varied across conditions and studies without independent justification (Rubio-Fernández, 2019b). This fact certainly obscures the interpretation of the results in the VOE paradigm.
On the other hand, there is a great variability in the confirmatory looking times across the FB conditions in the VOE paradigm that it is surprising in itself, because it could not be predicted a priori (all the looking times for expected and unexpected trials are reported in Table 9). Looking times considerably varied from one experimental condition to another: they stretched from 4.52 to 29.5 s when unexpected outcomes are presented and from 3.27 to 18.8 s in the case of expected events. At least part of this variability could be due to the different materials employed in each task, the setting in which studies were embedded, and the cultural context in which the studies were run. However, the existing variability should be pointed out and make researchers to raise the question of how to set and justify a criterion for expected versus unexpected looking times rather than stipulate them ad hoc as it seems it is done depending on the times found. 15 Anyway, what matters for the VOE methodology is that longer looking times, in fact, indicate a surprise reaction rather than duration per se. As it was well established long ago, longer looking may also indicate a preference for novelty (Fantz, 1967): without surprise, longer looks might indicate a preference for an event that is seen as new rather than unexpected. This point brings us to a second consideration, common in the debate about the use of this methodology to assess conceptual development in first year-olds: the familiarization phase involved in the VOE paradigm might just induce perceptual expectations instead of the assumed one in terms of belief and intention attribution. The fact that a long and constant familiarization phase is required reinforces this possibility, providing the participants with some anticipation of where the elements involved in the scene are likely to move around. In other words, the familiarization phase might induce the infant to attend to a particular location (Falck, Brinck, & Lindgren, 2014) or to a particular perceptual configuration (Heyes, 2014) 16 .  Onishi and Baillargeon (2005) first false belief condition, a mean looking time of 17.47 seconds is taken to indicate an expected look; while in the second false belief condition, a mean looking time of 19.89 seconds was taken to indicate an unexpected look (just two seconds of difference). 16 As a corollary, accumulated looking time is a very rough measure of VOE -social looking has been proposed as an alternative, more reliable measure of VOE than looking time (Dunn & Bremner, 2017). While the VOE paradigm relies on a measure that only focuses on the macrostructure of looking, looking times are in fact composed of individual fixations of variable duration and a mix of active information processing and blank stares P. Barone, et al. Infant Behavior and Development 57 (2019) 101350 However, the studies did not include a measure of familiarization or habituation. An independent way to assess the generated expectations, if any, during the familiarization phase would be welcome. The design should also involve greater perceptual variability in terms of colors, distances, objects and all the perceptual details of the situation. It could easily be tested in (i) a control condition in which perceptual changes were introduced in the familiarization phase to check whether the VOE response also appeared, thereby indicating conceptual understanding; and (ii) a control condition in which the familiarization phase involved perceptually different instances of the same intentional event. Additionally, if infants were already able to ascribe beliefs in the familiarization phase, a very brief familiarization period would suffice.
Main effect analyses showed how multiple aspects, on their own and in combination, decreased the probability of correct performance in these tasks. We may be unaware of many other variables that moderate infants' performance on these tasks. Although we attempted to codify more parameters, like the separation between locations and the delay between hiding the object and testing, most authors did not report such information. For instance, the distance between the two locations (i.e., boxes) was reported in only 13 out of 56 conditions. In those studies, such distance considerably varied from 5 cm to 100 cm. On the other hand, the delay between hiding the object on its last position and testing was not specified. In some cases, it could be estimated but it was a difficult task since infants controlled the end of the trials in the VOE studies. Any estimation, then, would be imprecise. The analysis of both parameters could give us a hint for the explanation of the variability in the data.

Study limitations
We remark some potential limitations to the current meta-analysis that simultaneously constitute venues for new research. First, we only focused on peer-reviewed journal publications; hence we did not have access to reports that failed to be published. The present meta-analysis, as we claimed before, is a necessary first step to have a panoramic view of the evidence published so far. It is also an open invitation to submit non-published results in order to increase the number of false belief conditions considered and rerun the analyses. Second, other moderators that we did not code could have influenced the phenomenon (like nationality, the distance between locations, the delay between hiding and testing, etc.). Third, although we chose to work with the more prolific paradigms in the field, new ways to measure implicit false belief ascription are being developed, like the use of electroencephalography (EEG) to record brain wave patterns during the task (Southgate & Vernetti, 2014).
Four, we selected sample sizes in which infants were younger than 26 months old. In future work, it would also be compelling to study the developmental trend of the capacity to attribute beliefs and see what happens in the transition from two to three years old. But this decision would also imply broadening the inclusion criteria in order to select verbal spontaneous-response paradigms, like the verbal AL and the verbal VOE tasks (He, Bolz, & Baillargeon, 2012;Scott, He, Baillargeon, & Cummins, 2012). Finally, we only included conditions in which infants were required to attribute false beliefs to another. However, almost all studies include additional conditions, like true belief as well as ignorance and other control conditions. Then, infants' above chance performance on implicit FBT would be doubtful if they, for example, performed at chance or below chance level in the true belief condition. As Rubio-Fernández (2019a) correctly pointed out, different kind of criteria have been proposed for passing the FBT in infants and older children. While a differential performance between true and false conditions is expected for infants, older children also have to perform above chance level in both conditions in the traditional task. So, a careful analysis of infants' performance in spontaneousresponse true belief tasks is also needed to get a more comprehensive view of infants' mentalistic abilities.

Conclusion
In conclusion, this meta-analysis is the first one in proposing an integrative view of all the published studies on spontaneousresponse false belief tasks in children younger than 26 months-old. Infancy is a key developmental period in which implicit methodologies are being used to test the origins of many psychological constructs, such as theory of mind and the specific capacity to attribute false beliefs. The emerging picture from this systematic review, however, is puzzling. Although correct performance on spontaneous-response FBT was on average more likely than incorrect performance, the analyses revealed publication bias and high heterogeneity across the data. While several authors have shown that preschoolers display a robust developmental pattern from a below-chance to an above-chance performance in the standard FBT (Wellman et al., 2001), infants' performance on the nonverbal FBT is equivocal and does not show any clear developmental pattern.
Understanding the behavior of another agent according to her false belief would minimally require robust attributions. We mean that the attributer, who ascribes a false belief to an agent, should correctly help or inform her when necessary, look in anticipation to the location where she will go and look longer when she acts incongruently with her epistemic state. Experimental settings should move forward in that direction and test the same children across a battery of implicit tasks that employ different paradigms. In this way, we could examine individual differences in performance (e.g., infants that show understanding across all the implicit measures and infants that fail all of them) and convergent validity of the tests (like Poulin-Dubois & Yott, 2018 who failed to find convergent (footnote continued) (Aslin, 2007, p. 50). In the VOE paradigm researchers manually collect information about infants' looking time to the scene but which aspects of it infants are exactly looking at remain unknown or uncontrolled. The use of more sophisticated technology, like eye-tracker devices, is needed to examine what infants are doing during that period of time. An example of that endeavor is Dörrenberg and colleagues' (2018) paper in which they also measure pupil size with an eye-tracker. validity). Similarly, tasks that concurrently combine two implicit measures are welcomed, like measuring infants' anticipatory look and (un)expected looks (e.g., Dörrenberg, Rakoczy, & Liszkowski, 2018) or participant's anticipatory look and interactive behavior (work in progress).

Declaration of Competing Interest
None.