Comparison between the Short Story Task and the Reading the Mind in the Eyes Test for evaluating Theory of Mind: A replication report

Abstract Introduction The ability to attribute emotional states, beliefs, and intentions to others has been termed Theory of Mind (ToM), mentalizing, and mind reading. The purpose of this study was to find an instrument to measure ToM in the Mexican population, that would yield similar results to those obtained in other cultures, and could discriminate between individuals. To achieve this objective, we replicated a study which compared two measures of ToM in a sample of English-speaking, neurologically intact adults. Methods A sample of young Mexican adults (n = 118) was evaluated on the Reading the Mind in the Eyes Test (RMET) and a test that uses naturalistic narrative stimuli, the Short Story Task (SST), and on tests of general cognitive ability, executive functions, and empathy. Results We found a significant correlation between the ToM tests, and both tests correlated with verbal ability, general cognitive ability, and empathy, similar to what was seen in a previous study. Both tests discriminated between individuals and were challenging enough that we found no perfect scores. Conclusions These results show that both the RMET, which taps into emotion recognition and its categorization with language, and the SST, which relies on narrative fiction to test the ability to interpret mental states, show concurrent validity in a sample of neurologically intact young adults from a Latin-American culture; these tests may be useful in the clinical setting and for basic research into ToM.

M. Giordano is a full professor at the Instituto de Neurobiologia, UNAM. Currently, her group is interested in understanding the cognitive and neural basis of pragmatic language comprehension. Pragmatic language is used in everyday interactions to communicate intentions, it includes, but is not restricted to figurative language, as in metaphors, and indirect speech. There is limited knowledge about the cognitive operations that are involved in pragmatic language comprehension, and most of that information comes from studies in clinical populations. She uses psychometric and neuroimaging tools to answer these questions.

PUBLIC INTEREST STATEMENT
The ability to attribute emotional states, beliefs, and intentions to others has been termed Theory of Mind (ToM), mentalizing, and mind reading. Since human beings are a social species, social competence is evolutionarily adaptive and indeed essential for adequate interactions. ToM may be one of three central processes that are part of social information processing, the other two being social perception and action observation. ToM is likely to involve a variety of more basic sub-processes that have yet to be described. Psychometric tools that are reliable and can discriminate among neurologically intact adults from different cultural backgrounds are necessary and useful for research in ToM. The purpose of this study was to find a reliable instrument to measure ToM in the Mexican population, that would yield similar results to those obtained in other cultures, and could discriminate between individuals. narrative fiction to test the ability to interpret mental states, show concurrent validity in a sample of neurologically intact young adults from a Latin-American culture; these tests may be useful in the clinical setting and for basic research into ToM. The ability to attribute emotional states, beliefs, and intentions to others has been termed Theory of Mind (ToM) (Schurz, Radua, Aichhorn, Richlan, & Perner, 2014), mentalizing (Frith & Frith, 2006), and mind reading (Turner & Felisberti, 2017). ToM was initially used by Premack andWoodruff in 1978 (Premack &Woodruff, 1978) to describe that non-human primates appeared to anticipate the actions of a human faced with a problem. The concept can also be traced to Daniel Dennett's philosophical proposal of the intentional stance (Dennett, 1987).
This ability changes across the lifespan, is associated with effective social interactions, and prosocial behavior (Turner & Felisberti, 2017), and is central for utterance interpretation (Cummings, 2017). Deficits in ToM are present in various disorders including autism spectrum disorder, psychiatric disorders such as schizophrenia (Turner & Felisberti, 2017), and after brain lesions (Shamay-Tsoory & Aharon-Peretz, 2007).
ToM is a multidimensional construct (Turner & Felisberti, 2017), and it has been argued that it needs to be reformulated and deconstructed (Schaafsma, Pfaff, Spunt, & Adolphs, 2015). Some of the proposed elements of ToM include perceptual discrimination and categorization of the socially relevant stimuli and interoceptive signals elicited by those stimuli, semantic or conceptual knowledge, and executive and motivational processes. Social content and causal inference are distinctive features of ToM (Schaafsma et al., 2015). Two ToM components have been identified as fundamental predictors of well-being, the affective, or "hot" component involving the comprehension of emotional states; and the cognitive, or "cold" component involving the attribution of thoughts and beliefs (Henry, Cowan, Lee, & Sachdev, 2015).
Due to the multidimensional nature of ToM, many different tests have been designed, a recent review categorized the existing tests for ToM in three classes: emotion recognition tests; cognitive and affective mentalizing tasks that measure attribution of beliefs, intentions, desires, and emotions; and multidimensional measures that combine these features (Turner & Felisberti, 2017). A meta-analysis using a different approach divided the studies into story-based and nonstorybased (Mar, 2011).
The Reading the Mind in the Eyes test (RMET; (Baron-Cohen, Jollifle, Mortimore, & Robertson, 1997)), has been widely used. It involves recognition of mental state terms, probes non-automatic processes, and is sensitive to variation in neurologically typical adults. The revised version of the RMET includes 36 items with four response options per item (Baron-Cohen, Wheelwriht, Hill, Raste, & Plumb, 2001). Some studies have found correlations between RMET and measures of empathy or IQ (reviewed in (Vellante et al., 2013)).
Naturalistic narrative stimuli are an alternative that allows ToM targets to be integrated into a context (Turner & Felisberti, 2017), an example is the Short Story Task (SST), a test developed by Dodell-Feder, Lincoln, Coulson, and Hooker (2013). This task requires reading a fictional story written by Ernest Hemingway about two characters in a romantic relationship and includes questions that assess explicit mental state reasoning and comprehension. The test demonstrated sensitivity to variations among neurologically typical adults and concurrent validity with the RMET and the Interpersonal Reactivity Index (IRI) (Dodell-Feder et al., 2013).
Because of our interest in the cognitive basis of pragmatic language and the central role that ToM plays in utterance interpretation (Cummings, 2017), our purpose was to find a dependable measure of ToM for neurologically intact Spanish-speaking, Mexican adults, that would yield similar results to those obtained in other cultures, and could discriminate between individuals. Although there may be a universal cognitive substrate for ToM, this needs empirical confirmation. Cultural tools and practices shape brain processes (Kitayama & Park, 2010), there is variation in human psychology and behavior across human populations (Henrich, Heine, & Norenzayan, 2010), and radical differences between languages (Evans & Levinson, 2009). For these reasons we needed to compare the performance of our participants in these ToM tests with that of participants from a different culture that speaks a different language. To achieve this objective, we replicated the study by Dodell-Feder et al. (2013), and correlated the RMET ( Baron-Cohen et al., 2001) and the SST (Dodell-Feder et al., 2013) between them, and also with empathy and general intelligence. Also, we measured the internal consistency of these tests and their correlation with executive functions.
Empathy and general intelligence were measured to make our study comparable to that of Dodell-Feder et al. (2013). Empathy refers to the ability to recognize and identify what someone else is feeling (cognitive component), and to share that emotional state (emotional component) (Lucas-Molina, Pérez-Albéniz, Ortuño-Sierra, & Fonseca-Pedrero, 2017). To measure empathy, we used the Interpersonal Reactivity Index (IRI), a self-report instrument consisting of four separate subscales that have been widely used to test both components of empathy (Lucas-Molina et al., 2017). The fantasy subscale of the IRI was found to correlate with the ToM measures (Dodell-Feder et al., 2013), we expected to see similar results in our sample. Executive functions were measured to identify if the performance on the ToM tests was related to general domain functions, such as inhibitory control, working memory, verbal fluency, and mental flexibility. Besides, these measures together with general intelligence allowed us to describe our sample more fully.

Methods
Participants (n = 118, 73 females) were young (23.03 ± 3.61 years of age) Mexican students. Inclusion criteria were having Spanish as their maternal language, being currently enrolled in a bachelor's or graduate degree program, between 18-35 years of age, and signing the informed consent form. Exclusion criteria included having been diagnosed with a neurological or psychiatric disorder, and scores above the norm in the Symptom Checklist 90 (SCL-90; (Cruz-Fuentes, López-Bello, Blas-García, González-Macías, & Chávez-Balderas, 2005)). The study was approved by the institutional ethics committee and complied with the federal guidelines of the Mexican Health Department (http://www.salud.gob.mx/unidades/cdi/nom/compi/rlgsmis.html), which agree with international regulations.

Intelligence and executive functions tests
The Wechsler Adult Intelligence Scale (WAIS-IV) was used to obtain the Perceptual Reasoning Index (PRI), and Verbal Comprehension Index (VCI). Together, these indices are used to calculate the General Ability Index (GAI). Executive functions (inhibitory control, working memory, verbal fluency, and mental flexibility) were evaluated using the Batería Neuropsicológica de Funciones Ejecutivas y Lóbulos Frontales (BANFE) (Flores Lázaro, Ostrosky Shejet, & Lozano Gutiérrez, 2014).

Theory of mind and empathy tests
The IRI was used to measure affective and emotional empathy (Davis, 1980), it is a self-report questionnaire that includes four subscales: Fantasy (FS) determines the identification with fictional characters; Perspective Taking (PT) relates to the consideration of others' viewpoints; Empathic Concern (EC) measures feelings of compassion and concern for others in need; and Personal Distress (PD), relates to the reaction of discomfort to the distress of others (Lucas-Molina et al., 2017). A recent study has confirmed the construct validity and equivalence between genders of the four-factor structure of the IRI in a large sample of Spanish-speaking college students (Lucas-Molina et al., 2017).
To test ToM, we used the Argentinian version of the RMET in Spanish (downloaded from the Autism Research Center webpage; http://www.autismresearchcentre.com/arc_tests) ( Baron-Cohen et al., 1997, 2001. The test was programmed in PsychoPy (version 1.83.01; (Peirce, 2007)). The test consists of 36 photographs of eyes expressing different mental states, they were displayed in the center of the screen, along with four mental state terms in each corner. Participants had to press the letters q, p, z, or m corresponding to each of the corners of the screen, they did not have access to the glossary of terms to increase the difficulty of the task. The other ToM was the SST (Dodell-Feder et al., 2013). The test consists in reading a short story and answering a spontaneous mental state inference question, five reading comprehension questions, and eight questions about the beliefs, emotions, intentions, and desires of the characters in the story. A Spanish translation was obtained and revised by a professor of Linguistics to ensure that it was written in correct Spanish and that it was comprehensible. The task was applied according to the instructions by Dodell-Feder et al. (2013).

Statistics
We used SPSS (v. 24.0, IBM Corp.) to calculate descriptive statistics, correlations, and internal consistency measures (Cronbach's alpha), only raw scores were used. To compare between genders, we used t-tests for independent samples. A probability of less than 0.05 was considered significant, and Bonferroni correction was calculated for multiple correlations between tests of empathy and ToM, and between intelligence and executive functions tests.
For the correlation analysis our sample was three times larger than the minimum size (n = 30) for a power of 80% (Sample Size Calculators from UCSF Clinical and Translational Science Institute; calculated using the r = 0.49 value reported by Dodell-Feder et al. (2013)).

Results
In agreement with results obtained by previous studies (Dodell-Feder et al., 2013), we found a significant positive correlation (r(116) = 0.351, p < 0.001) between the RMET and SST (mental state reasoning score; MSR) (Figure 1), better performance in the RMET was associated with a higher score on MSR. The distribution of scores was normal with moderate skewness and kurtosis. No differences in ToM scores between female and male participants were found. The only difference between genders was in perceptual reasoning (WAIS-PRI score), which was higher for males (t(115) = 2.175, p < 0.05; 95% CI [.4442, 9.4835]), and in the empathic concern subscale of the IRI (IRI-EC), which was higher for females (t(106) = 2.122, p < 0.05; 95% CI [.1285, 3.7881]). The descriptive statistics of the various tests are presented in Table 1.
Regarding empathy, and in agreement with previous studies (Dodell-Feder et al., 2013) we found that both ToM tests correlated only with the fantasy subscale of the test of empathy (IRI-FS) (r (108) = 0.212, 0.219, p < 0.030, for RMET and MSR/SST, respectively). The correlations did not reach significance when correcting for the number of comparisons for all tests of social cognition, i.e., empathy and ToM. Importantly, partial correlations between both ToM tests were still significant when controlling for the fantasy subscale of the test of empathy (IRI-FS) (r(105) = 0.297, p < 0.003). The data on the IRI of ten participants was not stored due to hardware malfunction.
In terms of internal consistency measures (Cronbach's alpha), RMET had an alpha of 0.530; the mean inter-item correlation was 0.032. For the RMET, we found that most items were answered above chance (25%); item 19 had less than 50% correct responses, followed by items 1, 7, 9 slightly above 50%, mean performance was 71.20% accurate (Table S1). The 95% confidence interval for the mean score was 24.93-26.32 (Table 1) with a perfect score being 36; the lowest score was 17 and the highest 33. These results are like those obtained in other non-clinical populations (Baron-Cohen et al., 2001;Dodell-Feder et al., 2013), and show that there is enough variability in the responses to discriminate between individuals (Figure 2; Table S1). To obtain more information about the items included in the RMET, a correlation between mean response time and mean correct responses per item was calculated. The analysis showed a significant negative correlation between mean response time and mean correct answers per item for the RMET (r (36) = −0.617, p < 0.001; Figure 3), indicating that the items in the test are heterogeneous in terms of difficulty. Those that were accurately answered were also answered faster and were probably clear-cut for the participants, in contrast, those that took more time had a lower mean score, and were perhaps more challenging for the participants. As indicated before, performance in this test correlated with verbal fluency, so it is possible that the more challenging items required a finer distinction to be made among the mental state terms.
For the SST internal consistency (Cronbach's alpha) was 0.514 for the comprehension questions and 0.621 for the mental state reasoning questions with an inter-item correlation of 0.166 for this section of the test. Only 18% of the participants made a spontaneous mental state inference, and none received a perfect score in this test. Only 10% of the participants obtained a perfect score of 10 for the comprehension questions. For the mental state reasoning questions, there were no perfect scores; the maximum score was 13 out of 16. The first comprehension question had the lowest mean score, while the first and second items for the mental state reasoning questions had the lowest mean scores, and just as for the RMET, we found that there was enough variability to distinguish between individuals (Figure 4). In contrast with the study by Dodell-Feder et al. (2013), we did not find perfect scores in either of the ToM tests.

Discussion
The results of this study replicate those by Dodell-Feder et al. (2013) in a sample selected from the Mexican population with different language and culture, in terms of the convergent validity between a widely-used instrument, the RMET ( Baron-Cohen et al., 2001), and an instrument that uses fiction to test ToM, the SST (Dodell-Feder et al., 2013). These two measures shared about 12% of the variance between them and correlated with verbal abilities (WAIS-VCI). RMET scores also correlated with a measure of verbal fluency indicating the reliance of this test on lexical skills. Scores on both instruments also correlated with intellectual capacity as measured by the General Ability Index from the WAIS-IV, and with the fantasy subscale for a measure of empathy (IRI). The correlation between both measures of ToM was still significant when controlling for those cognitive variables, similarly to what Dodell-Feder et al. (2013) found.
The contribution of this study is the demonstration that both the RMET, that taps into the recognition of emotions and its categorization with language, and the SST, which relies on narrative fiction to test the ability to interpret mental states, are both useful tests for neurologically intact young adults from a Spanish speaking, Mexican culture. Our results, for the most part, reproduce those by the proponents of the SST, thus widening the choices available to researchers working with this kind of population that are interested in studying ToM, its underlying mental operations, and its relation to cognitive functions. We did not find ceiling effects, that is, there were  r=-0.617, p<0.001 Figure 3. Correlation between mean correct per item (percentage) and mean reaction time (sec) to answer each item for the Reading the Mind in the Eyes Test. A significant negative correlation was found (r = −0.617, N = 36, p < 0.001).
no perfect scores in either test, and we found that both tests could discriminate between individuals. Although our results argue for the use of both of these tests in the Mexican population, it is essential to make some considerations.
It must be noted that the proportion of variance that these two tests shared in our sample was 12% (r = 0.35), while for the sample reported by Dodell-Feder et al. (2013) it was 24% (r = .49). Among the possible reasons for this, are differences between the samples, including cultural differences, differences in the language of the tests, as well as a statistical effect (Button et al., 2013). In terms of the differences between samples, 50% of their participants made at least one spontaneous mental state inference, while few of our participants answered the spontaneous inference question correctly. The average score for mental state reasoning and comprehension were two points above those of the sample in the present study. Cultural differences may also play a role. The fact that the narration is a story written by an American writer, albeit well-known worldwide, it may reflect subtleties of the American culture, not easily understood by a person born and raised in Latin American culture. Nevertheless, the results of both studies coincide in the correlations observed between the SST/MSR and the fantasy subscale of the IRI-FS, and general cognitive ability, which speaks to the fact that the SST/MSR is measuring the same cognitive construct in both samples.
A few studies in children have evaluated the role of bilingualism in the development of ToM and have found that bilingual children outperform monolinguals in typical ToM tasks when controlling for factors such as language proficiency, age, socioeconomic status, and intelligence. The explanations given by the authors vary, some stress the enhancement of executive functions in bilingual children (Goetz, 2003, Kovács, 2009, working memory (Nguyen & Astington, 2014), metalinguistic understanding (Goetz, 2003) and a positive influence on social cognition (Diaz & Farrar, 2018;Goetz, 2003). In this study we did not formally evaluate language proficiency in other natural languages besides Spanish, which was the maternal language of our participants, and all tests were given in Spanish. It could be of interest to evaluate if the advantage observed in bilingual children, extends to adults when they learn a second language later in life.
The SST showed better internal consistency than the RMET, we found an alpha of 0.53 for the RMET, and 0.62 for the MSR/SST. In the previous study, an alpha of 0.54 for MSR/SST was reported (Dodell-Feder et al., 2013). In the case of the RMET, the values of alpha reported vary (see (Vellante et al., 2013)), in a reduced version of the test in a Spanish sample it was 0.56 (Redondo & Herrero-Fernández, 2018). The RMET may be measuring more than one factor, influencing its internal consistency. Some have found it to be unidimensional (Preti, Vellante, & Petretto, 2017), others (Olderbak et al., 2015;Redondo & Herrero-Fernández, 2018), have not found a single factor solution, and instead obtained a shorter version of the test which yielded better internal consistency and homogeneity. We calculated the internal consistency of the 19-item version obtained in the Spanish study (Redondo & Herrero-Fernández, 2018) and found only a marginal increase in internal consistency from 0.53 to 0.58 (data not shown). The main limitation of the present study is the absence of a factor analysis approach that could have contributed to resolving the controversy related to the RMET, and to support that the SST is evaluating only one factor. Although the number of participants was enough to have adequate power for correlation analysis, it was not enough to calculate a factor analysis (Costello & Osborne, 2005).
The heterogeneity between items in the RMET could be related to the words used to describe the mental states in the test. We used the Argentinian version of the test because the terms used were more common in Mexico than those in the Spanish version. Nevertheless, when the mental state terms were searched in a corpus of contemporary Mexican Spanish (Silva-Pereyra, Rodríguez-Camacho, Prieto-Corona, & Aubert-Vázquez, 2013), only 75% of words were found, and their frequencies varied widely. Indeed, it has been suggested that a complete back translation for the RMET is necessary (Redondo and Herrero-Fernández (2018)), and we propose that regional variations in the use of Spanish should be taken into consideration. In this study, we found a significant negative correlation between mean score and response time for the RMET, evidencing that, as could be expected, participants take longer to answer the more challenging items. Also, performance in this test correlated with verbal fluency, so it is possible that the more challenging items required a finer distinction to be made among the mental state terms. Similar to our findings, item 19 was answered correctly by less than 50% of participants in both Argentinian (Allegri, 2006) and Spanish (Fernandez-Abascal et al., 2013) samples, while in other languages the items with least correct responses are different (for a review, see (Khorashad et al., 2015)).
Our results suggest that the RMET that evaluates emotion recognition, and a fictional story that requires the attribution of mental states to the characters of the story, share a portion of variance indicating that they may be measuring a shared construct, theory of mind. However, theory of mind is a multidimensional construct (Turner & Felisberti, 2017), that could be broken down into basic processes (Schaafsma et al., 2015). In this regard, neuroimaging studies could contribute to characterize the various processes involved. Individual neuroimaging studies and meta-analysis of those studies have provided a statistical map of the possible neural substrate underlying ToM (e.g., Mar, 2011;Schurz et al., 2014). Of particular interest to the present study are the results of the meta-analysis by Schurz et al. (2014). These authors found that tasks such as the RMET, recruit left inferior frontal areas, temporal regions including the temporo parietal junction (TPJ), and medial prefrontal cortex (mPFC). These areas have been implicated in reasoning about others' beliefs and intentions, considered cognitive ToM, as well as in making inferences about their valence, considered motivational ToM (Koster-Hale, Richardson, Velez, Asaba, Young, and Saxe, 2017). Importantly Koster-Hale et al., (2017) found that the brain areas recruited in relation to knowledge about others are different from those involved in other components of linguistic and conceptual processing.
Another meta-analysis (Mar, 2011) is particularly relevant for the present study because the SST uses narrative fiction to test ToM, and Mar's research showed that this particular type of task taps into brain areas involved in ToM. Briefly, Mar (2011) carried out a meta-analysis using the activation likelihood estimation (ALE) approach for neuroimaging studies of story comprehension and studies that used story-based and nonstory-based ToM tests. The conjunction analysis of the three types of tasks showed overlap in areas associated with ToM including mPFC, bilateral posterior superior temporal sulcus (pSTS), TPJ, bilateral middle temporal gyrus, and left inferior frontal gyrus (pars opercularis). Based on those results, Mar (2011) proposed that ToM processes are employed during narrative comprehension, as readers infer the mental states of characters similarly to how they understand them in real-life conspecifics. However, these brain areas include but are not limited to ToM and narrative comprehension; an alternative explanation is that they subserve more general cognitive processes. Processes such as the projection of the self, scene construction, associative processing, or the integration of motivational systems with language/categorization systems appear to share this network (Mar, 2011).
Finally, Yang, Rosenblau, Keifer, & Pelphrey (2015) have proposed that ToM is one of three central processes that are part of social information processing, the other two being social perception and action observation. They highlight the role of the pSTS, which is at the intersection of these processes and supports social information processing on different levels, from temporal integration of sensory cues of other's behaviors to representation of a basic form of intentionality. The proposed neural model includes connections between areas of social perception (amygdala, orbital frontal cortex, fusiform gyrus), areas of action observation (inferior parietal lobule, inferior frontal gyrus), and areas of ToM (TPJ, mPFC, anterior temporal lobule, posterior cingulate cortex/precuneus).
Together the theoretical proposals and neural models reviewed above, underscore the complexity and multidimensional nature of ToM, and its relevance as one of the psychological processes that encode socially and emotionally relevant inputs to provide a response that is adequate to the social context. Given the social nature of humans, expertise in social perception is evolutionarily adaptive, and deficits in this ability seriously affect social competence and hinder valuable interactions (Yang et al., 2015). Psychometric tools that are reliable and do not show ceiling effects in educated, neurologically intact adults from different cultural backgrounds are necessary and useful not only for the clinical setting but also for research in ToM to propose a universal hypothesis regarding the basic operations that it entails, as well as its relation to other cognitive functions.

Conflict of interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.