Constructing three emotion knowledge tests from the invariant measurement approach

Background Psychological constructionist models like the Conceptual Act Theory (CAT) postulate that complex states such as emotions are composed of basic psychological ingredients that are more clearly respected by the brain than basic emotions. The objective of this study was the construction and initial validation of Emotion Knowledge measures from the CAT frame by means of an invariant measurement approach, the Rasch Model (RM). Psychological distance theory was used to inform item generation. Methods Three EK tests—emotion vocabulary (EV), close emotional situations (CES) and far emotional situations (FES)—were constructed and tested with the RM in a community sample of 100 females and 100 males (age range: 18–65), both separately and conjointly. Results It was corroborated that data-RM fit was sufficient. Then, the effect of type of test and emotion on Rasch-modelled item difficulty was tested. Significant effects of emotion on EK item difficulty were found, but the only statistically significant difference was that between “happiness” and the remaining emotions; neither type of test, nor interaction effects on EK item difficulty were statistically significant. The testing of gender differences was carried out after corroborating that differential item functioning (DIF) would not be a plausible alternative hypothesis for the results. No statistically significant sex-related differences were found out in EV, CES, FES, or total EK. However, the sign of d indicate that female participants were consistently better than male ones, a result that will be of interest for future meta-analyses. Discussion The three EK tests are ready to be used as components of a higher-level measurement process.


INTRODUCTION
Contrary to the view of emotions as discrete natural events, an amalgam of expression and behavior with a distinct neural basis, constructionism posits that they are not ''ontologically objective'' categories or brute facts (Barrett, 2012;Searle, 2010) but ontologically subjective categories. These categories depend on collective intentionality, and not only physical actions and body states. That some of the so-called emotional behaviors (e.g., fight or freeze) have innate circuits does not imply that discrete emotions have them too (Lindquist et al., 2012). At the neural level, there are no bi-univocal correspondences between a given emotion and areas of activation (Lindquist et al., 2012), and a variety of emotional experiences are associated with dynamic interactions of extended neural networks (Raz et al., 2016).
If we consider emotion categories as ontologically subjective categories, then we can think of them as cognitive tools allowing us to represent the shared meaning of changes in the natural world, i.e., the shared meaning of both internal physical changes and of sensory changes external to the perceiver (Barrett, 2012). Psychological constructionist models such as the Conceptual Act Theory (CAT) postulate that complex states (e.g., emotions and cognitions) are composed of basic psychological ingredients that are brain-based (Barrett, 2009a). The CAT hypothesizes that physical changes are transformed into emotions when taking on psychological functions that require socially shared conceptual knowledge to be meaningful to the perceiver; it is in this sense that emotions are real: they are both biologically evident and part of our social reality (Barrett, 2006;Barrett, 2012;Barrett, 2014;Wilson-Mendenhall et al., 2011). Some emotion categories serve this purpose only for members of one particular culture, but there are some others, e.g., happiness, sadness, anger, fear, and disgust, that can be thought of as closer to universal and so it is typical to find them in experimental and developmental studies (Lindquist et al., 2014;Tracy & Randles, 2011).
In any case, emotion categories are not context-independent representations: Emotion knowledge (EK) is situated (Barrett, 2012;Barrett, 2014;Wilson-Mendenhall et al., 2011), and thus there are cultural and individual differences in the use of emotion words, a skill that is closely related to emotional intelligence (Barrett, 2009b). Currently, it is not clear whether ability-based emotional intelligence is a construct with the same status as fluid and crystallized intelligence or rather whether it is already defined by extant constructs, such as acculturated knowledge/crystallized intelligence (MacCann et al., 2014). The predominant operationalization has been the Mayer-Salovey-Caruso Emotional Intelligence Test Battery, whose psychometrical properties are not optimal (Orchard et al., 2009). Given that emotional aptitude variables predict dependent variables as relevant as perceived stress (Rey, Extremera & Pena, 2016) or depressive symptoms (Luque-Reca, Augusto-Landa & Pulido-Martos, 2016) we should start to test narrowly defined emotion domains each requiring its own theories and measures (Matthews, Zeidner & Roberts, 2012).
The fact that both categorical knowledge and contextual information are constitutive of emotions is a substantive reason to prefer invariant measurement models over the reflective structural equation models usually employed in validation studies; there are also psychometric reasons (e.g., Engelhard & Wang, 2014) to prefer invariant measurement models to the formative structural equation models recommended by Coan & Gonzalez (2015) in the CAT context. An implementation of the invariant measurement approach is the Rasch model (RM;Rasch, 1960), increasingly used to validate psychological and neuropsychological tests (Delgado, 2012;Engelhard & Wang, 2014;Miguel, Silva & Prieto, 2013;Prieto et al., 2010). The probability that subject n passes item i is modeled as Pni = exp(Bn − Di)/(1 + exp[Bn − Di]), where Bn is the person level and Di is the item location. This logistic one-parameter model shows the property of specific objectivity, allowing the algebraic separation of items and person parameters (the person parameter can be erased when estimating the item parameters). This is so because the sum score for an item or person is a sufficient statistic for the corresponding parameter, i.e., it captures all the information about the corresponding parameter that is contained in the sample.
One of the main advantages of the RM derives from the fact that it is a conjoint measurement model: If empirical data fit the model adequately, then person measures (e.g., aptitude, personality trait) and item locations (e.g., difficulty, severity) can be jointly located on an interval scale (variable map) whose unit is the logit. When using the RM, item parameter estimations are sample-independent and person parameter estimations are independent of the particular items that have been used; this is not true of the classical measurement model. An item of great interest for psychological measurement is the fact that the level of analysis is the individual in the RM, while Structural Equation Models (which are statistical models for covariances) use the group as level of analysis (Engelhard & Wang, 2014).
Thus, the general objective of this study was the construction and initial validation of EK measures from a psychological constructionist theoretical frame, the CAT, by means of an invariant measurement approach, the RM. When validating situational tests of emotion understanding, it has been found that items describing situations with close/concrete receivers are easier than those in which receivers are far/abstract (Delgado, 2016), a result that is predicted by psychological distance theories (Soderberg et al., 2015;Trope & Liberman, 2010), and so the close/far distinction has been taken into account in the generation of items for the situational tests. Three EK tests have been constructed and tested with the RM, both separately-vocabulary, close and far situations-and conjointly, given that they are all EK measures.

Participants
The sample was composed of 100 females and 100 males, with ages ranging from 18 to 65 years old, Spanish as first language, and Spanish nationality. Roughly half of them (n = 94) were young adults (18-30). The educational level was high (155 participants were or had been to college or further).
Even though the property of specific objectivity allows the person-independent estimation of item parameters, i.e., no representative sample is needed, we obtained the most heterogeneous sample that was available to us by recruiting participants from various Spanish regions in an art museum that was public and open to all.

Instruments
Three tests were constructed with LiveCode Ltd. (2011) and implemented on a portable computer. Identification, gender, age, informed consent, response option and right/wrong answers were asked for and automatically stored by the application. Each of the three tests was composed of 40 multiple-choice items, eight for each of the five emotion ''families'' of happiness, sadness, anger, fear, and disgust. There was no time limit, and feedback on the total score (number of correct answers; possible range: 0-120) was provided in the last On arriving home, Maria finds out that her best friend has arranged a surprise party for her. ¿What does Maria feel?
Years ago, an acquaintance got a secure job in her field of interest. ¿What did she feel?

Emotion Vocabulary (EV)
Each item stem was an emotion word, carefully selected from the corpus of the Royal Spanish Academy CORPES XXI, which has 25 million of forms for each year between 2001 and 2012 (Real Academia Española, 2015). The five response options, of which only one is correct, were happiness, sadness, anger, fear, and disgust (alegría, tristeza, ira, miedo, and asco, in standard Spanish). The subject had to choose the response option whose meaning was the closest to that of the target word.

Close Emotional Situations (CES)
Item stems were verbal scenarios showing a character and a close/concrete moment, act, object and place. Scenarios described concrete variations of the prototypes of the five emotion ''families''. Some 40 first names (half of them male) were selected from the database of the National Statistical Institute so that each scenario showed a different character identified by his/her name. There were five response options: happiness, sadness, anger, fear, and disgust. This is the adequate level of specificity for this kind of test, given that it has been found that emotional inferences from verbal scenarios are more specific than valence and class, but not more specific than emotion ''families'' (Molinari et al., 2009). The subject had to choose the option that best described the emotion that would be typical to feel in that concrete situation.

Far Emotional Situations (FES)
Item stems were verbal scenarios showing a far/abstract time, character and situation. Scenarios described abstract variations of the prototypes of the five emotion ''families'' and the main character was not identified by his/her first name but by a generic label (half of them female, e.g., ''a girl''). Response options were the same as in the previous tests, and the task was to choose the one that best described the emotion that would usually be felt in that abstract situation.

Procedure
Participants were approached by a university researcher with a visible identification card, and asked about age, provenance and first language to warrant inclusion criteria. After asking for consent to use the data for research purposes, the tests were individually applied on a portable computer.

Data analysis
Responses to the three tests were separately analyzed with the RM. Then, after conjoint scaling of items and persons, the effect of type of test and emotion on item difficulty was probed by means of factorial ANOVA.
Rasch analyses were performed with the computer program Winsteps 3.80.1 (Linacre, 2013). Data-model fit is assessed by outfit (calculated by adding the standardized square of residuals after fitting the model over items or subjects to form chi-square-distributed variables) and infit (an information-weighted form of outfit ). Infit/outfit values over 2 distort the measurement system (Linacre, 2013). Unidimensionality is a requirement for the model, not implying that performance is due to a unique psychological process (Reckase, 1979); component analyses of residuals are performed by Winsteps 3.80.1 in order to test this assumption. The indications are that Rasch measures should account for at least 20% of the total variance and it is recommended that the unexplained variance in the first contrast be lower than 3 (Miguel, Silva & Prieto, 2013).
Differential item functioning (DIF) analysis tests the generalized validity of the measures for different groups. In this case, given that there is evidence of female superiority in the accuracy of affective judgements (Hall, Gunnery & Horgan, 2016), a plausible alternative hypothesis is the instrumental one: items could be functioning differently for males and females. Thus, the DIF hypothesis was tested: The standardized difference between item calibrations in the case of two groups (i.e., male and female) was calculated and tested using Bonferroni-corrected alpha levels; the Rasch aptitude estimates from the analysis of all the data were held constant, providing the conjoint measurement scale in logit units (Linacre, 2013).

EV test
One person and two items got extreme scores (i.e., zero or perfect raw scores) and so their Rasch measures were not estimated. The Rasch analysis of the remaining data indicates good data-model fit for items, mean infit was .99 (SD = .08) and mean outfit was .91 (SD = .31). For persons, mean infit was 1.00 (SD = .25) and mean outfit was .91 (SD = .64). No item showed infit/outfit over 1.5. Twelve persons (6%) showed outfit over 2, but just one of them showed infit over 2. The percentage of variance explained by EV measures was 33.9% and the component analysis of residuals showed that the unexplained variance in the first contrast was 2.3. Finally, item reliability (.96) and model person reliability (.72) were good enough. Table 1 shows the main results of the item analysis.
Average person aptitude in logit units was 2.19, SD = 1.04, range = −1.12 to 4.78. Just one item (EV24, happiness) showed sex-related DIF favoring male subjects, i.e., male subjects had a higher probability of passing this item than female subjects with the same total score. No gender differences (impact) in Rasch measures were found, Welch-t (196) = .68, p = .50, d = −.11 (conventionally, 0 = female, 1 = male). As an illustration, Table 2 shows the map of the variable, where the right side shows item locations while person measures are situated at the left.

CES test
Seven persons obtained extreme scores; their Rasch measures were not estimated. The Rasch analysis indicates good data-model fit: Item mean infit = .98 (SD = .11), mean outfit = .87 (SD = .25); person mean infit = 1.00 (SD = .17), mean outfit = .87 (SD = . 74). No item showed infit/outfit over 2. Eight persons (4%) showed outfit over 2. The percentage of variance explained by close emotional situations measures was 22.9% and the component analysis of residuals showed that the unexplained variance in the first contrast was 2.3. As to item reliability and model person reliability, they were .93 and .58, respectively. Table 3 shows the main results of the item analysis.

FES test
Four persons got extreme scores. The Rasch analysis of the remaining data indicates good data-model fit: For items, mean infit was .97 (SD = .11) and mean outfit was .88 (SD = .32). For persons, mean infit was 1.00 (SD = .18) and mean outfit was .88 (SD = .59). No item showed infit /oufit over 2; eight persons showed outfit over 2. The percentage of variance explained by measures was 22.9% and the component analysis of residuals showed that the unexplained variance in the first contrast was 3.2. Finally, item reliability (.93) and model person reliability (.65) were acceptable. Table 4 shows the main results of the item analysis.  Average person aptitude in logit units was 2.46, SD = 1.09, range = −2.44 to 4.34. No item showed sex-related DIF. No gender differences in Rasch measures were found Welch-t (183) = .14, p = .89, d = −.02 (conventionally, code was 0 = female, 1 = male).

Total EK score
Two items had extreme scores and so their Rasch scores were not estimated. The Rasch analysis of the responses to the remaining 118 items indicates good data-model fit: for items, mean infit was .99 (SD = .07) and mean outfit was .90 (SD = .25). For persons, mean infit was 1.00 (SD = .14) and mean outfit was .90 (SD =. 42). No item showed infit/outfit over 2, and just six persons out of 200 showed outfit over 2. The percentage of variance explained by EK measures was 22.7% and the component analysis of residuals showed that the unexplained variance in the first contrast was 4.5 (2.9%). Finally, item reliability (.94) and model person reliability (.83) were good. Table 5 shows the main results of the item analysis.
In regard to the type of test, item mean difficulty (in logit units and in ascending order) was −0.26 (CES), −0.01 (FES) and 0.28 (EV). As to emotions, item mean difficulty (in ascending order) was −1.54 (HAPPINESS), −0.07 (FEAR), 0.26 (SADNESS), 0.56 (ANGER), and 0.66 (DISGUST). A factorial ANOVA of the effects of type of test and emotion on EK item difficulty was statistically significant, F (14,103) = 5.09, p < .001. Neither type of test, F (2,103) = 1.78, p = .17, nor the interaction effects F (8,103) = 1.06, p = .40, were statistically significant. Emotion effects on EK item difficulty were found, F (4,103) = 14.02, p = p < .001; Bonferroni post hoc tests indicated that the only statistically significant difference was that between HAPPINESS and the remaining emotions.

DISCUSSION
Three emotion knowledge tests have been constructed from a psychological constructionist theoretical frame: one vocabulary and two situational tests. Although items were generated in Spanish, the careful selection of the emotion words and the substantive background (conceptual act and psychological distance theories) should facilitate their adaptation to other languages and/or cultures. Test items and specifications will be made available upon request to accredited researchers for non-commercial purposes.
The RM, an invariant measurement approach, was used for the initial validation of the three tests separately-EV, CES and FES-and conjointly, given that all the items were designed to provide EK measures. In the four cases, data-model fit was good enough, so the probability of a response can be expressed as an additive function of a person parameter and an item parameter; this is consistent with the quantitative assumption implicitly made-but not tested-in most psychological assessment situations (Michell, 1999). Even though the first contrast of the component analysis of residuals was slightly over the recommended value for one of the tests, as well as for the conjoint scaling of the three   tests, some evidence of multidimensionality should be expected when measuring complex constructs, e.g., when measuring math ability, some evidence of multidimensionality is better tolerated than when measuring geometry aptitude (Linacre, 2013). It is relevant to note here that the use of parametric statistical methods takes for granted interval status, even though the nature of many scoring systems is ordinal at best. We have evaluated the interval scaling assumption with the RM, which because of its desirable metric properties can be used to quantify different types of experimental data (Delgado, 2007). Some other advantages of the RM, at the practical level, are the ease of interpreting and communicating results: because both participants and items are located on the same variable, comparisons can be made concerning what items have been passed by what persons (Prieto et al., 2010).
As to gender differences, at least three quantitative reviews have shown clear evidence of female superiority in the accuracy of affective judgments; effect sizes are small-to-medium following conventional standards (Hall, Gunnery & Horgan, 2016). In our study, the testing of gender differences was carried out after corroborating that item DIF would not be a plausible alternative hypothesis for the results. No statistically significant sex-related differences were found in EV, CES, FES or total EK score. However, the sign of d indicate that female participants scored consistently higher than male ones, −.11 (EV), −0.21 (CES), −0.02 (FES) and −.17 (total EK), a result that will be of interest for future meta-analyses. For instance, a recent multi-level meta-analysis by Thompson & Voyer (2014) found that the effect size of the difference in emotion perception, a basic emotional aptitude, is d = −.19 (if coded as female = 0, male = 1), not far from the d = −.17 found on our study for the EK measures. Given such a small effect size, finding statistically significant sex-related differences in EK would require studies with very large samples.
Finally, when EK items from the three tests were conjointly scaled, item difficulty did not statistically differ as a function of the original test (if CES, FES or EV, ordered by ascending average item difficulty) and so they could be used somewhat interchangeably when measuring EK with time restrictions. This is not implying that the three tests are measuring the same processes (in an essentialist way), only that there is one latent variable (EK), all items tap into it, and the level of this EK variable is in a certain moment the focus of measurement interest (Wu, Tam & Jen, 2017). Descriptively, the average item difficulty was ordered as expected from psychological distance theories: CES item scenarios were designed as the most concrete ones, while the EV items, words, were the most abstract stimuli. As to emotions, only HAPPINESS items were significantly easier than the remaining ones, corroborating results from previous research in emotion recognition and emotion understanding (Delgado, 2012;Delgado, 2016;Russell, 1994;Suzuki, Hoshino & Shigemasu, 2006). From the perception science field, it has been suggested that the ''happiness superiority effect'' could have evolved due to the fact that happy faces are communicatively less ambiguous than the remaining facial expressions of emotion (Becker et al., 2011).
Thus, the three tests are ready to be used as components of a higher-level measurement process (Newton & Shaw, 2013). A promising application field is the assessment of EK as a mediator of change in social competence, given that EK is consistently associated with various social and behavioral outcomes in children and teenagers (Trentacosta & Fine, 2010) and EK deficits are found in disorders such as alexithymia (Lumley, Neely & Burger, 2007).