Affective stimuli are increasingly used in emotion research. The most frequently used stimulus categories include emotional scenes and faces, but words have also been used in a variety of tasks including lexical decision (Eviatar & Zaidel, 1991; Kanske & Kotz, 2007; Kuchinke et al., 2005; Nakic et al., 2006; Schacht & Sommer, 2009; Scott, O’Donnell, Leuthold, & Sereno, 2009), memory tasks (Kuchinke et al., 2006; Sim & Martinez, 2005), versions of the Stroop task (van Hooff, Dietz, Sharma, & Bowman, 2008), mental imagery (Osaka, Osaka, Morishita, Kondo, & Fukuyama, 2004), the attentional blink (Mathewson, Arnell, & Mansfield, 2008), attentional orienting (Stormark, Nordby, & Hugdahl, 1995), and emotionality judgments (Maddock, Garrett, & Buonocore, 2003). The biggest advantage of word stimuli is that they can be tightly controlled for physical attributes (size, complexity, color composition, luminance), frequency of occurrence in everyday life, or concreteness of the underlying concept. To make use of these advantages, normative data for word stimuli are needed.

Several databases offer affective norms for words in different languages, including English (Altarriba, Bauer, & Benvenuto, 1999; Stevenson, Mikels, & James, 2007), German (Lahl, Göritz, Pietrowsky, & Rosenberg, 2009; Võ et al., 2009; Võ, Jacobs, & Conrad, 2006), Spanish (Redondo, Fraga, Padrón, & Comesaña, 2007), and Finnish (Eilola & Havelka, 2010). Typically, these norms are obtained from rating studies in which participants evaluate the valence and arousal of the stimuli. However, the degree to which the norms are applicable in emotion research critically depends on their validity. Only if the ratings validly measure the emotional status of the stimuli will it be advisable to use normative data to select affective stimuli. The present study addresses this problem by measuring convergent validity of affective norms for words. Convergent validity is defined as the degree to which one variable correlates with another that it is theoretically predicted to correlate with. Therefore, we correlated valence and arousal ratings of word stimuli in different sensory modalities. We selected a subsample of words from the Leipzig Affective Norms for German (LANG) database (Kanske & Kotz, 2010) containing visual ratings of the written words. These words were spoken by professional actors to obtain an auditory version of each stimulus, which was then rated again in valence and arousal. Theoretically, the emotional status of the stimuli should be the same in different modalities, which we tested by correlating the visual and auditory rating scores. This is a critical test of convergent validity, as the two types of materials differ greatly; while visual stimuli are present at once, auditory stimuli evolve over time. In addition to the emotional word meaning, auditory stimuli also contain an emotional prosodic modulation. Furthermore, distinct neural pathways underlie the processing of visual and auditory affective signals (for a review, see Schirmer & Kotz, 2006). To our knowledge, ours is the first study that has tested the validity of affective norms cross-modally. We chose the valence and arousal dimensions because these have been shown to be the most unambiguous factors explaining variance in words (Hager & Hasselhorn, 1994). To reduce variance based on word class (Osterhout, 1997; Perani et al., 1999), we only included nouns.

To summarize, the present study probes the validity of affective word norms in German by correlating the ratings of word stimuli in vision and audition. This cross-modal approach should provide insights into the validity of affective rating studies and normative data.

Method

Participants

All procedures were in accordance with the ethical standards of the local committee on human experimentation and with the Helsinki Declaration of 1975, as revised in 1983.

A sample of 30 native German speakers was recruited from the University of Leipzig. There were 16 female participants; mean age was 23.2 years (SD = 2.8). All participants were right-handed according to the Edinburgh Handedness Inventory (Oldfield, 1971), with a mean laterality quotient of 87.8 (SD = 20.4). All participants reported normal or corrected-to-normal vision, normal hearing, and no history of mental disorders or emotional problems according to the Depression Anxiety Stress Scales (DASS; Lovibond & Lovibond, 1995).

Materials and procedures

From the 1,000 German nouns in the LANG database, we selected a subset of 120 that were prototypical for the following categories: (1) negative and high-arousing, (2) neutral and low-arousing, and (3) positive and high-arousing. The descriptive statistics are displayed in Table 1. There were no significant differences in concreteness, frequency of usage, number of letters, or number of syllables between the categories. Several auditory recordings of each word were made with the emotional expression corresponding to the word’s emotional valence, using two professional actors who were native speakers of German. One of the speakers was female, the other male. Recordings were made with Algorec 2.1 (Algorithmix GmbH, Waldshut-Tiengen, Germany), and the sound files were further processed with PRAAT (Institute of Phonetics Sciences, University of Amsterdam). Two versions of each positive, negative, and neutral word from each speaker were chosen for the rating study, excluding any recordings with background noise or outlying in duration or intensity. In total, participants were presented with 480 different auditory stimuli. To control for differences in loudness, all stimuli were normalized in sound intensity to 75 dB SPL.

Table 1 Descriptive statistics of the 120 words selected

Participants completed one session during which they rated the words for valence (negative–neutral–positive) and for arousal (high arousing–low arousing). The order of the tasks was counterbalanced. Words were presented in a different randomization for each participant. Ratings were done on 9-point Likert scales. For valence and arousal ratings, Self Assessment Manikins (Bradley & Lang, 1994; Hodes, Cook, & Lang, 1985) were used. The assignment of the scale endpoints to the left and the right was counterbalanced across participants. During the rating, participants were seated in a comfortable chair in a sound-attenuating room and wore headphones (Sennheiser HD 202). Stimuli were presented with ERTS (Experimental Run Time System; Berisoft Cooperation, Frankfurt, Germany).

Results

Figure 1 shows the valence and arousal ratings of the 120 selected words in the visual and auditory modalities. (The visual ratings were taken from the LANG database.) Here, we observed quadratic relationships (visual: r quad = .88, p < .001; auditory: r quad = .98, p < .001), demonstrating the typical distribution of valence and arousal values. We then correlated visual and auditory ratings of the 120 words. Valence ratings correlated highly (r = .98, p < .001), as did the ratings in arousal (r = .97, p < .001). For the respective scatterplots, see Fig. 1.

Fig. 1
figure 1

(a) Mean valence and arousal ratings for each word in the visual and in the auditory modality. (b) Mean valence and arousal ratings across the visual and auditory modalities, including regression slopes

To control for potential inflation of the correlation coefficients due to the selection of extreme groups (and the consequently enhanced variance), we used two approaches (Feldt, 1961). Firstly, assuming the same linear relation of written and spoken valence and arousal ratings (same regression coefficient, same residuals), but reduced variance (estimated through the variance in the entire list of 1,000 word ratings; i.e., for valence, SD = 1.358; for arousal, SD = 1.597), we found only slightly reduced correlation coefficients (valence, r = .979; arousal, r = .949). Secondly, the most conservative approach assumes random variation of the intermediate data points. We therefore repeated the analysis including pseudorandomly created data. These were created to be intermediate in visual valence and arousal ratings (e.g., to fall between the negative and neutral stimuli). We then assigned auditory ratings that randomly varied between the minimum and maximum valence or arousal values obtained in the auditory rating. To complement this, we also used the opposite strategy (intermediate in auditory ratings and randomly varying in visual ratings). These analyses represent a lower boundary for the correlation, as it is unlikely that empirical values would have been completely random. Since there were 30 words in each group (negative, neutral, positive), we also used 30 words for each intermediate group (i.e., 30 neutral–negative mean arousing and 30 neutral–positive mean arousing data points). The results showed reduced, but still highly significant, correlation coefficients between the visual and auditory ratings, ranging from r = .78 – .82 (all ps < .001).

Discussion

The present study aimed to probe the validity of normative data regarding valence and arousal ratings of emotional and neutral word stimuli. The correlations between the ratings of visual and auditory words were very high for both affective dimensions, confirming that affective norms of the words are valid measures. These data justify the use of word stimuli selected on the basis of affective norms for experimental studies of emotion.

Valence and arousal ratings displayed the typical quadratic function that has also been observed in other normative studies with words and pictorial stimuli (Eilola & Havelka, 2010), indicating that highly negative and positive words are more likely to be rated as high arousing, while neutral words are more likely to be rated as low in arousal. This underlines the comparability of the present results with previously established norms. We used this dimensional approach because it seems well suited to the characterization of word stimuli, for which it is difficult to always assign a primary emotional category (e.g., to words like “bomb, vacation, pizza”). However, we acknowledge attempts to use discrete emotional categories (Stevenson et al., 2007), and future studies could validate the affective norms in dimensional and categorical ratings.

The major result provided by the present approach is the strong correlation between the visual and auditory word ratings. This result indicates the high validity of the present normative data. Using this cross-modal validation strategy is an unorthodox approach. It does not compare different ways of measuring valence and arousal, but manipulates the stimulus characteristics themselves. These include the change in the sensory modality and the additional prosodic manipulation, which also conveys emotional information. These alterations should, however, not change the affective character of the stimuli, which we confirmed in the high correlations between the ratings.

One limitation of the present study is the use of extreme groups (i.e., words that are high or low in arousal and either very negative, neutral, or positive). This may inflate the observed correlation coefficients (Feldt, 1961). To address this issue, we used two approaches. Making the plausible assumption that the regression slope is the same with data that are not affected by an extreme group variance increase, we found only slightly reduced correlation coefficients. We also repeated the analyses including artificial, randomly varying data between the extreme groups. This is a conservative approach, because it inserts data with absolutely no correlation of the visual and auditory ratings, which is unlikely to be the case if ratings were acquired on intermediate data points. Nevertheless, the correlations remained highly significant, indicating that the relation between visual and auditory ratings was stable, despite the use of extreme groups.

A critical question is whether the reported results would generalize to other normative data and different samples. We believe the latter to be true, as we previously showed that correlations between the affective ratings of different samples and of the same sample at different time points are very high (Kanske & Kotz, 2010). However, it remains an open question whether the observed validity of the present norms is specific to the current database or indicates that the word norms are generally highly valid. This seems very likely, as the instructions and methodology used in the different rating studies are very similar. Nevertheless, it is an empirical question and should be addressed in future studies.

The present results demonstrate the validity of the Leipzig Affective Norms for German (LANG), which provides a comprehensive database of German nouns with a wide range of emotional status ratings. Subsamples of these words have already been successfully used to study the neural basis of emotional processing with functional magnetic resonance imaging and electroencephalography (Kanske & Kotz, in press a, b). Therefore, these norms can support experimental studies on emotion by helping researchers to select highly controlled word samples.