The Complex Affective Scene Set (COMPASS): Solving the Social Content Problem in Affective Visual Stimulus Sets

Social information, including faces and human bodies, holds special status in visual perception generally, and in visual processing of complex arrays such as real-world scenes specifically. To date, unbalanced representation of social compared with nonsocial information in affective stimulus sets has limited the clear determination of effects as attributable to, or independent of, social content. We present the Complex Affective Scene Set (COMPASS), a set of 150 social and 150 nonsocial naturalistic affective scenes that are balanced across valence and arousal dimensions. Participants (n = 847) rated valence and arousal for each scene. The normative ratings for the 300 images together, and separately by social content, show the canonical boomerang shape that confirms coverage of much of the affective circumplex. COMPASS adds uniquely to existing visual stimulus sets by balancing social content across affect dimensions, thereby eliminating a potentially major confound across affect categories (i.e., combinations of valence and arousal). The robust special status of social information persisted even after balancing of affect categories and was observed in slower rating response times for social versus nonsocial stimuli. The COMPASS images also match the complexity of real-world environments by incorporating stimulus competition within each scene. Together, these attributes facilitate the use of the stimulus set in particular for disambiguating the effects of affect and social content for a range of research questions and populations.


Introduction
In daily life, people encounter vast arrays of stimuli that compete for visual attention and cognitive resources. Selection of some types of information over others is the end result of a complicated algorithmic process that integrates immediate perceptual salience with the viewer's prior experience and current state. The phenomena underlying and influencing selection are the subject of research focused on how affective information is processed in typical daily life and in more extreme circumstances. From the simplest and most elegant behavioral tasks to the rapidly developing technology of brain imaging, a common essential element is the use of visual stimuli that allow valid measurement of the mechanisms of interest. It follows that stimuli that correspond well to the visual and affective complexity of the physical world, while controlling for the attributes that are most likely to confound results, are necessary for investigation of the interaction of affect with visual selection. Within this framework, visual social (i.e., human) content, such as human faces and bodies, has special status in the competition for prioritized processing. For example, faces or bodies are fixated first in naturalistic scenes (e.g., Fletcher-Watson, Findlay, Leekam, & Benson, 2008;Rosler, End, & Gamer, 2017), faces attract gaze in experimental tasks even at a cost (Cerf, Frady, & Koch, 2009), and only responses to social stimuli reflected the effects of anhedonia in people with schizophrenia compared to healthy controls (e.g., Bodapati & Herbener, 2014). However, unbalanced representation of social and nonsocial information in affective stimulus sets has limited the clear determination of effects as attributable to, or independent of, social content. For example, neutral social images are underrepresented in some sets, which can result in lower power for that category, or the need to repeat images, which, given novelty is a factor in affective processing, can weaken the magnitude of the results. We developed the Complex Affective Scene Set (COMPASS), a novel set of social and nonsocial naturalistic affective scenes, or combinations of valence and arousal, to fill a major gap among existing sets by specifically balancing social content across affect categories, and also by incorporating visual complexity and human diversity.
A growing number of visual stimulus sets are available for use in studies of visual and affective processing. Of these, the most well known and well characterized is the International Affective Picture System (IAPS; Lang, Bradley, & Cuthbert, 2008), which includes images that vary along the dimensions of valence and arousal and span most of the affective space. More recently, the Open Affective Standardized Image Set (OASIS; Kurdi, Lozano, & Banaji, 2017) was introduced as a more current alternative to the IAPS. Both sets include social and nonsocial stimuli, although they are not balanced across affect dimensions. Other stimulus sets have been developed to address specific themes or content. For example, the Geneva Affective Picture Database (GAPED; Dan-Glauser & Scherer, 2011) predominantly includes unpleasant affective stimuli, such as images depicting violations of human and animal rights. The Nencki Affective Pictures System (NAPS; Marchewka, Żurawski, Jednoróg, & Grabowska, 2014) includes specific categories of affective images, as well as erotic (NAPS ERO; Wierzba et al., 2015) and fear-provoking (NAPS SFIP; Michałowski, Droździel, Matuszewski, Koziejowski, Jednoróg, & Marchewka, 2017) subsets. A number of welldeveloped sets include exclusively social stimuli, most of which are emotional faces (e.g., Karolinska Directed Emotional Face set; Lundqvist, Flykt., & Ohman, 1998; Montreal Set of Facial Displays of Emotion; Beaupre & Hess, 2005;NimStim;Tottenham et al., 2009;Pictures of Facial Affect;Ekman & Friesen, 1976; the Warsaw Set of Emotional Facial Expression Pictures; Olszanowski, Pochwatko, Kuklinski, Scibor-Rylski, Lewinski, & Ohme, 2015), and which offer the capacity to compare responses among human emotional facial expressions.
Depending on the research question, one or several of the previously published stimulus sets could be the most appropriate and useful. However, we suggest that our set represents a unique combination of attributes that make it especially relevant and useful for questions concerning processing of complex daily environments, while controlling for several potentially confounding factors. For example, although, as noted, some sets include both social and nonsocial stimuli, none balance these attributes within affect categories, thereby facilitating direct comparison without the problem of unequal stimulus numbers per category. To meet our objective of creating a set of naturalistic affective scenes that balances social content, we selected images that vary along several salient dimensions and attributes.

Affect dimensions
COMPASS images vary along two well-established dimensions of affect: valence (unpleasant to pleasant) and arousal (low to high activation; e.g., Barrett, 2006). The COMPASS set is consistent with other affective image sets in that the images fall into six broad combinations of arousal and valence (higher arousal unpleasant, higher arousal pleasant, moderate arousal unpleasant, moderate arousal pleasant, moderate arousal neutral, and lower arousal neutral) that are represented by the boomerang shape of the canonical affective circumplex (e.g., Barrett, 2006;Posner, Russell, & Peterson, 2005). Although we categorize the images in this way to represent much of the affective space, we also note that these two dimensions do not have objective cutoffs between levels, and the categories thus should be used as a helpful guide rather than an absolute evaluation of image content. Also, as noted earlier, because many questions in affective science center on typical daily affective experience, rather than representing affectively extreme experiences, one of our objectives was to represent a range of affective experiences that people typically encounter in daily life. Given this objective, the COMPASS set does not include affectively extreme stimuli, such as strongly aversive (e.g., mutilated bodies) or strongly erotic (e.g., couple engaged in sexual activity) content.

Social content
COMPASS scenes are balanced by social content, which we define as representation of humans. Social scenes include clearly discernible people as at least one of the most salient focal points, and nonsocial scenes either do not include people or include people as non-salient percepts (e.g., smaller figures in the background). Social content is a crucial attribute in the context of affective evaluations, because visual and neural processing of affective information differs between social and nonsocial information. For example, pupillometry and eye-tracking studies show that social information preferentially captures visual attention compared to nonsocial information (e.g., Fitzgerald, 1968), and compared to low-level salient features such as contrast and luminance (e.g., . In addition, the neural regions engaged in affective evaluation of social information differ from those of nonsocial information (e.g., Harris, McClure, Van den Bos, Cohen, & Fiske, 2007).
Within existing affective stimulus sets that include social and nonsocial stimuli, the inclusion of social content often is confounded with arousal or valence (e.g., Colden, Bruder, & Manstead, 2008). For example, social images (e.g., two people hugging or arguing) have more extreme pleasant or unpleasant valence ratings in comparison with nonsocial images (e.g., garbage on the street). In a subset of the IAPS images, images with humans were rated as more arousing, and more unpleasant or pleasant than images with inanimate content. Images with humans also were rated as more unpleasant or less pleasant than images with non-human animal content (Colden et al., 2008). Further, differential processing of images with human content is not limited to downstream top-down processing such as explicit ratings; emotional images were associated with enhanced initial allocation of attention only when the images contained humans (Löw, Bradley, & Lang, 2013). In addition, social images often have greater visual complexity in comparison with nonsocial images, which often consist of simple single objects, and neutral images are more likely to include single non-human objects than social content. COMPASS has equal numbers of social and nonsocial scenes within each affective category. As a result, our stimulus set controls for the potential confound of human content with affect, and similarly can be used to disambiguate the effects of affect and social content, by comparison of data from social and non-social images.

Stimulus competition
The COMPASS set was designed specifically to represent naturalistic visual arrays, or scenes, rather than discrete single objects or people. We defined a "complex scene" as an image that includes at least two salient points of interest, such that the salient content competes for visual attention. This characteristic is especially important for research questions and methods that require stimulus competition for valid measurement of specific visual mechanisms such as initial allocation of attention to or disengagement of attention from affective content (e.g., Desimone & Duncan, 1995). For example, a valid test of preferential allocation of attention to a specific type of information within a single stimulus array requires that there are alternative targets of attention within the array. The COMPASS set is thus especially useful and appropriate for paradigms that assess covert or overt (e.g., eye tracking) allocation of attention that favors some visual content over other visual content.

Representation of human diversity
A final distinguishing feature of COMPASS is the representation in the images of people from a variety of racial, ethnic and cultural backgrounds. The impetus for such inclusion was the parallel with the extremely diverse daily environment of the geographic location of our lab in New York City. For example, for studies testing neural or endocrine responses to naturalistic affective information in trauma-exposed participants, it is important that the images reflect the participants' daily experiences. As a result of the inclusion of diverse people and settings, COMPASS can be more reliably applied in a variety of subject populations.

The influence of sex/gender
There are well-known and documented sex/gender differences in affective processing of visual stimuli (e.g., Andreano, Dickerson, & Barrett, 2014;Cahill, 2006;Soares, Pinheiro, Costa, Frade, Comesana, & Pureza, 2015;Wrase, Klein, Gruesser, Hermann, Flor et al., 2003) and these differences can be particularly pronounced in processing of human faces (e.g., Proverbio, 2016). For these reasons, affective stimulus sets commonly include both overall norms and subdivision by sex/gender. Generally, but not exclusively, the evidence supports that female participants show greater neural responses to unpleasant, higher arousal stimuli, and rate them accordingly, whereas male participants show greater neural responses to pleasant, higher arousal stimuli and rate them accordingly. One potential explanation for differences in affective processing is biology, such as the effects of sex hormones between groups, but also sex hormone differences within women due to fluctuations across the menstrual cycle. Additional explanations implicate gender, whereby societal expectations and experiences of men and women might predispose them to respond differently to affective stimuli. For the purpose of our stimulus set development and norming, we do not make a specific claim regarding the individual or interacting roles of biology or environment, however we did anticipate sex/gender differences in affective ratings, consistent with the preponderance of the literature.

Stimulus set development goals
We present the Complex Affective Scene Set (COMPASS), a normed set of 300 complex, affectively balanced, naturalistic scenes that include representation of cultural, racial, and ethnic diversity, and that represent visual arrays that are reasonably likely to be encountered in daily life. These images were selected to accomplish the overall goal of creating an affective scene set that balances social (human) and non-social content and covers the canonical affective space, or the combinations of valence and arousal (i.e., affect categories). Within that broad goal, our first aim was to develop a set of complex scenes that approximate daily life experiences and therefore can be used to estimate the magnitude of typical everyday affective responses. In the interest of capturing affective processing of typical, everyday scenes, our set does not include the valence and arousal extremes such as mutilated bodies or strongly erotic content. Our second aim was to distinguish between the influences of affect category and social content on valence and arousal ratings by including equal numbers of social and nonsocial scenes within each affective category. Because it is not possible to entirely remove the affective qualities of human images, this balance is the best strategy to facilitate direct comparisons between social and nonsocial affective content.
Along with the major aims of the development of the stimulus set, for which we predicted only that the end result would cover the affective space as comprehensively as possible while also balancing relevant attributes, we had two evidence-driven hypotheses. First, given the known special status of social information over non-social information, although during each phases of stimulus set development we sought to achieve equivalence in affective ratings, we hypothesized that the social stimuli nonetheless would continue to be processed differently. Because we used participant ratings in early development phases to select stimuli with approximate affective equivalence for the final stimulus set, comparison of subsequent affect ratings between social and non-social scenes would be circular. Instead, to address this question we conceptualized response time for initial affective ratings as a proxy for processing time. We hypothesized that longer response times for social versus non-social ratings would provide an index of the persistent special status of social content, even when the affective ratings themselves were roughly equivalent. In addition, given the extensively documented sex/gender differences in affective processing, we hypothesized that our data would replicate the previous pattern of sex/gender differences in ratings; men would rate pleasant images as more pleasant and more arousing than would women, whereas women would rate unpleasant images as more unpleasant and more arousing than would men.

Participants
The COMPASS scenes were rated by 847 participants (71% women, 29% men); age M = 20.5, SD = 4.6, range = 18-53; Table 1). An a priori power analysis showed that power to detect a small effect with alpha at .05 and power at .80 would require 230 participants per group.
It is essential to be adequately powered to detect even small effects that without detection could invalidate the stimulus set and/or tests of sex differences. Given that recruitment population was known to have a female:male ratio of approximately 2:1, we set a recruitment target to fill the male participant n, with the understanding that open enrollment would result in twice as many female participants. Participants were recruited from a large, non-residential urban university. This student population is extremely ethnically and racially diverse and includes a high percentage of non-traditional students. About one third of the participants (38%) were born outside of the US. For these participants, the mean number of years in the US was M = 11.7 (SD = 6.4). English was the first language for 57% of the participants, and 75% also reported additional languages.

Image selection criteria
Stimuli were selected from non-copyrighted images on the internet and photographs taken by lab members. We selected full-color images of complex scenes with multiple focal points and excluded images of single objects. As our goal was to create a set of naturalistic scenes, we excluded pictures that appeared to be posed or digitally enhanced, as well as pictures of famous people or places. For the same reason, we excluded images at the extreme ends of the arousal dimension, such as those depicting extreme violence or openly erotic content.

Image specifications
We resized all images to 500 × 667 pixels by adding horizontal and/or vertical black bars where necessary. Because written words capture visual attention, we blurred visible logotypes or written words using Adobe Photoshop. For each scene, we calculated mean luminance as the average pixel value of the gray-scale image, and contrast as the standard deviation across all pixels of the gray-scale image (Bex & Makous, 2002 The final COMPASS set includes 100 unpleasant (50 each higher and moderate arousal), 100 neutral (50 each moderate and lower arousal), and 100 pleasant (50 each higher and moderate arousal) scenes. Because pleasant and unpleasant information typically is rated as more arousing than neutral information (e.g., Libkuman, Otani, Kern, Viger, & Novak, 2007), pleasant and unpleasant scenes ranged from moderate to higher arousal, whereas neutral scenes ranged from lower to moderate arousal.

Social content
Scenes that included clearly discernible people in the foreground or as one of the primary focal points were classified as social (67% of all social scenes contain clearly visible faces). COMPASS includes 150 social and 150 nonsocial scenes (within each category: 25 each higher arousal unpleasant, moderate arousal unpleasant, higher arousal pleasant, moderate arousal pleasant, moderate arousal neutral, lower arousal neutral).

Human diversity
The social scenes in COMPASS include representation of racial and ethnic diversity. Because human diversity is an important attribute but not a primary set design factor, the race/ethnicity category is not balanced by number of scenes. Thirty-nine percent of the social scenes include White people, 19% mixed ethnic groups, 16% people of unclear ethnicity, 13% Asian people, 10% Black people, and 3% Latinx people. Forty-one percent of the social scenes include both male and female people, 33% only male people, 20% only female people, and in 7% the gender is unclear (e.g., face is not discernible). Most social scenes (81%) include multiple people.

Set development
We report only data from the final normed stimulus set, however the stimulus selection and norming procedure had four phases. In phases 1-3, we iteratively developed the final set of 300 stimuli (please see the Supplemental Information for additional detail regarding the first 3 phases of scene selection). In each phase, participants (phase 1 n = 496, phase 2 n = 486, and phase 3 n = 723) rated valence and arousal for a set of scenes. At the end of each of the first three phases, we selected the scenes whose ratings were consistent with the assigned valence and arousal categories and included them in the next phase. We also discarded scenes that showed a bimodal valence distribution, thus indicating affective ambiguity, and replaced them with new scenes. In the fourth phase (the data reported in this paper), all 847 participants rated the final set of 300 images.

Study procedure
Following consent, a researcher explained the procedure. Each participant then completed the computer rating task and a questionnaire. At the end of the study, participants were debriefed and granted course credit. The study protocol was approved by the Institutional Review Board and carried out in accordance with Standard 8 of the American Psychological Association's Ethical Principles of Psychologists and Code of Conduct.

Rating task
Each participant was seated 60 cm from the computer screen. To control for room illumination and prevent screen glare, overhead lights were turned off and a small 60-watt floor lamp provided the only light source besides the screen. The computer task was administered on a Dell PC with a 19" (1280 × 1024 resolution) no-glare display using E-Prime software.
Participants were informed that the purpose of the study was to learn how people respond to pictures that represent different settings and events, and that they would be viewing and rating 300 pictures (see Text S1 for task instructions). Participants were instructed to provide two ratings for each scene according to their initial reactions. The first rating was for how unpleasant or pleasant the scene made them feel (1 = unpleasant to 9 = pleasant), and the second rating was for how arousing or activating they found the scene to be (1 = low arousal to 9 = high arousal). Because the primary goal of the ratings procedure was to create a stimulus set that had social representation in affect categories that have infrequent social representation in other stimulus sets (e.g., neutral), we prioritized valence ratings rather than counterbalancing the response order. Participants were also informed that the task was not timed. After each participant completed three practice trials, the researcher left the room.
Participants rated four blocks of 75 images, with an opportunity to rest and stretch between blocks. The order of blocks and the order of within-block images was randomized for each participant. For each trial, an image was presented on the computer screen. The participant pressed the spacebar to advance to the first response screen, which showed a 9-point rating scale for valence (unpleasant to pleasant). After the participant entered a valence rating using the keyboard, a 9-point rating scale for arousal (low to high) appeared on the screen. After the participant entered an arousal rating, the next image appeared. Most participants completed the ratings task within 30-40 minutes.

Questionnaire
The demographics questionnaire included items about gender, age, race/ethnicity, birthplace, number of years in the US, parents' birthplaces, and first language. The latter items were included to control for known cultural influences on affective ratings.

Results
Data were analyzed using Matlab R2017a and SPSS (24). We excluded trials with reaction times slower than 4000 ms or faster than 150 ms, due to the unreliability of very fast or very slow response times for rating tasks. This filter resulted in the exclusion of 18619 (7%) individual valence ratings and 37861 (15%) individual arousal ratings. 1 After exclusions, each image retained valence ratings by an average of 756 participants (SD = 13, range 698-786) and arousal ratings by an average of 692 participants (SD = 17, range 643-737). We have reported all manipulations, measures, and exclusions.

Affective scene categorization
Image names reflect their respective valence, arousal, and social content categories (e.g., NeutLowSoc = neutral valence, lower arousal, social scene). We note that the image names utilize "Negative" and "Positive" rather than the more accurate "Unpleasant" and "Pleasant" due to easier readability of the former when abbreviated. Similarly, we use "Mid" in the image names as a proxy abbreviation for "Moderate".

Summary statistics of scene-wise valence and arousal ratings
The average valence rating across all scenes was 4.87 (SD = 1.88). The lowest (most unpleasant) mean valence rating of 1.36 (SD = 1.02) was for scene NegHighSoc_22 depicting childhood bullying. The highest (most pleasant) mean valence rating of 8.25 (SD = 1.24) was for scene PosHighNonsoc_1 (tropical island). The average scenewise standard deviation of valence ratings was 1.63 (SD = 0.26). Scene NegHighSoc_12 (man assaulting a woman) had the smallest standard deviation of valence ratings (M = 1.44, SD = 0.96). Scene NegMidSoc_15 (crying man hugging dog) had the largest standard deviation of valence ratings (M = 5.68, SD = 2.56).
The average arousal rating across all scenes was 4.41 (SD = 0.98). Scene NeutMidNonsoc_1 (a parking lot) had the lowest mean arousal rating and the smallest standard deviation of arousal ratings (M = 2.27, SD = 1.74). Scene NegHighSoc_22 (childhood bullying) had the highest mean arousal rating of 7.05 (SD = 2.70). The average scene-wise standard deviation of arousal ratings was 2.46 (SD = 0.23). Scene NegHighNonsoc_7 (severed buffalo heads) had the largest standard deviation of arousal ratings (M = 6.41, SD = 2.91). Figure 1 shows distributions of scene-wise means and standard deviations of valence and arousal ratings. The distribution of mean valence ratings is bimodal with one peak near 2 and the other one near 6 (skewness = -0.22). The distribution of mean arousal ratings approaches normality (skewness = 0.14). The distributions of scenewise standard deviations of valence (skewness = 0.36) and arousal (skewness = -0.47) ratings also approach normality, although the standard deviations of arousal ratings are larger than the standard deviations of valence ratings. Summary statistics for COMPASS and IAPS norms by affective category (from Grühn & Scheibe, 2008) are presented in Table S1.

Valence and arousal ratings
Consistent with other stimulus sets (e.g., Libkuman et al., 2007), COMPASS valence and arousal ratings showed a boomerang-shaped relationship, such that scenes at the extremes of the valence dimension were rated as more arousing than scenes in the middle of the dimension (Figure 2). Bivariate distributions of valence and arousal ratings for each image are presented in Figure S2. Also consistent with previous reports (e.g., Kurdi et al., 2017), there was an M-shaped relationship between the means and standard deviations of valence ratings (Figure 3). This result indicates that standard deviations tend to be smaller for scenes with mean ratings closer to the three anchor points (1 = unpleasant, 5 = neutral, 9 = pleasant), and larger for scenes with valence means between the anchor points. In contrast, there was a linear relationship between the means and standard deviations of arousal ratings (Figure 3), indicating greater variability in arousal ratings for higher arousal scenes.

Valence
Mean valence ratings by scene category are presented in Table 2. For each scene, we calculated mean valence ratings across all participants. We then calculated mean ratings across all the scenes within each Valence × Arousal × Social Content Category and conducted a Valence Category × Arousal Category × Social Content repeatedmeasures ANOVA with valence ratings as the dependent variable. There was a main effect of Valence Category (F(2,1692) = 7595, p < .001, η p 2 = .90). Post-hoc pairwise comparisons with Bonferroni correction showed that pleasant scenes had higher (more pleasant) valence ratings than neutral scenes, which had higher valence ratings than unpleasant scenes (all ps < .001). There was a main effect of Arousal Category (F(1,846) = 2565, p < .001, η p 2 = .75), such that lower arousal scenes had higher (more pleasant) valence ratings than higher arousal scenes. There was also a main effect of Social Content (F(1,846) = 560, p < .001, η p 2 = .40), such that nonsocial scenes had higher valence ratings than social scenes. Finally, there was a Valence Category × Arousal Category × Social Content interaction

Arousal Ratings
Social Nonsocial (F(2,1692) = 1150, p < .001, η p 2 = .58), driven by a larger effect of Social Content on higher arousal pleasant scenes compared to other categories. Higher arousal nonsocial pleasant scenes had higher valence ratings than higher arousal social pleasant scenes (see Figure 4).

Arousal
Mean arousal ratings by scene category are presented in Table 2. For each scene, we calculated mean arousal ratings across all participants. We then calculated mean ratings across all the scenes within each Valence × Arousal × Social Content Category and conducted a Valence Category × Arousal Category × Social Content repeatedmeasures ANOVA with arousal ratings as the dependent variable. There was a main effect of Valence Category (F(2,1692) = 454, p < .001, η p 2 = .35). Post-hoc pairwise comparisons using Bonferroni correction showed that unpleasant scenes had higher arousal ratings than pleasant scenes, which had higher arousal ratings than neutral scenes (all ps < .001). There was a main effect of Arousal Category (F(1,846) = 796, p < .001, η p 2 = .48), such that higher arousal scenes had higher arousal ratings than lower arousal scenes. There was also a main effect of Social Content (F(1,846) = 24.9, p < .001, η p 2 = .03), such that nonsocial scenes had higher arousal ratings than social scenes. Finally, there was a Valence Category × Arousal Category × Social Content interaction (F(2,1692) = 269, p < .001, η p 2 = .24), driven by a greater effect of Social Content on higher arousal pleasant scenes, compared to other affective categories. Higher arousal nonsocial pleasant scenes were rated as more arousing than higher arousal social pleasant scenes (see Figure 4).

Participant gender and affect ratings
Given the well-documented gender differences in affective processing of visual stimuli (e.g., Cahill, 2006;Wrase, Klein, Gruesser, Hermann, Flor et al., 2003), we calculated scene-wise valence and arousal ratings for men and women separately (see Table 3 and Figure 5). We conducted a Valence Category × Arousal Category × Social Content repeated-measures ANOVA with Participant Gender as a between-subjects factor and valence ratings as the dependent variable. There was a main effect of Participant Gender on valence ratings (F(1,840) = 9.76, p = .002, η p 2 = .01), with men providing higher valence ratings on average than women. There was also a Valence Category × Arousal Category × Social Content × Participant Gender interaction (F(2,1680) = 65.1, p < .001, η p 2 = .07): women rated nonsocial higher arousal pleasant scenes as more pleasant than did men and social higher arousal pleasant scenes as less pleasant than did men (Figure 6).
We also conducted a Valence Category × Arousal Category × Social Content repeated-measures ANOVA Note: Asterisks denote significant gender differences in ratings. See Table S2 for the t-test statistics. * p < .05; ** p < .01; *** p < .001.

Mean Valence & Arousal Ratings Women
Social Nonsocial with Participant Gender as a between-subjects factor and arousal ratings as the dependent variable. There was a main effect of Participant Gender on arousal ratings (F(1,840) = 5.60, p = .018, η p 2 = .01), with women providing higher arousal ratings than men. There was also a Valence Category × Arousal Category × Social Content × Participant Gender interaction (F(2,1680) = 28.1, p < .001, η p 2 = .03): women rated nonsocial higher arousal pleasant scenes as more arousing than did men and social higher arousal pleasant scenes as less arousing than did men (Figure 6).
To identify scene content for which valence and arousal ratings differed by participant gender, we conducted scene-wise independent samples t-tests on valence and arousal ratings. To correct for multiple comparisons, we used a Bonferroni-corrected alpha of 0.05/300 = 0.000167. For each of four generally gender-discrepant content categories, we tested mean valence and arousal ratings by participant gender using independent samples t-tests.
Men rated higher arousal pleasant scenes depicting scantily dressed women as more pleasant (
We also tested differences in image complexity between social and non-social scenes using two common measures of image complexity: JPEG compressibility and entropy (Donderi, 2006;Machado et al., 2015). Overall, social content had a small effect on COMPASS image complexity, with nonsocial COMPASS scenes being somewhat more complex than social scenes. However, the effect of social content on image complexity depended on the measure of image complexity (please see Supplemental Information for the detailed results).

Rating response times by scene category
Due to the special status accorded to social information over nonsocial information, we tested response times (RTs) for valence ratings by category. Mean RTs for valence ratings by scene category are presented in Table 4. For each scene, we calculated mean RTs across all participants.
We then calculated mean RTs across all the scenes within each Valence × Arousal × Social Content Category and conducted a Valence Category × Arousal Category × Social Content repeated-measures ANOVA with RTs for valence ratings as the dependent variable. There was a main effect of Valence Category (F(2,1692) = 38.5, p < .001, η p 2 = .04). Post-hoc pairwise comparisons with Bonferroni correction showed that participants had slower rating RTs for unpleasant compared to pleasant scenes, which had slower rating RTs than neutral scenes (all ps < .001). There was a main effect of Arousal Category (F(1,846) = 21.2, p < .001, η p 2 = .02), such that rating RTs were slower for higher arousal scenes compared to lower arousal scenes. There was also a main effect of Social Content (F(1,846) = 536, p < .001, η p 2 = .39): rating RTs were slower for social compared to nonsocial scenes.
There was also a Valence Category × Social Content interaction (F(2,1692) = 307, p < .001, η p 2 = .27), driven by a greater effect of social content on RTs for pleasant scenes compared to other affect categories. Rating RTs were slower for pleasant social scenes compared to pleasant nonsocial scenes. There was also an Arousal Category × Social Content interaction (F(1,846) = 113, p < .001, η p 2 = .12), such that rating RTs were slower for higher arousal social compared to higher arousal nonsocial scenes.
Finally, there was a Valence Category × Arousal Category × Social Content interaction (F(2,1692) = 116, p < .001, η p 2 = .12): for social scenes, the effect of valence category on RTs was moderated by arousal category. For higher arousal social scenes, RTs were fastest for negative scenes and slowest for positive scenes. For lower arousal social scenes, RTs were fastest for neutral scenes and slowest for negative scenes. In contrast, for nonsocial scenes, there was no rating RT difference by arousal category for neutral and positive scenes. However, for negative nonsocial scenes, RTs were slower for lower arousal scenes (Figure 7).

Discussion
We present the Complex Affective Scene Set (COMPASS), a novel set of 300 social and nonsocial complex, naturalistic affective scenes normed on the dimensions of valence (unpleasant to pleasant) and arousal (low to high activation). This set achieves our primary goals and contributes to existing measurement tools in the following ways.

Coverage of the canonical affective space
Our primary goal was to develop a set of complex scenes that capture daily life experiences, and we did not include affectively extreme stimuli that were less likely to represent daily experience. Consequently, for the arousal dimension, most COMPASS scenes had mean ratings near the midpoint of the arousal scale (i.e., between 4 and 5 on the 1-9 scale). In comparison with published IAPS norms (i.e., Grühn & Scheibe, 2008;Lang et al., 1999), unpleasant and pleasant COMPASS scenes have on average lower arousal ratings, whereas neutral COMPASS scenes have similar arousal ratings. For valence, it is important to note that our division of stimuli into discrete unpleasant, neutral, and pleasant categories was designed to provide coverage of the affective space as much as possible, and that the boundaries for categorization were somewhat arbitrary with respect to the nature and definition of the continuous valence dimension. The "neutral" category covers the middle range of the scale from unpleasant to pleasant, however there is no possible absolute determination of the scale number at which an image is neither pleasant nor unpleasant. With this caveat, the ratings demonstrate good coverage of the space but without the extremes. Most of the unpleasant COMPASS scenes had valence ratings that corresponded to the middle of the generally unpleasant range (i.e., between 2 and 3 on the 9-point scale with 1 as most unpleasant), and most of the pleasant COMPASS scenes had mean valence ratings that corresponded to the middle of the generally pleasant range (i.e., between 6 and 7 on the 9-point scale with 9 as most pleasant). Unpleasant IAPS and COMPASS scenes had similar valence ratings, whereas pleasant IAPS scenes were rated as more pleasant than pleasant COMPASS scenes. Consistent with the norms for other affective stimulus sets (Kurdi et al., 2017;Lang et al., 1999;Libkuman et al., 2007), the distribution of valence and arousal ratings of COMPASS scenes is shaped like a boomerang, such that highly pleasant and highly unpleasant scenes were rated as more arousing than neutral scenes. However, unpleasant COMPASS scenes were rated as more arousing than pleasant scenes. Consistent with previous reports (e.g., Kurdi et al., 2017), there was less variability in valence ratings for scenes with means near the three anchor points (1 = highly unpleasant, 5 = neutral, 9 = highly pleasant), compared to scenes with mean valence ratings between the anchor points. In contrast, variability in arousal ratings was directly related to the direction of arousal ratings, such that the most arousing scenes also had the greatest variability in arousal ratings.
Although our data are not intended to address the conceptual debates regarding the structure of affect or the affect-emotion distinction, our measurement model favors the bipolar valence-arousal approach outlined by the circumplex model of affect (e.g., Barrett & Russell, 1999). We utilized the bipolar valence-arousal model because we were most interested in a person's initial affective response to visual information, as when a trauma-exposed person first encounters a trauma-relevant stimulus in the environment. We concur that on a longer timescale people can experience some degree of pleasant and unpleasant affect alternatingly (e.g., Kron, Pilkiw, Banaei, Goldstein, & Anderson, 2015), however the literature also supports that only one affective state will predominate initially. Relatedly, we agree with the perspective that valence and arousal constitute the basic affective units experienced by humans, whereas identification and categorization of an emotion requires application of a conceptual label, which by then is one step removed from the initial experience. We were interested primarily in the former, which is why we did not focus on emotion. Our intent is that this stimulus set should be appropriate for testing additional sets of questions, however, and we encourage the use of the set to further test the dual unipolar model of valence (e.g., Kron, Goldstein, Lee, Gardhouse, & Anderson, 2013;Kron et al., 2015), and to further test the interdependence of valence and arousal ratings (e.g,. Larsen, Norris, & Cacioppo, 2003). We support an approach whereby researchers are careful about their own questions and about the risks of imposing universal claims about the nature of affective experience where individual differences not only exist (e.g., Kuppens, Tuerlinckx, Russell, & Barrett, 2012), but are vital for a clearer understanding of the fundamental mechanisms of affect.

Solution for affectively unbalanced social content
Our second primary goal was to distinguish between the influence of affect category and social content on valence and arousal ratings by including equal numbers of social and nonsocial scenes within each affective category. On average, nonsocial scenes were rated as more pleasant and higher in arousal than social scenes, and this effect was driven largely by the higher arousal pleasant category. Specifically, whereas social and nonsocial scenes had similar valence and arousal ratings within the unpleasant and neutral categories, nonsocial higher arousal pleasant scenes were rated as more pleasant and arousing than social higher arousal pleasant scenes. These results are consistent with previously reported confounding effects of social content on valence and arousal ratings (e.g., Colden et al., 2008), supporting the importance of controlling for the social content of affective stimuli. Because the COMPASS images intentionally exclude the highest arousal pleasant (e.g., erotic) and unpleasant (e.g., mutilated bodies) content due to the goal of representing more everyday experiences, researchers who wish to equate arousal and valence extremes between social and nonsocial stimuli might choose to add images from sets such as the IAPS to fit that purpose.

Demonstration of the persistence of the special status of social content
We designed the COMPASS set to provide a set of social and nonsocial images that are balanced across affect categories, and our development process resulted in the elimination of scenes that did not contribute to this goal. However, this methodological contribution does not eliminate the actual special status effect of social information. Once we had a balanced set of 300 images, we also sought to demonstrate the persistence of the social content effect. Because participants were instructed to respond quickly and in accord with their initial impression of each scene, the most efficient way to complete the 300 image ratings was to respond quickly to each image. We reasoned that slower RTs for the first rating for each image (i.e., the valence rating) would provide evidence for the persistence of a special effect of social content. Consistent with our expectations, initial ratings for social scenes took longer than initial ratings for nonsocial scenes. Although it is not possible with simple rating data to isolate the precise mechanism or mechanisms that account for this effect, the slower response time is consistent with greater attentional capture by social information, for example. In addition, this effect was moderated by image valence, with faster RTs for unpleasant and slower RTs for pleasant social information, suggesting more efficient processing of depicted negative affect relative to depicted positive affect. These results are consistent with prior evidence of more distributed brain network activation in response to the mere presence of social information (e.g., Tso, Rutherford, Fang, Angstadt, & Taylor, 2018), and greater relevance detection for social information (e.g., Schacht & Vrticka, 2018;Vrticka, Sander, & Vuilleumier, 2013). Regardless of mechanism, the persistence of the special status of social information, when controlling for valence and arousal, is clear.

Replication of rater gender effects
Consistent with prior evidence of gender differences in affective ratings and physiological reactivity to unpleasant stimuli (e.g., Bradley, Codispoti, Sabatinelli, & Lang, 2001;Lithari et al., 2010), women rated unpleasant COMPASS scenes as more unpleasant and more arousing than did men. In addition, women rated higher arousal pleasant nonsocial scenes as more pleasant and more arousing than did men, and higher arousal pleasant social scenes as less pleasant and less arousing than did men. Together, these results are consistent with previously reported gender effects on affect ratings of specific content categories, such as erotica and highly unpleasant scenes (e.g., Kurdi et al., 2017;Marchewka et al., 2014).

Stimulus competition and additional scene characteristics
In addition to the primary attributes of affect dimensions and social content, the COMPASS scenes also incorporate additional characteristics that position the set well for certain types of research questions. First, the scenes feature stimulus competition in the form of two or more visually salient points of interest. This characteristic is important for questions addressing allocation of visual attention. Combined with careful placement of pre-stimulus fixation points and selection of presentation timing parameters, these scenes can be used to test initial fixation, shifts of attention, and disengagement of attention (e.g., Weierich, Treat, & Hollingworth, 2008) within a single image. In addition, because the set includes representation of human diversity, subsets of the scenes can be used to test interactions of affect with race perception or culturespecific visual information.

Potential constraints on generality
Two characteristics of our sample might constrain the generality of our results. First, the public, non-residential, urban university sample from which we recruited has a very large proportion of non-traditional age students (sample age range 18-53), and the vast majority of these students have had a much broader and less privileged variety of life experiences than the canonical "WEIRD" (i.e., Western, Educated, Industrialized, Rich, Democratic; Henrich, Heine, & Norenzayan, 2010) undergraduate samples. Nonetheless, the sample had a relatively young mean age (i.e., 20.5, SD 4.6), such that normative ratings provided by younger (i.e., adolescent) or older samples might differ. In addition, our sample was comprised of participants who live in or very near New York City, and the daily experience of life in a large, densely populated, racially and ethnically diverse urban area might have influenced ratings of some of the images, and in particular the social images, which included representation of a range of races and ethnicities. We welcome researchers to conduct norming studies with this stimulus set in additional populations. We also note that although our sample was predominantly female, our male subsample was large enough to adequately power between and within group tests, as reported, and thus this imbalance is not likely to have affected generality with regard to sex or gender. In addition, although the absolute whole sample means represent twice as many ratings from women as from men, in our view the absolute means are less important than the coverage of the affective space as well as the expected within group (i.e., within-gender) patterns that are consistent with the affective space. Together, our whole sample data and the analyses by gender both support the achievement of a stimulus set that covers the affective space.
In addition to further norming in additional populations, due to its unique attributes, including visual complexity, human diversity, and naturalistic everyday-life content, the COMPASS stimulus set can be used to study affective processing in complex daily environments using a variety of methods, including eye-tracking, psychophysiology, and neuroimaging (e.g., Mauss & Robinson, 2009), while controlling for potential confounds, and in particular social content. We provide the basic affective norms for the COMPASS set, however future research will be necessary to characterize COMPASS scenes along other dimensions that might influence affective processing, such as memorability and distinctiveness. Similarly, our strategy of collecting valence ratings before arousal ratings, although important for our stimulus set development goals, also might have constrained generality; rating order could have influenced the valence and/or arousal ratings, and future work counterbalancing or switching the order will address that question.

Usage
COMPASS images and image usage rules are available without cost to researchers upon request at the following link: www.compass-scenes.com. In addition to the images, the downloadable content includes scene-wise norms for the total sample and separately by participant gender, and scene-wise attributes (e.g., affective category, social content category, scene content, image orientation, luminance, and contrast).

Data Accessibility Statement
The stimuli, presentation materials, participant data, and analysis scripts can be found on this paper's project page on the www.compass-scenes.com.

Note
1 To estimate the impact of trial exclusion on scene ratings, we also calculated mean valence and arousal ratings for each scene without excluding trials based on RTs (see Figure S1). The largest absolute difference in mean scene-wise valence ratings before and after trial exclusion was 0.13 (possible range: 0-8), whereas the largest absolute difference in mean scene-wise arousal ratings was 0.26 (possible range: 0-8), suggesting that exclusion of potentially unreliable trials did not have a significant impact on mean scene ratings.

Additional Files
The additional files for this article can be found as follows: • Figure S1. Mean scene-wise valence and arousal ratings before and after exclusion of trials with RTs <150 ms or >4000 ms. DOI: https://doi.org/10.1525/collabra.256.s1 • Figure S2.