We live in a multisensory world, filled with sights, sounds, smells, textures, and tastes. We need to correctly integrate the information from different senses to create a unified understanding of the world—the binding problem. This article deals with “property binding” (Treisman, 1996): linking together the different sensory properties of individual objects.

Shams and Kim (2010) suggested that, faced with multisensory input, brains attempt to minimize perceptual errors across all domains, using at least some top-down processes. Some combinations of information are therefore more likely to be bound together than others. This can happen through crossmodal correspondences (CMCs): pairs of cross-sensory stimuli that “go together,” apparently automatically (e.g., Evans & Treisman, 2010; but see Spence & Deroy, 2013). One example is the kiki–bouba effect: Participants typically pair spiky shapes with names containing high-pitched vowels (e.g., kiki), and round shapes with names containing low-pitched vowels (e.g., bouba; e.g., Bremner et al., 2013). CMCs occur in many sensory pairings: high luminance pairs with tactile softness (Ludwig & Simner, 2013), and blackberry odor pairs with “piano” (Crisinel & Spence, 2011). CMCs may occur for a variety of reasons, including (adult remnants of) neonatal inability to differentiate sensory inputs, statistical coupling of sensory dimensions in the environment, and semantic “matching” of stimuli (e.g., Mondloch & Maurer, 2004; Spence, 2011; Walker, Walker, & Francis, 2012).

Early studies on CMCs generally explored complex stimuli (e.g., Karwoski, Odbert, & Osgood, 1942, had participants draw visual responses to music); more-recent studies have focused on single CMCs. However, we lack information about how CMCs interact. This topic has been systematically approached only by Eitan and Rothschild (2011), who studied imagined tactile qualities of musical notes, and by Woods, Spence, Butcher, and Deroy (2013), in an online study of interactions between sounds, shapes, and emotions.

Interactions between CMCs are important, since real-world objects do not have only two sensory dimensions. For example, drums have visual, tactile, and auditory properties. A drum may have a dark color but a light weight (i.e., opposing ends of the dark–light and heavy–light dimensions; Ward, Banissy, & Jonas, 2008). Do we predict that the drum makes a high sound because of its weight (Walker et al., 2012), or a low sound because of its color (Hubbard, 1996)?

In this study, we investigated the existence of interactions between auditory–visual CMCs (Spence & Deroy, 2013). We displayed visual stimulus pairs varying in luminance (lightness), saturation (color intensity), size, and/or vertical position, with auditory stimulus pairs varying in pitch. Participants decided which auditory stimuli “went with” which visual stimuli. Our goal was to determine the principles used to combine multiple CMCs.

We tested three models of CMC interaction. First we tested the summation model, based on sensory cue integration models (Trommershäuser, Körding, & Landy, 2011), in which the strengths of the individual CMCs add. When CMCs are consistent, crossmodal associations are strengthened. When CMCs conflict, they cancel out completely or partially, depending on their relative strengths. Second, we examined the hierarchy model, in which there was a hierarchy of CMCs, with some dominating others. The third was the majority model, where most (but not all) characteristics were paired with a specific pitch (e.g., a small, low luminance, low position stimulus would pair with high pitch, in terms of size, but with low pitch, in terms of luminance/position). In this model, participants’ pitch choices were predicted by the majority of the feature correspondences (in this case, low pitch).

Method

Participants

Because this was a novel line of research, and relied on proportions of responses across participants as the dependent measure, we wanted to sample as many participants as possible in the time available. We collected data online (https://uelpsychology.org/soundvision), recruiting 113 participants (76 female, 30 male, two other, and five who declined to respond; 18–67 years of age, mean = 30.82, SD = 11.39) from personal contacts and online communities of volunteers. Seventy-nine of these informants were monolingual English speakers, ten were bilingual native speakers of English, and the remaining 24 were nonnative speakers of English.

All participants gave informed consent. The experiment was approved by the Research Ethics Committee of the University of East London.

Materials, design and procedure

The visual stimuli were two circles on a midgray background (Table 1). The circles varied in luminance, saturation, size, and position. We chose four hues: red (hue in HSL system: 0), yellow (58) green (120), and blue (240). Within participants, hue was held constant, and the other characteristics varied. Each characteristic had three levels: low/large, medium, and high/small. For luminance and saturation, “low” was a value of 16%, “medium” 50%, and “high” 85% in the HSL system. All circles were presented with their centers aligned. We report positions and sizes as they appeared on a 56-cm widescreen monitor where the image occupied a rectangle with width 106 mm and height 79 mm (monitor sizes will have varied, since this was an online experiment). “Low” circles had centers 56 mm from the top of the image background, “medium” 40 mm, and “high” 23 mm. “Large” circles had a diameter of 25 mm, “medium” of 16 mm, and “small” of 8 mm.

Table 1 Example visual stimuli for each of the four levels

Each pair of circles was either the same (i.e., both medium) or opposite (e.g., one large, one small) on all four within-participants characteristics. This gave us four “levels” of stimuli. At Level 1, the circles varied in one characteristic (e.g., one high and the other low luminance, but for all other characteristics both were medium). At Level 2, the circles varied on two characteristics; at Level 3, on three characteristics; and at Level 4, on all four. We describe pairs according to the characteristics of the left circle; the right circle’s characteristics are implied in that description. Participants saw every possible combination of circles twice; the second time, the circles’ left–right positions were reversed (a total of 80 stimuli).Footnote 1

We used the responses to Level 1 stimuli to predict the responses at Levels 2–4. Therefore, it was unimportant that the perceptual distances between values were not identical across the stimulus dimensions; participants needed only distinguish between the values on each dimension. The experiment was programmed using Javascript.

The auditory stimuli were created using Audacity (http://audacityteam.org/). These were two pure-tone sine waves, each of 1,000-ms duration. One was at a pitch of 261.63 Hz, the other at 523.25 Hz. In each trial, participants heard both beeps; their order was counterbalanced across trials. The order of beeps was counterbalanced across participants, who were randomly assigned across the eight conditions (4 hues × 2 auditory orders). Twenty-eight participants were assigned to the red condition, 28 to green, 30 to yellow, and 27 to blue.

At the start of each trial, the visual stimuli appeared on the screen (see Fig. 1). Participants clicked to play the first auditory stimulus, with the second following automatically after 2,000 ms silence. Participants could replay the stimuli as needed before deciding which beep went with which visual stimulus.

Fig. 1
figure 1

Example trial (high luminance, medium saturation, medium size, low position), as viewed by the participant after the video has been played. The participant could not see the radio-button decisions beneath the video until it had been played once

Prior to the 80 experimental trials, the participants completed four practice trials with stimuli not used in the main study.

Initial analysis and statistics

For the reported analyses, we used data from all participants. Similar results were found for monolingual native English speakers alone.

Initial analysis of the Level 1 stimuli established the association strengths of each individual CMC. Figure 2a shows the proportions of participants who chose high beeps for each stimulus at Level 1. In all cases, we found reliable and significant correspondences between sensory dimensions. The high beep was associated with stimuli with higher luminance, saturation, or position, or stimuli that were smaller. Because the auditory stimuli were matched for physical amplitude, the high beep was probably perceived as louder (International Organisation for Standardization (ISO), 2003). It is thus likely that the strength and direction of association was determined by pitch and loudness. This does not affect our interpretation of the results, which concern how different visual dimensions combined in determining the CMCs.

Fig. 2
figure 2

Results for the Level 1 conditions, in which the stimuli varied on only a single dimension. (a) Proportions of participants who chose a high-pitched beep as the one that went with each stimulus. Error bars show binomial 95% confidence intervals. (b) Strengths of the associations between the frequency of the auditory stimulus and each dimension of the visual stimuli, calculated using probit analysis (see the text for details). “Low” and “high” map to “large” and “small,” respectively, for the size dimension

Association strengths were modeled by assuming that each value on each visual dimension has a particular strength of association with the high beep, relative to the low beep. We also assumed some variation in association strength across the population, modeled using a normal distribution. Using probits, we transformed the proportions of participants choosing each beep for each visual stimulus, to quantify the association strengths in units of the standard deviation of the variability (Thurstone, 1927):

$$ p\left( R=\operatorname{} LEFT\Big| S\right)=\phi ( S), $$
(1)

where R = LEFT represents a participant choosing the left stimulus, S is the association strength, and ϕ is the cumulative distribution function of the standard normal distribution. Association strength is quantified in terms of the variability in responses across observers.

Probit values for the Level 1 stimuli are plotted in Fig. 2b. These values were fixed at 0 for neutral stimuli: When both circles have the same value on a dimension, there can be no preference associated with that dimension. These associations were used to predict the outcomes for stimuli containing variations in multiple dimensions. We predicted these results using each model as follows:

Summation

The simplest assumption is that association strengths will add:

$$ {S}_{TOTAL}={S}_{LUM}+{S}_{S AT}+{S}_{S IZE}+{S}_{POS}. $$
(2)

This model assumes that all dimensions are equally important in determining association strengths.

Hierarchy

In this model, there is a hierarchy of CMCs. For any stimulus, the CMC is predicted by the dominant association, which is not necessarily the dimension with the strongest association when presented alone. Rather, this model assumes a specific order in which the dimensions are considered, with the association determined by the first dimension, within this order, on which stimuli differ. Since we tested four CMCs, there were 24 (4 × 3 × 2 × 1) possible hierarchies. We calculated correlations between the predicted and actual responses for each stimulus, for all hierarchies, and chose the hierarchy that best fit the data. This method provided considerable freedom to achieve the best fit; the other models contained no free parameters.

Majority

In this model, where there was conflict between the directions of CMCs, the response was determined by majority vote, regardless of the strengths of the individual CMCs. If all stimulus dimensions, and experimental manipulations, had the same strength, then the predictions of the summation and majority models would agree. However, if, for example, one dimension was particularly dominant, this might outweigh the combined effects of other dimensions that predicted the opposite response.

Results

For each model, we calculated correlations between the predicted and actual responses for Level 2, 3, and 4 stimuli (Table 2). The summation model predicted the data well, with all correlations being significant. The correlations for the majority model were also significant, but lower than those for the summation model. The correlations for the hierarchy model, which did not take account of all CMCs, were in all cases lower, and nonsignificant for Level 4 stimuli. Therefore, for stimuli containing multiple CMCs, all visual dimensions contribute to participants’ decisions.

Table 2 Correlation coefficients and significance levels for the fits of the probit summation, hierarchy, and majority models

To further test the summation model, we created a generalized linear model with a binomial distribution and a probit linking function. A full factorial model was used, with color saturation and luminance, the width of the stimulus, and its distance from the center of the screen as covariates. Each of these covariates was significant (luminance, Wald χ 2 = 1,734.8, p < .001; saturation, Wald χ 2 = 424.1, p < .001; size, Wald χ 2 = 348.0, p < .001; position, Wald χ 2 = 203.1, p < .001). None of the two-way interactions were significant, but there were significant three-way interactions between luminance, size, and position (Wald χ 2 = 4.10, p = .043); luminance, saturation, and size (Wald χ 2 = = 9.00; p = .003); and saturation, size, and position (Wald χ 2 = 6.69, p = .01).

We also predicted the main effect and two-way interaction results using probits for the Level 1 stimuli (Fig. 2b), using a linear regression after centering the data for each dimension. The results were combined according to Eq. 2, and the predicted proportion of “left” responses was calculated from the resulting probit value. These results are plotted in Figs. 3 (main effects) and 4 (two-way interactions). These report good predictions for luminance and size. However, the effect of saturation, in particular, was less than expected. A simple linear model therefore does not appear to fully account for associations made when stimuli vary across multiple visual dimensions. This apparent different was tested using a generalized linear model with saturation as a covariate, fit separately to the data from different levels of luminance. The effect of saturation was significantly greater for neutral-luminance stimuli (b = .026 [95% confidence limits: .024–.028]; Wald χ 2 = 739.2; p < .001) than for those with low (b = .001 [–.0001 to .0003]; Wald χ 2 = 14.934; p = .24) or high (b = .003 [.002–.005]; Wald χ 2 = 15.8; p < .001) luminance. Participants’ responses were only strongly influenced by saturation when luminance was neutral.

Fig. 3
figure 3

Proportions of “left” responses associated with the higher tone, as a function of each visual dimension, for stimuli pooled over all other visual dimensions. The color symbols indicate the participants’ responses, and the solid black lines the predictions of the probit model. Error bars and dotted black lines represent 95% binomial confidence limits of the data and the model predictions, respectively

Fig. 4
figure 4

Proportions of “left” responses associated with the higher tone, as a function of each pair of visual dimensions, for stimuli pooled over all other visual dimensions. In all cases, one dimension is plotted on the horizontal axis, and the black, red, and blue symbols (color only in the online figure) represent the “low,” “medium,” and high values on the other dimension, respectively. The dashed lines of each color show the predictions of the probit model. Error bars and dotted lines indicate 95% binomial confidence limits of the data and the model fits, respectively

To interpret the significant three-way interactions, we performed separate analyses for each stimulus size, with luminance and position, luminance and saturation, or saturation and position as predictors (Table 3). We found significant main effects of luminance, saturation, size, and position in all conditions. For medium-sized objects, there was a significant interaction between luminance and saturation, consistent with the reduced effect of saturation at low and high levels of luminance.

Table 3 Results of the generalized linear models, performed separately for small, medium, and large stimuli, with luminance and position or luminance and saturation as predictors

All calculations were performed using the HSL system. It is possible that different results could have been obtained if the stimuli were analyzed in a different color space. For example, the CIE Luminance × Chroma Hue (LCh) space might be considered more appropriate, since distances in this space relate to just-noticeable differences in color. We recalculated our probit predictions in the LCh color space, but found little difference in the overall fits of the model, regardless of whether the HSL (R 2 = .534 across all stimuli) or the LCh (R 2 = .525) space was used.

Discussion

We examined how visual characteristics interact to determine which auditory pitch “goes with” a given visual stimulus. We found the predicted associations of high pitch with high luminance, high saturation, small size, and high position when one visual characteristic was varied (following, e.g., Evans & Treisman, 2010; Hamilton-Fletcher, 2015; Klapetek et al., 2012). Our study extends previous research by using visual stimuli that differed on two or more characteristics. A linear summation model predicted participants’ choices more accurately than a majority or a hierarchy model, although some results did not fit this model.

The summation model’s overall success in predicting participants’ responses suggests a general strategy of weighting the available visual cues to determine the best auditory match, perhaps via neural intensity matching (Spence, 2011) or a generalized system for dealing with magnitude (Walsh, 2003). However, we need to account for the few results that violate the model (the lower effects of saturation at low and high luminances, and the three-way interactions of luminance, position, size; luminance, saturation, and size; and saturation, position, and size). The decreased effects of saturation at low and high luminances appear to be the result of Garner interference. In Garner’s (1976) paradigm, participants are presented with stimuli that vary along two perceptual dimensions, and then make decisions about one dimension. Information from the irrelevant dimension can interfere with decision making about the relevant information. When this happens, the dimensions are integral and viewed as one super-dimension. In our results, luminance and saturation integrate to form one super-dimension (see, e.g., Burns & Shepp, 1988), except when luminance is medium and does not differ between the two stimuli. However, other violations of the model are not clear-cut instances of Garner interference. One possibility is that the dimensions are incompletely integrated, so that participants’ decisions are influenced by the dimensions at unequal relative strengths, but also by interactions between the different dimensions.

Explaining summation in the context of theories about CMCs

How CMCs arise is a matter of ongoing investigation (e.g., Lindborg & Friborg, 2015). Eventually it should be possible to make a broad taxonomy of the fundamental mechanisms of CMCs. Some probably occur earlier in processing than others (e.g., a CMC based on statistical features of the environment probably occurs earlier than a language-based one), so early-occurring CMCs are likely to impact on later ones.

It is also possible that some CMCs begin at an early stage of processing and spread to other CMCs (e.g., a structural CMC that becomes encoded in language). These hypothetical CMCs would likely have more effect on perceptions and decisions than those that occur at only one level. That is, if a multilevel CMC conflicts with a single-level one, the multilevel CMC is likely to “win.”

Limitations and future directions

Online testing has advantages, including ease and speed of participant recruitment, but also disadvantages (Woods, Velasco, Levitan, Wan, & Spence, 2015). Repeated participation is one concern. However, since this study was unpaid and was informally reported by some participants to be tedious, the participants would not have repeatedly participated for money or fun.

The variety of participant hardware and system settings used will have affected the exact presentation of the stimuli. However, because we asked participants to judge the comparative visual features of stimuli presented at the same time, this cross-participant variance should not matter. This does, however, mean that our experiment cannot speak to whether CMCs and the interactions between them are relative or absolute. Consequently, an important next step will be to replicate this experiment in laboratory conditions. We do not expect very different results: When millisecond accuracy in presentation or response collection is not required, participants largely behave similarly in the lab and online (Woods et al., 2015).

To keep the experiment short, we tested only one auditory dimension. Therefore, we cannot know whether our findings are specific to the relationships of visual dimensions with pitch, or whether the same interactions would occur if we tested, say, duration instead. This question may also be applied to other sensory pairings—for example, differing tactile stimuli being matched with visual stimuli.

A consideration for future research is whether the relationships between CMCs could have appeared if we had presented visual characteristics varying in a single dimension alongside auditory or tactile stimuli varying in multiple dimensions. Is summation a general feature of CMCs, or is it unique to vision? Evidence showing that timbre, pitch, and loudness interact to varying extents in speeded classification paradigms (Melara & Marks, 1990) suggests that similar results would be seen at least with aurally multidimensional CMCs.

Finally, it is not clear whether the CMC interactions we have reported are implicit or explicit. This could be tested using speeded classification tasks (see Marks, 2004). For example, temporal-order judgments (e.g., Parise & Spence, 2009) would allow an exploration of whether interactions occur at perceptual or decisional levels. An analysis of response times would also allow an exploration of the impacts of multiple conflicting or converging cues on decision making.

Applications

CMCs are used in packaging design (e.g., Becker, van Rompay, Schifferstein, & Galetzka, 2011), though not always successfully (Crisinel & Spence, 2012). Since real-world objects have multiple sensory dimensions, the existence of nonsummative effects of different dimensions indicates that it is important to consider which features of packaging or advertising are most strongly associated with the dimension that needs to be emphasized.

Our findings will also be helpful for designers of sensory substitution devices, such as the vOICe (Meijer, 1992), which allow the “translation” of information from one sense to another (for a review, see Hamilton-Fletcher & Ward, 2013). Having explicit knowledge about the relationships between different CMCs will allow for better design of default settings that are intuitively correct to most, reducing the time needed to learn to use such devices (Auvray, Hanneton, & O’Regan, 2007). The findings could also help when comparing devices that pair a single quality (e.g., pitch) with others, such as saturation (Bologna, Deville, Pun, & Vinckenbosch, 2007) and luminance (Doel, 2003).