Does multisensory study benefit memory for pictures and sounds?

Studies have found a multisensory memory benefit: higher recognition accuracy for unimodal test items that were studied as bimodal items than for those studied as unimodal items. This is a surprising finding because the encoding specificity principle predicts that memory performance should be better with greater overlap between processing during study and test. We used Thelen, Talsma, and Murray ’ s (2015) method who previously found a multisensory memory benefit. Items were presented as unimodal (picture or sound) or bimodal (picture and sound) items in a continuous recognition task in which only one modality was task-relevant. In four experiments we obtained little evidence for a difference in memory performance between items studied as unimodal or bimodal stimuli, but there was a benefit of study-test overlap in format if sound was the task-relevant modality. Task-induced attention for the irrelevant modality or response bias may have played a role in previous studies. We conclude that the multisensory memory benefit may not be a general finding, but rather one that is found only under conditions that induce participants to pay attention to the task-irrelevant modality.

Studies have found a multisensory memory benefit: higher recognition accuracy for unimodal test items that were studied as bimodal items than for those studied as unimodal items. This is a surprising finding because the encoding specificity principle predicts that memory performance should be better with greater overlap between processing during study and test. We used Thelen, Talsma, and Murray's (2015) method who previously found a multisensory memory benefit. Items were presented as unimodal (picture or sound) or bimodal (picture and sound) items in a continuous recognition task in which only one modality was task-relevant. In four experiments we obtained little evidence for a difference in memory performance between items studied as unimodal or bimodal stimuli, but there was a benefit of study-test overlap in format if sound was the task-relevant modality. Task-induced attention for the irrelevant modality or response bias may have played a role in previous studies. We conclude that the multisensory memory benefit may not be a general finding, but rather one that is found only under conditions that induce participants to pay attention to the task-irrelevant modality.
Memory might be better for events that are experienced in multiple sensory modalities than for events that are experienced in a single sensory modality (Meyerhoff & Huff, 2016;von Kriegstein & Giraud, 2006; see Shams & Seitz, 2008 for a review). The most striking finding is that this benefit of multisensory learning is found even when participants are tested with unimodal targets. For example, when participants are tested with a picture of a dog, they are more likely to recognize the picture as a studied item if earlier on the picture had been presented together with the sound of a dog than if it had been presented without a sound (Heikkilä, Alho, Hyvönen, & Tiippana, 2015;Lehmann & Murray, 2005;Meyerhoff & Huff, 2016;Moran et al., 2013;Thelen et al., 2015).
These findings are in contrast with the idea that memory performance benefits from an overlap in the type of processing between study and test, formulated as the encoding specificity principle (Tulving & Thomson, 1973) and the principle of transfer appropriate processing (Morris, Bransford, & Franks, 1977). These principles are supported by many findings that memory performance is positively related to the overlap in processing between study and test (e.g., Barclay, Bransford, Franks, McCarrel, & Nitsch, 1974;Blaxton, 1989;Cabeza, 1994;Parks, 2013;Pecher, Zeelenberg, & Barsalou, 2004, Pecher, Zanolie, & Zeelenberg, 2007Roediger & Adelson, 1980;Roediger & Blaxton, 1987;Van Dantzig, Cowell, Zeelenberg, & Pecher, 2011;Zeelenberg, 2005;Zeelenberg, Pecher, Shiffrin, & Raaijmakers, 2003; see also Roediger, Weldon, & Challis, 1989). One interpretation of these findings is the context in which an item is presented and the task instructions affect what features are attended and encoded in memory and consequently determine what is an effective retrieval cue during test. Following these principles, we would expect better memory performance for stimuli with greater overlap between study and test. Participants are likely to process a bimodal stimulus differently than a unimodal stimulus, for example they might attend different features if a picture is presented with a sound than if it is presented without a sound. Several studies have shown that memory for words is affected by the overlap in the modality specific meaning that was activated during study and test (Pecher et al., 2004;Van Dantzig et al., 2011). During test, when the picture is presented without sound, the features activated by the test item will be more similar to those activated during unimodal study than during bimodal study. Thus, for items that are tested as unimodal items performance should be better if they are studied as unimodal items than if they are studied as multimodal items. The finding of the opposite result, namely a benefit for multisensory items, deviates from this general memory principle. The finding of a robust multisensory memory benefit that generalizes across stimuli and procedures would be interesting and stimulate further research into the processes driving the effect. Moreover, a robust multisensory memory benefit might be helpful in developing methods that optimize learning and retention of new information (e.g., Mayer, 2008).
One possibility is that the benefit for multimodal study items is due to integrative processing that connects the information from different modalities into a single object representation (Quak, London, & Talsma, 2015). Shams and Seitz (2008) have proposed two such integrative mechanisms that might account for the multisensory advantage. Their proposals are based on the idea that information is processed in modality-specific brain areas that are connected through higher level multisensory areas (like convergence zones, Binder & Desai, 2011, Damasio, 1989. Such multisensory areas are activated when information from different sensory modalities is integrated (Woods & Newell, 2004). One proposed explanation for the multisensory advantage is that, during study, activation in the non-tested modality causes stronger activation of the tested modality via these multisensory areas, which results in a stronger memory trace than when information from only one modality is presented. The other proposed explanation is that a multimodal stimulus leads to a stronger memory trace because there is additional encoding in multisensory areas, and during test the unimodal stimulus activates this entire multisensory memory (Shams & Seitz, 2008). The latter explanation is supported by auditory cortex activation during test of words that were studied with a sound compared to words that were studied without a sound (Nyberg, Habib, McIntosh, & Tulving, 2000). Other studies too have shown that stimuli from one modality may activate brain regions associated to a different modality when the items were previously studied as multisensory items including that modality (see Thelen & Murray, 2013, for a review). Although this finding is consistent with the idea that a unisensory stimulus can activate a multisensory memory, such brain activity alone does not show that memory performance is better for multisensory than unisensory items. Both integrative explanations are based on the claim that multisensory integration is essential for the multisensory benefit. Thelen et al. (2015) further argued that integrative processes will result in stronger memories when information from different modalities is congruent (e.g., a picture of a dog and the sound of a barking dog), that is, activates that same object in long-term memory, but will lead to interference when they are incongruent (e.g., a picture of a dog and the sound of a train). Thus, memory is expected to be better for congruent multimodal items than for unimodal items because a stronger representation is encoded for the target, but worse for incongruent multimodal items than for unimodal items because there is interference from the incongruent object.
Although there is some evidence for a benefit of multisensory learning (Lehmann & Murray, 2005;Matusz et al., 2015;Meyerhoff & Huff, 2016;Moran et al., 2013;Thelen et al., 2015), not all studies have obtained such a benefit (Cohen, Horowitz, & Wolfe, 2009;Nyberg et al., 2000). Moreover, although Thelen et al. observed a negative effect of incongruent multimodal items compared to unimodal items, others found no disadvantage for multimodal multisensory items compared to unimodal items (Heikkilä et al., 2015, Heikkilä, Alho, & Tiippana, 2017Lehmann & Murray, 2005;Matusz et al., 2015;Moran et al., 2013). Canits, Pecher, and Zeelenberg (2018) found no benefit of congruent motor actions on later memory for pictures even though congruency did affect immediate motor responses during the study task. Thus, the currently available evidence from only a few studies partially supports the idea that multisensory integration explains the memory advantage for multimodal items. The advantage for congruent stimuli is consistent with this explanation, but the absence of a disadvantage for incongruent stimuli (in most published studies) might be hard to explain.
In the current study we aimed to further investigate the effect of multimodal stimulus presentation on memory. Although studies have obtained a multisensory benefit, most of these did not use an optimal design to allow strong conclusions. First, it is often not clear if materials were properly counterbalanced. In most studies, the numbers of sound items and picture items do not match such that a fully counterbalanced design would be possible (Lehmann & Murray, 2005;Murray et al., 2004, Murray, John, & Wylie, 2005Thelen et al., 2015;Heikkilä et al., 2015Heikkilä et al., , 2017Heikkilä & Tiippana, 2016). If some items are systematically presented as unimodal and other items are systematically presented as bimodal a confound exists between stimulus materials and experimental condition. It seems possible that items in one condition are more memorable than those in another condition. Second, with the exception of Thelen et al. (2015), in the continuous recognition paradigm presentation format (unimodal vs. bimodal) and study status (studied vs. nonstudied) were correlated because study items could be unimodal or bimodal, but test items were always unimodal (Lehmann & Murray, 2005;Matusz et al., 2015;Moran et al., 2013;Murray et al., 2004. In a continuous recognition paradigm (Shepard & Teghtsoonian, 1961), as was used in all five aforementioned studies, the participant has to distinguish study (initial presentation) and test (repeated presentation) items mixed in the same continuous stream of items. Because all bimodal items were study items, the correct response to a bimodal item was always 'new'. Because only a portion (varying from 25 to 50%) of the study items were unimodal but all test items were unimodal, the probability that a unimodal item was old was larger than 50% (varying from 67 to 80%). As a result of this correlation participants may have had a bias to respond 'old' to unimodal items. Moran et al. indeed observed a higher false alarm rate for unimodal than for the bimodal items which might indicate such a bias. The effect of such a bias might be that on their first presentation, unimodal and bimodal items are processed differently. For example, participants may put less effort into encoding unimodal study items than bimodal study items because they are seen as test items. Moreover, when all test items are unimodal this causes multimodal items to be less frequent and, therefore, they may be more distinctive than unimodal items. Memory is better for distinctive items (Hunt & McDaniel, 1993) and thus having unequal numbers of unimodal and bimodal items might lead to differences in memory performance that are unrelated to their modality perse. To summarize, many findings of multisensory advantage may still be open to alternative explanations because of suboptimal designs. Therefore, we conclude that the evidence for a multisensory advantage is not very strong.
In the present study, we wanted to collect more data to establish whether, compared to unimodal items, memory is better for bimodalcongruent items and worse for bimodal-incongruent items. We replicated the continuous recognition experiment by Thelen et al. (2015) with some modifications. We chose the study by Thelen et al. because they presented the same number of unimodal and multimodal items during study and test and therefore their design did not suffer from a correlation between presentation format and study status. Thelen et al. (2015) presented object pictures and sounds in a continuous recognition task in which only one modality was task relevant (i.e., participants made recognition decisions to either pictures or sounds). Participants had to decide if the item was presented for the first time (i.e., a study trial) or repeated (i.e., a test trial). Half of the items were presented as unimodal stimuli and the other half as bimodal stimuli where the irrelevant sensory modality was congruent, incongruent, or meaningless. We dropped the meaningless condition so that more items could be presented in the congruent and incongruent conditions. Unlike Thelen et al., who presented items several times in different blocks, we presented each item only once for study and once for test. In addition, unlike Thelen et al., who tested the same participants in the picture memory and sound memory tasks, creating even more repetitions of items, we tested separate groups of participants in the picture memory task (Experiment 1) and the sound memory task (Experiment 2). Thelen et al. obtained an interaction between task order and presentation format on the study trial. Although they did not present all data that would be needed to fully understand the interaction, it appears that the multisensory advantage in picture memory was restricted to participants who had done the auditory memory task first. This suggests that auditory information may have been attended more by participants who had previously been instructed to remember these sounds than by participants who did the picture memory task first. Another important change was that we removed a potential confound in stimulus materials as described in the previous paragraph. The appendix in Thelen et al. shows that they presented 144 different pictures but only 106 different sounds. Although they did not fully specify how items were counterbalanced and we thus do not know the details, the different numbers of pictures and sounds suggest that 38 pictures were presented only in the visual-only condition and therefore that there was a confound between materials and study condition in their experiment. We created a set of stimuli with equal numbers of pictures and sounds such that each picture was paired to a sound. We used a random selection procedure such that all items were equally likely to be presented in a specific condition. Finally, Thelen et al. analyzed only data for the items that were presented as unimodal items on the repeated (i.e., second) presentation, but we also looked at the items in multimodal conditions to assess the effect of overlap between study and test. We tested at least twice as many participants per experiment as Thelen et al. who tested 26 participants in their experiment. A replication of Thelen et al.'s findings would show an advantage for unimodal test items studied as bimodal-congruent items and a disadvantage for items studied as bimodal-incongruent items compared to items studied as unimodal items. However, based on the literature on transfer appropriate processing we might expect a different pattern of results, namely that the advantage of bimodal study items would be restricted to items that were also tested as bimodal items.

Participants
Fifty-four students at the Erasmus University participated for course credit.

Stimuli
A set of 232 picture-sound pairs was used as congruent pairs. Of these, 224 were used in the experimental trials and 8 were used for practice. Of these pairs, 168 were from Moran et al. (2013), kindly provided by Zachary Moran. The other 64 pairs were created by pairing pictures and sounds, similar in quality to those provided by Moran, retrieved from various websites. The pictures were colored line drawings of animals (e.g., dog, bee) and objects (e.g., trumpet, helicopter). The sound files were 500 ms clips of a typical sound made by the animal or object. Incongruent picture-sound pairs were created by randomly pairing pictures to sounds with the restriction that the sound could not be that of a closely related item (e.g., a picture of a trumpet with the sound of a saxophone). A full list of items is provided in Appendix A. For each participant, a different random allocation of items to conditions was created.

Design
Each picture was presented twice. In the continuous recognition task, all items could be considered test items, where the response to an initial presentation should be "new" and the response to a repeated item should be "old". We will call the initial presentation of an item a study trial because this is when the item is initially encoded in memory. We will call the repeated presentation of an item a test trial, because responses on these trials indicate if participants recognize the item as being presented previously. It is important to note that study and test trials were mixed into a single sequence (i.e., there are no separate study and test blocks) and the task for the participant was the same on all trials, namely decide whether the item was new or old. For both study and test trials, half of the items were presented alone as a picture, the unimodal condition. The other half was presented with a sound, the bimodal condition. Half of the sounds were congruent with the picture and half were incongruent. Of all the items in each of the three study conditions (unimodal, bimodal-congruent, bimodal-incongruent), half were subsequently presented at test alone as a picture and half were presented with a sound (congruent or incongruent). The full set of conditions with number of items in each condition is presented in Table 1.

Procedure
The experimental procedure was closely based on that of Thelen et al. (2015). Participants were tested individually. They were seated at a desk wearing headphones with a keyboard and monitor on the desk in front of them, all connected to the PC that was used to run the experiment. They were instructed that a list of pictures that they had to remember would be shown on the monitor, and more specifically, that they had to decide for each picture whether it was 'old' or 'new'. Participants were informed that some of the pictures were presented together with a sound, but that the old-new decision should be based only on the picture. The experiment started with 16 practice trials, followed by two blocks of 224 experimental trials. Each trial started with a 500 ms fixation marker (+) in the center of the screen, followed by a picture that was shown for 500 ms. If a sound was played, it started at the same time as the picture and played for 500 ms. Each picture was followed by a variable ISI of 900-1500 ms (mimicking the timing used by Thelen et al.) until the next trial started. Participants kept their index fingers on the z and m keys throughout the block. They pressed the z key for new items and the m key for old items. Feedback was provided only during the practice trials. After each trial, the word "Correct!" in blue or "Incorrect" in red was displayed in the center of the screen for 250 ms. During the experimental trials no feedback was provided. Items from different conditions were mixed in a semi-random order. Repetitions were presented at a lag that varied from 5 to 15 items.

Results
Because participants had to respond before the start of the next trial there were trials on which no response was recorded. Only trials on which a participant responded were included in the following calculations and analyses (94.9% of all trials). The accuracy for new and old items was calculated for each condition. Data from three participants were removed from the analyses because their accuracy was below 60% (note that 50% accuracy represents chance performance). All data that were used for the analyses are available on https://osf.io/vkqbf/.
The mean hit rates and false alarm rates for the three study conditions are presented in Table 2. For each type of trial d' (d-prime) was calculated using the Snodgrass and Corwin (1988) correction. Because hit rates (and false alarms) may be influenced by response bias, d' provides a better measure of memory sensitivity than just hit rates. The false alarm rates are calculated from 'old' responses to the first presented items grouped by study condition, so to calculate d' the comparison is between first presentation and second presentation of items in the same format. This was the best comparison to eliminate the effect of response bias toward a particular format. The average d's for all conditions are shown in Fig. 1. A 3 (study condition) by 3 (test condition) ANOVA showed an effect of study condition, F(2,100) = 4.89, p = .009, partial η 2 = 0.09. It seems that memory was slightly better for items that were studied as picture-only than for items studied as a picture with sound, which is opposite from the previously reported multisensory advantage.
Memory was better for items tested with sound than for items tested as picture-only, F(2,100) = 3.63, p = .030, partial η 2 = 0.07. There was no interaction between study and test condition, F(4,200) = 1.13, p = .344, partial η 2 = 0.02. The absence of an interaction indicates that the overlap between study and test format did not affect memory. Because Thelen et al. (2015) analyzed only responses to items tested as unimodal stimuli, we did a second analysis to separately test the effect of study condition on items tested as picture-only. In addition to calculating p values, we calculated the JZS Bayes Factor (BF), which is the ratio of p(D|H0) and p(D|H1), the probabilities of observing the data under the null hypothesis and the alternative hypothesis, respectively. The Bayes Factor thus provides a relative measure of the extent to which the data provide evidence for the null hypothesis of no effect or the alternative hypothesis (Rouder, Speckman, Sun, Morey, & Iverson, 2009). Bayes Factors between 3 and 10 can be considered moderate evidence, and Bayes Factors above 10 can be considered strong evidence. Bayes Factors were calculated using JASP (JASP Team, T., 2017). The Bayes Factor varies as a function of the set prior. In our analyses we always set the Cauchy prior width at 0.707, and in Appendix B we show how the Bayes Factor varies with different values of this prior. These plots show that, although the strength of the evidence varies with the prior, in most cases the direction of the Bayes Factor is not affected by our choice of prior.
For unimodal test items there was no difference between items studied unimodally and items studied with a congruent sound, t(50) = 1.67, p = .101, BF 01 = 1.79. 1 For unimodal test items there was also no difference between items studied unimodally and items studied with an incongruent sound, t(50) = 1.63, p = .109, BF 01 = 1.90. Plots showing the Bayes Factor as a function of the Cauchy prior width are provided in Appendix B. Thelen et al. (2015) found that, compared to memory for items studied as pictures only, memory for items studied with congruent sounds was better and memory for items studied with incongruent sounds was worse. In contrast, the present experiment did not show such effect. If anything, our results indicated better memory for items studied as picture only than for items studied with a sound.

Experiment 2
In Experiment 2 memory for sounds was investigated. Participants studied sounds with or with a picture in a continuous recognition paradigm as was used in Experiment 1.

Participants
Sixty-six students at the Erasmus University participated for course credit.

Stimuli and procedure
The same set of stimuli was used as in Experiment 1. Each sound was presented twice, and could be presented as sound-only, with a congruent picture, or with an incongruent picture. The procedure was the same as that used for Experiment 1, with the exception that participants responded whether the sound was 'old' or 'new'. They were informed Table 2 Mean hit rates and false alarm rates in experiments 1 to 4 (standard errors of the mean in parentheses). Note. Hit and false alarm rates reflect values obtained before applying the Snodgrass and Corwin (1988) correction. that some of the sounds were presented together with a picture, but that the old-new decision should be based only on the sound.

Results
Data from 4 participants were excluded because the sounds did not play properly during their session, and data from 13 participants were excluded because their accuracy was below 60%. Only trials on which a participant responded were included (91.7% of all trials). The mean hit rates and false alarm rates are presented in Table 2. The average d's for all conditions are shown in Fig. 2. A 3 (study condition) by 3 (test condition) ANOVA showed that memory was better for items studied with pictures than as sound-only, F(2,96) = 9.01, p < .001, partial η 2 = 0.16, and this effect seems to be restricted to the items that were tested with the same picture (i.e., the picture with which it had been presented on the first presentation), as indicated by the interaction effect, F(4,192) = 29.27, p < .001, partial η 2 = 0.38. Test format did not affect memory, F (2,96) = 1.09, p = .341, partial η 2 = 0.02.
To compare our results with those of Thelen et al., we analyzed the sound-only test condition separately and, compared to items studied as unimodal, found no advantage for items studied as bimodal-congruent, t (48) = 0.49, p = .626, BF 01 = 5.74, nor a disadvantage for items studied as bimodal-incongruent, t(48) = 0.51, p = .614, BF 01 = 5.70 Thus, we did not replicate Thelen et al.'s results.
Overall, these results suggest that task-irrelevant pictures may support memory for sounds, but only when the picture during study and test was the same. Interestingly this was found for both congruent and incongruent pictures. In other words, it was the overlap between study and test that mattered, not the congruency between the sound and the picture.
It is surprising that we did not replicate Thelen et al.'s (2015) findings of better memory for unimodal test items that had been studied as bimodal-congruent items than for unimodal test items studied as unimodal items. A possible explanation is that participants did not have enough time or resources to process the task-irrelevant modality. The presentation speed of items in the continuous recognition task was quite fast. The time from onset of one stimulus to the onset of the next stimulus varied between 1400 and 2000 ms, during which time participants had to process the stimulus, decide whether it was first presented or repeated, and make a response. Under such demanding conditions when there is competition for resources, top-down processes might work to select the task-relevant modality and ignore the task-irrelevant modality (Talsma, Senkowski, Soto-Faraco, & Woldorff, 2010, but see Alais, Morrone, & Burr, 2006). The absence of a multisensory study advantage in our results might be due to the demanding task that caused participants to ignore the task-irrelevant modality. Although we used the same timing of presentations as Thelen et al. they presented the same items in a picture recognition and a sound recognition task to the same participants, which may have increased their participants' attention to the task-irrelevant modality despite the demanding conditions. The fast pacing is not critical to finding a multisensory benefit. To accommodate fMRI measurements, Murray et al. (2005) varied the intertrial interval between 6000 and 10,000 ms and also observed a multisensory benefit. In the next two experiments we increased the time between trials with 2000 ms in an attempt to make the task less demanding and give more room for attention to the task-irrelevant modality.

Participants
Sixty students at the Erasmus University participated for course credit.

Stimuli and procedure
The stimuli and procedure were the same as those in Experiment 1, except that the ISI was increased by 2000 ms to a variable duration between 2900 ms and 3500 ms.

Results
The accuracy for new and old items was calculated for each condition. Data from four participants were removed from the analyses because their accuracy was below 60%. Only trials on which a participant responded were included (92.4% of all trials). The mean hit rates and false alarm rates are presented in Table 2. The average d's for all conditions are shown in Fig. 3. A 3 (study condition) by 3 (test condition) ANOVA showed no effects of study condition, F(2,110) = 1.40, p = .251, partial η 2 = 0.03, or test condition, F(2,110) = 0.24, p = .789, partial η 2 = 0.00. There was no interaction between study and test condition, F(4,220) = 1.66, p = .162, partial η 2 = 0.03. The absence of an interaction indicates that the overlap between study and test format did not affect memory.
A separate analysis of the items tested as picture-only showed, compared to items studied as unimodal, no advantage for items studied as bimodal-congruent, t(55) = 0.09, p = .931, BF 01 = 6.83, nor a disadvantage for items studied as bimodal-incongruent, t(55) = 0.93, p = .357, BF 01 = 4.55. Thus, increasing the inter-stimulus interval did not result in an effect of study modality on recognition memory performance for pictures.

Participants
Sixty-six students at the Erasmus University participated for course credit.

Stimuli and procedure
The stimuli and procedure were the same as those in Experiment 2, except that the ISI was increased by 2000 ms to a variable duration between 2900 and 3500 ms as in Experiment 3.

Results
Data from seven participants were excluded because their accuracy was below 60%. Only trials on which a participant responded were included (92.6% of all trials). The mean hit rates and false alarm rates are presented in Table 2. The average d's for all conditions are shown in Fig. 4. A 3 (study condition) by 3 (test condition) ANOVA showed that memory was better for items studied with pictures than as sound-only, F (2,116) = 31.42, p < .001, partial η 2 = 0.35, and this effect seems to be restricted to the items that were tested with a picture and mainly due to better memory when the picture during study and test was the same, as indicated by the interaction effect, F(4,232) = 26.27, p < .001, partial η 2 = 0.31. In addition, there was an effect of test condition, memory was better for items tested with a picture than as sound-only, F(2,116) = 6.22, p = .003, partial η 2 = 0.10.
A separate analysis of the items tested as sound-only showed, compared to items studied as unimodal, an advantage for items studied as bimodal-congruent, t(55) = 2.25, p = 028, although the Bayesian analysis indicates that the evidence was very weak, BF 10 = 1.45. We found no disadvantage for items studied as bimodal-incongruent, t(58) = 1.55, p = .125, BF 01 = 2.26. Thus, we partially replicated Thelen et al. (2015) who found better memory for sounds studied with a congruent picture than items studied as sound only, although the Bayesian analysis indicated that the evidence was very weak. We did not replicate their finding that memory was better for items studied as sound-only than items studied with an incongruent picture.

General discussion
In four experiments we tested memory for pictures (Experiments 1 and 3) and sounds (Experiments 2 and 4) representing common objects and animals. Items were presented as unimodal, bimodal-congruent, and bimodal-incongruent items in a continuous recognition task.
Format for new and repeated items could be the same or different in a fully crossed design. The results showed little evidence for a multisensory benefit and some evidence that memory for sounds benefited from study-test overlap. Based on the results of Thelen et al. (2015) and others (Lehmann & Murray, 2005;Moran et al., 2013), we had expected an overall benefit for bimodal-congruent study conditions over unimodal   D. Pecher and R. Zeelenberg study conditions. However, as shown in Figs. 1-4 no such omnipresent benefit was obtained; memory performance in the bimodal-congruent study condition was not consistently higher than in the unimodal study condition. A benefit for bimodal-congruent items was found, but only present when participants had to memorize sounds (see Figs. 2 and 4) and there largely restricted to bimodal-congruent test items that had also been studied as bimodal-congruent items. Importantly, and contrary to the idea that memory is only enhanced for bimodal-congruent items, a similar pattern was found for bimodal-incongruent items, indicating that these results are due study-test overlap. In the unimodal test condition in particular we had expected performance to be better when items had been studied in the bimodal-congruent condition than when items had been studied in the unimodal condition. In Figs. 1-4, performance in the unimodal test condition is shown in the three leftmost bars. As can be seen, we did not find an advantage for the bimodal-congruent study condition over the unimodal study condition, except a very weak effect in Experiment 4. In addition, we also did not find that performance in the bimodal-incongruent study condition was below that of the unimodal study condition.
We were also interested in possible effects of study-test overlap in modality. Based on the encoding specificity and transfer-appropriate processing principles we expected better performance if the format of items (unimodal, bimodal-congruent, bimodal-incongruent) were the same during study and test, compared to when they were different (see also Meyerhoff & Huff, 2016). When memory was tested for pictures, the presence of a sound during study did not have an effect, nor was there an effect of study-test overlap (see Figs. 1 and 3). When memory was tested for sounds, however, there was an effect of the presence of a picture and this effect was primarily due to study-test overlap (see Figs. 2 and 4). We even found that memory performance for sounds was improved by a semantically incongruent picture at study if the same incongruent picture was also presented at test. Sound memory might be affected more by a picture than picture memory by a sound because memory for sounds is weaker than memory for pictures (Cohen et al., 2009;Heikkilä et al., 2017). Moreover, visual information is dominant in general (Colavita, 1974), also in processing meaningful pictures and sounds (Yuval-Greenberg & Deouell, 2009). This means that task-irrelevant visual information is harder to ignore than task-irrelevant auditory information. Therefore, sound memory may have benefitted from the overlap in visual information between study and test because sound memory is weak and visual information is hard to ignore. Picture memory, on the other hand, may not have benefitted from overlap in sound information because picture memory is strong already and sound information is easy to ignore. Thus, overlap in sound would be more effective than overlap in picture.
The current study used d' as a measure of recognition memory accuracy (Macmillan & Creelman, 2005;Snodgrass & Corwin, 1988). Although this is the standard measure of performance in recognition memory tasks, and much preferred over just considering hit rates, d' accurately reflects recognition memory accuracy only if the assumptions underlying its calculation are met. One such assumption is that the variance of the target and foil distributions of familiarity are equal. Studies indicate, however, that this equal-variance assumption is consistently violated (e.g., Ratcliff, Gronlund, & Sheu, 1992); the variance of the target distribution is typically larger than the variance of the foil distribution. The violation of the equal-variance assumption is particularly problematic when comparing conditions in which participants use different response biases (e.g., Grider & Malmberg, 2008). As we argued in the Introduction, such differences may have been present in studies where there was a correlation between presentation format (unimodal vs. bimodal) and study status (studied vs. nonstudied). The present study, however, was designed so that there was no correlation between presentation format and study status, thereby eliminating the motivation for participants to be differently biased in the unimodal and bimodal conditions. To the extent that in the present study d' accurately reflected differences in performance between conditions, we would expect to find similar findings for alternative measures of memory accuracy, such as d' e (Grider & Malmberg, 2008), 2 or with procedures developed to eliminate response biases, such as two-alternative forcedchoice (Grider & Malmberg, 2008;Zeelenberg, Wagenmakers, & Raaijmakers, 2002;Zeelenberg, Wagenmakers, & Rotteveel, 2006). To our knowledge, no study on multisensory memory has used such alternative approaches.
The lack of a general multisensory memory benefit in our study suggests that the effect is not robust and may be sensitive to task-specific factors. One likely factor is that the presence of a multisensory benefit depends on the amount of attention that participants pay to the taskirrelevant modality. In Thelen et al. (2015), all participants were tested in both a picture recognition memory and a sound recognition memory task (see also Heikkilä et al., 2015;Heikkilä et al., 2017). Even though Thelen et al. separated these tasks by a week, participants may still have paid attention to the task-irrelevant modality in the second task, especially because the same target items were presented in both tasks and the task-irrelevant modality in the second task had been the task-relevant modality in the first task. For example, if the second task was sound memory, bimodal items would be accompanied by pictures that had been the to-be-remembered items in the first task and may for that reason have captured attention.
Such a task-specific carry-over effect on attention cannot explain the multisensory benefit obtained by three other studies in which only one modality was task-relevant (Lehmann & Murray, 2005;Moran et al., 2013;Murray et al., 2005). In the studies by Lehmann and Murray (Experiment 2) and Murray et al. participants had to remember pictures while sounds were task-irrelevant, in the study by Moran et al. participants had to remember sounds while pictures were task-irrelevant. That is, in contrast of the Thelen et al. (2015) study, only one modality was ever task relevant. In these studies, however, items were presented in a continuous recognition task as either unimodal or bimodal items on their first presentation, but always as unimodal items on their second presentation. As discussed in the Introduction, this may have introduced a response bias based on the modality of the items. Moreover, because across all trials the correct response to a bimodal item was always 'new' and the correct response to a unimodal item was more likely to be 'old' than 'new', participants have been motivated to pay close attention to the task-irrelevant modality because that would help them make their responses. In addition, in this design unimodal items are more frequent (75%) than bimodal items (25%). Unusual items are commonly better remembered than usual items when the two types are mixed (McDaniel & Bugg, 2008). Because the bimodal items were the unusual items it is possible that the memory benefit for bimodal items was the result of their unusualness rather than the bimodal format itself. To circumvent these potential problems, we used the continuous recognition task with the design of Thelen et al. (2015) in which all formats were equally likely. Our results suggest that using such a design eliminates the multisensory memory benefit.
We are not the first study to obtain little evidence for a multisensory memory benefit. Studies that presented items in separate study and test blocks also did not obtain a multisensory advantage (Cohen et al., 2009;Nyberg et al., 2000). Although this is a different procedure than the continuous recognition task there is no obvious reason why the effect of multimodal presentation format should be restricted to the continuous recognition task. In fact, Duarte, Ghetti, and Geng (2021) have suggested that the multisensory advantage might be limited to recollection processes in memory. Such recollection processes might play a larger role when participants have more time to respond as might be the case in blocked study-test designs compared to the fast-paced continuous recognition paradigm. In both procedures participants receive explicit memory instructions and the test taps into long-term memory. These and our results suggest that the multisensory memory benefit is not a general finding but rather one that is restricted to specific task details.
A multisensory benefit in memory would be consistent with findings of multisensory benefits in attention (Evans, 2020;Quak et al., 2015;Talsma et al., 2010). Attention is increased by items in a task-irrelevant modality that are spatially or temporally aligned with the task-relevant stimulus. For example, Van der Burg, Olivers, Bronkhorst, and Theeuwes (2008) found that participants were faster to detect the orientation of a horizontal or vertical line segment among oblique distractor line segments if a change in color of the target segment was accompanied by a tone than if the color change occurred without a tone. The stimuli used in studies that investigated the effect of multimodal stimuli on attention were often simple stimuli such as visual gratings or single tones. The effects of unimodal vs multimodal presentations on attentional processes with more complex stimuli such as a picture of a dog or the sound of a dog bark might be different because they require deeper processing to establish congruency. Thus, processing the second, task-irrelevant modality may actually hurt performance because it requires processing resources and thus takes attention away from the task-relevant modality. Moreover, evidence suggests that conditions that facilitate initial processing of a stimulus may not always result in better memory. For example, Canits et al. (2018) found no benefit of congruent motor actions on later memory for pictures even though congruency did affect immediate motor responses during the study task. Thus, although it may seem reasonable to expect memory benefits for congruent conditions that show an effect during study, the relation between congruency during study and later memory performance may not be so straightforward.
The effect of study-test overlap in Experiments 2 and 4 shows that participants did not completely ignore the task-irrelevant modality. It appears that some information from the task-irrelevant modality was encoded in memory, otherwise the overlap between study and test would not have affected memory performance. However, the irrelevant information at study was not helpful unless it was also presented at test. A similar conclusion can be drawn from Nyberg et al. (2000) who measured activation in auditory brain regions during a memory test for pictures. They showed that these auditory regions were activated more by unimodal test pictures that had been studied with a sound than by pictures that had been studied without a sound. Responses were not more accurate, however, for items studied bimodally than for items studied unimodally. Thus, even though these results suggest that irrelevant sounds were encoded at study and activated during test they did not support memory performance.
To conclude, our results show little benefit of multisensory study on memory. Previous findings of such a benefit may have been due to context-specific carry-over effects of attention allocation or to unequal distributions of items over conditions and responses. Our results showed that only when the exact multisensory item is repeated at test a strong memory benefit due to study-test overlap is present for both congruent and incongruent items, but this does not indicate a general multisensory integration benefit on memory performance.