Introduction

People often have to perform multiple tasks simultaneously. For example, drivers control their cars’ speed and direction, while monitoring other cars and passengers nearby and sometimes heeding their phones. Moreover, some information is received by hearing and touch, such as horn sounds and vibrations caused by speed bumps. The human’s ability to multitask, however, has a capacity limitation, which constrains performance and may lead to serious consequences such as accidents.

One of the main paradigms for studying multitasking is the psychological refractory period (PRP). It indicates that when two stimuli (S1 and S2) are presented with short stimuli onset asynchrony (SOA) and both require fast responses (R1 and R2), the reaction times to the second stimulus (RT2) increases with decreasing SOA (Pashler, 1994).

Theories of the psychological refractory period (PRP)

The dominant account of the PRP effect is the response-selection bottleneck (RSB) model (Pashler, 1994; Welford, 1952). It generally divides the processing of a task into three phases: the perceptual processing, the response selection, and the response execution; response selections for two tasks cannot be carried out simultaneously. According to the RSB model, the response time to the first target (RT1) does not depend on SOA, and RT2 decreases with a slope of −1 if plotted against SOA when SOA is shorter than RT1. Many studies have supported the RSB model (e.g., Ruthruff et al. 2009; Ruthruff et al. 2003). Recently, it has been argued that the central bottleneck is not located in response selection but in the timing of response initiation (Klapp, Maslovat, & Jagacinski, 2019). Some PRP studies have found a backward crosstalk effect: the performance of Task 1 was better if R2 was congruent with R1 (Hommel, 1998; Hommel & Eglau, 2002). The RSB model could not explain the backward crosstalk effect, and some updated versions of RSB have been proposed (Hommel, 1998; Janczyk, Renas, & Durst, 2018) to account for this effect. Klapp et al.’s (2019) model also allows a backward crosstalk effect to exist.

The central capacity-sharing model (Tombu & Jolicoeur, 2003) introduced an additional parameter: the sharing portion of central capacity to Task 1. The central capacity-sharing model assumes there is no bottleneck in processing, and processing to two tasks could be carried out at the same time. The RPR effect is caused by a limited central capacity. Although the central capacity-sharing model does not assume any bottleneck, it includes the RSB model as a special case where the sharing portion equals one. When the portion is less than one, the central capacity is shared between the two tasks. According to the central capacity-sharing model, RT2 decreases with a slope of −1 if plotted against SOA when SOA is shorter than RT1; when the sharing portion is less than one, RT1 decreases as SOA increases, and the backward crosstalk effect is possible. Studies about the backward crosstalk effect support the central capacity-sharing model over the RSB model (e.g., Huestegge & Koch, 2010; Stephan & Koch 2010). Besides, in most (67.1%) of the studies that have analyzed the effect of SOA on RT1, RT1 was not independent of SOA (Strobach et al. 2015).

The executive control of the theory of visual attention (Logan & Gordon, 2001) combines the theory of visual attention (Bundesen, 1990) and the exemplar-based random walk model (Nosofsky & Palmeri, 1997). It explains the PRP effect by running the theory of visual attention twice. Executive control of the theory of visual attention assumes that both tasks have access, but with different and dynamic priorities, to the central capacity under short SOA, and that the earlier processing of Task 2 is largely inhibited (e.g., to 10%) after the response selection of Task 1. According to the executive control of the theory of visual attention, RT1 decreases as SOA increases, the backward crosstalk effect should be observed, and RT2 decreases with a slope even less than −1 if plotted against SOA when SOA is shorter than RT1. Figure 1 presents the comparison of three models.

Fig. 1
figure 1

The response time of two tasks predicted by response selection bottleneck (RSB), central capacity-sharing (CCS), and executive control of theory of visual attention (ECTVA) models. The response selection of Task 1 and Task 2 were assumed to take three units of central capacity to process. In CCS and ECTVA, the priority of Task 1 was assumed to be twice that of Task 2 (presented by the height of the sharing capacity), so SP is 0.67. Under a long SOA, all models make the same prediction. Under a short SOA (SOA < RT1), the RSB model predicts RT1 to be independent of SOA, and RT2 decreases with slope of −1 as SOA increases. The CCS model predicts RT1 decreases as SOA increases and the slope depends on SP, and RT2 decreases with slope of −1 as SOA increases; the ECTVA predicts RT1 decreases as SOA increases and the slope depends on the relative priority of Task 1 to Task 2, and RT2 decreases with slope less than −1 as SOA increases. The early processing of Task 2 (the gray rectangle) is largely lost after the response selection of Task 1

Effect of modality

All the models mentioned above are independent of (input and output) modalities. However, some recent studies supported the notion that dual-task performance depended on modalities (Huestegge & Hazeltine, 2011; Morrison, Burnham, & Morrison, 2015). Dual-task costs are affected by ideomotor compatibility, i.e., the similarity between the stimuli and the feedback of responses (Greenwald, 1972; Greenwald & Shulman, 1973; Lien, McCann, Ruthruff, & Proctor, 2005; Lien, Proctor, & Allen, 2002). The input-output modality pairing also affects dual-task performance (Hazeltine & Ruthruff, 2006; Hazeltine, Ruthruff, & Remington, 2006): visual-manual and auditory-vocal pairings generally result in less dual-task costs than visual-vocal and auditory-manual pairings. Besides, the somatosensory system has some unique features compared with vision and audition: Somatosensory signals go through the spinal cord and processing in the brain has asymmetry in hemispheres (Coghill, Gilron, & Iadarola, 2001; Symonds, Gordon, Bixby, & Mande, 2006; Vallar, 2007). This evidence suggested that the role of modalities in multitasking is worth exploration (Huestegge & Hazeltine, 2011).

However, the effect of input modalities on PRP has not been fully explored: PRP studies using tactile stimuli are very few, compared with the abundance of visual and auditory PRP studies (e.g., Dux, Ivanoff, Asplund, & Marois, 2006; Hunt & Kingstone, 2004; Lien et al., 2002; Pashler, 1994; Sigman & Dehaene, 2008). Actually, Huestegge and Hazeltine (2011) reviewed 53 dual-task studies between 2006 and 2011 and found only one study involved touch. Collignon and De Volder (2009), which supposedly is the study meant, found that blind participants responded faster than those with normal sight in auditory-tactile tasks; this is irrelevant with PRP, because the SOA was over 1,500 ms. An earlier relevant study is from Brebner (1977), who found a PRP effect with tactile stimulation on both hands (as cited in Pashler, 1994).

A recent study involving visual, auditory, and tactile stimuli found that PRP existed in input modality combinations of visual-visual, auditory-visual, and tactile-visual stimuli (Hibberd, Jamson, & Carsten, 2013): Drivers first discriminated visual, auditory, or tactile stimuli as Task 1, then, after a varying SOA, stopped their cars when the leading car’s brake lights turned red. Drivers performed the braking tasks slower as the SOA decreased, and this effect was not moderated by the input modalities of Task 1. Still, the combination of audition and tactility has not been explored in previous studies on PRP. Moreover, research on crossmodal dual-task found asymmetrical influence between modalities (Arnell & Larson, 2002; Chun & Potter, 1995, Exp. 5; Soto-Faraco et al., 2002, Exp. 1). For instance, Arnell and Larson (2002) found that performing an auditory Task 1 affected the performance of a visual Task 2, but performing a visual Task 1 did not affect the performance of an auditory Task 1. Therefore, although PRP has been studied with tactile-visual tasks (Hibberd et al., 2013), the combination of visual-tactile tasks still needs to be explored.

To sum up, input modality plays a key role in dual-task performance. However, few previous studies have employed tactile stimuli, and actually none incorporating audition with tactility, and none covering the visual-tactile tasks. This study will fill in these gaps by exploring and comparing the PRP effect in all the possible combinations of input modalities in one experiment. Based on previous studies, we predict the RT2 of auditory-visual, visual-auditory, and tactile-visual would decrease with a slope of −1 as SOA increases. RT2 of the other conditions, RT1, and accuracy data could not be predicted because of the lack of relevant results.

Method

Participants

Thirty participants (23 female), aged 18–30 years (M = 23.6, SD = 2.79 years), took part in the experiment. All had normal or corrected-to-normal vision and normal auditory and tactile senses. They were rewarded with 36–56 CNY (equivalent to US$5.18–8.06) according to their performance. This research was approved by the institutional review board at the Institute of Human Factors and Ergonomics, Tsinghua University. Informed consent was obtained from each participant.

We calculated the required sample size based on effect size of SOA on RT2 in a previous study (Hibberd et al., 2013) with F to \( {\upeta}_p^2 \) transformation (Lakens, 2013) by G* Power (Faul, Erdfelder, Lang, & Buchner, 2007). Because of a rather large effect size, i.e., \( {\upeta}_p^2 \) = .55, to achieve a power of .95, three to 11 participants are needed. However, to be conservative and to counterbalance the order of input modalities, we still recruited 30 participants.

Apparatus and stimuli

We used an HTC vive pro to present visual stimuli in the head-mounted display (HMD). Haptic stimuli were presented as vibrations on vive controllers. Auditory stimuli were presented through a pair of noise-cancelling earphones connected to the HMD. Stimuli could be of two magnitude levels: Visual stimuli were white circles with a relative area of 16:100 on a black background; auditory stimuli were rectangle sound waves with a relative volume of 16:100, and haptic stimuli were vibrations with a relative magnitude of 16:100.

Design

This study was a 6 (modality: visual-auditory, auditory-visual, visual-tactile, tactile-visual, auditory-tactile, and tactile-auditory) × 5 (SOA: 75, 150, 300, 600, and 1,200 ms) within-subject design. Modality, or the combination of input modalities, was counterbalanced between subjects. SOA was randomly selected from the five levels with equal probability for each trial. RT and accuracy data for both tasks were analyzed as advocated by Strobach et al. (2015).

Task and procedure

Information about the participants’ gender and age was collected during the online recruitment. Each participant arrived at the laboratory individually, sat on a chair, put on the HMD and earphones, held one controller in each hand, and rested their elbows on the chair arms. Participants got familiar with the visual, auditory, and tactile stimuli so that they could easily distinguish small and large stimuli in each modality. The experiment consisted of six blocks, corresponding to the six conditions of modality, and its order was counterbalanced between subjects.

Within each block, participants completed ten training trials and then 50 testing trials. In each trial, participants first saw a fixed cross at the center of the panel. The cross disappeared after 500 ms, and the first stimulus appeared and stayed for 100 ms. After a random SOA, the second stimulus appeared and stayed for 100 ms. Participants could respond to the stimuli as soon as each appeared. Trials were terminated by participants’ responses to the second stimulus or after a maximum response window of 3 s. Subsequently, a blank screen was shown during an inter-trial interval of 1 s.

The participants’ task was to judge the magnitude (small or large) of the two stimuli as accurately and as fast as possible, with the order of manual responses matching the order of stimuli occurrences. To respond, participants pointed the virtual laser from the controller to a button on the virtual panel and clicked the trigger on the controller. RT was defined as the time from stimuli appearing to the clicking action. It was very similar to using a mouse: move and click. Participants were explicitly instructed to follow these requirements: Use the left hand for Task 1 and the right hand for Task 2; do not wait until the second stimulus appears to respond to the first one; always prioritize Task 1. Figure 2 presents the interface of the experiment.

Fig. 2
figure 2

The experiment interfaces. The “small” and “large” buttons were in the Chinese in experiment. T1 indicates Task 1, T2 indicates Task 2

Results

We report the performance of Task 2 and then Task 1. No trials were excluded, but using only trials with correct responses did not change the results, see Fig. 4. Greenhouse-Geisser correction was used wherever the assumptions of sphericity were violated. The data set is available at the open science framework (Rau & Zheng, 2019).

Performance of Task 2

Table 1 presents the means and SDs of RT and accuracy of Task 2. We conducted two 6 (modality) × 5 (SOA) repeated-measures analysis of variance (RM-ANOVA) predicting the RT and accuracy of Task 2 separately. Table 2 presents the results. Figure 3 depicts the RT2 pattern.

Table 1 Mean (SD) RT and accuracy of Task 2
Table 2 Effect of modality, SOA, and their interaction on RT and accuracy of Task 2
Fig. 3.
figure 3

Response time of Task 2 on stimuli onset asynchrony (SOA, measured in ms) by modalities. Error bars represent CI 95%. AT auditory-tactile, AV auditory-visual, TA tactile-auditory, TV tactile-visual, VA visual-auditory, VT visual-tactile

The effect of SOA on RT2 was significant: RT2 decreased as the SOA increased, first with a slope of about −1 when SOA was less than 500 ms, then with a much flatter slope. We also ran six separate RM-ANOVAs on the effect of SOA on RT2 within the six conditions of modality. Table 3 shows that the effects of SOA on RT2 were significant for all the conditions. Both the effect sizes and the powers were large. Therefore, PRP existed in all the six modality combinations.

Table 3 Effect of SOA on RT and accuracy of Task 2 in different modalities

Significant effects are shown in bold

RT reaction time, AT auditory-tactile, AV auditory-visual, TA tactile-auditory, TV tactile-visual, VA visual-auditory, VT visual-tactile

The main effect of modality was significant: Generally, the auditory-visual task had the longest RT2, and the tactile-visual task had the shortest RT2. Multiple comparisons with Bonferroni correction (see Table 4) revealed that RT2 of the auditory-visual task was longer than that of any other modalities; RT2 of the tactile-auditory task was longer than that of visual-auditory, tactile-visual, and visual-tactile tasks; RT2 of the auditory-visual task was longer than that of the visual-auditory task; differences in other comparisons were insignificant. The difference in RT2 between modalities could be partly ascribed to the task difficulty: the auditory task was the most difficult, followed by the tactile task, and the visual task was the least difficult (see the next section for details).

Table 4 Difference in RT2 (ms) between modalities with Bonferroni correction

Regarding the interaction, the difference in RT2 between modalities (mostly between the auditory-tactile task and the others) gradually became smaller as the SOA increased (see Fig. 3). At a short SOA, the features of Target 1 have more influence on RT2; at s longer SOA, the features of Target 2 have more influence on RT2. We noticed that the slope of the auditory-tactile task seems steeper than the other conditions. Results of a one-sample t-test showed that the slope of RT2 of the auditory-tactile task with SOA less than 1,000 ms (M = −1.22, CI 95% = [−1.05, −1.39]) was less than −1, t(29) = -2.599, p = .015. This is predicted only by the executive control of theory of visual attention.

The effect of SOA, modalities, and their interaction were all significant on accuracy of Task 2. Further analysis revealed that the accuracy of Task 2 was affected by SOA only in the auditory-tactile condition (see Table 3): accuracy generally increased from 83% to 95% as the SOA increased (see Table 1). In other modalities, accuracy stayed above 95%. This was very similar to the attentional blink effect, which will be discussed later.

We carried out additional analysis to explore whether the relative worse performance in auditory-tactile condition (accuracy of Task 2 depending on SOA and RT2 of a slope less than −1) was caused by the forward crosstalk effect. Table 9 presents the main method and results. Neither RT nor accuracy of Task 2 was affected by the congruence between two tasks or its interaction with SOA. The relatively worse performance in the auditory-tactile condition could not be attributed to the forward crosstalk effect.

Performance of Task 1

Tables 5, 6, 7 in the appendix present the descriptive statistics and results of ANOVA of Task 1 performance. As Table 7 illustrates, the effect of SOA on RT1 was significant in the tactile-visual and auditory-tactile conditions: RT1 generally decreased as the SOA increased. Accuracy of task 1 was not affected by SOA in any modality and stayed above 94%.

To explore whether the task difficulty differs between modalities, we conducted a repeated-measure ANOVA of S1 modality’s effect on Task 1 performance, see Table 8. Accuracy of the visual task was larger than the auditory and the tactile tasks, but the latter two did not differ from each other (p = .141). RT of the visual task was shorter than the auditory and tactile tasks, and RT of the tactile task shorter than that of the auditory task. Therefore, the auditory task was the most difficult, the visual task was the easiest, and the tactile task came somewhere in between.

We carried out additional analysis to explore whether the effect of SOA on RT1 in the auditory-tactile and tactile-visual conditions was caused by backward crosstalk effect. Table 9 presents the main method and results. In the auditory-tactile condition, the effect of congruency on accuracy of Task 1 and the interaction between congruence and SOA on RT and accuracy of Task 1 were significant. No significant effect was found in the tactile-visual condition. In other words, we observed backward crosstalk effect only in the auditory-tactile but not the tactile-visual condition.

Discussion

In this study, PRP, i.e., the effect of SOA on RT2, exists in all the cross-modal combinations of vision, audition, and tactility. This is consistent with previous studies (Brebner, 1977; Hibberd et al., 2013). Only the executive control of the theory of visual attention could predict all the results in the auditory-tactile condition: the RT1 increased, and the accuracy of Task 2 decreased, with a decreasing SOA; the slope of RT2 against SOA was less than −1; and a significant backward crosstalk effect was observed. In the tactile-visual condition, the RT1 increased with a decreasing SOA, which could be explained by the central capacity-sharing model. Results of the other combinations of modalities could be explained by both RSB and the central capacity-sharing models.

The result of accuracy of Task 2 in the auditory-tactile condition resembles the attentional blink effect: accuracy of Task 2 decreases as the SOA between two stimuli decreases (Raymond, Shapiro, & Arnell, 1992). The PRP and attentional blink effects have been found at the same time when RT1 is extremely long (Marti, Sigman, & Dehaene, 2012). The typical attentional blink paradigm would employ rapid serial visual presentation and back-masking for the second target (Giesbrecht & Di Lollo, 1998; but see Jannati, Spalek, & Lollo, 2011). In this study, however, we found an attentional blink effect in the auditory-tactile condition with only two stimuli and no back-masking. This could be partly explained by the higher difficulty and thus longer RT of the auditory task (see Table 8), which led to the loss of the information of S2 in working memory. Furthermore, to explain the difference between auditory-tactile and auditory-visual conditions we need to refer to some modality features: Visual information can linger for a while after the stimuli disappear, in the form of an after-image and visual-spatial scratchpad; however, there is no such counterpart for tactile stimuli, which might be more easily lost. In another study (Rau, Zheng, Wang, Zhao, & Wang, in press), we found that performing auditory-tactile tasks generally resulted in higher accuracy than performing pure tactile tasks. Combining results from the two studies, we speculated that the accuracy of performing dual-task will be highest if the tasks were distributed across visual and (then) tactile modality, followed by auditory-tactile tasks, and pure tactile tasks came in the last.

In both auditory-tactile and tactile-visual conditions, RT1 also decreases as SOA increases. This violated the prediction of the RSB models, but could be explained by both the central capacity-sharing model and the executive control of the theory of visual attention: a relatively small portion of the central capacity had been allocated to Task 2 when SOA is short, so less resource was allocated to Task 1 and RT1 became longer. This was further supported by the backward crosstalk effect in the auditory-tactile (but not the tactile-visual) condition: early processing of response selection of Task 2 affected the response selection of Task 1. To explain the difference between auditory-tactile and tactile-visual conditions we need to refer to modality features again: Somatosensory processing relies on (the postcentral gyrus of) the right hemisphere more than the left one (Coghill et al., 2001; Symonds et al., 2006; Vallar, 2007). The response execution of Task 1, which is performed with the left hand, was controlled by the precentral gyrus in the right hemisphere. There was a proximity in cerebral cortex between R1 and processing of a tactile Task 2. Another explanation is the similarity between auditory and tactile stimuli. Auditory stimuli, i.e., sound waves, are physical vibration. Vibrations inevitably emit noise (that is why noise-cancelling earphones were used).

As mentioned above, in the auditory-tactile condition, the slope of RT2 against SOA was less than −1, which was predicted by the executive control of the theory of visual attention only: the early processing of Task 2 was largely lost and needed to start from almost zero after the response selection of Task 1. Neither the RSB nor the central capacity-sharing models could explain this result.

Limitations and directions for future studies

The responses in this study, i.e., moving and then clicking, were more complex than the key-pressing used in most RT experiments. The RT included both the time for moving from home position to the target and the time for clicking the trigger. This will generally make the RT longer. Besides, performing complex manual responses with both hands may lead to bimanual interference (Scalf, Banich, Kramer, Narechania, & Simon, 2007; but see Swinnen & Wenderoth, 2004), and the interference might also affect the RT. However, the bimanual interference is not moderated by SOA (Huestegge & Koch, 2010), and it should be independent of stimulus modality. Moreover, as suggested by Fig. 3, the issue of RT measurement did not obscure the PRP effect. Still, we recommend future studies use simple responses such as key pressing.

The two tasks in this study both required semantic processing, and the dual-task costs may arise from the interference (Wickens, 2002; Wu & Liu, 2008). Future studies could consider adding the task types as a factor (Wahn & König, 2017). PRP effect becomes weaker when two responses are of different modalities (Klapp et al., 2019). Future studies should explore the cross-modal PRP with better control of input-output modality pairing (Hazeltine et al., 2006), using vocal responses (Hazeltine & Ruthruff, 2006) and even eye movements (Pashler, 1993) besides manual responses, though they also need to consider the RT issue mentioned above.

Comparison of RT2 between the visual-tactile and visual-auditory conditions and between the auditory-tactile and auditory-visual conditions did not support the assumed compatibility between haptic input and manual output. Whether there is a compatible output modality for haptic input is left for future studies to explore. There are still possible ideomotor compatibilities within this pairing, i.e., pressing the touchscreen harder when a strong vibration is felt. This could be easily tested with smart phones nowadays (e.g., iPhone X). The task difficulty between modalities was not matched in this study, and this might make the results more complicated. Future multisensory studies should consider matching the task difficulty before the experiments (see Riggs & Sarter, 2019).

PRP exists in all the modality combinations: The results in the auditory-tactile condition are best explained by the executive control of the theory of visual attention; the results in the tactile-visual condition was best explained by the central capacity-sharing model; the results of other conditions could be explained by both the RSB and the central capacity-sharing models. Processing of visual, auditory, and tactile information are all limited by the central capacity. However, the difference between modalities cannot be explained by any of those modality-independent theories, but could be explained by modality features to some extent. New theories incorporating input (and maybe even output) modalities into the model are needed to account for those (and future) results.