Introduction

In the numerical version of the Stroop paradigm, participants compare simultaneously presented Arabic digits based either on their numerical (numerical task) or physical (physical task) dimension (Besner and Coltheart 1979; Henik and Tzelgov 1982). Assessing the physical features of digits is significantly faster than assessing their value (e.g. Kadosh et al. 2007; Szucs and Soltész 2007). Regardless of the task instructions, an incongruent trial consists of a pair of digits in which the numerically larger digit has a smaller font size (Fig. 1). Although size and numerical information are processed automatically and in parallel, they interact and evoke response competition (Henik and Tzelgov 1982; Kadosh et al. 2007; Szucs and Soltész 2007; Szűcs and Soltész 2008; Santens and Verguts 2011). As with other conflict tasks, this interference manifests in slower reaction times (RTs) and error rates compared to neutral and congruent trials. While there is a large body of work on number–size interaction, very little is known about the origins of errors in numerical and physical comparisons. Notably, none of the aforementioned studies reported the RTs of erroneous responses.

Fig. 1
figure 1

Scheme of the numerical Stroop paradigm

As is commonly observed in conflict tasks, RTs of errors are significantly shorter than RTs of hits (e.g. Gratton et al. 1988; Falkenstein et al. 1991; White et al. 2011). A simple explanation of this phenomenon is that an erroneous response is executed before the stimulus is fully processed (Pailing and Segalowitz 2004; Danielmeier et al. 2009). An elucidation was provided by studies using electromyography and lateralized readiness potentials (Gratton et al. 1988; Szűcs et al. 2009). According to them, early processing of task-irrelevant stimulus information overwhelms the relevant feature of the target and triggers activation of the incorrect motor response, resulting in a high likelihood of error commission. If this response competition is reduced and overcome, a delayed correct reaction is executed. Considering that incongruent trials in both numerical and physical Stroop tasks elicit such interference (Kadosh et al. 2007; Szucs and Soltész 2007), one could speculate that errors in both tasks have shorter RTs than hits. As will be discussed later, it is not the case.

Computational and neuroimaging studies have provided substantial insights into numerical cognition. In general, number processing consists of two steps: the transformation of an external sensory number input to an internal number representation, and the transformation of a numerical representation to a task-relevant decision before the execution of a motor response (Verguts and Fias 2008). According to functional magnetic resonance imaging (fMRI) studies, perceptual encoding into a numerical representation occurs in the inferior temporal gyrus, which selectively responds to written Arabic numerals (Shum et al. 2013; Amalric and Dehaene 2016). The intraparietal sulcus (IPS) acts as a cortical hub to extract quantitative information to estimate magnitude, space, or time (Hubbard et al. 2005; Nieder and Dehaene 2009), all of which are necessary when making decisions. On the other hand, electroencephalography (EEG) studies have contributed a comprehensive overview of the precise timing of each processing stage. The numerical representation is evaluated until 200 ms after the presentation of digits, as marked by early temporo-occipital P1/N1 visual potentials (Pinel et al. 2001; Szũcs et al. 2007). Number–size interference can be observed later (~ 450 ms after stimulus onset) within the centro-parietal positivity (CPP), also referred to as P3b (Kadosh et al. 2007; Szucs and Soltész 2007; Szűcs and Soltész 2008, 2012; Beldzik et al. 2015a).

In our previous study, in which only the numerical task was implemented (Beldzik et al. 2015a), we applied independent component analysis (ICA) to EEG data and successfully isolated several functionally distinct neural sources including two lateral temporo-occipital components that showed early visual P1/N1 potentials, a fronto-central component explaining all variance of error-related negativity, and a centro-parietal component showing pronounced CPP. The analysis of response-aligned event-related potentials (ERPs) revealed that the CPP increases linearly before the response until it reaches its maximum just after response onset. Moreover, the slope of this increase was sensitive to all experimental conditions that vary in RT.

This effect can be attributed to the recently established decision variable (DV) signal. DV was first introduced by sequential sampling models that focused on explaining two-choice perceptual decisions (Smith and Ratcliff 2004; Bogacz et al. 2006). According to the models, noisy sensory evidence is accumulated in favour of a particular outcome upon reaching an action-triggering threshold. DV represents the accrual of priors, evidence, and value into a quantity. The model has proven extremely successful at accounting for performance on a variety of different cognitive tasks (Ratcliff and McKoon 2008). It was soon employed by neurobiologists and supported by empirical data from single-cell recordings in monkeys (Platt and Glimcher 1999; Gold and Shadlen 2007). According to these studies, the build-to-threshold DV dynamics are represented by neurons in the lateral part of IPS. The DV signal was later isolated in human EEG studies (O’Connell et al. 2012). The authors showed that DV is reflected by the build-up rate of the response-aligned CPP as it exhibited every aspect of the dynamics observed in single-neuron counterparts. It increases steadily with incoming evidence, peaks at the time of response execution, scales with target detection difficulty, and accounts for trial-to-trial variance in RT. Recently, Twomey et al. (2015) indicated that the commonly observed centro-parietal P3b potential exhibits the same accumulation characteristics as CPP, thus it encodes the DV signal. The authors pointed out that these effects observed at the slope of response-aligned P3b have been overlooked in previous EEG studies, possibly due to the spatiotemporal overlap with distinct frontal potentials. Indeed, as demonstrated by our previous study (Beldzik et al. 2015a), the fronto-central and centro-parietal independent sources explained over 80% of variance of the P3b potential. Once separated using ICA, the effects on the slope of CPP revealed the classic build-to-threshold dynamic.

In this study, which was a part of broader research project, numerical and physical Stroop tasks were implemented in separate experiments using EEG and fMRI. Both experimental sessions revealed an interesting and novel behavioural pattern of RT measurements. That is, errors committed in the numerical task were, as expected, faster than correct responses, whereas errors in the physical task showed the opposite pattern of slower RTs than hits. Thus, the goal of this study was to reveal the neural underpinnings of erroneous responses in the numerical Stroop paradigm using the temporal and spatial advantages of EEG and fMRI, respectively. Group ICA was applied to both EEG and fMRI data to separate EEG independent sources and identify the neural networks involved in the task. The unique pattern of RT measurements was used as a marker when exploring the time-courses of independent components. Based on the results from our previous study (Beldzik et al. 2015a), we hypothesise that the response-locked CPP is sensitive to the task and accuracy interactions. Particularly, we expect to find greater CPP amplitude before the response and smaller linear slope coefficient. Considering the compelling involvement of the IPS in numerical cognition (Nieder and Dehaene 2009) and its alleged locus of DV signal (Gold and Shadlen 2007), we hypothesise that this brain structure shows greater involvement in the longer taking numerical and physical decisions, regardless of its accuracy.

Materials and methods

Participants

Thirty-four participants (mean age of 23.4 ± 2.4 years, 17 females) met the inclusion criteria for MRI scanning and the experiment requirements: right-handedness, normal or corrected-to-normal vision, no physical and psychiatric disorders, being drug-free. They were informed about the procedure and goals of the study and gave their written consent. The study was approved by the Bioethics Commission at Jagiellonian University.

Experimental task

The numerical version of the Stroop paradigm, also known as the size congruity paradigm (Besner and Coltheart 1979), was prepared and generated using E-Prime 2.0 (©Psychology Software Tools). In the EEG experiment, the task was presented on a 17-inch screen (luminance 300 cd/m2, brightness level 70%) located approximately 80 cm from the eyes of participants. In the fMRI experiment, a mirror was placed on the head coil so participants could see the 32-inch screen (luminance 400 cd/m2, brightness level 50%) situated behind the scanner approximately 100 cm from the mirror. In the former experiment, a keypad was provided in front of the participants, whilst in the latter experiment two keypads were placed in the participants’ hands.

In the task, the stimuli were a pair of Arabic digits (from 1 to 9) of varying size (from 20 to 52 pt in steps of 4 pt) in a black Arial font on a light grey background (Fig. 1). In the numerical task, subjects were instructed to press a button with the left/right index finger if the digit on the left/right side had higher magnitude. In the physical task, they were instructed to indicate the digit with the greater font size. There were three congruence conditions: (1) congruent, when the numerically larger digit was physically larger; (2) neutral, when the two digits were of the same physical size (numerical task) or the same two digits had a different size (physical task); (3) incongruent, when the numerically larger digit was physically smaller. One block comprised 160 trials, which were equally distributed between congruence condition as well as left/right correct responses. Stimuli were presented in a pseudorandom order to avoid carryover effects. At the beginning of each block, the instruction “numerical task” was presented. After 80 trials, the instruction “physical task” informed participants about the change of task. There were four blocks, each differing in the type of feedback provided after the response (words ‘Good!’, ‘Bad!’, ‘You win!’, ‘You lose!’ after correct or erroneous responses, respectively). However, feedback types were not the concern of this study, thus blocks were treated as equivalent sessions.

The stimulus was preceded by a blank screen for 300 ms. The stimulus then appeared for 350 ms and was followed by a blank screen during which the response was registered for 200 ms. Next, feedback or a black screen was presented for 900 ms and was followed by the fixation point (a hash symbol, #). The only difference in the task protocols between the EEG and fMRI experiments was the time intervals between the events. In the EEG experiment, the stimulus-feedback interval was on average 1000 ms (varying between 800 and 1200 ms), whereas the interval between feedback and the next stimulus was on average 1600 ms (varying between 1100 and 2100 ms). Together, there were 4150 ms on average between trials. In the fMRI experiment, the stimulus–feedback interval was set to a constant 250 ms, whereas the interval between feedback and next stimulus was on average 3400 ms (varying between 2000 and 4800 ms). Together, there were 5200 ms on average between trials.

Behavioural data analysis

The same behavioural analysis was applied to data from both experiments. Accuracy and reaction times (RT) were analysed in Statistica 12.0 (StatSoft Inc., OK, US). These measures were subjected to a 2 × 3 way repeated measure ANOVA with task and congruency conditions. To ensure a clear comparison between correct and erroneous responses, all further time course analyses were performed solely on incongruent trials. RTs for correct and erroneous responses in both tasks underwent a 2 × 2 repeated measures ANOVA.

EEG data acquisition and analysis

Dense-array EEG data (HydroCel Geodesic Sensor Net, EGI System 300; Electrical Geodesic Inc., OR, USA) was collected from 256 channels (band-pass filtered at 0.01–100 Hz with a vertex electrode as a reference) at a sampling rate of 250 Hz and recorded with Net Station Software (Version 4.5.1, Electrical Geodesic Inc., OR, USA). Further offline data analysis was conducted with the open-source EEGLAB toolbox (Delorme and Makeig 2004). Data were digitally filtered to remove frequencies below 0.5 Hz and above 35 Hz. Average reference was recomputed and bad channels were automatically removed by kurtosis measures with a threshold value of 20 standard deviations. The choice of the threshold was adjusted based on our previous study (Beldzik et al. 2015a) to ensure that only true outliers (usually 5% of channels) were automatically removed; for instance, electrodes not adhering to the scalp. Next, continuous data were visually inspected to remove channels or time epochs containing high amplitude, high-frequency muscle noise and other irregular artefacts. The removed channels were interpolated.

Independent component analysis at the individual level was used to remove artefacts from the data. Due to the large number of channels, decomposition of EEG data with the Infomax algorithm was preceded by Principle Component Analysis. Fifty ICs were obtained and visually inspected for each subject. According to the characteristic spatiotemporal pattern (Jung et al. 2000; Delorme and Makeig 2004), components recognized as blinks, heart rate, saccades, muscle artefacts, or bad channels were removed. Next, epochs were extracted from − 400 to 1200 ms with regard to stimulus presentation and were baseline corrected from − 100 to 0 ms. Thus, both stimulus-locked and response-locked ERPs had the same prestimulus baseline. This choice was determined by the fact that the responses in the numerical Stroop task occur within the same time range (300–700 ms) as the effects on ERP amplitudes (Szűcs and Soltész 2012; Beldzik et al. 2015a). In such cases, a pre-response baseline could considerably reduce the effects.

To obtain reliable and stable group ICA decomposition, all incongruent trials entered the group ICA using the EEGIFT toolbox (Eichele et al. 2011). Since the toolbox requires the same number of trials for each subject, we chose to remove the rearmost correct trials (as they were substantially more numbered than incorrect trials), so that the final trial count would be evened out to the lowest number in the sample (i.e. 191). The number of independent sources to extract was set to seven based on our previous study with the numerical task (Beldzik et al. 2015a). The stability of ICA decomposition was validated with ICASSO (Himberg et al. 2004), with 50 random initializations of the Infomax algorithm. Data were back-reconstructed using the moo-icar option, i.e. the multivariate objective optimization ICA with reference (Du and Fan, 2013). Since the reconstruction adds some noise to component activity, a 35 Hz low pass filter was reapplied to the time courses with the use of the EEGLab iirfilt() function. To estimate the location in the brain of each independent source, the scalp maps were imported to BESA 5.3 (MEGIS Software GmbH, Gräfelfing, Germany) and underwent Classical LORETA Analysis Recursively Applied (CLARA).

Component ERPs were extracted from − 100 to 700 ms with regard to stimulus presentation and from − 400 to 400 ms with regard to reaction occurrence. ERPs were obtained for 2 × 2 (accuracy × task) conditions to search for a neural measure that accounted for the behavioural pattern found here. Thus, besides a significant interaction, a significant post hoc test between correct and erroneous responses within each task was required. A repeated measures ANOVA was applied to point-by-point amplitudes and peak latencies within the time range of recognized ERP components. Specifically, the N1 was verified within windows ranging from 0 to 300 ms with regard to stimulus onset. The error-related negativity (ERN) was 0–150 ms with regard to response onset. Finally, the CPP was verified in the stimulus-locked window from 200 to 600 ms and the response-locked window from − 300 to 100 ms. To protect against Type-I errors, multiple time point comparisons were corrected with the false discovery rate (FDR) method.

To establish the build-up rate of CPP, which represents the accumulation, we measured the temporal slope of each condition in each subject’s response-locked waveforms (Murphy et al. 2015). Build-up rate was computed as the slope of a straight line fitted to the signal within the temporal window of extending length. That is, fitting was conducted in a stepwise fashion for time windows, which started at − 100, − 104, − 108,…, until − 400 ms and ended at the response onset. The coefficient of determinant, R 2, was obtained for each fit and averaged across subjects. Since R 2 represents the goodness of the fit, this way (1) the linearity of the slope can be verified (2) the start of the CPP build-up can be establish for each condition using maximal R 2 value measured among extending temporal windows. Thus, maximal R 2 value defined the final length of the window for each condition and final fit to the statistical testing. Both the slope coefficients and constant terms of each fit underwent statistical testing using a 2 × 2 repeated measures ANOVA.

fMRI data acquisition and analysis

Magnetic resonance imaging (MRI) was performed using a 3 T scanner (Magnetom Skyra, Siemens) with a 20-channel head/neck coil. High-resolution, whole-brain anatomical images were acquired using a T1-MPRAGE sequence. A total of 176 sagittal slices were obtained (voxel size 1 × 1 × 1.1 mm3; TR = 2300 ms, TE = 2.98 ms, flip angle = 9°) for coregistration with the fMRI data. Functional T2*-weighted images were acquired using a whole-brain echo planar pulse sequence (EPI) with the following parameters: 3 mm isotropic voxel, TR = 2000 ms, TE = 30 ms, flip angle = 90°, FOV 192 × 192 mm2, GRAPPA acceleration factor 2, phase encoding A/P. The whole-brain image (cerebellum excluded) was covered with 35 axial slices, taken in an interleaved, ascending fashion. The acquisition time for each session was 15 min. Due to magnetic saturation effects, the first four volumes (dummy scans) of each session were discarded instantly.

The standard pre-processing procedure was applied using Analysis of Functional NeuroImage (AFNI) software (Cox 1996). Each 3D image was time-shifted so that the slices were aligned temporally. After head motion correction, the functional EPI data sets were coregistered to structural scans and smoothed using a full-width at a half maximum isotropic Gaussian kernel of 4 mm. Voxels with low-signal intensity located outside the brain were excluded from the functional images by a clipping function. During the scaling procedure, the percent signal change was calculated for each voxel. Anatomical and functional images were transformed into a coordinate system of Talairach space (Talairach and Tournoux 1988).

Next, group ICA was conducted using the GIFT toolbox (Calhoun et al. 2001). The estimation determining the number of components was performed using the minimum description length (MDL) criteria implemented in the software (Li et al. 2007). The stability of ICA decomposition was validated with ICASSO (Himberg et al. 2004), with 50 random initializations of the Infomax algorithm. In the case of insufficient stability scores, ICA was performed with the numbers neighbouring the number estimated with the MDL criteria. Data were back-reconstructed using the gica template option. The group average maps were inspected to identify and discard those ICs primarily associated with artefacts that represented signals from large vessels, ventricles, motion, and susceptibility. From the remaining ICs, time-courses were interpolated to 100 ms resolution and epochs were extracted − 1 to 12 s prior to the stimulus. These epochs were baseline corrected by subtracting the value at zero. Epochs were divided into 2 × 2 (accuracy × task) conditions and subjected to statistical testing. Specifically, values at the peak of hemodynamic responses underwent a 2 × 2 repeated measures ANOVA for each neural network. As with EEG data analyses, besides the significant interaction effect, a significant post hoc test between correct and erroneous responses within each task was required.

Results

Behavioural results

In general, participants committed 7.8% errors and 0.5% omissions in the EEG experiment, and 11.4% errors and 5.3% omissions in the fMRI experiment. The main effect of the congruence condition on both error rates and RTs confirmed the occurrence of conflict (Table 1). Since most of the errors were committed on incongruent trials, it was possible to base further behavioural, EEG, and fMRI data analyses solely on incongruent trials. The numerical task evoked more errors than the physical task (Table 2). RTs of correct and erroneous responses showed a significant interaction in both experiments (Fig. 2; EEG: F (1,33) = 161; p < 0.001; fMRI: F (1,33) = 295, p < 0.001). Particularly, errors in the physical task had longer RTs than correct trials, whereas errors in the numerical task showed the opposite pattern (Table 3). Moreover, significant and opposite pattern in RTs was observed when comparing hits to errors within each task.

Table 1 Behavioural results—all trials
Table 2 Error rate of incongruent trials (% of all)
Fig. 2
figure 2

Reaction time of incongruent trials in a EEG and b fMRI experiments. Asterisk indicates Tukey’s post hoc p < 0.001. Num numerical task, Phys physical task

Table 3 Reaction time of incongruent trials in ms

EEG results

Seven independent sources were reliably obtained (component stability was above 0.98) with the group ICA. Each of the estimated EEG sources can be characterized by its topography (Fig. 3a), the constellation of underlying brain sources (Fig. 3b), and event-related time courses (Fig. 3c). Local maxima of independent component source reconstructions are listed in Table 4. Three of the components had clearly identifiable ERPs and matched the independent components obtained previously (Beldzik et al. 2015a). The fronto-central component with the source estimated in the medial frontal cortex showed early (~ 100 ms after the stimulus onset) negativity, which can be linked to the visual anterior N1 potential (Vogel and Luck 2000). The activity of this component in response-locked data revealed pronounced ERN and followed it positivity (Holroyd and Coles 2002). The inferior temporal component with the source estimated approximately in the right inferior temporal cortex and left insula manifested in early visual activity of temporo-occipital N1 potential (Pinel et al. 2001). No significant task and accuracy interactions were found within these aforementioned ERP components.

Fig. 3
figure 3

ICA results of EEG data. Independent components’ a mean scalp topography, b location in brain estimated using CLARA and c stimulus-locked and response-locked ERPs for both tasks. R right, L left

Table 4 Local maxima of EEG independent components’ source reconstructions reported in Talairach coordinates

The centro-parietal component (Fig. 4a) with the source estimated in the bilateral inferior parietal cortices showed pronounced P3b/CPP potential (Polich 2007; Twomey et al. 2015). A significant interaction effect was found on the amplitudes within both stimulus-locked and response-locked ERPs (around 300 ms post stimuli and around −110 and − 18 ms prior to response; see Table 5 for details; Fig. 4a). Average R 2 values of the linear fit to the signal before the response determined the time window in which CPP increases linearly for each condition (Supp. Figure 1). Maximal R2 was found for windows beginning at −352 ms (R 2 = 0.78) for numerical correct, − 142 ms (R 2 = 0.91) for physical correct, − 164 ms (R 2 = 0.80) for numerical error, and − 344 ms (R 2 = 0.59) for physical error. Lower R 2 values for erroneous trials in comparison to correct ones reflect lower number of trials, which contributed to ERPs. The obtained slope coefficients and constant terms showed significant interaction effect between accuracy and task instructions (Table 5; Fig. 4b).

Fig. 4
figure 4

Interaction effects on centro-parietal EEG component. a ERPs of stimulus-locked (left) and response-locked (right) data. Bars represent mean potential within the significant time range. b Linear fit to response-locked ERPs for each condition. Bars presents mean slope coefficients and constant terms. Grey shading indicates significant (p cor < 0.05) interaction effect. Num numerical task, Phys physical task

Table 5 Significant task and accuracy interactions of the centro-parietal EEG component

fMRI results

Although MDL criteria estimated 25 components on average, ICA decomposition with a total of 29 proved most reliable (component stability was above 0.97). All activations are reported at the FDR-corrected threshold, p < 0.001. Based on the literature (Kelly et al. 2010; Varoquaux et al. 2010), 11 components were classified as neural networks (Fig. 5; Table 6). Three maps corresponded to visual areas. Two of them are commonly observed in resting state studies (Beckmann et al. 2005; Smith et al. 2009; Varoquaux et al. 2010): the medial visual network, which comprised extensive activation in the lingual gyrus, and the lateral visual network ranging from the bilateral middle occipital to the superior parietal cortices. Less commonly observed is the temporo-parietal network involving bilateral inferior temporal gyri and intraparietal sulcus (IPS). The executive control network usually includes the bilateral prefrontal cortex and the core regions of anterior cingulate and insular cortices (Smith et al. 2009; Domagalik et al. 2012). Here, the executive control and prefrontal networks were separated. In line with the study of Damoiseaux et al. (2008), the default mode network was divided into anterior and posterior parts. The left and right frontoparietal and the auditory networks are highly reproducible, as indicated by all the referenced studies. Lastly, the network comprising the precuneus (Domagalik et al. 2012) is less often observed. Only the temporo-parietal network achieved the established criteria of significant interaction [Fig. 6; F (1,33) = 44.1; p < 0.001; LSD post hocs: numerical (Cor vs. Err) p = 0.032, physical (Cor vs. Err) p < 0.001, correct (Num vs. Phys) p = 0.011, error (Num vs. Err) p < 0.001].

Fig. 5
figure 5

ICA results of fMRI data. Brain maps (a) and stimulus-locked hemodynamic responses for both tasks (b). R right, L left

Table 6 Local maxima of fMRI independent components reported in Talairach coordinates
Fig. 6
figure 6

Temporo-parietal network. Networks’ map (a), stimulus-locked hemodynamic responses (b), and mean peak activities for each condition (c). Asterisk indicates significant LSD post hoc: *p < 0.05, **p < 0.001. Num numerical task, Phys physical task

Discussion

In the study, we aimed to verify the neural substrates of correct and erroneous decisions in numerical and physical comparisons. To achieve this goal, we focused solely on incongruent trials as they yield the size congruity effect (Besner and Coltheart 1979) and trigger errors. The size congruity effect originates from the parallel processing of physical and semantic information in the brain, which interact and evoke response competition (Henik and Tzelgov 1982; Kadosh et al. 2007; Szucs and Soltész 2007; Szűcs et al. 2009). Here, we replicated a common finding that assessing the physical feature of the digits is faster than assessing their value (Fig. 2). Errors in the numerical task showed a typical-for-conflict task effect of shortened RTs in comparison with correct trials (Gratton et al. 1988; White et al. 2011). An interesting finding that has not been reported by any previous conflict study is the effect of longer RTs of errors in physical tasks compared to correct trials. This pattern of cross-shaped interaction was used as a marker for neural measurements to explore the origins of this phenomenon.

The EEG data were analysed using group ICA (Eichele et al. 2011) to separate functionally distinct neural sources. Decomposition of EEG data into functionally distinct independent sources was similar to our previous study (Beldzik et al. 2015a), including a centro-parietal component with a robust CPP potential. Although there was no effect on the latency of stimulus-locked CPP, the abrupt slope and sharp peak around the timing of response execution suggest that CPP is more accurately represented by an ERP average aligned to subject responses (Makeig and Onton 2009). In agreement to common findings of decreased P3b amplitudes for more difficult and longer-to-react conditions (West and Alain 2000; Szűcs and Soltész 2012), we found significant task and accuracy interaction within the stimulus-locked P3b/CPP potential (Fig. 4a). However, in light of the aforementioned methodological aspects, P3b/CPP amplitude decreases for more difficult conditions due to the temporal variability of the peak onset (Twomey et al. 2015). Thus, analysing the response-locked CPP is more accurate and informative approach.

As hypothesised, a significant interaction among task and accuracy conditions was found on the slope of response-aligned CPP around 110 ms before the response (Fig. 4a). In light of the interpretation of Twomey et al. (2015), this slope represents the build-up rate of accumulating evidence in favour of a particular outcome. To strengthen this association, we measured the build-up rate by fitting a straight line to the signal before the response onset separately for each condition (Murphy et al. 2015). The significant accuracy and task interaction of the slope coefficients confirmed that the build-up rates tracked the RT pattern of each condition (Fig. 4b). In addition, the constant terms showed the interaction effect, i.e. the value of CPP amplitude at the onset of the response. Due to the shared topography of the CPP signal, we cannot definitely distinguish between correct numerical and physical responses; however, these results indicate that each response has a characteristic build-up rate and action-triggering threshold. Following that logic, errors in the numerical task have the same build-up characteristic as hits in the physical task. This suggests that errors in the numerical task were committed due to accumulation of the wrong evidence. In other words, evidence was accumulated towards the decision of which digit was bigger in size rather than magnitude. Finally, the CPP build-up rate parameters of errors in the physical task indicate that, at least partially, subjects accumulate wrong evidence in favour of the numerically, instead of physically, greater digit. This potentially explains why errors in the physical task have longer RTs than hits.

The fMRI data were analysed in a similar manner to the EEG data. Group ICA was used to separate neural sources, event-related averaging was performed on components’ time-courses, and peaks of hemodynamic responses were searched for significant task and accuracy interaction. Eleven neural networks were obtained (Fig. 5), but only one showed the expected pattern of activity: greater activation for correct vs. error trials in the numerical task and greater activation for error vs. correct in the physical task. This network involves the bilateral inferior temporal gyrus as well as a bilateral cluster extending along the IPS to the postcentral gyrus; therefore, it was labelled temporo-parietal (Fig. 6a). The architecture of this component shows some similarities to the dorsal attention network (Fox et al. 2005; Visintin et al. 2015); however, the lack of frontal eye field inclusion speaks against this association.

More evident is a link to the recently established ‘number network’ (Amalric and Dehaene 2016; Nieder 2016), which comprises bilateral intraparietal, inferior temporal, and dorsal prefrontal sites. The network is activated by all domains of mathematics such as analysis, algebra, topology, and geometry (Amalric and Dehaene 2016). As indicated by studies using intracranial electrophysiological recordings in humans, the inferior temporal gyrus selectively responds to written Arabic numerals, but not to morphologically, semantically, or phonologically similar stimuli (Shum et al. 2013). IPS was activated during a variety of number-processing and calculation tasks implemented in fMRI (Pinel et al. 2004; Kadosh et al. 2005; Kaufmann et al. 2005) and is considered a key node for the representation of the semantic aspect of numerical quantity (Nieder and Dehaene 2009). Moreover, studies using single-neuron recordings in the IPS of monkeys have identified ‘number neurons’ which encode the number of elements in a set and its cardinality or numerosity, irrespective of stimulus modality (see a review by Nieder 2016). In brief, the involvement of the temporo-parietal network in the numerical Stroop paradigm is highly consistent with the literature.

Due to separate EEG and fMRI experiments, we were unable to couple independent components in the single-trial levels; however, partial consistency might be observed between EEG source location measurements (Fig. 3) and fMRI results (Fig. 6a). Particularly, the inferior temporal EEG component had the highest source probability in the right inferior temporal gyrus, whilst the centro-parietal EEG component had bilateral parietal distribution of the estimated source. Together, these results combine when and where the numerical processing occurs (Verguts and Fias 2008). The inferior temporal component with a source in the inferior temporal gyrus encodes numerical/physical sensory information into numerical representation (Shum et al. 2013; Amalric and Dehaene 2016), whereas the centro-parietal component with a source in IPS accumulates quantitative information into the DV until reaching an action-triggering threshold (Nieder and Dehaene 2009; O’Connell et al. 2012). The similar pattern of amplitude modulations by task and accuracy conditions in EEG and fMRI measurements supports this association. Interestingly, topography of the inferior temporal EEG component indicates right hemispheric dominance, whereas fMRI data show bilateral activations. The same discrepancy can be observed in the literature as some studies point to the right inferior temporal supremacy in response to selective numbers (Park et al. 2012; Shum et al. 2013), whereas others report bilateral activations in that brain region during numerical cognitive tasks (Pinel et al. 2004; Amalric and Dehaene 2016). Future studies, which implement simultaneous EEG and fMRI recordings, might consider verifying the link between these components and clarify the hemispheric inconsistency of the inferior temporal activations.

An interesting question remains regarding the nature of the DV signal. Does it reflect accumulation of the numerical quantity information (“seven is greater”) or spatial aim-of-reach information (“press left button”)? Several recent findings speak in favour of the latter perspective. First, Twomey et al. (2015) indicated that the DV signal is supramodal as it was detected during auditory and visual oddball tasks. Second, an fMRI study implementing a saccadic task revealed that, as opposed to regions involved in conflict resolution, IPS exhibits strong trial-by-trial variations with RTs (Domagalik et al. 2014; Beldzik et al. 2015b). Third, de Lafuente et al. (2015) showed using single-neuron recordings in monkeys that when decisions are communicated by hand movements, the reach-related neurons from the medial IPS exhibit a gradual modulation of their firing rates consistent with the representation of an evolving DV. When decisions are communicated by eye movements, neurons in the lateral IPS represent the accumulation of evidence bearing on the potential targets of saccades.

On the other hand, quantity and space are part of a generalized magnitude system in which “two kinds of quantity must be combined to determine behaviourally important decision variables” (Gallistel and Gelman 2000; Walsh 2003). The numerical–spatial interactions arise from common parietal circuits for attention to external space and internal representations of numbers (Hubbard et al. 2005). This perspective is also supported with studies applying single-neuron recordings in monkeys. Tudusciuc and Nieder (2007) found anatomically intermingled single neurons in the IPS of monkeys which encode continuous spatial, discrete numerical, or both types of quantities. This and several subsequent studies have provided evidence that the IPS encodes numerosity as a perceptual category, regardless of behavioural relevance (see review by Nieder 2016).

In summary, when one is presented with a pair of digits that are incongruent in size and numerical value, this sensory information is encoded in parallel in the bilateral inferior temporal gyri, which manifests in posterior N1 activity of the inferior temporal component. This information is then integrated in the IPS, and the DV is accumulated in favour of a particular outcome until reaching an action-triggering threshold, which manifests in CPP potential of the centro-parietal component. These brain regions are functionally connected into one coherent neural network. Analysis of the DV signal for each condition revealed characteristic build-up rates and action-triggering thresholds for the two tasks, which differ only in the instructions given by the experimenter. Our results suggest that errors are committed due to accumulation of evidence in favour of the other (i.e. wrong) task instruction. This interpretation fully explains the behavioural pattern observed here: errors in numerical comparisons have shorter RTs than hits (and similar RTs to hits in physical comparisons), whereas errors in physical comparisons show longer RTs than hits. On a final note, our study highlights the advantages of implementing group ICA in both EEG and fMRI data.