The Cognitive Reflection Test (hereafter, CRT) measures the ability to suppress a prepotent but incorrect intuitive answer and engage in cognitive reflection when solving a set of mathematical word problems (Frederick, 2005). The most famous CRT item is the “bat and ball” problem: “A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? ___ cents.” Participants usually come up with an appealing intuitive yet incorrect answer—10 cents—instead of the correct answer, which requires more analytical processing and some formal computation—5 cents. The test has become increasingly popular, yielding more than 2,000 citations 12 years after its publication on Google Scholar, and has grown into the optimum measure of rational thinking (Toplak, West, & Stanovich, 2011). It gained popularity because it predicts an extensive array of variables: inter alia, biases in reasoning, judgment, and decision-making (e.g., Campitelli & Labollita, 2010; Frederick, 2005; Lesage, Navarrete, & De Neys, 2013; Liberali, Reyna, Furlan, Stein, & Pardo, 2012; Sirota, Juanchich, & Hagmayer, 2014; Toplak et al., 2011; Toplak, West, & Stanovich, 2014, 2017); real-life decision outcomes (Juanchich, Dewberry, Sirota, & Narendran, 2016); moral reasoning (Baron, Scott, Fincher, & Metz, 2015); paranormal beliefs and belief in God (Pennycook, Cheyne, Seli, Koehler, & Fugelsang, 2012; Pennycook, Ross, Koehler, & Fugelsang, 2016); and political beliefs (Deppe et al., 2015; but see Kahan, 2013; for a review, see Pennycook, Fugelsang, & Koehler, 2015a).

Critically, a lot of the research, whether correlational or experimental, has presented participants with its initial form—an open-ended answer format—so that participants have had to construct their responses (e.g., De Neys, Rossi, & Houde, 2013; Frederick, 2005; Johnson, Tubau, & De Neys, 2016; Liberali et al., 2012; Royzman, Landy, & Leeman, 2015; Sirota et al., 2014; Szaszi, Szollosi, Palfi, & Aczel, 2017; Toplak et al., 2011). Sometimes, however, an ad hoc multiple-choice question version of the CRT has been used—most commonly a two- or four-option format (e.g., Gangemi, Bourgeois-Gironde, & Mancini, 2015; Morsanyi, Busdraghi, & Primi, 2014; Oldrati, Patricelli, Colombo, & Antonietti, 2016; Travers, Rolison, & Feeney, 2016). In the latter approach, the equivalence between the open-ended and multiple-choice versions of the test has been implicitly assumed or explicitly claimed and similar processes have been inferred from these two types of tests. Indeed, if such equivalence has been achieved then using a validated multiple-choice version of the CRT would be more convenient since such a version would most likely be quicker to administer and code than the open-ended CRT. Furthermore, an automatic coding scheme would eliminate any potential coding ambivalence—for instance, whether “0.05 cents,” a formally incorrect answer to a bat and ball problem, should count as an incorrect or correct answer, on the basis of the assumption that participants mistook the unit in the answer for dollars instead of cents: that is, “0.05 dollars.”

There are several good empirical and theoretical reasons to expect differences according to the response formats of the Cognitive Reflection Test. First, evidence from educational measurement research has pointed out the fact that despite a high correlation between open-ended (also called constructed) and multiple-choice versions, multiple-choice tests usually lead to a better overall performance (e.g., Bridgeman, 1992; Rodriguez, 2003). Open-ended questions are more difficult to solve than multiple-choice ones for stem-equivalent items (i.e., that differ only by listing multiple choices), because presenting options enables a different array of cognitive strategies leading to increased performance (Bonner, 2013; Bridgeman, 1992). For instance, if participants generate an incorrect answer then a limited set of answers might provide unintentional feedback and eliminate that particular solution as a possible answer. With multiple-choice questions, participants can use a backward strategy, in which they pick up an answer listed in the options and try to reconstruct the solution. Participants can also guess whether they are uncertain about the options.

Second, there might be theoretical reasons for the nonequivalence of tests with different response formats, which could also be consequential. Cognitive conflict, which triggers deeper cognitive processing—according to several dual-process theories (De Neys, 2012, 2014; Kahneman & Frederick, 2005)—might be more pronounced in the presence of an explicitly correct option and some other intuitively appealing, but incorrect, alternative option (Bhatia, 2017). Thus, a multiple-choice version of the CRT might be easier because the explicit options would trigger cognitive conflict with higher likelihood, which, in turn, would lead to easier engagement with cognitive reflection and thus become more strongly associated with the benchmark variables usually linked with the cognitive reflection the CRT is assumed to measure (e.g., belief bias, paranormal beliefs, denominator neglect, and actively open-minded thinking; Pennycook et al., 2012; Toplak et al., 2014). Limited process-oriented evidence has indicated that pronounced cognitive conflict was present when using a multiple-choice version of the CRT. The mouse trajectories of participants who responded correctly revealed that they were attracted to the incorrect intuitive response (Travers et al., 2016), whereas evidence of conflict was missing in a thinking-aloud study using the open-ended version of the CRT (Szaszi et al., 2017). Clearly other factors, such as the employed process-oriented methodologies having different sensitivities to a conflict, might account for the difference, but the format response of the CRT remains a possible reason for this difference as well.

In addition, even if people were equally likely to detect a conflict in the two different response formats and engage in reflective thinking afterward, they might still fail to correct their initial intuition due to lack of mathematical knowledge (Pennycook, Fugelsang, & Koehler, 2015b). This is supported by the thinking-aloud evidence, in which performance on the CRT with the open-ended response format was partly explained by the lack of specific knowledge needed to solve the problem (Szaszi et al., 2017). Since the correct answer is already included in the multiple-choice version of the test, this particular format might therefore be easier. Another consequence could be that such a test would have a weaker association with numeracy than the open-ended CRT does (Liberali et al., 2012). In other words, construct nonequivalence would implicate different cognitive processes taking place in the different formats of the CRT. These should result in different levels of performance and different correlational patterns with the benchmark variables usually associated with the CRT.

The present research

In the present experiment, our overarching aim was to test the construct equivalence of three different formats of the Cognitive Reflection Test (and its variations). To do so, we compared their means and correlational patterns with other typically predicted constructs. We did not opt for correlation between the different versions of the test because this does not necessarily represent equivalence of the underpinning cognitive processes that manifest themselves into the final scores. Specifically, we set up three main aims to fill in the gaps outlined above. First, we tested whether the CRT response format affects performance in the test, both in terms of reflectiveness score (i.e., correct responses) and intuitiveness score (i.e., appealing but incorrect responses). Second, we tested whether the CRT response format altered the well-established association between performance in the CRT and benchmark variables: belief bias, denominator neglect, paranormal beliefs, actively open-minded thinking and numeracy. Third, we tested the psychometric quality of the different formats of the tests by comparing their internal consistency. In addition, and from a more practical perspective, we also had some expectations concerning time of completion.

According to the construct equivalence hypothesis, which is assumed in the present research practices, (i) the effect of the answer format on the reflectiveness and intuitiveness scores will be negligible, (ii) the correlational patterns with outcome variables will not differ across the different test formats, and (iii) the tests’ scores will have similar internal consistencies. In contrast, according to the construct nonequivalence hypothesis, which was derived from the mathematical problem-solving literature and based on other theoretical reasons, (i) multiple-choice versions will lead to higher reflectiveness scores and lower intuitiveness scores, due to employing a different array of more successful cognitive strategies, better chances of detecting cognitive conflict, and/or better chances of identifying the correct response; (ii) the correlational patterns with outcome variables will differ across the different test formats—the multiple-choice test should better predict the predictive validity variables (belief bias, paranormal beliefs, and denominator neglect), since they share similar processes, and should be less correlated with numeracy;Footnote 1 and (iii) the multiple-choice version will have a higher internal consistency in its summation score. Finally, we predicted that the multiple-choice versions of the CRT would be quicker to complete than the open-ended version.

Method

Participants and design

We powered the experiment to detect a small-to-medium effect size (f = .17 ≅ ηp2 = .03). Given α = .05 and β = .90 for a between-subjects analysis of variance with three groups, such a power analysis resulted in a minimum required sample size of 441 participants (Cohen, 1988). Such a sample size would be sensitive enough to detect a medium effect size difference between two correlations (i.e., Cohen’s q ≈ 0.32). The participants were recruited from an online panel (Prolific Academic). Panel members were eligible to participate only when they fulfilled all four conditions: (i) their approval rate in previous studies was above 90%, (ii) they had not taken part in previous studies conducted by our lab in which we had used the Cognitive Reflection Test, (iii) they were UK nationals, and (iv) they resided in the UK. The first criterion aimed to minimize careless responding (Peer, Vosgerau, & Acquisti, 2014), whereas the second criterion aimed to reduce familiarity with the items, and the last two criteria aimed to guarantee a good level of English proficiency. The participants were reimbursed £1.40 for their participation, which lasted, on average, 17 min. A total of 452 participants (with ages ranging from 18 to 72 years, M = 37.0, SD = 12.3 years; 60.2% of whom were female) completed the questionnaire. The participants had various levels of education: less than high school (0.7%), high school (39.4%), undergraduate degree (44.2%), master’s degree (12.2%), and higher degrees such as PhD (3.5%).

In a between-subjects design, the participants were randomly allocated to one of the three versions of the Cognitive Reflection Test and then answered items from the five benchmark tasks, which were presented in a random order.

Materials and procedure

Cognitive Reflection Test: Response format manipulation

After giving informed consent, the participants solved the extended seven-item Cognitive Reflection Test, composed of the original three items (Frederick, 2005) and four additional items (Toplak et al., 2014). The test was presented in one of the three test formats: (i) the original open-ended version, (ii) the two-option multiple-choice version, or (iii) the four-option multiple-choice version of the test. Each item was presented with the same stem and response format specific to the test format (see items in the supplementary materials)—for instance:

“In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?”

  1. (i)

    The open-ended version:

    • ____ days

  2. (ii)

    The two-option multiple choice version:

    • □ 47 days

    • □ 24 days

  3. (iii)

    The four-option multiple choice version:

    • □ 47 days

    • □ 24 days

    • □ 12 days

    • □ 36 days

The two-option MCQ version always featured the correct and intuitive incorrect answers. The four-option MCQ version featured the correct and intuitive incorrect answers plus two other incorrect answers that had been found to be the most common incorrect answers, after the intuitive incorrect answer, in a previous study (Sirota, Kostovicova, manuscript in preparation Juanchich, Dewberry, & Marshall, 2018). The presentation order of the CRT items, as well as the individual options in the MCQ versions of the test, was randomized for each participant. After solving the CRT, the participants then assessed the item familiarity of all of the CRT items, presented in a random order: “Have you answered any of the following questions prior to taking this survey?”: Yes/No.

The participants then answered three indicators of predictive validity—(i) belief bias, (ii) paranormal beliefs, and (iii) denominator neglect—and two indicators of construct validity—(iv) open-mindedness beliefs and (v) numeracy.

Belief bias

The participants assessed the logical validity of the conclusions of eight syllogisms (Evans, Barston, & Pollard, 1983; Markovits & Nantel, 1989). Each syllogism featured two premises and one conclusion. Four of the syllogisms had an unbelievable conclusion that followed logically from the two premises. For instance: “Premise 1: All things that are smoked are good for the health. Premise 2: Cigarettes are smoked. Conclusion: Cigarettes are good for the health.” The other four syllogisms featured a believable conclusion that did not follow logically from the premises. For instance: “Premise 1: All things that have a motor need oil. Premise 2: Automobiles need oil. Conclusion: Automobiles have motors.” The belief bias score had good internal consistency (Cronbach’s α = .86). Correct responses were summed (+1 each) to create a belief bias score (0–8), with higher values indicating a stronger bias.

Paranormal beliefs

We assessed paranormal beliefs across different domains (e.g., witchcraft, superstition, spiritualism) with the Revised Paranormal Belief Scale (Tobacyk, 2004). The participants expressed their agreement with 26 statements (e.g., “It is possible to communicate with the dead”) on a 7-item Likert scale (1 = Strongly Disagree, 2 = Moderately Disagree, 3 = Slightly Disagree, 4 = Uncertain, 5 = Slightly Agree, 6 = Moderately Agree, 7 = Strongly Agree). The scale had excellent internal consistency (Cronbach’s α = .95). We averaged the participants’ responses (1–7), with higher values indicating stronger paranormal beliefs.

Denominator neglect

We used five scenarios describing a game of chance in which the participants could draw a single ticket from one of two bowls—a small bowl and a big bowl—each containing folded tickets (Kirkpatrick & Epstein, 1992; Toplak et al., 2014). The small bowls featured a higher probability of winning and ratios with smaller denominators than the big bowls. For instance, the small bowl contained ten tickets with one winner ticket out of ten, therefore giving an 11% chance of winning, whereas the large bowl contained 100 tickets with eight winning tickets out of 100, giving an 8% chance of winning. Denominator neglect occurs when participants prefer to choose from the bigger bowl, not realizing that they are actually more likely to draw a winning ticket from the smaller bowl (with the smaller denominator). The participants indicated which bowl they would prefer in a real-life situation in order to hypothetically win £8 on a 6-point Likert scale (ranging from 1: I would definitely pick from the small bowl, 2: I would pick from the small bowl, 3: I would probably pick from the small bowl, 4: I would probably pick from the large bowl, 5: I would pick from the large bowl, 6: I would definitely pick from the large bowl). The ratios of winning to losing tickets in the small and big bowls were the same as those used in Toplak et al. (2014): 1:10 versus 8:100, 1:4 versus 19:81, 1:19 versus 4:96, 2:3 versus 19:31, and 3:12 versus 18:82. The answers had good internal consistency (Cronbach’s α = .80). The ratings in the five scenarios were averaged (ranging from 1 to 6), with higher values indicating a stronger tendency to neglect denominators.

Open-minded thinking

We used the Actively Open-Minded Thinking Beliefs Scale to measure beliefs about open-mindedness (Baron, 2008; Stanovich & West, 1997). The participants expressed their agreement with 11 statements (e.g., “People should revise their beliefs in response to new information or evidence”) on a 5-point Likert scale (anchored at 1 = Completely Disagree, 5 = Completely Agree). The scale had good internal consistency (Cronbach’s α = .79). Average scores with higher values indicated stronger beliefs in open-minded thinking.

Numeracy

We used the Lipkus Numeracy Scale, perhaps the most commonly used measure of numeracy in this context (Lipkus, Samsa, & Rimer, 2001). The measure consists of 11 simple mathematical tasks, which tap into general numeracy, including understanding of basic probability concepts, ability to convert percentages to proportions, and ability to compare different risk magnitudes (e.g., “The chance of getting a viral infection is .0005. Out of 10,000 people, about how many of them are expected to get infected?”). The scale had satisfactory internal consistency (Cronbach’s α = .65). The participants’ correct answers were summed (0–11), so higher scores indicated higher numeracy.

Finally, the participants answered some socio-demographic questions (age, gender, and education level) and were debriefed. We conducted the study in accordance with the ethical standards of the American Psychological Association. We have reported all the measures in the study, all manipulations, any data exclusions, and the sample size determination rule.

Results

Effect of test response format on performance

The percentages of correct and intuitive incorrect responses, as well as item difficulty, item discrimination and item-total correlations, looked very similar across the three formats of the seven-item CRT (Tables 1 and 2). Overall, the participants correctly answered similar numbers of problems in all the test versions of the seven-item CRT (Fig. 1, top left). We found no statistically significant differences between these three formats in the number of correctly solved problems, F(2, 449) = 0.31, p = .733, ηp2 < .01. A Bayesian analysis, using the BayesFactor R package and default priors (Morey & Rouder, 2015), yielded strong evidence (evidence categorization as recommended in Lee & Wagenmakers, 2014) to support the model assuming the null format effect relative to the model assuming the format effect, BF01 = 30.3. The effect of format on reflectiveness scores remained nonsignificant when we entered familiarity with the items as a covariate, F(2, 448) = 0.27, p = .271, ηp2 < .01, and the effect of familiarity was also nonsignificant, F(1, 448) = 1.73, p = .190, ηp2 < .01. A Bayesian analysis including the same covariate (as a nuisance term) yielded strong evidence to support the model assuming the null format effect relative to the model assuming the format effect, BF01 = 31.2. The robustness of the findings was tested against a six-item version of the CRT, which did not include the seventh item because it featured a three-choice option in the original open-ended version of the CRT. The conclusion remained the same: There was no detectable format effect between the open-ended, two-option, and four-option formats of the six-item CRT (M1 = 2.8, M2 = 3.1, M3 = 2.8, respectively; SD1 = 2.1, SD2 = 1.8, SD3 = 1.9, respectively), F(2, 449) = 0.86, p = .426, ηp2 < .01, and strong evidence for relative support of the null effect model relative to the alternative model, BF01 = 18.2.

Table 1 Correct, intuitive incorrect, and, if applicable, other incorrect responses across three versions of the Cognitive Reflection Test
Table 2 Item difficulty, discrimination, and item-total correlations for the three response formats of the Cognitive Reflection Test (correct responses)
Fig. 1
figure 1

Effects of three different CRT formats on the numbers of correct responses (top left) and intuitive (incorrect) responses (top right) on the seven-item CRT, as well as on numbers of correct responses (bottom left) and intuitive (incorrect) responses (bottom right) on the three-item CRT. Each graph represents the individual data points, their density, and the mean (middle bold line) with its 95% confidence interval (CI; box borders)

The participants also correctly answered similar numbers of problems in all test versions of the original three-item CRT (Fig. 1, bottom left). The format effect was not statistically significant, F(2, 449) = 1.19, p = .306, ηp2 = .01, and we found strong evidence supporting the model assuming no effect relative to the effect, BF01 = 13.4. The null effect on reflectiveness scores remained when we controlled for familiarity of the items, F(2, 448) = 1.08, p = .342, ηp2 = .01 [there was a nonsignificant effect of familiarity, F(1, 448) = 1.21, p = .273, ηp2 < .01]. We found strong evidence to support the model assuming the null format effect (including the covariate as a nuisance term) relative to the model assuming the format effect, BF01 = 15.2. Thus, the evidence found here clearly supported the hypothesis of construct equivalence.

We observed more variability in the numbers of intuitive responses across the different test formats of the seven-item CRT, with the two-option test giving rise to higher numbers of intuitive answers (Fig. 1, top right). We found a significant effect of the response format, F(2, 449) = 7.47, p < .001, ηp2 = .03, and strong evidence to support the model assuming the format effect relative to the no-effect model, BF10 = 25.4. The two-option CRT yielded more intuitive responses than the open-ended CRT, t = 3.59, p < .001, and also more than the four-option CRT, t = 3.01, p < .001, but there was no difference between the open-ended and four-option CRTs, t = – 0.59, p = .825. (All p values are based on Tukey’s adjustment.) The effect of format on intuitiveness score was stable even when we controlled for the familiarity of the items, F(2, 448) = 7.49, p < .001, ηp2 = .03, whereas familiarity did not have a significant effect, F(1, 448) = 1.38, p = .241, ηp2 < .01. The format effect was further corroborated by strong relative evidence, BF10 = 27.3. The pattern of results was the same when Item 7 was removed from the averages of the open-ended CRT, the two-option CRT, and the four-option CRT (M1 = 2.3, M2 = 2.9, M3 = 2.4; SD1 = 1.7, SD2 = 1.8, SD3 = 1.7, respectively). The effect of the format was still significant, F(2, 449) = 6.30, p = .002, ηp2 = .03, and we found moderate evidence to support the model assuming the format effect, BF10 = 8.7.

However, we did not find a significant increase in the number of intuitive answers across the test formats in the original three-item CRT, F(2, 449) = 1.50, p = .224, ηp2 = .01 (Fig. 1, bottom right). The data provided strong relative evidence to support the null-format-effect model, BF01 = 10.0. The null effect of format on intuitiveness score remained nonsignificant when we controlled for the familiarity of the items, F(2, 448) = 1.40, p = .247, ηp2 = .01. There was a nonsignificant effect of familiarity, F(1, 448) = 2.90, p = .089, ηp2 = .01, which was supported by strong relative evidence, BF10 = 11.4.

Effect of test response format on validity

The construct equivalence hypothesis predicted that the correlational patterns of correct and intuitive CRT responses with belief bias, paranormal beliefs, and denominator neglect as predictive validity variables, as well as with actively open-minded thinking and numeracy as construct validity variables, would be similar across the three test formats. Overall, in terms of expected direction and strength, indicated by the confidence intervals in Fig. 2, we observed that all the correlations were significantly different from zero, with the exception of a nonsignificant correlation between intuitive responses on the seven-item open-ended CRT and paranormal beliefs (Fig. 2). We observed only small correlational variations between the three test formats of the seven-item and three-item CRTs and the predicted variables (Fig. 2, left panels). The four-option format followed by the two-option format sometimes yielded higher correlations than the open-ended format—for example, most notably for the belief bias: – .52 and – .51 versus – .36 (Fig. 2, top left), but in other tasks, such as denominator neglect, the correlations were remarkably similar to each other. We tested the differences between correlations by using a series of z tests adjusted for multiple comparisons (given three tests for the three comparisons, we used a Bonferroni adjustment, which decreased the alpha error from .05 to .017), using the cocor R package (Diedenhofen & Musch, 2015). None of the differences between the correlations reached statistical significance (see Table 3). In other words, the different formats of the CRT predicted the outcome variables to similar extents.

Fig. 2
figure 2

Effects of three different CRT formats on the correlational patterns with predictive validity variables and a confounding variable of numeracy for correct responses (top left) and intuitive (incorrect) responses (top right) on the seven-item CRT, as well as on the correlational patterns for correct responses (bottom left) and intuitive (incorrect) responses (bottom right) on the three-item CRT. Each horizontal error bar represents a point estimate of the Pearson correlation coefficient (r) and its 95% CI

Table 3 Differences between correlational patterns between the open-ended and multiple-choice versions of the CRT with indicators of predictive and construct validity

Effect of test response format on internal consistency

All three response formats of the seven-item CRT—open-ended, two-option, and four-option—had good internal consistency for the reflectiveness score, α = .79, 95% CI [.74, .84], α = .73, 95% CI [.66, .79] and α = .71, 95% CI [.63, .78], respectively. We did not find significant differences between these three alphas, χ2(2) = 3.29, p = .193 (using the cocron R package; Diedenhofen & Musch, 2016). Similarly, no differences were detected for the intuitiveness scores, α = .68, 95% CI [.59, .75], α = .73, 95% CI [.66, .79], and α = .64, 95% CI [.54, .72], respectively, which were not significantly different, χ2(2) = 2.36, p = .307.

The internal consistencies of the three-item CRT were also similar across the response formats. The internal consistencies of the reflectiveness scores were not statically significant:, α = .73, 95% CI [.64, .80]; α = .61, 95% CI [.49, .71]; and α = .60, 95% CI [.47, .70], χ2(2) = 3.72, p = .156. This was the case for the intuitiveness scores, as well: α = .67, 95% CI [.57, .75]; α = .61, 95% CI [.49, .71]; and α = .58, 95% CI [.45, .68], χ2(2) = 1.15, p = .563. Thus, any differences in the correlational patterns would be less likely to be due to differences in the internal consistencies of the scales.

Effect of test response format on completion time

Finally, we looked at the time taken to complete the three format versions of the CRT. As expected, the open-ended CRT (M = 5.9, SD = 4.0, Mdn = 4.8, IQR = 3.9 min) took substantially more time to complete than either the two- or the four-option CRT (M = 3.5, SD = 1.5, Mdn = 3.2, IQR = 2.0 min; M = 4.5, SD = 2.6, Mdn = 3.8, IQR = 2.8 min, respectively). A nonparametric Kruskal–Wallis test (used due to data skewness) confirmed that the difference was statistically significant, χ2(2) = 44.71, p < .001. The open-ended CRT took longer than the two-option test, Mann–Whitney U = 6,352, p < .001, and than the four-option CRT, Mann–Whitney U = 8,490, p = .001. The four-option CRT also took longer than the two-option CRT, Mann–Whitney U = 9,067, p = .001. Thus, multiple-choice versions of the CRT are much quicker to complete without compromising the predictive validity of the tests.

Discussion

In a well-powered experiment, we found that different test response formats—open-ended, two-option, and four-option—did not significantly affect the number of correct responses in the original three-item Cognitive Reflection Test (Frederick, 2005) or in its seven-item extension (Toplak et al., 2014). Overall, the response format did not alter the number of intuitive responses, except in the case of the two-option format of the seven-item CRT, which yielded a higher rate of intuitive responses than the open-ended and four-option formats. This could be due to the presence of more prominent intuitive options. Furthermore, we found no detectable differences in the pattern of correlations of the test with benchmark indicators of predictive and construct validity—belief bias, denominator neglect, paranormal beliefs, actively open-minded thinking, and numeracy. Finally, all three formats had similar internal consistency of the items regardless of the type of scoring (reflectiveness vs. intuitiveness). Overall, these findings favor the construct equivalence hypothesis over the nonequivalence hypothesis.

Our findings are surprising in the context of the literature of mathematical word problems and educational testing, because in those fields multiple-choice questions have been shown to be easier to solve than open-ended questions (e.g., Bonner, 2013; Bosch-Domènech, Brañas-Garza, & Espín, 2014). This might be because of the specific nature of the CRT items. The strategies believed to be responsible for better performance in multiple-choice mathematical problems—such as corrective feedback, guessing or backward solutions (Bonner, 2013; Bridgeman, 1992)—might not work so well when an intuitively appealing but incorrect answer is provided. For instance, it seems less likely that participants would resort to guessing when an appealing intuitive option is available. Similarly, corrective feedback relies on generating an answer that is not in the offered set of possible answers, and therefore such an item offers unintentional feedback. However, there is little benefit from corrective feedback if the intuitive incorrect answer and the correct answer are the two most generated answers. Our findings therefore indicate that the three versions of the CRT capture similar processes.

Methodologically speaking, we created and validated four new measures of cognitive reflection: two-option and four-option three-item CRTs, as well as the equivalents for the extended version, the two-option and four-option seven-item CRTs. The four-option versions seemed particularly well suited for use in both experimental and correlational research. They offer the same level of difficulty, similar internal consistency and the same predictive power as the open-ended version of the CRT (in fact, the four-option CRT was nominally the best ranked predictor among the formats), while being substantially quicker for participants to answer. In addition, coding the answers can be completely automated, which saves time for the researchers and eliminates coding ambivalence, which may lead to coding errors. The overall additional financial cost is not trivial due to its cumulative nature. For example, Prolific Academic, which is one of the cheapest panel providers in the UK, currently charges an additional £0.20 for roughly 100 s of additional time for each participant, which is the additional time associated with using the open version of the CRT compared with the four-option version. This represents an additional £100 for a study with 500 participants (similar to a study presented here). In addition to this cost, if one were to employ a research assistant to code those 3,500 answers of the open CRT for three hours, this would cost an additional £60. Hence, running a single study with the open CRT compared with the four-option CRT would be £160 more expensive. If one were to run three studies in one manuscript, this would add up to £480 (i.e., the cost of running one new study).

The argument regarding the coding ambiguity is not negligible either. For example, in the bat and ball problem, the correct answer is supposed to be indicated in pence (cents in the US)—and hence it should be “5.” But is “0.05” also a correct answer or not? Is “£0.05” a correct response? A strict coding scheme would not classify such answers as being correct responses even though, clearly, they are not reasoning errors and should be—in our view—coded as correct. There are no coding instructions in the original article, nor a standardized coding practice regarding this test, and one can rarely see any details provided on coding of the CRT in other articles. So, there are obvious degrees of freedom in deciding on a coding scheme for the open answers in the CRT, and it is not obvious what exactly should constitute the correct response. To illustrate the extent of the difference, in our research reported here, when we followed the strict coding for the bat-and-ball problem (n = 147), around 10% (six out of 58) of the originally correct answers were recoded as “other incorrect,” and around 11% (nine out of 83) of the originally intuitive answers were recoded as “other incorrect” responses—a significant change in absolute performance, according to a marginal homogeneity test, p < .001. The automatic coding of a multiple-choice version of the CRT eliminates this problem and would allow the CRT performance reported in future studies to be more comparable.

The two-option versions of the test are appropriate too. However, the seven-item CRT yielded a higher number of intuitive responses without compromising predictive patterns; this might have consequences for researchers for whom the average absolute level of intuitive responses is important. Future research should consider whether different cognitive processes are involved when response formats vary in stem-equivalent judgment and decision-making tasks. For instance, one could wonder whether the process and performance is the same behind the base-rate fallacy, whereas constructed responses are used in the textbook version of Bayesian reasoning tasks (e.g., Sirota, Kostovičová, & Vallée-Tourangeau, 2015), multiple-choice questions are used in the stereotypical base-rate problems (e.g., De Neys, Cromheeke, & Osman, 2011).

Three limitations of our research require more discussion and should be addressed in future research. First, we have shown that the response formats did not alter predictive and construct validity, as indicated by the benchmark variables, and even though our selection of such variables captures different cognitive domains, it is not exhaustive. Future research should explore other outcome variables and see whether response format would yield any changes to the predictive and construct validity. Second, even though the format did not alter the performance or predictive patterns, clearly validity and reliability of multiple-choice versions of the CRT depend on the multiple-choice construction of the test. Here, we adopted a transparent and consistent procedure according to which two remaining options were the most common incorrect (nonintuitive) responses generated by the participants in other studies. Changes in the provided choices (the four-option version) might affect the construct equivalence—for example, adding an additional appealing incorrect answer could increase cognitive conflict and subsequent cognitive engagement (Bhatia, 2017). Therefore, in terms of future research, we would recommend the use of multiple-choice tests that have been validated (see the supplementary materials and https://osf.io/mzhyc/ for quick implementation) or, in the case of ad hoc development, we would advise at least testing the construct equivalence of new multiple-choice versions. In addition, as pointed out by our reviewer, it is also possible to imagine the construct nonequivalence hypothesis in the opposite direction. For example, one could argue that the open-ended version of the CRT is better at testing the spontaneous detection of an incorrect intuition, which might be the reason why the CRT predicts the predictive validity task. Even though we did not find supportive evidence for such a direction of the nonequivalence hypothesis, future research should consider this possibility when further testing the construct nonequivalence hypothesis. Finally, even though we used a sample based on the general adult population, generally speaking this was a relatively well-educated sample; it is still possible that in low educated samples the test formats would play a significant role and future research should address this possibility empirically.

Conclusion

We developed and validated a multiple-choice version (with two and four options) of the three-item (Frederick, 2005) and seven-item Cognitive Reflection Test (Toplak et al., 2014). We have shown that the response format did not affect the perfomance, predictive patterns or psychometric properties of the test. Prior research used various response formats while assuming construct equivalence of these tests. Our findings are aligned with such an assumption. We recommend the use of the four-option multiple-choice version of the test in future correlational and experimental research because it saves time and eliminates coding errors without losing its predictive power.