The formation and revision of intuitions

This paper presents 59 new studies (N = 72,310) which focus primarily on the "bat and ball problem." It documents our attempts to understand the determinants of the erroneous intuition, our exploration of ways to stimulate reflection, and our discovery that the erroneous intuition often survives whatever further reflection can be induced. Our investigation helps inform conceptions of dual process models, as "system 1" processes often appear to override or corrupt "system 2" processes. Many choose to uphold their intuition, even when directly confronted with simple arithmetic that contradicts it - especially if the intuition is approximately correct.

This paper presents 59 new studies (N = 72,310) which focus primarily on the "bat and ball problem."It documents our attempts to understand the determinants of the erroneous intuition, our exploration of ways to stimulate reflection, and our discovery that the erroneous intuition often survives whatever further reflection can be induced.Our investigation helps inform conceptions of dual process models, as "system 1" processes often appear to override or corrupt "system 2" processes.Many choose to uphold their intuition, even when directly confronted with simple arithmetic that contradicts itespecially if the intuition is approximately correct.
Mental operations range from rapid, effortless, perceptual impressions (recognizing a face) to more deliberate computations that one must choose to execute (algebra).Sometimes, operations that are effortful initially (eleven minus three) become nearly automatic later.
Research examining the ease or difficulty of mental operations often goes under the label of dual systems or dual processes.In the framework advanced by Kahneman and Frederick (2002, 2005), a fast and intuitive system proposes initial answers which a slower, more reflective system scrutinizes and then accepts, rejects, or revises.Others have produced similar frameworks (Sloman, 1996; Stanovich & West, 2000).
We assume that exposure to any stimulus will initiate at least one cognitive process, which takes some time to execute.The output of that process may be sufficient to permit a response.In other cases, the stimulus evokes a second cognitive process which may be initiated concurrently with the first, somewhat later (as shown in Fig. 1), or after the first process has yielded some output needed to initiate a subsequent operation.If a second process is initiated, we assume only that it concludes after the first, and that its output may either affirm or compete with the output of the first process for control of the overt response. 1n the most discussed examples in dual process research, subsequent considerations conflict with an initial impression.Consider "Linda," who Tversky and Kahneman (1983) described as having opposed nuclear power and taken an interest in issues of discrimination.Subjects must decide whether she is more likely to be: (a) a bank teller (b) a bank teller who is active in the feminist movement Given Linda's description, most readily conceptualize her as a Fig. 1.Generic dual process model feminist, and are eager to express that inference.However, those who scrutinize this intuition may recognize that the set of bank tellers encompasses feminist bank tellers, and thereby change their answer from (b) to (a).We'd consider that to be the reflective response, as it is emitted later and associated with superior reasoning on other tasks.Analogously, consider the question below: Were the 9/11 hijackers cowards?Yes No The intuitive response here is "Yes," because most are eager to attach a negative label to a negatively evaluated target.Once again, however, those who think "harder" may question the suitability of that label.Accordingly, "No" responses are produced more slowly and also betoken superior reasoning abilities (see Appendix A).
In the questions above, the intuitions (feminists care about discrimination, hijackers are bad) exist apart from the stimulus.In other cases, the intuition may emerge from an operation on elements within the stimulus.Consider the "bat & ball" problem below. 2 A bat and a ball cost $1.10 in total.The bat costs $1.00 more than the ball.

How much does the ball cost? ____ cents
The first sentence references two objects and specifies their sum ($1.10).The second includes the words "more than," inviting respondents to subtract something from that sum, with the only remaining number ($1.00) providing an attractive candidate.The subtraction yields a 10-cent ball, which is the modal response.With a 10-cent ball, the problem's two requirements cannot be mutually satisfied.If the two prices sum to $1.10, the bat must cost $1.00 (which is only 90 cents more than a 10-cent ball).If the two prices differ by $1.00, the bat, itself must cost $1.10 (and the two prices would sum to $1.20).
The bat and ball problem is often used to illustrate a dual process model of cognition, in which a fast "system" provides tentative answers that a slower "system" inspects and (only) sometimes revises.This conception is supported by observations that the intuitive answer is (a) initially considered by many who ultimately respond correctly (Frederick, 2005; Szaszi, Szollosi, Palfi, & Aczel, 2017; Travers, Rolison, &  Feeney, 2016), (b) more common under mnemonic load or time constraints (Borghans, Meijers, & Ter Weel, 2008; Johnson, Tubau, & De  Neys, 2016) and (c) produced more quickly than the correct answer, despite the superior computational abilities of those who respond correctly (see Appendix B).Kahneman and Frederick (2002, 2005) proposed that judgmental errors raise questions about both the production of erroneous intuitions ("System 1" questions) and the factors that facilitate or inhibit their detection and revision ("System 2" questions).We will retain this schema to organize our discussion, but will also raise objections to it along the way.
We first show that there are multiple routes to the 'intuitive' response, as distinctions can be made even among those who say 10; some do so because they misread the question, whereas others appear to just subtract the smaller number from the larger one, with little regard for the words in which those numbers are embedded.Such results are not easily situated within dual-system frameworks, which usually emphasize differences in the degree to which intuitions are scrutinized, but neglect differences in the sophistication or complexity of operations that led to the so-called intuition.
The second part of our paper further chafes dual system models.In contrast with the common assumption that reflection will reject faulty intuitions, we find that intuitions typically survive whatever kinds of reflection we can induce.Executing the intuitive operation appears to inhibit or impair the reasoning processes needed to detect the error.For example, upon concluding that the bat costs $1.00 and the ball costs $0.10, many respondents will explicitly affirm that those two values differ by $1.00.
Although our attempts to induce reflection only slightly raised performance for the standard problem, we later show that performance improves dramatically for variants of the problem in which the heuristic operation yields results that more radically violate the stipulated constraints.We propose the notion of an "approximate checker" which pardons small errors but not large ones.
We conclude by situating these findings within a broader discussion of dual system theories and propose ways of distinguishing the willingness to think from the ability to think.Kahneman and Frederick (2002, 2005) used the term attribute substitution to describe situations in which respondents unwittingly answer a simpler version of the question they encounter.With that concept in mind, consider a "lite" version of the bat and ball problem.

Forming intuitions
A bat and a ball cost $1.10 in total.The bat costs $1.00.

How much does the ball cost? ____ cents
Of course, 10 cents is the correct answer to this question.If you find yourself stopping here to ponder how it differs from the original, you can appreciate how readily it might be substituted. 3Further evidence for this substitution is revealed when respondents must specify the price of both items.
A bat and a ball cost $1.10 in total.The bat costs $1.00 more than the ball.
How much does the ball cost?______ How much does the bat cost?______ Among 196 mTurk participants, 121 made the common error, concluding that the ball cost 10 cents, 4 and all but two of them concluded that the bat cost $1.00.In other words, nearly everyone who missed the problem generated prices which satisfied its first constraint (summing to $1.10) but violated its second (differing by $1.00).This result suggests that respondents were substituting the simpler "lite" version (where the bat costs $1.00) for the actual question (where the bat costs $1.00 more than the ball).
We further tested the hypothesized substitution by asking 615 MTurkers to reproduce the problem from memory.Among those who 2 This is the best known of the three items comprising the "Cognitive Reflection Test" or CRT proposed by Frederick (2005). 3The lite version is not only easier, but more typical.In his survey of mathematics textbooks, Mayer (1981) found that word problems which assign values to variables (as in the lite version) are six times more common than those which assign values to relations (as in the standard problem).Thus, much as people are prone to apply a solution strategy from a preceding problem to a subsequent problem (Luchins, 1942) they are more likely to apply solution strategies from problems they encounter more frequently. 4Widespread exposure to this problem on Amazon's Mechanical Turk is a well-known issue.Although repeated exposure has surprisingly little impact on the item's predictive validity (Bialek & Pennycook, 2018; Meyer, Zhou, &  Frederick, 2018; Stagnaro, Pennycook, & Rand, 2018), it could still affect response processes.Accordingly, for all of our MTurk studies (and some of our other studies), we asked participants whether they had seen the problem before and excluded those who said they had.Thus, the referenced Ns in the paper refer to the subset of participants who were plausibly seeing the item for the first time.Appendix C reports demographics for all studies reported in the main text.
A. Meyer and S. Frederick  solved the problem, nobody misremembered it as the lite variant, but among those who made the 10-cent error, 23% did so.Although this is some evidence for the posited substitution, 61% of those who said 10 cents could recall the words that their answer implies they neglected -"more than the ball."(See Appendix D and Hoover & Healy, 2019.)Moreover, we found no effect of emphasizing the "neglected" detail, such as by bolding the words more than the ball.(We discuss these studies in Appendix E, though see Hoover & Healy, 2019; Mata, 2020;  Mata, Ferreira, & Sherman, 2013 who each found effects using comparable manipulations on smaller samples.) We initially regarded the posited substitution as the thoughtless error, but later learned that many respondents follow an even simpler strategy: subtracting the smaller number from the larger one.In the standard problem, these two strategies yield the same answer ( 10), but if one instead asks about the price of the bat (as below) substitution would yield the answer 100, whereas subtraction would yield the answer 10. 5 A bat and a ball cost $110 in total.The bat costs $100 more than the ball.
How much does the bat cost?______ Among 1001 respondents on Google Consumer Surveys (hereafter GCS) who answered the "bat price" problem, the $10 "subtraction" response was nearly as common as the $100 "substitution" response (31% vs. 34%), was emitted much faster (23 s vs. 36 s), 6 and was associated with even shallower reasoning in other tasks (see Appendix F). 7 This very simple and very fast subtraction "strategy" is also evident in the "lite difference" variant below: A bat and a ball cost $110 in total.The bat costs $100.
What is the difference in price between the bat and the ball?______ Here, there is no opportunity to misinterpret the second number as the price of the bat, because it is the price of the bat.However, the second number can still be subtracted from the first, and many do that: among 1032 GCS respondents, 56% answered $10. 8 These results create issues for dual process theories, by suggesting three "types" or "levels" of thought: (1) a super-fast subtraction strategy in which the smaller number is subtracted from the larger one, (2) a medium speed strategy involving the unwitting substitution of a similar, but simpler, question, and (3) a slow strategy of generating values that actually satisfy both of the stipulated constraints.This ternary classification may be accommodated by the dual system nomenclature if one regards the "two" so called "systems" as endpoints on some thought continuum, but such results still raise questions about the criteria used to position responses (or people) on that continuum.Should thinking "styles" or "levels" be characterized in terms of the overt response, own reaction time, average reaction time of others who produced that response, or from evidence that an initial thought was overridden (even if replaced by a new thought that was also incorrect)?

Maintaining intuitions
Whether resulting from subtraction or substitution, respondents are highly and often maximally confident that their $0.10 response is correct (see Appendix G).Why does the error remain hidden in plain sight?The constraint that the two prices differ by $1.00 is one of just three sentences, clearly stated, and sometimes even emphasized.Moreover, verifying the intuitive response requires nothing more than adding $1.00 and $0.10 to ensure that they sum to $1.10 (they do) and subtracting $0.10 from $1.00 to ensure that they differ by $1.00 (they don't).Since essentially everyone can perform these verification tests, the high error rate means that they aren't being performed or that respondents are drawing the wrong conclusion despite performing them.
If respondents aren't attempting to verify their answer, encouraging them to do so may help.We tested this in five studies involving a total of 3219 participants who were randomly assigned to either a control condition or to one of four warning conditions shown below.Two studies were administered to students who used paper and pencil.The rest were web-based surveys of a broader population. 9

SIMPLE WARNING
Be careful!Many people miss this problem.

CONSTRAINT WARNING
The warnings improved performance, but not by much (see Table 1).This suggests that they failed to engage a checking process, or that the checking process was insufficient to remedy the error. 10Others find similarly modest effects of asking respondents to reflect on initial responses, for the bat and ball problem (Bago & De Neys, 2019) and for other reasoning tasks (Lawson, Larrick, & Soll, 2020; Thompson, Turner,  & Pennycook, 2011). 5To avoid the need to recode decimal errors, here we specified the prices as dollars rather than cents.We use both versions of the problem throughout this paper.Solution rates are about the same.
6 Unless otherwise specified, response times are geometric means. 7The bat price problem has the advantage of partitioning subjects into three "tiers" of reasoning, rather than two.A control condition (N = 1009) revealed that it is solved at the same rate as the standard problem (20% vs. 19%). 8The lite difference problem is solved more often (32%) than the bat price problem (20%) or standard problem (19%), and more quickly (33 s vs. 55 s and 48 s).However, for all three problems, the $10 error is comparably fast (23 s, 23 s, and 25 s, respectively). 9Here, and in the next study, the problem was sometimes presented by itself and sometimes as the first item in the 3-item CRT (Frederick, 2005).In the Constraint Warning condition, the problem's first sentence "A bat and a ball cost $1.10 in total."was printed in red; and its second "The bat costs $1.00 more than the ball." was printed in blue. 10Although warnings have only modest effects on solution rates, they do increase time spent on the problem.We presume that this extra time was spent engaged in mental activity related to the problem, which one might reasonably call "checking."Nevertheless, as discussed in Appendix G, these checks not only failed to markedly improve performance, they also failed to reduce confidence in the erroneous intuition.
A. Meyer and S. Frederick  Since these warnings were ineffective, we next tried an even stronger manipulation by telling respondents that 10 cents is not the answer.We conducted eight such experiments, with a total of 7766 participants.In five studies (three online and two paper and pencil), participants were randomly assigned to either the control condition or to a Hint condition in which the words "HINT: 10 cents is not the answer" appeared next to the response blank.
A bat and a ball cost $1.10 in total.The bat costs $1.00 more than the ball.
How much does the ball cost?$_____ HINT: 10 cents is not the answer.
In three other studies (two online and one in-lab), we used a within-participant design in which the Hint was provided after the participant's initial response.In those studies, respondents could revise their initial (unhinted) response, and we recorded both their initial and final responses.The results of all eight studies are shown below in Table 2.
The hint that the answer wasn't 10 cents helped substantially, but, more notably, manyand sometimes moststill failed to solve the problem. 11Though the bat and ball problem is often used to categorize people as reflective (those who say 5) or intuitive (those who say 10), these results suggest that the "intuitive" group canand shouldbe further divided into the "careless" (who answer 10, but revise to 5 when told they are wrong) and the "hopeless" (who are unable or unwilling to compute the correct response, even when told that 10 is not the answer).The careless fit neatly into the dual process framework, but the hopeless do not, and they create problems for those using this item as a measure of reflection in non-elite populations.If the very thing that reflection selectively provides to those who have enough of it realization that the answer cannot be 10 -is provided to all by telling them that the answer is not 10, and responses still vary, the problem must also be measuring other things. 12As shown in Fig. M of Appendix M, the relative size of these three groups depends on the cognitive abilities of the populations being tested: Though many highly intelligent people get this problem wrong, nearly all of them are careless, whereas many others cannot solve it, even after being alerted to the common error, suggesting that it requires more effort than they are willing to expend or greater abilities than they possess.A recent study (Enke et al., 2021) implies the latter, as offering participants a full month's salary for solving the problem only modestly increased solution rates (from 35% to 48%).
Manipulations intended to raise solution rates may fail to do so in part because the operations producing the intuition disrupt or degrade the execution of subsequent operations needed to detect the error.Consider the results of the following study, in which 2010 GCS respondents were randomly assigned to one of two conditions.

MINUEND ABSENT
A bat costs $100 more than a ball If you said the bat cost $100 and the ball costs $10, 18% would your prices be correct?YES NO

MINUEND PRESENT
A bat and a ball cost $110 in total.
The bat costs $100 more than the ball.
If you said the bat cost $100 and the ball cost $10, 53% would your prices be correct?YES NO When the heuristic operation is encouraged by supplying the 110 minuend (from which 100 might be subtracted) many more respondents erroneously affirm $100 and $10 as the correct prices, even though both conditions clearly stipulate that the bat costs $100 more, and even though rejecting that pair of prices requires no further mental effort: no need to determine the cost of either object.
A subsequent study reveals that once the outputs of the intuitive operation are expressed, they become even more recalcitrant to requests for further scrutiny.In that study, 124 passengers on a commuter ferry between Connecticut and Long Island were either told that a bat cost $1.00 and a ball cost $0.10,or were presented with the standard question, which required them to generate prices for each object.We then asked all participants whether the pair of prices (which they had either been provided with or generated) differed by $1.00.Among those provided with a $1.00 bat and $0.10 ball, only 6% said "Yes," whereas 76% of those who had generated those same two prices did so. 13Once again, the heuristic operation (subtraction) appears to disrupt or degrade Main script indicates percent correct.Subscript indicates seconds to respond.a A special thank you here to Bob Spunt, who helped design this study and collected these data.
11 These effects are much larger than those of similar hints administered after participants have already attempted multiple variants of the problem (Janssen,  Raoelison, & de Neys, 2020), but somewhat smaller than removing the 10-cent lure from a set of response options which include the correct answer (Patel,  Baker, & Scherer, 2019).
12 An examination of Table 2's subscripts reveals that some of the "hopeless" may be better described as stubborn: they maintain their 10-cent response despite the hint that that answer is wrong.In our three within-subject studies, some participants seem to have assumed that we were repudiating the form of their 10-cent response rather than its content, as they modified their response from one form of 10 cents to anothersuch as rewriting a decimal response (0.1) as a whole number (10). 13Among the 57 participants in the Generated condition, 41 (or 72%) entered the two intuitive prices.Another 12 participants gave the correct pair of prices ($0.05 and $1.05) and all of them affirmed the $1.00 difference.
subsequent operations involving its output.(See Appendices H and I for further data and discussion.)

PROVIDED PRICES
A bat costs $1.00 and a ball costs $0.10.6% With those prices, does the bat cost $1.00 more than the ball?YES NO

GENERATED PRICES
A bat and a ball cost $1.10 in total.
The bat costs $1.00 more than the ball.How much does the ball cost?How much does the bat cost?76% Is your "bat" answer $1.00 more than your "ball" answer?
YES NO In stark contrast to the account that the 10-cent error indicates an unwillingness to check (Frederick, 2005; Kahneman & Frederick,  2002, 2005), the error survives at least some cursory version of the checking process that ought to expose it.Even when their attention is directed to the constraint specifying that prices differ by $1.00, most respondents nevertheless maintain that their $1.00 and $0.10 responses satisfy that constraint.This result has hallmarks of simultaneous contradictory belief (Sloman, 1996), because respondents who report that $1.00 and $0.10 differ by $1.00 obviously do not actually believe this.It is also akin to research on Wason's four card task showing that participants will rationalize their faulty selections, rather than change them (Beattie & Baron, 1988; Wason & Evans,  1974).It could also be considered as an Einstellung effect (Luchins,  1942), in which prior operations blind respondents to an important feature of the current task or as an illustration of confirmation bias, in which initial erroneous interpretations interfere with the processes needed to arrive at a correct interpretation (Bruner & Potter, 1964;  Nickerson, 1998).
The durability of the intuition is further evidenced by the types of manipulations researchers have resorted to, such as providing respondents with arguments for why $5 is correct (Trouche, Sander, &  Mercier, 2014) or testing whether classrooms of university students asked to discuss the problem can reach a correct consensus (Claidière,  Trouche, & Mercier, 2017 showed, reassuringly, that they can).
Similarly, as shown below, we ran two studies on GCS in which we asked respondents to either consider the correct answer (N = 2002) or to simply enter it (N = 1001).

Consider $5
A bat and a ball cost $110 in total.The bat costs $100 more than the ball.
How much does the ball cost?Before responding, consider whether the answer could be $5. $_____

Enter $5
A bat and a ball cost $110 in total.
The bat costs $100 more than the ball.

How much does the ball cost?
The answer is $5. Please enter the number 5 in the blank below.

$_____
Asking respondents to consider the correct answer more than doubled solution rates, but only to 31%.Asking them to simply enter the correct answer worked better, as 77% did so, though, notably, the intuitive response emerged even here.See Appendix J for further data.
Of course, interpreting the results from such extreme manipulations as "solution rates" obviously distorts what it means to "solve" a problem.
Moreover, the very existence of such manipulations (and their lack of complete efficacy) undermines a conclusion many draw from dual process theories of reasoning: that judgmental errors can be avoided merely by getting respondents to slow down and think harder. 14

Revising intuitions
Historically, the bat and ball problem has been used to illustrate the laziness or inefficacy of corrective operations (Kahneman & Frederick,  2002, 2005).We endorse this view, but further suggest that the presence of the erroneous intuition prevents the correct conclusion from being drawn even when checks are attempted.An explicit affirmation that $1.00 and $0.10 differ by $1.00 suggests that the erroneous intuition corrupts any subsequent operations; it is not merely acceded to (see Risen, 2016), but endorsed.
There are, however, limits to this endorsement.Much as people can more quickly distinguish numbers that are further apart (Moyer &  Landauer, 1967), their endorsement of the intuitive operation diminishes when it yields values that more strongly violate the problem's requirements.To illustrate this, we randomly assigned 10,044 GCS participants to one of ten conditions.The problem always began as usual: "A bat and a ball cost $110 in total…" but we varied the number specified in the second sentence: "The bat costs [X] more than the ball," with values ranging from $100 to $10.As the specified difference gets smaller, the intuitive operation (subtracting the difference from the total) yields values that more radically violate the problem's requirements.For example, if the specified difference is $40, the intuitive operation ($110 minus $40) yields a $70 ball, which would be more than half of the $110 total it is supposed to share with a more expensive item.
For each of those ten conditions, Fig. 2 shows the fraction who Solve the problem, who Subtract the difference from the total (the posited intuitive operation), or who give some Other answer.As the price difference between the bat and ball decreases, participants slow down (see Appendix K) and solution rates rise markedlyfrom 14% to 57%. 15 In other words, the results of the posited intuitive operation (subtraction) appear to receive greater scrutiny when yielding values that more radically violate the problem's requirements. 16o illustrate our "approximate checker" hypothesis, suppose that someone charging $39 an hour worked 37 hours and submitted an invoice for $1513.You'd probably not scrutinize itsince it is close to, and less than, $1600 (i.e., 40 × 40).It may be adaptive to assume that 14 Of course, subjects can learn to solve the problem, with the benefit of instruction (Boissin, Caparos, Raoelison, & De Neys, 2021; Hoover & Healy,  2017) or exposure to multiple variants in succession (Raoelison et al., 2021;  Raoelison & De Neys, 2019). 15Though not shown in Figure 2, we had an eleventh and twelfth condition in which price differences were $34 and $54 (total N= 2050).Contradicting the notion of "desirable difficulties" (Bjork, 1994; Alter et al., 2007; Mastrogiorgio  and Petracca, 2014), performance was worse (26% and 24% correct) than the nearest conditions with round numbers.For further discussion of disfluency effects on solution rates, see Meyer et al. (2015) or Lawson, Larrick, and Soll  (2022). 16We are not the first to observe effects of the numbers specified on the problem's solution rate, and these other results are also consistent with our "approximate checker" hypothesis.Frederick (2005) found improved performance when the items summed to 37 cents and differed by 13 cents.Baron et al.  (2015) found dramatically improved performance when the two objects summed to $5.50 and differed by $1.00.Silva (2005) found that just 8% of his sample could solve the standard problem, whereas 93% could do so when the two objects summed to 3 cents and differed by 1 cent.However, unlike the aforementioned manipulations, Silva's result is probably not explained by our approximate checking hypothesis, since, with these values, subtracting the smaller number from the larger one actually yields the price of both objects (2 & 1) and the problem stipulates that the ball is the cheaper of the two.plausible answers are correct, and it follows that people will have greater difficulty solving problems when the intuitive error is approximately correct.
The results of the prior study dovetail with other dual-process research on conflict detection, which finds, among other things, that base-rates are more likely to be incorporated into judgments if they are sufficiently extreme.For instance, if told that Bill likes carpentry, judgments of the likelihood that he is an engineer (vs. a lawyer) often neglect whether the relevant population is mostly engineers or mostly lawyers (Kahneman & Tversky, 1973).However, if the base rates are made sufficiently disparate (if only 5 of 1000 people in the sample are engineers) respondents do consider them; they notice (and must then resolve) the conflict between the disparate base rates and their assumption that engineers are more inclined towards carpentry (De Neys & Glumicic, 2008; Pennycook, Fugelsang,  & Koehler, 2015).
In the context of the bat and ball problem, the cued operation (subtracting the smaller number from the larger one) only creates a conflict if respondents notice the constraints "hidden" in the problem stem.Respondents' superior performance when the cued operation yields an answer that more radically violates the problem constraints suggests that those constraints were never completely hidden (otherwise, the degree of violation wouldn't have any effect).Accordingly, although most subjects in our within-subject Hint experiments who initially say 10 cents either maintain that response (n = 877) or revise to 5 cents (n = 731), the sizable remainder who do neither (n = 490) are much more likely to adjust down to nine cents (n = 113) than up to eleven cents (n = 14).We interpret this as further evidence that they maintain some awareness of the constraint that they are still largely neglecting: that the two prices need to differ by $1.00 -and, correspondingly, that the correct answer must be less than 10. (See also, Bago, Raoelison, & De  Neys, 2019).
Our approximate checker hypothesis suggests that even without being pressed to revise their initial response, a 9-cent ball will feel more correct than an 11-cent ball, because a 9-cent ball (and $1.01 bat) violate the $1.00 difference requirement less than an 11-cent ball (and 99 cent bat).We tested this conjecture in a study involving 1909 GCS respondents who were randomly assigned to one of nine conditions.In each, the standard bat and ball problem was presented along with two response options: the correct answer (5 cents) and an alternative value, X, which varied from 6 cents to 14 cents. 17nsurprisingly, performance was much lower in the condition where X was the tempting lure (10) than in the other eight, more curious, conditions in which many respondents were forced to choose between two unintuitive options.But Fig. 3 further reveals that respondents perform worse if the alternative option is closer to 10, and worse for the four conditions with values below 10 (67%) than for the four with values above 10 (76%).Both of these additional results appear consistent with some version of our approximate checker hypothesis, though they remain distinct, as the first suggests scrutiny being withheld from responses that more closely resemble the dominant intuition (10), and the second suggests scrutiny being withheld from responses that more closely satisfy a requirement stipulated in the problem stem (that the bat and ball prices ought to differ by $1.00).
Of course, positing that the quality of an intuition is intuitively appraised is awkward, since it suggests that the intuitive "system" is checking itself.Yet the foregoing data do seem to argue against a fully deliberate checking process (in which violations of any degree would be equally wrong).Further, they appear consistent with the finding by Johnson et al. (2016) that, even under mnemonic load, respondents are  17 This study was inspired by an episode of the game show "Who Wants to be a Millionaire?" which aired on November 10th, 2014.In that episode, the contestant (Erin LaVoie) correctly answered her first round question (regarding the meaning of the word "contusion") and then received the following question in round 2: "Try this tricky math question that stumps many Ivy Leaguers: A bat and ball cast $1.10.The bat costs $1 more than the ball.How much does the ball cost?" (A) $0.30 (B) $0.20 (C) $0.15 (D) $0.05.Curiously, this set of response options omitted (or excluded) the typical error ($0.10).With no obvious answer present, Erin first used her "Plus One" lifeline (in which a friend in the audience joins her to offer assistance), but after receiving insufficient help, then decided to use her "Jump the Question" lifeline to skip the question, foregoing her payoff from answering correctly ($5000), but removing her risk of answering incorrectly (which ends the game).
less confident in their 10-cent responses to the standard problem than in their 10-cent responses to the lite variant. 18(See Appendix K for further data and discussion of the approximate checker hypothesis.)

General discussion
The title of a recent best-seller, Thinking: Fast and Slow, reflects the view that different types of cognitive processes can and should be distinguished (Kahneman, 2011).Others question the value of such an endeavor (Keren, 2013; Keren & Schul, 2009; Kruglanski & Gigerenzer,  2011; Melnikoff & Bargh, 2018).The debate surrounding dual system theories is energized partly by differences in use of the term theorywhose meaning can range from a preliminary notion to a precisely stated and falsifiable hypothesis.Many dual system theorists regard their "theories" as provisional frameworks that help characterize and organize distinctions they find important, whereas critics may demand the sort of precision that would permit a decisive test of whether cognition has one system or two. 19eciding how many mental 'systems' to enumerate depends on the sorts of distinctions one wishes to emphasize.If contrasted with, say, the digestive system, nearly everyone would attribute all thoughts to a single cognitive "system."But finer distinctions may also be useful, as Shweder (1977) so eloquently expresses: A useful distinction in the study of human thought is between intuitive and non-intuitive concepts.Concepts can be arranged along a continuum having to do with the relative ease with which they can be attained and in the kinds of learning inputs and environmental orchestration that are required for acquisition and application to occur… More intuitive concepts are acquired even under highly degraded learning conditions… [and] seem to be available without conscious effort or reflection…these concepts seem to be merely "released" by experience… In contrast, nonintuitive concepts require special learning conditions for their acquisition (e.g.massive instructional input, an orderly and explicit organization of learning trials, high motivation, etc.) In our view, advocacy of dual process theories is typically nothing more (or less) than an endorsement of the possible value of distinguishing the types of thought a stimulus might generate or require; a desire to characterize mental operations in term of their speed, the amount of attention they demand or consume, their accessibility to introspection, and their difficulty of acquisition.Of course, not every distinction warrants the application of different labels: subjects would solve 7 × 12 faster than they would solve 18 × 27, but a book entitled Thinking: Fast and Slow would not be very compelling if these were the only sorts of data cited in support of the eponymous distinction.
Though the bat and ball problem has often been upheld as emblematic of the dual system framework, it may not be as canonical as its frequent citation suggests.First, the posited heuristic operationsubtractionis, itself, a "rule-based manipulation of symbols," which is ordinarily ascribed to "System 2" (Sloman, 1996, p. 4).Second, this operation is sensitive to mnemonic load (DeStefano & LeFevre, 2004), which is sometimes taken as the defining feature of "Type 2" processes (Evans & Stanovich, 2013).Third, although intuitions are often defined by their lack of introspective access, those who miss this question know exactly how they arrived at their answer ($1.10 minus $1.00 equals $0.10); what they have trouble understanding is why that operation is inappropriate here. 20 Bago and De Neys (2019) further suggest that the 5-cent solution may not require deliberation (though we are skeptical of this claim, as discussed in Appendix L).
Further complications with dual system models arise when intuitions override reflection.Consider a version of the classic Monty Hall problem offered by Margolis (1987): Two Queens and a King are taken from a deck of playing cards, placed face down and shuffled.If you select the King, you win a prize.You first point to a card.The dealer then checks the two remaining cards and turns over a Queen.You may either keep the card you first pointed to or select the other card that remains face down.Is there any advantage to switching?
Since there are two ways to lose but just one way to win, you will be pointing to a losing card two out of three times.In both of those two cases, the remaining card that the dealer has not turned over will be the King.Thus, if you switch, you'll double your chance of winning: from 1 in 3 to 2 in 3.
Though few can offer any sensible rebuttal to this logic, it does not typically unseat the dominant intuition.Many conclude that they are missing something -that there's been some sleight of hand (Margolis,  1987).This turns the usual dual-process story on its head.If people remain incredulous following exposure to logic they cannot rebut, System 1 is effectively checking and overriding System 2. We find something similar with the bat and ball problem, as respondents seem to maintain a belief in a 10-cent ball (and $1.00 bat), despite having had their attention directed to the requirement that those two prices must differ by $1.00.
Intuitive answers may be even more influential for problems lacking any promise of an algorithmic solution.Consider our "minor injuries" problem, below: The Department of Transportation is deciding between two different roadway designs.These are associated with different types of auto accidents, and, consequently, with different rates of serious injuries and minor injuries.Please enter the number of minor injuries that would make the two designs equivalent, all things considered.1016).Though all are absurd upon reflection (as all imply that minor injuries are as bad or worse than serious ones), their presence, coupled with the absence of any obvious alternative solution strategy, makes this problem difficult.Indeed, Meyer et al. (2023) find that fewer than one in fifty "solve" it (respond with a number above 1016).Moreover, this 18 Our proposal that checking may actually be as intuitive as production of the intuition itself concords with research summarized by De Neys and Bonnefon (2013), who find that that those who succumb to the intuitive errors on classic heuristics and biases problems (such as Linda) are less confident than those who perform similar operations to correctly solve easier variants (De Neys, Rossi, &  Houdé, 2013), take longer to respond (De Neys & Glumicic, 2008), show greater autonomic activation (De Neys, Moyens, & Vansteenwegen, 2010), and greater activation in brain regions supposed to mediate conflict detection (De  Neys, Vartanian, & Goel, 2008; Simon, Lubin, Houdé, & De Neys, 2015). 19When a skeptic challenged J.B.S. Haldane to explain how evolutionary theory could be falsified, he famously shot back "Fossilized rabbits in the Precambrian."(None have yet been found.)It is difficult to imagine a dual-system theorist producing a comparably pithy answer to a similar challenge. 20Further, the modest arithmetic abilities required to generate the intuition (i.e., subtraction) presumably correlate positively with the more demanding abilities required to solve the problem, thereby failing the "stochastic independence" criterion proposed by Tulving (1985), which is honored by those who propose distinct systems in the context of vision (Weiskrantz, 2009; Weiskrantz et al., 1974) or memory (Tulving, Schacter, & Stark, 1982).

Serious injuries Minor injuries
A. Meyer and S. Frederick  problem may more closely resemble those we commonly confront, which lack established algorithms that might be used to check or override an intuition.Thus, the difficulty of inducing respondents to reflect on their 10-cent ball answer may understate the difficulty of inducing reflection more generally.
Although we've focused here on trying to understand why people typically miss the bat and ball problem rather than why their failure or success predicts other traits, the two issues are obviously related.We've proposed that performance on this item (and other items intended to measure cognitive reflection) is determined by the ability to detect and reject the erroneous intuition and by the ability to solve the problem once the error is detected.To help distinguish these two abilities, in some of our studies, participants first responded and were then told that the answer is not 10.As noted earlier, this "hinted" procedure serves to partition respondents into three groups: the reflective (who reject the common intuitive error and solve the problem on the first try), the careless (who answer 10, but revise to 5 when told they are wrong), and the hopeless (who are unable or unwilling to compute the correct response, even after being told that 10 is incorrect).
Expressed or implied claims that items intended to measure cognitive reflection have surplus predictive validity over other "regular" math problems suggest that the ability to suppress an activated intuition is an important cognitive skill distinct from numeracy or mathematical ability (Frederick, 2005).While some have affirmed this claim (Primi, Morsanyi, Chiesi, Donati, & Hamilton, 2016; Shenhav, Rand, &  Greene, 2012; Toplak, West, & Stanovich, 2011) others have disputed it (Attali & Bar-Hillel, 2020; Otero, Salgado, & Moscoso, 2022).Our analysis of the hinted procedure does suggest that these are dissociable skills.Specifically, as shown in Table 3, the careless perform nearly as well as the reflective on a subset of Raven's Matrices, 21 but nearly as poorly as the hopeless on the "Linda" problem. 22This suggests that the bat and ball problem predicts Raven's scores because it requires mathematical ability (which the careless and reflective both possess), but predicts performance on the Linda problem, because it also requires the ability to suppress an activated intuition (which the careless and hopeless both lack).
More generally, we predict that items intended to measure cognitive reflection will be superior predictors for tasks which demandand benefit froman ability to detect and suppress a dominant intuition (such as Linda and other counterintuitive problems), but will function like "regular" math items for most other tasks whose items generally fail to induce a dominant intuition that must be suppressed (most numeracy tests, GRE math, and, perhaps, Raven's Matrices).Since providing the hint nullifies the importance of detecting the intuitive error, it goes some way to transform the CRT into a "regular" math test and we'd then expect it to function more like those tests.This general claim is supported by Table 3, by Appendix M, and by subsequent work (Meyer  et al., 2023). 23f the bat and ball problem does measure anything distinct from general mental abilitysuch as willingness to reason carefullywe assume that it does so by permitting those disinclined toward reflection a chance to exit early while still feeling successful.Many other problems share this feature, such as the "XYZ" problem below.
Since the universal intuition is obvious, we presume that nearly everyone who "solves" this item feels some sense of success.However, it offers little opportunity for the reflective to differentiate themselves, as negligibly few will have the motivation or ability to check whether this system of equations mutually entail the intuitive answer. 24deally, any item advanced as a measure of cognitive reflection should satisfy four criteria.It should (a) generate an intuition, which (b) requires suppression.Further, it should (c) contain cues to reject that intuition, and (d) allow those who do so to solve the problem. 25The XYZ problem clearly satisfies (a) and (b), but clearly fails (c) and (d).By contrast, the bat and ball problem often satisfies all four criteria.It clearly satisfies (c), as it explicitly states that the two prices must differ by 100.For elite populations, it also satisfies (d).But elsewhere it does not, as many cannot solve the problem even when told that 10 is not the answer.
To the extent that problems satisfy these four criteria, we'd expect wrong answers to be generated more quickly than correct answers and highly concentrated at the putative intuition.Further, we'd expect those who miss such problems will judge them as easier than those who solve them (see Frederick, 2005; Mata et al., 2013).To help illustrate these criteria, consider three logically equivalent variants of a novel item that we call the "smokers" problem. 26 21 We used items 2, 8, 14, 20, 26, and 34 from Raven's Advanced Progressive Matrices. 22Linda is 31 years old, single, outspoken, and very bright.She majored in philosophy.As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.
Which is more probable?Linda is a bank teller.OR Linda is a bank teller and is active in the feminist movement.
atypical answers for the ball's price (210, 105, 50, etc.).Though such responses are often negligibly rare, Pennycook, Cheyne, Koehler, and Fugelsang (2016)  note that the CRT has greater predictive validity when it is scored in terms of number correct (5, 5, and 47) than when (reverse) scored as number of intuitive responses (10, 100, and 24).Expressed differently, they find it better to treat atypical answers as incorrect answers (failed to solve it) than as correct answers (managed to avoid intended trap).Some interpret this as evidence against the traditional dual-process interpretation of the CRT, which places emphasis on surmounting the intuition.We aren't fully persuaded by this critique, as these atypical answers could instead reflect submission to some unintended lure or to corruption of subsequent thinking by temporary consideration of the intended lure, even after it has been (partially) dismissed. 24The correct answer is 25/6. 25If an item possesses these characteristics, the presence of a correct answer is sufficient evidence that an intuition has been rejected (i.e., that a person is "reflective"), but it is not the only kind of evidence.For instance, suppose Adam said, "I first thought the ball cost $10, but then I realized that can't be right, because then the bat, itself, would cost $110.I left it blank because I just couldn't figure out what the right answer is, but now I can't stop thinking about it."He's clearly reflective (if not especially numerate).So is Beth who answered $105, which is clearly not a thoughtless error (as revealed by response times and common sense).Carl might also be considered reflective if he answered $10 but indicated very low confidence in his answer.Though all missed the problem, each displayed reflection: they all appeared to notice and care about facts that conflicted with their (likely) intuition.Thus, "requirement" (d) is more pragmatic than essential.To the extent an item satisfies criteria (a) and (b), a correct answer is good evidence that an intuition has been suppressedbut not the only sort of evidence. 26This problem was inspired by an example provided by Chris Chabris.
A. Meyer and S. Frederick  : If 3 in 30 men smoke and 1 in 30 women smoke, then 1 in ___ people smoke.(48% correct) : If 1 in 10 men smoke and 2 in 60 women smoke, then 1 in ___ people smoke.(9% correct) : If 1 in 10 men smoke and 1 in 30 women smoke, then 1 in ___ people smoke.(7% correct) Since these variants all require averaging the same two fractions (1/10 and 1/30), they are logically equivalent.But they are psychologically distinct, as revealed by marked differences in responses, response times, and judged difficulty.Variant #1 fails as a measure of reflection because the intuitive operation (averaging numerators) "happens" to yield the correct answer.Variant #2 fails because it evokes no intuition; no simple operation promises to yield a solution.Variant #3 is more promising because it (a) suggests an intuitive operation (averaging denominators) which (b) yields an erroneous solution, such that reflection will be required. 27ig. 4 plots the responses to these three variants by their response times and judged difficulty.Variant #1 is answered quickly and judged to be easy.Variant #2 is answered slowly and judged to be difficult.Only variant #3 resembles the bat and ball problem, as wrong answers are highly concentrated (almost all 20 or 40) and emitted much more quickly than correct answers.Moreover, those who miss this variant regard it as easier than those who solve it (note that the red 20 and red 40 in the bottom left are well below the red 15 in the top right).

Concluding remarks
When we began studying the bat and ball problem, we assumed respondents missed it because they didn't bother to check.Accordingly, we assumed that they'd be able to solve it if we directed their attention to the features of the problem that differentiate it from the problem we thought they were unwittingly solving instead (bat and ball "lite") or to the constraint the typical answer violates (that the prices differ by 100).
We discovered instead that many respondents maintain the erroneous response in the face of facts that plainly falsify it, even after their attention has been directed to those facts.Although subjects' apparent sensitivity to the size of the heuristic error merits further research, the remarkable durability of that error paints a more pessimistic picture of human reasoning than we were initially inclined to accept; those whose thoughts most require additional deliberation benefit little from whatever additional deliberation can be induced.respondents, solution rates for the standard problem were not influenced by question order, but prior presentation of the standard problem lowered performance on the small difference variant.

STANDARD PROBLEM FIRST
A bat and a ball cost $110 in total.The bat costs $100 more than the ball.How much does the ball cost?$____ 19% correct A bat and a ball cost $110 in total.But this time, the bat costs $10 more than the ball.In this case, how much does the ball cost?$____ 40% correct

SMALL DIFFERENCE FIRST
A bat and a ball cost $110 in total.The bat costs $10 more than the ball.How much does the ball cost?$____ 60% correct A bat and a ball cost $110 in total.But this time, the bat costs $100 more than the ball.In this case, how much does the ball cost?$____ 17% correct Another distinct alternate account of the price difference effect is suggested by Trémolière and De Neys (2014).If respondents expect bats to cost substantially more than balls, a smaller difference between item prices will cause the heuristic operation (subtracting the smaller number from the larger one) to yield prices that more strongly violate this expectation.However, we are skeptical of this interpretation, in part from the (non) results of a study we conducted on GCS in which 411 respondents were randomly assigned to one of the two conditions below.Although the semantic content of the Father & Son condition would seem to invalidate the heuristic operation more forcefully (yielding a 20 yearold father with a 30 year-old son), solution rates are nearly unaffected.Accordingly, we doubt that the effects of the price difference on solution rates (see Fig. 2) reflect beliefs about the relative cost of bats and ballsand, further, we doubt that the heuristic error will typically be very sensitive to manipulations of the semantic content.(The prevalence of the subtraction error in the bat price variant discussed on page 3 provides further evidence of the neglect of semantic detail.)

AL & BOB
Varying the difference between item prices is not the only way to manipulate the degree to which the intuitive operation violates the stipulated constraints; that can also be achieved by manipulating the specified total.For example, Baron, Scott, Fincher, and Metz (2015) report that just 38% solved the standard bat and ball problem (in which the two items sum to $1.10 and differ by $1.00), whereas 90% solved a "soup and salad" variant (in which the two items sum to $5.50 and differ by $1.00).
To further test the effect of manipulating the total price, we randomly assigned 1286 Prolific participants to one of seven conditions: a control condition (total of $110 and difference of $100), three conditions which hold the total price at $110 while reducing the difference, and three conditions which increase the specified total while holding the difference at $100. (Though we doubt it matters much, all conditions involved a "clabor" and a "plonket", to remove any semantic variance in how "realistic" the resulting prices were.)Solution rates from these seven variants are shown in Table K2.The top row is the standard control condition (total = 110, difference = 100).The left side reports results of the variants that manipulated the difference (while holding the total constant).The right-side reports results of the conditions that manipulated the total (while holding the difference constant).
Regardless of whether manipulations involve the difference or the total, solution rates increase if differences are a smaller proportion of the total.While this provides further support for our approximate checker hypothesis, the effects are more modest than implied by the aforementioned "soup & salad" variant.Although we recognize the irony of suggesting this in the context of a 2nd table in a K th appendix, perhaps "more research is needed" regarding the problem elements that do or do not matter.Within each cell, the first number is the price total, the second number is the price difference, and the third number is the solution rate.
A. Meyer and S. Frederick  Appendix L. Reassessing the putative evidence for intuitive solutions We assume that solving the bat and ball problem requires slow, effortful deliberation.Respondents who solve it take much longer than those who don't (see our Appendix B), and solution rates are markedly reduced by the imposition of time limits (Borghans et al., 2008) or mnemonic load (Johnson et al., 2016).
By contrast, Bago and De Neys (2019) propose that many can solve the problem intuitively.As part of their case for a "Smart System 1," they use a multi-trial, two-response paradigm in which respondents must first respond quickly under mnemonic load, but later get to respond again with no time pressure or load.They claim that most of those who ultimately solve the problem could do so intuitively (i.e., quickly, and despite cognitive load).
We are unpersuaded.First, their respondents don't just answer the standard bat and ball problem; they answer many slightly modified variants of the bat and ball problem interspersed among versions of "bat and ball lite."Many of the so called "intuitive" solutions are from these later trials; from variants of a problem respondents have already repeatedly encountered.We think it is important to distinguish intuiting an answer from quickly applying a solution strategy discovered during an earlier trial.
Our concern that repeated exposure exaggerates how many participants appear to intuit the solution is based, among other things, on our reanalysis of data in Raoelison and De Neys (2019), who used the two-response paradigm described above, and data from Raoelison, Keime, and De Neys (2021), who intermixed 4-s and 25-s trials in a single response paradigm.Fig. L pools the speeded responses from these two papers and plots solution rates by trial.On the first trial (red dot), only 1 in 30 respondents select the correct answer.The "intuitive" solution rate increases dramatically over the next forty or so trials, before falling slightly and leveling out close to 25% (which could be achieved by fatigued respondents randomly choosing one of the four response options).Second, most of these experiments, and the independent replication by Burič and Konrádová (2021), used a multiple-choice response format (in which the correct answer is presented alongside one, two, or three incorrect options).To illustrate our objections to this paradigm, suppose you put respondents under mnemonic load, enforced a six second time limit, used experimental instructions which alert respondents to the distinction between intuitive and deliberate responses, and then posed the "XYZ" problem, as below: Some may select 25/6 because the set up implies that there are two possible answers and 4 may seem suspiciously obvious.But we'd not conclude that any of these respondents were solving this set of equations -much less doing so within 6 seconds, under load.
When Bago and De Neys ( 2019) used an open-ended response format, they still found that 15 of the 50 respondents who eventually answered the first trial correctly could do so on the first of the two response opportunities (within about 6 seconds and despite the imposition of concurrent cognitive load).However, we'd regard that result as an anomaly. 32First, our foregoing analyses revealed that only 1 in 30 respondents produced the correct answer on the first trial even when it was included as one of the response options.Second, when we presented the standard open-ended problem to 32 We suspect that this result is a small sample fluke or reflects prior exposure to the problem.Like us, they excluded participants based on self-reported exposure to the problem.However, these self-reports should be interpreted cautiously.Meyer et al. (2018) found that many (1368 out of 4731) who claimed that they had never seen the problem before had, in fact, both seen and answered that identical problem at least once before (based on repeatedly appearing MTurk IDs).
A. Meyer and S. Frederick  a sample of 2000 American internet users who were not part of any regular participant pool that might have previously exposed them to it, 288 answered correctly.However, none did so within 6 seconds, and only eight did so within 12 seconds. 33Further details are presented in Appendix B.
Furthermore, the putative "Smart System 1" is not correlated with cognitive ability in ways one might expect, or in ways that have been claimed.In Raoelison, Thompson, and De Neys (2020), cognitive ability was operationalized as participant's score on a 12-item Raven's APM and a 4-item "verbal CRT."Their claim that cognitive ability correlated positively with both initial (intuitive) accuracy and final (reflective) accuracy on the sequentially presented bat and ball variants holds only if later trials are included (which we find problematic for the aforementioned reasons). 34able L presents correlations between bat and ball accuracy and cognitive ability for each trial within their second experiment. 35On the first trial, only 2 of 54 participants initially selected the correct answer to the bat and ball problem.Both scored more than two standard deviations below the sample mean on their 16-item test of cognitive ability, and both switched to the wrong answer after deliberating, suggesting that many of the so-called intuitive solutions actually reflect an inability to even perform the intuitive calculation within the permitted time. 36 doubt that selecting the correct answer from a small set of provided options after repeated exposure to variants of the same problem represents anything resembling an intuitive solution to the bat & ball problem.Given the instructions which emphasize two kinds of answers and repeated exposure to isomorphs of the standard problem, we suspect, instead, that respondents either eventually recognize that their intuition may be incorrect (and thereby start choosing a counter-intuitive answer from the provided list) or learn to apply a problem-specific shortcut they eventually discover, such as dividing the difference between the two numbers by two. 37hough we reject Bago and De Neys (2019) claim that an appreciable fraction of respondents can solve this problem without engaging in substantial deliberation, we take no issue with Bago and De Neys (2017) broader claim that many reasoning problems contain multiple competing principles, that more than one of them can sometimes be quickly apprehended, and that this can create conflict which reduces confidence in the more dominant intuition (if the problem states almost everyone in the sample is a lawyer, Bill likely is too, even though he sounds more like an engineer).We also support (and, indeed, provide further evidence for) their proposal of rapid, nearly unconscious monitoring of the quality of quickly generated candidate responses.
As a final note regarding intuitive and deliberative responding, it seems important to distinguish the claim that 5 may be an intuitive response from the distinct, but related (?) claim, that some who ultimately solve the bat and ball problem never entertained 10 cents as a potential response.For instance, Szaszi and co-authors (2017) asked 219 respondents to solve the bat & ball problem out loud.Of the 38 who solved it only 14 explicitly mentioned the 10-cent intuition.While we assume this substantially underestimates the fraction who computed or considered that value, we agree that some who possess the ability to solve the problem may not seriously entertain 10 cents as a potential response, perhaps because they (a) immediately encode it as an algebra problem and start doing the math, (b) disbelieve they'd be asked to merely subtract 100 from 110, (c) notice that the second statement does not simply say that the bat costs $1, which it would if subtraction were the only required operation, or (d) somehow intuitively appreciate the principle that one can create a difference of n units between two things by subtracting n/2 units from one thing and adding it to the other (if Andrew gives Shane $5, the difference in their wealth has increased by $10).Following Raoelison et al. (2020), we excluded participants if they reported familiarity with the bat and ball problem, missed the response deadline, or failed to recall the mnemonic load.After those exclusions, the sample sizes used to compute these correlations were about 55 for each trial.
Appendix M. The careless and the hopeless Some of our within-subject Hint experiments included questions besides the Bat and Ball problem: six items from Raven's Advanced Progressive Matrices, which is widely upheld as a measure of general intelligence (Jensen, 1998) and Tversky and Kahneman's (1983) "Linda question" (which also plausibly measures the ability to resist an intuition).
Some of these studies also included the other two items from Frederick's (2005) Cognitive Reflection Test and employed the same within-subject Hint procedure.For "Widgets", 38 we told subjects that the answer was not 100.For "Lilypads", 39 we told them it was not 24.As shown in the tables below, the result for the bat & ball item reported in the main text replicates for these two items as well: in both cases, the largest gap in Raven's scores is between the careless and the hopeless, whereas the largest gap in the "Linda" problem is between the reflective and the careless. 40e Hint manipulation enables us to distinguish the ability to catch the intuitive error on one's own from the ability to perform the required math once the error has been pointed out.Table M3 reports how solving these items (either without or with hints) predicts performance on a second reasoning task (the Raven's items or the "Linda" problem).For all three CRT items, the hint strengthened the relation with Raven's scores, but weakened the relation with solving the Linda problem (i.e., avoiding the conjunction fallacy).If the items are aggregated, both of these "opposing" effects are significant. 41rrespondingly, Fig. M shows that those with higher cognitive abilities (operationized by their Raven's scores) were not only more likely to solve the bat and ball problem initially (reflective responses) but also more likely to use the hint to correct their initial error (careless responses).

Table M3
Relation between performance on CRT items (before & after hint), and performance on two other reasoning tasks (six Raven's matrices & the Linda problem).10), but later revising to 5 when told that 10 is wrong.The Hopeless area indicates the percent of respondents initially answering 10, but failing to revise to 5 when told that 10 is wrong.The unlabeled grey area indicates the percent of respondents who initially answered something other than 5 or 10.

Fig. 2 .
Fig. 2. Effect of price difference on percent of respondents who: Solve the problem, Subtract the price difference from $110, or give some Other incorrect answer.

Fig. 3 .
Fig. 3. % Choosing 5 cents over decoy for nine different decoys Decoy varies between-subjects, forming nine binary-choice bat and ball conditions.Error bars indicate standard errors of the mean.
Al and Bob are 50 years old in total.Al is 20 years older than Bob.How old is Bob? ____ 32% correct FATHER & SON A father and a son are 50 years old in total.The father is 20 years older than the son.How old is the son? ____ 36% correct

Fig. M .
Fig. M. Distribution of bat and ball responses by performance on Raven's APM The Reflective area indicates the percent of respondents initially answering correctly (5).The Careless area indicates the percent of respondents initially answering with the intuitive error (10), but later revising to 5 when told that 10 is wrong.The Hopeless area indicates the percent of respondents initially answering 10, but failing to revise to 5 when told that 10 is wrong.The unlabeled grey area indicates the percent of respondents who initially answered something other than 5 or 10.

Table 2
Effects of "Hint: 10 cents is not the answer."

Table 3
Raven's and Linda performance by Bat and Ball response Number of observations

Table B
Response times for the most common responses.

Table L
Correlations between cognitive ability and bat and ball accuracy

Table M1
Raven's and Linda performance by Widgets response Number of observations