People are explanatory creatures. We often seek to generate explanations based on our own knowledge of how the world works. However, our ability to generate complete explanations on our own is frequently inadequate. We may not have all of the evidence or the expertise to be able to form accurate models of complex phenomena. So we use the knowledge of experts, friends, and communities to piece together explanations. Our beliefs about science are not limited to intuitive preconceptions, but are also derived from scientists who inform us of how things work. Our beliefs about the economy are affected not only by our own experiences, but also by what economists and politicians tell us about large-scale financial systems. We rely on the explanations of others to form our own beliefs. How, then, do we evaluate the explanations of others?

Explanatory criteria

A common view has emerged that the quality or value of an explanation can be determined by how well it satisfies a set of criteria known as explanatory virtues (Lipton, 2004; Thagard, 1978; Harman, 1965; Mackonis, 2013; Glymour, 2014; Lombrozo, 2011). However, there is disagreement about what counts as an explanatory virtue, how these virtues are defined and measured, and how they are weighted when we evaluate an explanation. Two commonly proposed virtues are simplicity and coherence. For example, a good explanation should be simple, requiring the fewest number of causes to explain a phenomenon (e.g., Lombrozo, 2007). A good explanation should also be coherent; it should be compatible with our existing beliefs, and consistent with the evidence and with itself (e.g., Thagard, 1989).

We may also evaluate an explanation using other criteria, such as the credibility of the explainer, or how well the explanation is articulated, that do not reflect the intrinsic value of an explanation. These criteria are useful in satisfying goals beyond identifying the information inherent to an explanation (Patterson, Operskalski, & Barbey, 2015). For instance, a well-articulated explanation can be useful for pedagogical reasons. The perceived credibility of an explainer may affect whether or not one believes the explanation, regardless of its intrinsic merit. While these criteria do not affect the inherent quality of an explanation, they may still serve important pragmatic functions and can be useful indicators of explanation quality.

Everyday explanations

Philosophers have examined features central to scientific explanation that may improve our understanding. Does an explanation need to appeal to general laws (Hempel, 1965)? Should an explanation aim to unify the widest range of phenomena (Kitcher, 1989)? A common method in philosophical inquiry is to analyze existing scientific explanations: what types of explanations do scientists provide, and what makes them good or bad explanations?

However, many of the criteria used for evaluating explanations in a scientific context may differ from the criteria that are important for explaining everyday events. Explanations in non-scientific domains may require a different set of explanatory criteria because they are structured differently. For example, historical explanations are more likely to appeal to a narrative, and less likely to invoke general laws (Dray, 2000).

Although some philosophical theories suggest that abstract explanations are desirable (e.g., Strevens, 2007), people sometimes prefer explanations that are more concrete (Bechlivanidis, Lagnado, Zemla, & Sloman, 2017) and less generalizable (Khemlani, Sussman, & Oppenheimer, 2011). Despite philosophical claims that explanations should be simple (e.g., Thagard, 1978), people tend to explain inconsistencies by positing additional causes rather than disputing a premise, resulting in a more complex causal structure (Khemlani & Johnson-Laird, 2011; Johnson-Laird, Girotto, & Legrenzi, 2004). Non-scientific explanations may also serve different explanatory goals. For instance, Newtonian mechanics is a source of good explanations for pedagogical and most practical purposes, even though Einstein’s relativistic mechanics provides a more faithful explanation of how the world works.

In contrast to the philosophical literature on explanations, psychologists have tended to study short and simple explanations (e.g., Kelemen & Rosset, 2009; Weisberg, Keil, Goodstein, Rawson, & Gray, 2008; Cimpian & Salomon, 2014). These explanations have minimal causal structure, often only a single causal relation. Some experiments of this type rely on causal inference (e.g., Lombrozo, 2007; Khemlani et al., 2011); they ask participants to identify the cause or causes that best explain the observed effects, often holding constant the probability of an effect given its cause. We intend to test whether results obtained with these paradigms also apply to explanations that are more naturalistic.

Explanations that do consist of multiple causal relations require people to consider additional criteria, such as whether there are gaps in the causal structure (Keil, 2006). For example, it is undoubtedly true that leaves change color in autumn because chlorophyll in the leaves breaks down. However, this explanation omits parts of the causal model, such as why chlorophyll causes leaves to be green, and what causes chlorophyll to break down. People may be sensitive to this omission, leading them to evaluate the explanation negatively even if they agree on the primary cause.

In more natural settings, we sometimes construct complex causal explanations in order to explain many pieces of evidence. Pennington and Hastie (1986, 1988) found that people explain complex events by constructing stories around the evidence, and that these stories can differ depending on the order that evidence is presented. These stories can be evaluated by how well they cohere with the available evidence (Byrne, 1995) using a set of coherence principles (Thagard, 1989). It is generally taken for granted that these principles are desirable, and subsequent work has provided some empirical support for these principles (Read & Marcus-Newhall, 1993; Schank & Ranney, 1992).

We should also consider how an explanation fits with our broader knowledge of the world. When evaluating a single explanation, we should consider possible alternative explanations (Fernbach, Darlow, & Sloman, 2010) and counterfactuals (Woodward & Hitchcock, 2003). When explanations provide evidence in support of a causal mechanism (Sloman, 2005), that evidence should be evaluated independently to determine whether it is credible and relevant (Kuhn, 1991).

Real-world explanations are typically more nuanced than experimental stimuli, and thus provide a more ecologically valid way of understanding the explanatory criteria people use to evaluate explanations. Experimental stimuli used to test explanatory criteria are often focused on a narrow subset of explanation types—for instance, explanations that explain token events or explanations that explain classes of events (types), but not both. Though many explanatory criteria have been established for evaluating scientific explanations, we test whether those same criteria are seen as virtues in everyday contexts. In addition, evaluating explanations can require us to engage in a number of processes simultaneously, including dialectical reasoning (resolving inconsistencies), probabilistic reasoning (finding the most likely causes, or the causes that make the effect most likely), and didactic methods (educating the reader). We observe whether previously touted explanatory virtues endure in the face of these multiple goals.

Experiment 1

To investigate how people evaluate everyday explanations, we compiled a small corpus of explanations that were generated in a non-scientific and non-experimental context. Specifically, we gathered explanations from Reddit’s Explain Like I’m Five (ELI5; www.reddit.com/r/explainlikeimfive), an Internet community that receives roughly 7 million unique visitors per month. The explanations in our corpus were rated by participants on a host of explanatory criteria that have been proposed in prior literature.

Method

Participants

Two hundred and forty participants located in the United States were recruited using Amazon’s Mechanical Turk (Paolacci, Chandler, & Ipeirotis, 2010). Five participants were removed from the data set prior to analyses for failing an attention check questionFootnote 1 (Oppenheimer, Meyvis, & Davidenko, 2009). Of the remaining 235 participants, 131 were male and 104 were female, aged 18–69 years (median age of 34 years).

Materials

Eight explanandaFootnote 2 (see Table 1) were selected from ELI5 with three explanations for each, for a total of 24 explanations. The explananda were selected to fit into one of four categories: historical, public health, legal, and social policy. These categories were chosen to contrast with scientific explanation, and also reflect topics of interest to the general public. By selecting explanations from several categories, we sought to identify whether explanatory criteria are domain-general rather than apply only in certain domains. The explanations also varied in style, including a mixture of token and type explanations, as well as teleological and mechanistic explanations. We selected explananda from ELI5 that had a high level of engagement (i.e., many unique explanations and many “votes” from the site’s users). For each explanandum, we chose three different explanations that proposed distinct mechanisms or offered different evidence in support of a given mechanism. In addition, the specific explanations were chosen because they varied prima facie on several explanatory criteria, such as appeals to expertise and evidence, complexity, and generality. An example explanation is shown in Table 2, and all of the explanations are provided in the Supplementary Material.

Table 1 Explananda used in Experiment 1
Table 2 An example explanation used in Experiment 1 to explain “If Ebola is so difficult to transmit (direct contact with bodily fluids), how do trained medical professionals with modern safety equipment contract the disease?

Procedure

Each participant was shown one explanandum with one corresponding explanation. After reading the explanation in full, participants assessed the quality of the explanation by rating whether the text constitutes a “good explanation.” Afterwards, participants rated the remainder of the attributes (see Table 3) in a randomized order. To prevent participants from referring to their previous ratings, each attribute was rated on its own page, with two exceptions. Generality was rated on the same page as principle consensus because the latter question refers to the former. Evidence credibility and evidence relevance were rated last and on the same page, after participants were asked to highlight any evidence in the explanation. All attributes were rated using a 7-point Likert scale ranging from Strongly Disagree to Strongly Agree.

Table 3 List of attributes rated in Experiment 1

Results

Overview

We first examined the relation between explanation quality and each attribute without controlling for the other attributes. A mean score was computed for each attribute for each of the 24 explanations. Partial correlations were computed using a mixed effect model for each attribute, in each case treating quality as the dependent variable, 1 of the 20 remaining attributes as a fixed effect, and explanandum as a random effect. A partial correlation was used in place of a simple Pearson correlation because the 24 data points are not truly independent (there are three explanations for each explanandum). All subsequent correlations reported for Experiment 1 reflect a partial R after controlling for explanandum as a random effect.

Of the 20 attributes, 14 significantly predicted explanation quality, as shown in Table 4. Those that did not include: the desired complexity of an explanation, whether the evidence was credible (evidence credibility), whether the evidence was relevant (evidence relevance), whether the explanation referred to an expert, whether the participant had a lot of prior knowledge in the domain, and whether the explanandum required an explanation (requires explanation). To aid in interpretation, we also corrected for multiple comparisons using a full Bonferroni correction, though it is likely that this correction is overly conservative: all tests were planned a priori, and the tested hypotheses are often complementary rather than orthogonal. Nonetheless, six attributes survived the multiple comparison correction, suggesting that their relation with explanation quality may be particularly strong: whether the explanation had a number of possible alternatives, the articulation of the explanation, whether there were gaps in the explanation (incompleteness), whether the parts of the explanation fit together (internal coherence), whether the explanation was regarded as true (perceived truth), and whether most people agree with the general rule provided in the explanation (principle consensus).

Table 4 Means and SDs for each attribute as well as partial correlations between each attribute and explanation quality, controlling for explanandum as a random effect. Adjusted p-values are computed using a full Bonferroni correction for multiple comparisons

Though many of the attributes were able to predict explanation quality, we also observed substantial covariance between the attributes. The attribute correlation matrix (Fig. 1) depicts the magnitude of the correlation between all attributes pairwise, including explanation quality. It is likely that many of these attributes are not independent predictors of explanation quality, but instead reflect a smaller number of latent factors. For further discussion of how these attributes group together, see the Supplementary Material.

Fig. 1
figure 1

The magnitude (absolute value) of each pairwise correlation is shown after controlling for explanandum as a random effect. A hierarchical clustering procedure using average-linkage is shown to aid visualization

Expertise

When evaluating an explanation, it can be helpful to assess the credibility of the explainer. Our knowledge about an individual can be used to predict what else that person is likely to know (Keil, Stein, Webb, Billings, & Rozenblit, 2008), which could play a role in judging whether the premises of an explanation are true. However, it is not always clear what cues are used to judge expertise. For example, while we might expect experts to use more technical language, using long words needlessly can make an author appear less intelligent (Oppenheimer, 2006). Similarly, scientific jargon does not always affect ratings of explanation quality (Weisberg, Taylor, & Hopkins, 2015; though see Eriksson, 2012).

Additionally, classifying the explainer as an expert can make the explanation more credible, but a good explanation does not have to be constructed by an expert. If two explanations are identical except for their source, they are presumably equivalent in their explanatory power even if they are not assigned the same degree of belief.

We assessed expertise using two dependent measures: whether the explanation referred to an expert (expert), and whether the participant believed the explanation was written by an expert (perceived expertise). The two factors are not identical, though they are related, R = .61, p = .002. An explanation can refer to an expert by self-identifying the explainer as an expert, or by citing an authoritative source. In contrast, someone may judge an explanation to be written by an expert through the quality of the language and level of technical sophistication. Both factors positively predict explanation quality, however perceived expertise is a stronger predictor (see Table 4). One possibility is that identifying an expert primarily serves to increase the perceived expertise of the explainer. A mediation model lends support to this hypothesis: although both expert and perceived expertise are positive predictors of quality (and each other), only perceived expertise is a significant predictor of quality when using multiple regression (Fig. 2). Sobel’s test (Sobel, 1982) confirms that perceived expertise mediates expert and quality, z(24) = 1.87, p = .06.

Fig. 2
figure 2

Perceived expertise mediates the relation between expert (reference to an expert or self-identification) and explanation quality

Coherence

One of the most often cited explanatory virtues is coherence. Despite having received much attention in the literature, the term has been defined in several different contradictory ways. While some authors use coherence to refer to consistency with prior knowledge and beliefs (Murphy & Medin, 1985; Mackonis, 2013), other authors use it to refer to whether the components of an explanation are compatible or complement each other (Thagard, 1989; Bovens & Olsson, 2000; Keil, 2006).

We distinguish between internal and external coherence. External coherence refers to how much of the explanation overlaps or “fits” with what the reader already knows. Internal coherence refers to “how well the parts of the explanation fit together.” We found that internal coherence is nearly twice as predictive as external coherence (Rint = .82, Rext = .47, see Table 4). Using multiple regression, after accounting for internal coherence, external coherence did not significantly correlate with quality judgments (Rint = .73, p int < .001, Rext = .21, p ext = .33). Previous research has suggested that people may not spontaneously generate or consider possible alternatives when evaluating an explanation (Hirt & Markman, 1995). This failure to take an outside view when reasoning (Sloman & Lagnado, 2015) may explain why internal coherence takes precedence over fit with background knowledge.

Articulation

Despite providing no epistemic value, the articulation of an explanation was a strong predictor of perceived explanation quality (R = .79). We examined several linguistic markers to determine if surface features could explain perceived articulation and, by extension, predict explanation quality.

Articulation was correlated with a multitude of surface features, such as the number of words in an explanation (R = .64, p = .002), the median word frequencyFootnote 3 in an explanation (R = –.54, p = .02), and the average word length (R = .45, p = .056). Perceived articulation also correlated with two related well-known readability metrics (Flesch, 1948; Kincaid, Fishburne, Rogers, & Chissom, 1975), Flesch-Reading Ease (R = –.54, p = .018), and Flesch-Kincaid Grade Level (R = .63, p = .003). Additionally, the proportion of nouns in an explanation predicted articulation (R = .54, p = .016).

Oddly, none of these metrics were significantly correlated with judgments of explanation quality (all R < .31, all p > .17), with the exception of word count (R = .60, p = .003). This finding is peculiar, given that articulation was highly correlated with explanation quality. As such, it is not entirely clear whether explanations are rated highly because they are articulate, or whether this correlation is the result of a third variable. For instance, an intelligent person might be skilled at both writing and explaining (identifying the causal structure), even if one does not directly impact the other.

Simplicity

A guiding principle in explanatory reasoning is that of Occam’s Razor: All things being equal, the simplest hypothesis should be preferred. Thus, we initially predicted a negative correlation between subjective complexity and explanation quality. Surprisingly, we observed a positive correlation, with explanations that were rated as more complex also rated as better explanations (R = .49, p = .03).

To further investigate this relationship, we examined other measures of complexity. Explanations may be deemed complex for many reasons, and it is not immediately clear what aspect of complexity our subjective measure is capturing. One possibility is that an explanation may be complex because it appeals to a large number of mechanisms. That is, the explanation suggests the explanandum occurred as a result of many causal pathways. Alternatively, an explanation may be complex because it is very detailed. An explainer may go into great detail about even a single mechanism. We test both of these hypotheses.

Causal pathways

One reason an explanation may be judged complex is because its underlying causal structure appeals to a large number of mechanisms. The four authors jointly identified the causal model for each of the 24 explanations in our corpus (see Fig. 3 for an example; all of the causal models are provided in the Supplementary Material) by identifying causal language in the explanation (Sloman, 2005). Each node in a causal model represents a cause or effect, or both. Node labels were used as shorthand to represent the underlying cause or effect. A directed link from node A to node B indicates that A is a cause of B, though not necessarily a sufficient cause. Non-causal information, such as the credibility of the speaker or flowery language, was not included in the causal model. Simple facts that are not causally related to the rest of the explanation were also excluded. Specific anecdotes and evidence used in the explanation were represented in the causal model by virtue of the fact that causal relations were distilled from more concrete examples. The causal models were constructed to be acyclic, consistent with Bayesian graphical models used elsewhere (e.g., Sloman, 2005). Though this process is somewhat subjective, we converged on a single causal model for each explanation.

Fig. 3
figure 3

An example causal model reflecting the explanation shown in Table 2. The causal model shows three root causes (depicted with a dashed border) that are connected to the explanandum (contracting Ebola)

We estimated the number of causal mechanisms in an explanation by counting the number of root causes in the model (nodes without a parent node and connected by some pathway to the explanandum). This measure is consistent with previous research that suggests the number of unexplained causes, rather than the absolute number of causes, has an impact on explanation judgments (Lombrozo & Vasilyeva, 2017). As predicted, the number of root causes significantly predicts explanation quality, R = .64, p = .005. A reasonable objection might be that explanations that appeal to more causes also explain more effects. However, the correlation remains significant even when we control for the number of final effect nodes in the model and a subjective measure of how much the explanation explains (scope), R = .63, p = .015.

Explanation length

Another reason an explanation may be complex is because it contains a lot of details. One way to operationalize this is to simply count the number of words in an explanation. Those explanations that use more words to describe the causal system can be seen as more detailed. Indeed, we found that as the length of an explanation increased, so did its perceived quality, R = .60, p = .004. Furthermore, it appears that the number of causal mechanisms and explanation length are independent predictors of explanation quality, as both are significant predictors in a multiple regression, Rwords = .52, (p = .02), Rroot_nodes = .51 (p = .03). One caveat is that explanation length is also correlated with articulation, as reported earlier. When subjective articulation was included in the regression analysis, explanation length no longer predicted perceived quality, Rwords = .26 (p = .27), Rroot_nodes = .43 (p = .08), Rarticulation = .44 (p = .04), indicating shared variance between the three attributes.

Complexity and expertise

Complexity may also have indirect effects on ratings of explanation quality. For example, it is possible that a complex explanation may make the explainer seem knowledgeable, which in turn increases the quality of an explanation.

Explanation quality is significantly correlated with both perceived expertise and judgments of complexity (see Table 4). In addition, subjective complexity is strongly correlated with perceived expertise, R = .55, p = .008. We tested the hypothesis that perceived expertise mediates the relationship between complexity and explanation quality. However Sobel’s test for mediation (Fig. 4) does not reach significance (z = 1.7, p = .088).

Fig. 4
figure 4

Perceived expertise is strongly related to both complexity and explanation quality, though it is not a significant mediator

We also conducted a multiple regression analysis to see if perceived expertise could explain variance in explanation quality ratings independent of other measures of complexity (subjective complexity, explanation length, and number of root causes). Although subjective complexity is no longer significant in this analysis (see Table 5), the other predictors remain significant. This finding suggests that although all of the factors in Table 5 reflect measures of complexity, these factors are not interchangeable.

Table 5 Using multiple regression analysis, explanation length, number of root causes, and perceived expertise each significantly predict explanation quality ratings

Incompleteness

We expected that explanations containing gaps in the proposed causal mechanisms would be rated lower than explanations that did not contain any gaps. That is, if an explanation suggests that A causes B, but it is not immediately clear how A causes B, participants will be sensitive to this omission. In support of this, we found that ratings of incompleteness (whether “there are gaps in the explanation”) significantly correlated with explanation quality (R = –.65, p < .001).

We explored this further by examining the average path length in each of the causal models, measuring the average number of steps from a root cause to the explanandum. Pathways that contain more steps are likely to contain fewer gaps, and could be rated higher. However, this was not the case, R = .08, p = .74.

Discussion

These findings suggest that the explanatory criteria used to evaluate everyday explanations may differ from those previously identified. The biggest departure from existing theories is the finding that people prefer complex explanations—specifically, a preference for explanations that appeal to multiple causal mechanisms (though see Ahn & Bailenson, 1996). One limitation of the study, however, is its reliance on correlational analyses. In addition, by using naturalistic explanations that were not modified extensively, the explanations vary in many respects other than complexity. To address these concerns, we conducted an additional experiment that manipulated the number of mechanisms present in each explanation.

Experiment 2

We conducted a follow-up study using controlled stimuli to examine whether explanations with multiple independent causal pathways are preferred. We expected that people would prefer explanations that appeal to multiple causal mechanisms, even when a single mechanism is sufficient.

Method

Participants

Ninety participants located in the United States participated in the experiment via Amazon’s Mechanical Turk.

Materials

For each of the six explananda listed in Table 6, we created two explanations (denoted A and B) that appeal to entirely distinct mechanisms. Explanations were constructed from those found online using multiple online sources, including Reddit, Wikipedia, and HowThingsWork.com. For instance, one explanation for why China’s population is rising despite their one-child policy is because ethnic minorities and rural populations are exempt from the rule; another explanation suggests that Chinese are living longer on average, and that wealthy couples can afford to pay fines associated with violating the policy. Explanations were designed to be roughly equal in length, amount of detail, and number of mechanisms. We also created a third explanation that was simply a concatenation of the other two (denoted AB). This explanation encompasses the other two as it appeals to all of the causal mechanisms in A and B and does not vary in any other way. See the Supplementary Material for the full text of each explanation used in the study.

Table 6 List of explananda used in Experiment 2

Procedure

Participants read an explanandum and were asked to make two ratings on a 7-point Likert scale: “How many reasons or mechanisms should a good answer to this question include?” and “How detailed should a good answer to this question be?” The response scale for both questions was ordinal, ranging from “1—A good answer would offer only one reason or mechanism” to “7—A good answer would appeal to many reasons or mechanisms” for the former question, and from “1—Very little detail is needed” to “7—A lot of details are needed” for the latter question.

On the following page, participants read the full explanation and rated the number of mechanisms in the explanation (from “1—Only a single mechanism” to “7—A lot of mechanisms”), the amount of detail in the explanation (from “1—Not detailed at all” to “7—Very detailed”), the overall quality of the explanation (whether it was a “good” explanation to the question being asked, from “1—Strongly disagree” to “7—Strongly agree”), and whether the participant learned a lot from the explanation (from “1—I didn’t learn anything at all” to “7—I learned a great deal”).

This procedure was repeated for all six explananda. Each participant was shown only one of three explanations (A, B, or AB) for each explanandum. The explanations were counterbalanced so that each explanation was presented to exactly 30 participants. The order of the explanations was also counterbalanced. Prior to beginning the experiment, participants were shown an example explanation and informed of the types of questions they would be answering.

Results

For all six explananda, the concatenated AB explanation was rated as having more mechanisms than the A and B explanations (pooled), all p < .05. In five cases, the AB explanation was rated as having significantly more mechanisms than its nearest competitor (A or B; all p < .05 except p ships = .12; see Fig. 5). This validates our manipulation—explanations with more mechanisms were rated as such. Additionally, AB explanations were rated as having more details than their corresponding A and B explanations for all six explananda (pooled), as well as having more details than their nearest competitor (A or B), all p < .001.

Fig. 5
figure 5

Participants were shown one explanation (A, B, or AB) for each of the six explananda. The AB explanations were rated as appealing to more mechanisms than the A and B explanations. The response scale ranged from “1—Only a single mechanism” to “7—A lot of mechanisms.” Error bars Standard error of the mean

In all six explananda, the quality of the AB explanation was rated significantly higher than the individual A and B explanations (pooled), all p < .05 (see Fig. 6). The concatenated explanation (AB) typically performed better than the second-most preferred explanation (A or B); significant in three explanations (p < .05), nearly significant in one (p ships = .07) and not significant in two (p plague = .3, p vaccines = .77). All of the component explanations (A or B) except one were rated as above average in quality (above the midpoint). Despite this, participants showed a preference for the concatenated explanation (AB) in each case. In addition, 75 of 90 participants rated AB explanations higher on average compared to A or B explanations (across all six explananda), binomial test p < .001.

Fig. 6
figure 6

Participants were shown one explanation (A, B, or AB) for each of the six explananda. The AB explanations were rated as better explanations than the A and B explanations. Error bars Standard error of the mean

Perhaps participants judged AB explanations as better because the phenomena to be explained were complicated, and thus benefited from an appeal to numerous mechanisms. Prior to reading each explanation, participants indicated that a good explanation should appeal to multiple mechanisms (mean ratings for the six explananda range from 3.8 to 5.7). We explored this possibility further by examining the relative complexity of each explanation: whether an explanation that appealed to more mechanisms than a “good explanation to the question should include” would still be preferred. We calculated a measure of relative complexity for each explanation by subtracting the mean rating of the number of mechanisms a good explanation should appeal to from the mean rating of the number of mechanisms the explanation did appeal to. As shown in Fig. 7, the majority of the AB explanations appealed to more mechanisms than initially expected from a good explanation.

Fig. 7
figure 7

Participants were shown one explanation (A, B, or AB) for each of the six explananda. Prior to seeing an explanation, they rated how many mechanisms “a good explanation to the question should include,” and after reading the explanation rated “how many mechanisms does this explanation appeal to”. The y-axis denotes the latter minus the former. Error bars Standard error of the mean

Furthermore, this relative complexity measure predicted quality ratings as well or better than the ordinal rating of number of mechanisms alone. Multiple regression controlling for explanandum found a near-significant effect of relative complexity (partial R = .45, p = .07) but no significant effect of the number of mechanisms (partial R = .26, p = .31). However, we found no evidence that an explanation could suffer from being “too complex”; in no case was the AB explanation rated worse than either of its component explanations.

Discussion

Using controlled stimuli that differed only in the number of mechanisms, we replicated one of the key findings of Experiment 1, showing that people prefer explanations that appeal to multiple causal mechanisms. We found that the relative complexity (observed minus expected complexity) was a strong predictor of explanation quality, however even those explanations that were “too complex” were rated highly. These results contrast with previous findings that have shown people prefer explanations that appeal to the fewest number of causes (Lombrozo, 2007; Read & Marcus-Newhall, 1993).

One possible reason for the discrepancy is that our participants are driven by the desire to know that the explanandum is fully accounted for. In probabilistic terms, this is best represented as the likelihood of the explanandum given a set of causes. This interpretation is consistent with some previous findings (Pacer, Williams, Lombrozo, Xi, & Griffiths, 2013; Vasilyeva & Lombrozo, 2015). Providing additional causes typically increases the likelihood of the explanandum, making it seem inevitable. This hypothesis leads to an interesting prediction that complex explanations should not be preferred after controlling for the likelihood of the explanandum. In contrast, experiments that use causal inference paradigms require participants to select the causes that best explain the observed data. For instance, Lombrozo (2007) uses a causal inference paradigm to provide support for the simplicity principle, but instructs participants that the effect always follows from a cause—thus, regardless of which explanation a participant prefers, the explanandum is inevitable. It is possible that causal inference paradigms overemphasize a particular explanatory goal: selecting the most likely cause.

Our data are inconsistent with an alternative hypothesis that participants should prefer explanations that increase the likelihood of a set of causes given the explanandum (Pearl, 1988). This hypothesis predicts a preference for simplicity, as the likelihood of a set of causes necessarily decreases (or remains the same) as additional causes are proposed. Moreover, an emphasis on predictive, as opposed to diagnostic reasoning, may lead people to endorse causes that are not true. For instance, one explanation suggested that China’s population is growing in part because of a decline in the death rate, despite the fact that the death rate has actually increased slightly since the one-child policy took effect (The World Bank, 2015). Our results suggest that when participants are not judging whether a cause is true (as in causal inference paradigms), they may not evaluate the likelihood of the causes at all and instead assume them to be true (Fricker, 2002).

Another possibility is that people sometimes prefer simple explanations because natural phenomena are often the result of few causal mechanisms. Thomas Aquinas (1945) advocates for simplicity in his claim that “nature does not employ two instruments when one suffices.” In this sense, a preference for simplicity could be a rational adaptation to the environment. However, many phenomena have complex antecedents. Carruthers (2006) counters that biological explanations may not be simple because “one should expect biological systems to be messy and complicated, full of exaptations and smart kludges” (p. 151). Similarly, Salmon (2001) argues that the “desirability of simplicity seems to be an empirical question. In the social sciences, for example, it appears that simple hypotheses may be considered implausible because they are apt to be oversimplifications” (p. 129). Our results echo Carruthers and Salmon: even prior to reading an explanation, participants expected each explanandum to be explained through multiple causal mechanisms, and so single-cause explanations may have appeared too simplistic.

General discussion

Everyday explanations differ from scientific explanations in their structure and in their goals. In Experiment 1, we evaluated potential explanatory criteria and found that some explanatory virtues, such as internal coherence, are good predictors of explanation quality even for everyday explanations. However, other criteria, such as articulation and perceived expertise, are also highly predictive, even though they do not reflect the intrinsic quality of an explanation.

In two experiments, we find evidence that when evaluating explanations, people prefer explanations that are subjectively complex and appeal to multiple causal mechanisms. While this finding is at odds with claims that simpler explanations are usually better, it is consistent with prominent theories in philosophy that suggest the role of an explanation is to identify the causal network of events leading up to an event (Salmon, 1984; Strevens, 2008). The explanations used in the present experiments differ from those often used in the psychological literature because they attempt to be complete and, for the most part, describe the causal mechanisms that led to an event. Rather than asking participants to ascribe causes to events or adjudicate between simple causes, participants were asked to evaluate the aptness of an entire causal system.

The ubiquity of explanations in our lives leads us to constantly evaluate potential causal mechanisms affecting the world. Deepening our understanding of what leads us to accept some explanations and reject others has implications for scientific communication, pubic policy, legal precedents, and beyond. Our current findings suggest that everyday explanatory practices are more complex and nuanced than previously thought.