Deontology and Utilitarianism in Real Life: A Set of Moral Dilemmas Based on Historic Events

Moral dilemmas are frequently used to examine psychological processes that drive decisions between adhering to deontological norms and optimizing the outcome. However, commonly used dilemmas are generally unrealistic and confound moral principle and (in)action so that results obtained with these dilemmas might not generalize to other situations. In the present research, we introduce new dilemmas that are based on real-life events. In two studies (a European student sample and a North American MTurk sample, total N = 789), we show that the new factual dilemmas were perceived to be more realistic and less absurd than commonly used dilemmas. In addition, factual dilemmas induced higher participant engagement. From this, we draw the preliminary conclusion that factual dilemmas are more suitable for investigating moral cognition. Moreover, factual dilemmas can be used to examine the generalizability of previous results concerning action (vs. inaction) and concerning a wider range of deontological norms.

In the present work, we present a set of new moral dilemmas based on real-life situations, such as the organ harvesting situation described above. We suggest that these dilemmas are experienced as more realistic, which causes higher participant engagement. Dilemmas with high participant engagement, in turn, likely engage psychological processes that are more similar to moral judgments in real life. Accordingly, these factual dilemmas are more appropriate for examining moral cognition. In addition, factual dilemmas have other advantages over typically used dilemmas; they do not confound (in)action and moral principle, employ more varied situations, and increase the range of deontological norms that can be examined. By testing factual compared with other dilemmas, the present work is chiefly focused on improving the predominantly used method in moral dilemma research. As such, it does not aim at testing psychological theories. At the same time, we are convinced that this methodological contribution will improve future theory tests.

Moral Dilemma Judgments
Dilemmas are used to examine decisions between adhering to moral norms and optimizing consequences. Conforming to the dominantly used terminology, we call judgments deontological if they favor the option prescribed by the philosophical principle of deontology (e.g., Kant, 1785Kant, /1998. Deontology requires acting in a way that is consistent with universal norms (e.g., do not kill) even if the consequences of the action involve suffering (Kant, 1785(Kant, /1998. Thus, deontology forbids killing even if this refusal leads to the deaths of many more people. In contrast, we call a judgment utilitarian if it favors the option prescribed by the principle of utilitarianism (e.g., Mill, 1861Mill, /1998. Utilitarianism requires optimizing the overall well-being regardless of the nature of the action itself (Mill, 1861(Mill, /1998. Thus, utilitarianism may require killing one person if it is the only means to save several people. We use these terms purely descriptively and for terminological coherence with previous research; we are not implying that deontological or utilitarian judgments are driven by participants' philosophical considerations. Deontological compared with utilitarian judgments have also been conceptualized as favoring norms compared with consequences  and as action-based compared with outcome-based (Cushman, 2013). As the present work is methodological, we take an agnostic stance concerning these positions.
Deontological and utilitarian judgments can result from several psychological processes. For example, norm-based reasoning has been found to increase deontological judgments (Broeders et al., 2011;Körner & Volk, 2014); while manipulations that increase the outcome disparity between the two options have been found to increase utilitarian judgments (Cao et al., 2017;Kawai et al., 2014;Moore et al., 2008;Petrinovich et al., 1993). Thus, both concerns about norms and consequences influence moral dilemma judgments. In addition, psychological processes that are unrelated to the moral principles of deontology and utilitarianism have also been shown to influence moral judgments; most prominently, affective processes (Bartels & Pizarro, 2011;Gleichgerrcht & Young, 2013;Valdesolo & DeSteno, 2006), the amount of cognitive deliberation (Moore et al., 2008;Suter & Hertwig, 2011), and social evaluation and self-presentation processes (Everett et al., 2016;Rom & Conway, 2018).
However, the above-described results were obtained using selections from the same set of moral dilemmas (introduced by Greene et al., 2001; extended and modified by Moore et al., 2008; see also Christensen et al., 2014;Lotto et al., 2014). These trolley-type dilemmas are the predominantly used stimuli in moral dilemma research (e.g., Hauser et al., 2007;Laakasuo et al., 2017). Trolley-type dilemmas are based on a small set of highly stylized situations with many shared characteristics, for example, emergency situations and an identifiable victim who suffers as a result of the utilitarian option. This homogeneity of stimuli undermines stimulus sampling (Wells & Windschitl, 1999) and entails the risk of results being shaped by the specific characteristics of the stimulus set (see e.g., Bauman et al., 2014).
This very low realism is problematic for at least two reasons: noncomparability between the world portrayed in dilemmas and the real world and, relatedly, the engagement of different psychological processes. First, it has been argued that trolley-type dilemmas are so unrealistic as to presuppose a very different world from the world we know (e.g., Davis, 2012). For example, in dilemmas, all consequences of each possible action follow deterministically (e.g., torturing a suspect will definitely lead to an innocent victim's being saved; whereas not torturing the suspect will definitely lead to the victim's death), and the actor has complete knowledge of these consequences. However, as moral intuitions are developed in real life (e.g., where torture might do more harm than good), assessing them in situations where realworld aspects are stipulated to be absent may yield nonsense results. Consequently, decisions may not be transferable to moral situations that could occur in real life (Davis, 2012;Elster, 2011). Therefore, some scholars consider it doubtful what we can learn from answers to trolley-type dilemmas (see also Martena, 2018;Sauer, 2018;Woodward & Allman, 2007).
Second, trolley-type dilemmas may engage different processes compared with real-world dilemma situations. Although to date we know little about psychological processes during real-life dilemma judgments, amusement is certainly not prominent. Trolley dilemmas, in contrast, have been found to amuse participants (about one third of participants felt amused by the trolley dilemma and about two third by the footbridge dilemma; Bauman et al., 2014). Thus, low realism probably alters the psychological processes dominating moral judgments (see also Knutson et al., 2010).
In a theoretical model of narrative text processing, low realism has been postulated to negatively affect text processing (Busselle & Bilandzic, 2008). Specifically, the detection of realism violations is postulated to impair imagery concerning the narrative, thereby reducing participant engagement and consequently outcomes that depend on participant engagement, such as entertainment and persuasion (Busselle & Bilandzic, 2008;Busselle & Vierrether, 2022). Consistent with this theory, realism has been found to predict message evaluation so that higher realism was associated with more persuasion (Cho et al., 2014) and narrative-consistent thoughts (Green, 2004). Conversely, fictional stories with low realism have been found to yield no narrative persuasion (Nera et al., 2018). In sum, low (vs. high) realism adversely affects text processing by reducing engagement. In the present research, we partially test this theoretical prediction by examining whether dilemma realism is associated with participant engagement.
Empirically, scenarios are most frequently judged to be unrealistic if they are (a) implausible, (b) untypical for most people, or (c) not based on facts (Hall, 2003). Trolley-type dilemmas are typically low on all three realism criteria. First, these dilemmas are usually not based on facts but are purely hypothetical. Second, decisions that directly and immediately result in life or death of other human beings are very seldom and that these decisions involve out-of-control trains or other fanciful emergencies is even more rare. Thus, the situations are certainly anything but typical events in most people's lives. Third, many trolley-type dilemmas consist of implausible options; both the range of options and these options' alleged consequences contradict most readers' knowledge about the world. Specifically, in many dilemmas, it seems implausible that a group will certainly die when a specific action is not performed (e.g., five people will be overrun by a nearing trolley: no one will notice the trolley in time, no one will survive the contact) but that the whole group will be saved if the suggested action is performed (one heavy man will stop the trolley: pushing him will have him falling on the tracks, and his weight is sufficient to stop the trolley that would otherwise kill five).
Plausibility has been found to systematically influence dilemma judgments (Körner et al., 2019). If a dilemma's range of options and these options' consequences were implausible (vs. plausible), participants were less willing to transgress deontological moral norms. Thus, low plausibility led to comparatively deontological judgments (Körner et al., 2019; for similar results, see Kneer & Hannikainen, 2022). As dilemmas have no correct answer, it is not clear whether the relatively utilitarian judgments in the plausible dilemma versions are less biased than the relatively deontological judgments in implausible versions. However, as plausibility, a central aspect of realism, influences moral judgments, this indicates that participants are influenced by low realism instead of ignoring it. Thus, dilemma judgments may be distorted by variations in realism. The association between realism and dilemma judgments will be further examined in the present research.

Additional Problems With Trolley-Type Dilemmas
In addition to low realism, trolley-type dilemmas have been criticized for confounding moral principle and action compared with inaction (Gawronski & Beer, 2017;; see also Manfrinati et al., 2013;Miller et al., 2014;Patil, 2015). Specifically, in trolley-type dilemmas, the deontological option is always an inaction while the utilitarian option always involves active interference. In the footbridge dilemma, for example, the deontological option means not interfering, letting the trolley proceed and overrun five people, whereas the utilitarian option requires actively pushing a stranger onto the tracks so that his bulk stops the trolley. Time pressure, for example, might alter either moral processes or preference for inaction-both could lead to the same judgments. Thus, using only dilemmas where moral principle and action/inaction are confounded, one cannot distinguish between processes that affect inaction tendencies and processes that affect utilitarian compared with deontological tendencies (for an approach to disentangle these determinants, see . Note, however, that preferences for action/inaction need not necessarily be unrelated to morality. Regarding harm caused by commission (breaking proscriptive norms) as more severe than harm caused by omission (breaking prescriptive norms) can be genuinely moral (Janoff-Bulman et al., 2009;cf. Spranca et al., 1991) Thus, processes related to (in)action may be genuinely moral processes.
Finally, trolley-type dilemmas are problematic because of their strong predominance in the field. Many highly influential studies use only trolley dilemma variations (e.g., Awad et al., 2020;Greene et al., 2009;Hauser et al., 2007;Valdesolo & DeSteno, 2006). This predominance induces other scientists to adhere to this de facto standard. Barak-Corren and Bazerman (2017, p. 284), for example, justified their decision to use the trolley dilemma by stating "our use of the trolley problem is not motivated by any great interest in railway ethics, and there are disadvantages . . . [T]he trolley problem has been extensively studied, and our experimental approach and hypotheses draw extensively upon the specific lessons of this existing literature." Thus, by now, trolley dilemmas are so extensively used that more researchers feel pressured to use them. A common method within a field of research is useful for comparing results and building a coherent body of knowledge. However, when results are only demonstrated with one method, it is unclear to which situations these results can be generalized (Wells & Windschitl, 1999). Accordingly, the strong reliance on a small set of dilemmas has raised doubts about the generalizability of moral judgment results (Bauman et al., 2014).
In sum, these arguments-trolley dilemmas are being perceived as unrealistic and they confound moral principle and (in)action-have raised concerns about the validity of trolley-type dilemmas for examining moral cognition. In addition, their strong predominance has raised doubts about the generalizability of established findings beyond trolleytype dilemmas. Thus, expanding and improving the set of dilemmas is a worthwhile goal for moral cognition research. There have been previous attempts at using more realistic dilemmas (e.g., Baron et al., 2015;Kahane et al., 2012;Piazza & Landy, 2013;Takamatsu, 2019). However, these dilemmas were generally created on the spot to examine specific questions. Accordingly, each publication contained only a small number of dilemmas and these dilemmas were usually not extensively pretested. In the present research, we reverse the focus; instead of testing theoretically derived predictions about moral judgments, we focus on examining whether the present new dilemmas are seen as more realistic and lead to higher participant engagement than trolleytype dilemmas.

The Present Factual Dilemmas
In the present research, we introduce a set of new dilemmas. These factual dilemmas are created as stimuli for moral psychology experiments where trolley-type dilemmas predominate. The basic requirement for these types of dilemmas is that they describe situations where a decision has to be made between two mutually exclusive options-one consistent with deontology and the other consistent with utilitarianism. Accordingly, our goal was to keep this conflict between deontology and utilitarianism as clear as possible. The new features compared with trolley-type dilemmas were (a) increased realism, which should facilitate participant engagement (Busselle & Bilandzic, 2008); (b) more situational variety (e.g., not only emergency situations) as well as more varied norms; and (c) removal of the confound between (in) action and moral principle.
A main advantage of the present factual dilemmas is that all dilemmas are based on real-life occurrences. By describing events that did occur, they fulfill the realism criterion of factuality. As they contain specific details, such as place and date, we expect that their factual nature is apparent to readers. We also hypothesize that factual (vs. trolley-type) dilemmas are perceived to be more plausible and more typical of real-life events.
Thematically, factual dilemmas involve predominantly medical decisions, emergencies resulting from accidents, and situations arising through war, terrorism, or crime. For an overview, see Tables 1 and 2; for full texts, see the Appendix. Structurally, factual dilemmas fall into three distinct classes. First, a number of factual dilemmas are similar to trolley-type dilemmas in that they describe a decision whether or not to kill someone to save more lives. The main differences are that factual-killing dilemmas are more realistic and include a wider range of situations as well as information about the general setting and the people involved.
Second, in some factual dilemmas, the pairing of action (vs. inaction) with utilitarianism (vs. deontology) is reversed. As already mentioned, trolley-type dilemmas always employ proscriptive deontological norms, that is, forbidding specific actions. Some factual dilemmas employ prescriptive deontological norms, that is, norms that involve actions, such as saving human lives. Studies using dilemmas where action coincides with deontology can show whether judgments in dilemmas with proscriptive and prescriptive norms employ similar psychological processes.
Third, factual dilemmas employ a wider set of deontological moral norms. Trolley-type dilemmas always involve the question of whether or not it is appropriate to kill a human being, with torture as the only exception. In addition to 22 life-or-death situations (decisions about killing or saving lives), the factual dilemmas contain 16 dilemmas involving other deontological moral norms, such as norms about bodily inviolateness, animal suffering, or intentionally breaking laws (for a complete list, see Table 2). These dilemmas can be used to test whether an influence on dilemma judgments generalizes beyond killing/saving decisions to other moral dilemma judgments. In sum, in addition to increasing realism, the factual dilemmas allow for three steps of increasing generalization compared with trolley-type dilemmas, from more varied and detailed but structurally similar dilemmas, over a generalization concerning action/inaction, to general dilemma situations (i.e., beyond killing/saving) contrasting consequence optimization with the adherence to commonly endorsed deontological moral norms.
Here, we present two studies where participants evaluated trolley-type and our newly developed factual dilemmas. To determine whether factual dilemmas are superior to trolleytype dilemmas as stimuli for moral psychology studies, we used perceived realism as a first criterion. Trolley-type dilemmas are generally perceived to be unrealistic, which could distort results (see above). Moreover, realism has been found Suffering subset of benefited: describes whether the victims of the utilitarian choice also belong to the group of people who suffer when a deontological choice is made. In factual-killing dilemmas, this is death avoidability (will the victim die in either case, see e.g., Moore et al., 2008). b Opposite framing reasonable: Can the question be framed as to make other principle (deontology or utilitarianism) coincide with active interference (without substantial dilemma modifications). c Unless one considers that GB is no longer in EU. Suffering subset of benefited: describes whether the victims of the utilitarian choice also belong to the group of people who suffer when a deontological choice is made. In factual-killing dilemmas this is death avoidability (will victim die in either case, see, for example, Moore et al., 2008). b Opposite framing reasonable: Can question be framed as to make other principle (deontology or utilitarianism) coincide with active interference (without substantial dilemma modifications).
to influence dilemma judgments (Kneer & Hannikainen, 2022;Körner et al., 2019). Therefore, we consider highly realistic dilemmas to be superior as stimuli for moral psychology. Accordingly, we asked participants in both studies to evaluate dilemma realism. As a second quality criterion, participant engagement was assessed. High participant engagement is a precondition of dilemma judgment validity (Bauman et al., 2014). Moreover, higher realism should increase participant engagement (Busselle & Bilandzic, 2008). To measure participant engagement, we assessed response times (Study 1) and participants' judgments how seriously they took the dilemmas (Study 2). We predicted that factual dilemmas would be perceived to be more realistic than trolley-type dilemmas and would lead to higher participant engagement.

Study 1
We developed dilemmas based on historic facts. For this, we gathered information on various events where norms and consequences seemed incommensurate. We excluded situations for which viable alternative actions (compared with both the default and the intervention option presented in the dilemma scenario) seemed possible (see Körner et al., 2019) and situations for which we did not find at least one source reporting the main facts in a reputable journalistic or historic/ scientific outlet. Using the remaining events, we wrote dilemma texts, highlighting the situation, the options, and their consequences in a manner similar to trolley-type dilemmas. However, the descriptions were generally longer and contained more details. The main aim of Study 1 was to test how these factual dilemmas are evaluated compared with trolley-type dilemmas. Our hypothesis was that factual dilemmas would be perceived to be more realistic than trolley-type dilemmasmore typical of real-life situations, more plausible, and less absurd. Moreover, if higher realism increases participant engagement, we would expect participants to spend more time on reading and answering factual (vs. trolley-type) dilemmas.

Method
To prevent fatigue-related effects, the dilemmas were split into smaller subsets of nine to 20 dilemmas, comprising six sets, each used in one data collection wave. 1 In both studies, all measures, manipulations, and exclusions are reported. Neither study was preregistered. All stimuli, materials, raw data, and analysis files can be found at: https://osf.io/cg5tq/.
Participants. A total of 545 German-speaking participants (406 female, 139 male, M age = 27 years, SD age = 9) from the local participant pool (consisting mainly of students from different disciplines) completed the study in the lab, approximately equally distributed across the data collection waves, in exchange for 9€/hour. 2 We expected at least medium-sized effects when comparing factual and trolley-type dilemmas for realism. However, to gain more precise estimates (and be able to compare individual dilemmas), we employed a much larger sample size, at least 80 participants per dilemma. The final sample size depended on logistic constraints. With 545 participants and α = .05, the sensitivity analysis for Study 1 yields 80% power to detect, d z = 0.12.
Dilemmas. As factual dilemmas, we used 16 killing dilemmas (similar to trolley-type dilemmas), six saving dilemmas (using the opposite moral principle-action pairing), and 16 dilemmas involving other norms (10 of which use a utilitarianism-action pairing and 6 a deontology-action pairing). For an overview of the content, see Tables 1 and 2, and for full texts (German and English), see the Appendix.
For comparison, we selected 15 frequently used trolleytype dilemmas from Moore et al. (2008) and Greene et al. (2001)

Measures. The moral judgment question was How appropriate is it to [perform suggested action] to [achieve goal]?
Participants answered on a 7-point scale, ranging from 1 (not at all appropriate) to 7 (completely appropriate).
Then participants were asked to evaluate their judgment and the dilemma on several 7-point scales. First, using the questions from Christensen et al. (2014), participants evaluated their feelings during the moral judgment, specifically experienced valence (1 = very negative; 7 = very positive) and arousal (1 = not at all arousing; 7 = very arousing). Next they rated the judgment difficulty (1 = not at all difficult; 7 = very difficult).
For dilemma realism, participants evaluated absurdness, typicality, and plausibility. Specifically, they rated the dilemma concerning absurdness (1 = not at all absurd; 7 = very absurd). Then, they indicated their agreement with four statements about dilemma typicality (adapted from Bauman et al., 2014), which were averaged into a typicality score. The scenario is realistic. The scenario is similar to decisions that people have to make in real life. This would never happen in real life. (reverse coded) The scenario resembles real moral predicaments. Then, plausibility was assessed as agreement with two statements, which were averaged into a plausibility score (Körner et al., 2019, Experiment 1). It is probable that the described actions would have the stated consequences. It is plausible that, given the circumstances, there are no other options; that is, no reasonable action to achieve a better outcome. Both typicality and plausibility were answered using 7-point scales (1 = do not agree at all; 7 = agree completely).
For the latter three (of six) data collection waves, participants were asked for each scenario (filler, trolley-type, and factual) whether, if the dilemma had been based on real-life events, they remembered hearing about it (by choosing one of the following four options, no; it seems familiar but I have no specific recollections; I can remember the scenario but not what was decided; I can remember the scenario and its aftermath). Finally, participants were given the option to comment on the dilemma.
Procedure. Participants were informed that they would be reading and evaluating descriptions of dilemma situations, some of which would be based on historic events (that is, they had occurred in real life) while others were purely hypothetical. After providing informed consent, participants first read and evaluated a filler scenario 3 to get familiar with the questions. Then they evaluated trolley-type dilemmas and factual dilemmas in random order with the restriction that not too many dilemmas of the same type succeeded each other, interspersed with a few filler scenarios. Participants were not informed to which kind (filler, factual, or trolleytype) any scenario belonged. After answering the moral judgment question, evaluating their feeling during the judgment, and evaluating the dilemma for realism and familiarity, the next dilemma ensued. After participants had evaluated all dilemmas, they provided demographic information and were given the chance to comment on the study.

Results and Discussion
For each participant, mean scores for all judgment dimensions were calculated separately for factual and trolley-type dilemmas. Response times were divided by the number of words before means for each dilemma type were calculated. Tables 3 and 4 depict all evaluations on the dilemma level. As can be seen, dilemma evaluations vary substantially, both for trolley-type dilemmas and for factual dilemmas. Among trolley-type dilemmas, Footbridge was perceived to be particularly unrealistic whereas Crying Baby and Lifeboat were generally perceived to be comparatively realistic. Among factual dilemmas, historically very distant situations were generally perceived to be less realistic than more current dilemmas. Table 5 depicts the means for the different subclasses of dilemmas separately. Although judgment, valence, arousal, and difficulty did not significantly differ depending on dilemma type, there seem to be nonsignificant tendencies indicating that dilemmas involving other norms compared to life-or-death might be judged less arousing and less difficult. Table 6 shows exploratory correlations between moral judgment and dilemma evaluations as well as all dilemma evaluations among each other. These results indicate that all realism aspects-typicality, plausibility, and absurdnesscorrelated with moral judgments. Specifically, the higher the perceived realism of a dilemma, the more participants were willing to endorse the suggested action (for the same results separately for each dilemma type, see the Supplemental Materials). This finding is consistent with previous results (Körner et al., 2019), indicating that high plausibility increases utilitarian judgments.

Study 2
Study 2 again compared factual dilemmas and trolley-type dilemmas. This time, we employed an online sample residing in the United States to examine whether the results from Study 1 generalize beyond German participants (which seemed especially important as several factual dilemmas occurred in Europe). Our hypothesis was that factual dilemmas would be again judged to be more realistic than trolleytype dilemmas. Moreover, as realism should influence participant engagement, we also hypothesized that participants would take factual dilemmas more seriously than trolley-type dilemmas. Note. Typicality, plausibility, and absurdness: 1 = not at all, 7 = completely; Arousal: 1 = not at all arousing, 7 = very arousing; Valence: 1 = very negative feelings, 7 = very positive feelings; Decision difficulty: 1 = not difficult, 7 = very difficult; moral judgment: 1 = not at all appropriate, 7 = completely appropriate; utilitarian judgment: identical to appropriateness rating of action = utilitarian dilemmas; reverse coded for action = deontological dilemmas; Familiarity: percentage of participants who stated that they had heard about the situation and knew the outcome; missing values indicate that the question was omitted in the study batch containing this dilemma.

Method
The dilemmas were split into three subsets, so that each subset contained 22 dilemmas (2 identical filler scenarios, 7-8 trolley-type dilemmas, and 12-13 factual dilemmas). Each participant was randomly assigned to evaluate one subset of dilemmas.  Conway and Gawronski (2013). To be able to examine whether there are any group differences between the three dilemma-subset conditions, every dilemma set included Fumes. Neither moral judgments nor any dilemma evaluations of Fumes differed between the dilemma subset conditions (see https://osf.io/cg5tq/). As factual dilemmas, we used the same dilemmas as in Study 1. They were translated to English and sometimes slight modifications were made.

Participants.
Materials and procedure. The instructions were the same as in Study 1. After providing informed consent, participants first read and evaluated the Heinz dilemma, a filler scenario. For each scenario (filler, trolley-type, and factual), participants read the scenario and answered the moral judgment question How appropriate is it to [perform suggested action]? 6 Participants answered on a 7-point scale, ranging from 1 (completely inappropriate) to 7 (completely appropriate). Then participants were asked to evaluate their judgment and the dilemma on the same scales as in Study 1 with three changes. First, the scale anchors for the typicality and plausibility questions were changed to 1 = strongly oppose; 7 = strongly support. Second, the question about participants' prior knowledge of the situation was modified. They indicated whether they thought the scenario was based on true facts and, if so, whether they remembered hearing about it (by choosing one of the following four options, no, I do not think it happened; yes, I think it happened, but I do not remember hearing about it; yes, I remember the situation but  not the results; yes, I remember the situation and also the results). Third, participants indicated how seriously they took the scenario by indicating their agreement to It was easy to take the scenario and the moral decision seriously using a 7-point scale (1 = strongly oppose; 7 = strongly support). Except for these modifications, the materials and procedure were the same as in Study 1.

Results and Discussion
For each participant, mean evaluation scores for each dimension were calculated separately for factual and trolley-type dilemmas. Factual dilemmas were judged to be more typical  Table 9 depicts the means for all evaluation dimensions for the different subclasses of dilemmas separately. This time, valence did differ significantly depending on dilemma type, with the least negative evaluations for dilemmas involving other norms compared with life-or-death decisions. Table 10 shows exploratory correlations among all judgments and dilemma evaluations. As in Study 1, the results indicate that all realism aspects-typicality, plausibility, and absurdness-correlated with moral judgments (for the same results separately for each dilemma type, see the Supplemental Materials). As in Study 1, higher realism was associated with more utilitarian judgments, indicating that realism affects morally relevant cognitive processes.

General Discussion
The majority of studies on moral dilemma judgments use a small set of dilemmas that have been widely criticized (e.g., Gigerenzer, 2010;Sauer, 2018). These trolley-type dilemmas are unrealistic, sometimes to the point of being ludicrous, they confound moral principle with (in)action, and they employ a small set of highly similar and impoverished scenarios. Low realism could distort moral judgments (Kneer & Hannikainen, 2022;Körner et al., 2019), and the strong predominance of homogeneous stimuli raises questions about the generalizability of empirical results (Woodward & Allman, 2007). In the present research, we presented and tested a set of new dilemmas based on historic events. These factual dilemmas were (a) perceived to be more realistic than the predominating trolley-type dilemmas; specifically, they were judged to consist of more plausible options, to be more similar to real-world moral quandaries, and to be less absurd. This higher perceived realism is probably why factual dilemmas led (b) to higher participant engagement; specifically, participants spent more time reading and deciding on factual compared with trolley-type dilemmas (Study 1) and were able to take them more seriously (Study 2). 7 Therefore, if researchers intend that participants get deeply engaged with moral dilemmas, the present research provides initial evidence that this goal is more easily achieved with factual than trolley-type dilemmas.
In addition to testing the new, factual dilemmas, the present research also provides the first systematic realism assessment of frequently used trolley-type dilemmas. Our results indicate that trolley-type dilemmas vary in realism. While Lifeboat, Crying Baby (this one more in Europe than the United States), and Border Crossing were judged to be comparatively realistic, other dilemmas were judged unrealistic-especially the popular Footbridge. It is noteworthy that several studies contrast the footbridge dilemma with other trolley variations. Many of these find that an employed manipulation influences only judgments in the footbridge dilemma (e.g., Claessens et al., 2020;Costa et al., 2014;Duke & Bègue, 2015), concluding that the manipulation influences only judgments in up-close, personal dilemmas. Instead, the present results suggest that different judgments for Footbridge (vs. other dilemmas) might be related to its low realism. We advise researchers to employ only highly realistic dilemmas and especially advise against the usage of Footbridge as well as Vitamins and Vaccine test. Similarly, we also advise against one of our new dilemmas, Lion Hunt, as its realism evaluations are markedly inferior to the other factual dilemmas.
Compared with the German lab-based sample, the American MTurk sample showed smaller effect sizes, that is, smaller differences between factual and trolley-type dilemmas. This might have several reasons. First, the two samples differ in their average experiences with moral dilemmas. While the lab participants in Study 1 were comparatively unused to dilemma studies, many MTurk workers participate in a great number of studies (e.g., Chandler et al., 2014), and repeated experiences with dilemmas might dull one's experiencing scenarios as unrealistic or absurd. This would explain the higher realism evaluations for trolley-type dilemmas in the MTurk (vs. lab) sample. Second, low motivation might lead to giving the same response to many questions. Overall, we observed a smaller response differentiation in the MTurk (vs. lab) sample (see Tables 3 and 6). And if there is less systematic variance, we would expect smaller effect sizes. Third, there might be cultural differences in what is deemed realistic. Differences in knowledge about the historic context might explain, for example, that the United States (but not the German) participants judged a dilemma about a Native American tribe fighting for its rights in 1830 more realistic than a dilemma about Nazi atrocities during World War II. Even for the majority of dilemmas that happened roughly in the present time, different laws and cultural practices might have influenced what is judged realistic. In general, when selecting dilemmas for experiments, realism judgments by a sample of participants with a similar background as the experimental participants should be considered.
Note that both the present participants and the factual dilemmas over-represent the Western world. Future research needs to determine whether these dilemmas are also judged to be realistic and lead to higher participant engagement than trolley-type dilemmas in other parts of the world. Moreover, adding more dilemmas that happened outside of Europe would probably increase realism for non-Western participants. In addition, future research should examine judgments in factual dilemmas by representative samples of participants.

Validity Criteria for Moral Dilemmas
In the present research, we used realism and participant engagement as criteria to determine dilemma quality for moral psychology experiments. Using additional validation criteria would enable firmer conclusions. However, comparisons with previous findings are, in our opinion, not suitable to determine stimulus validity. These comparisons seem problematic because there is no way of knowing which findings are psychologically "correct" and which might be stimulus artifacts. Accordingly, neither finding similar effects for factual and trolley-type dilemmas nor finding different effects necessarily indicates validity.
Ultimately, the goal of any stimulus set should be to enable high internal and external validity. Concerning internal validity, stimuli need to be free from confounds (here, for example, concerning moral principle and action/inaction). In addition, dilemmas must depict a conflict between deontology and utilitarianism, for which, we argue, extraordinary situations are preferable to mundane situations. First, scenarios need unrecoverable losses to prevent participants from re-appraising the situation in a way that dissolves the dilemma. If bad things happen to a good person, participants tend to think that this person will be compensated (Harvey & Callan, 2014; see also Hafer & Rubel, 2015;Liu & Ditto, 2013). Many everyday sufferings, such as discomfort or monetary losses, could be compensated in the future, whereas severe suffering, especially death, cannot be compensated. This might explain why judgments concerning smaller stakes (monetary losses) have been found to yield different results from judgments concerning life-or-death decisions (Gold et al., 2014). Second, in everyday life, decisions frequently involve many different options or uncertain consequences. However, as moral dilemmas need to clearly contrast deontology and utilitarianism, they require scenarios whose consequences occur with very high probability. For example, in the conjoined twin case, one baby would definitely die during the operation, and both babies would definitely die if the operation was not performed. In most real-life health care decisions, consequences are less certain. However, with uncertain consequences, one might gamble on making a decision that leads to optimal consequences without violating deontological principles (e.g., the survival of both babies). Thus, with uncertain consequences, the moral conflict task might, through re-appraisal, become an optimization task. In sum, for high internal validity, it is essential to have scenarios in which the consequences can be plausibly said to follow with high probability and entail unrecoverable losses, ensuring that deontology and utilitarianism do indeed require mutually exclusive actions. External validity is the extent to which a study uncovers psychological processes that generalize across contexts and that play an important role in real-life situations (see Eastwick et al., 2013;Evans et al., 2015;Mook, 1983). As there is hardly any research on psychological processes in real-world dilemma situations, it is difficult to know which processes should be active in dilemma judgments. Accordingly, the external validity of moral dilemmas cannot be examined in the short run. A feasible alternative is a predictive validity. Predictive validity could be examined by testing whether judgments, first, predict participants' own behavior or, second, predict how participants evaluate decisions made by others. First, in trolley-type dilemmas, judgments were found not to predict (or to predict only weakly) participants' own behavior in similar situations (Bostyn et al., 2018). Thus, future research needs to determine whether factual dilemmas fare better at predicting participants' behavior. Second, although individuals rarely have to decide on dilemmas themselves, they form public opinion and elect politicians who decide about dilemma-related topics, such as euthanasia, and behavior in health pandemics, war, and terrorist attacks. Thus, citizens frequently evaluate dilemmas at secondhand. Accordingly, examining whether factual (and trolley-type) dilemmas predict moral judgments at secondhand would be another valuable test of predictive validity.
participants who might have participated in many studies before, using well-worn dilemmas could prevent participants' forming new judgments. New dilemmas, in contrast, have to be evaluated on the spot and cannot be answered by memory retrieval. Besides being used in their present form, factual dilemmas can be modified. For example, when using them in combination with purely hypothetical dilemmas to enlarge the dilemma pool (i.e., without contrasting hypothetical and factual dilemmas), they may be shortened to make them more similar to trolley-type dilemmas (e.g., some of the present dilemmas, Rwanda, Coma, and Bishops, were modified to increase the number of dilemmas suitable for multinomial modeling, Körner et al., 2020). Note, however, that modifications could change perceived realism. Moreover, not only the dilemma texts but also the judgment question could be modified. For example, instead of evaluating how appropriate the focal action is, participants could be asked to judge whether the action is permissible, required, or signals a good moral character. Thereby, the factual dilemmas could be used to examine moral minimalism (Aktaş et al., 2017;Royzman et al., 2015) or moral character judgments (Uhlmann et al., 2013). Factual dilemmas can be used to examine the generalizability of findings obtained with trolley-type dilemmas. First, factual-killing dilemmas may be used to examine whether any manipulation or process that was established with trolley-type dilemmas influences dilemma judgments also when realism is higher, the situations are more varied, and more context is available. Second, factual-saving dilemmas can be used to examine whether a manipulation or process influences dilemma judgments not only concerning proscriptive but also concerning prescriptive life-or-death decisions (for initial evidence, that realism might influence judgments in prescriptive and proscriptive dilemmas differently, see the Supplemental Figures S1 & S2). The final level of generalization concerns norms beyond life-or-death decisions. Factual dilemmas involving other norms (see Table 2) can be used to examine whether a manipulation or process influences dilemma judgments beyond killing/saving decisions. Thus, when selecting factual dilemmas, researchers can choose to examine dilemma judgments at differing degrees of generality.
In addition, some of the dilemmas can be easily reframed so that either the deontological or the utilitarian option is consistent with active interference (see Tables 1 and 2). This increases flexibility in usage, as these dilemmas can be used either in settings that require the traditional action-utilitarian pairing or in settings where the role of action compared with inaction in moral judgments is examined.
Making dilemmas more realistic involves adding information about the people involved and their relations to each other. From a purely cognitive perspective, one might deem this information unnecessary. However, for external validity, we argue that adding details and humanizing the protagonists (e.g., by mentioning their gender, nationality, or relationship to oneself) is useful (Hester & Gray, 2020). If a manipulation only influences dilemma judgments when the situation descriptions are devoid of any personal or social information, the generalizability of its influence is limited. On the downside, however, the richer descriptions present in factual dilemmas likely decrease effect sizes because the additional information introduces unsystematic variance. Thus, we recommend increasing the sample size when using factual dilemmas.
Factual dilemmas vary along some dimensions that have previously been found to influence dilemma judgments (e.g., Christensen et al., 2014;Moore et al., 2008), for example, whether the protagonist would also benefit from the suggested action or is a disinterested bystander (see Tables 1 and 2 for each dilemma's level). These variations could be either seen as potential confounds or as desirable to ensure the robustness of any examined effect (note that the present results are very similar when controlling for these factors, see Supplemental Materials). When regarded as confounds, dilemmas only from one combination of these categories (dilemmas where the victim is only going to suffer in one of the options and where no option involves self-benefit) could be selected. Alternatively, to examine generalizability, these dilemma dimensions could be systematically examined, statistically controlled, or allowed to vary.

Conclusion
Despite widespread criticism, trolley-type dilemmas are used in the overwhelming majority of studies on moral dilemma judgments. By the use of trolley-type dilemmas, evidence for the influence of various psychological processes on dilemma judgments has been accumulated. We suggest that it is time for using more varied and more realistic stimuli. Using more varied dilemmas in psychological studies would inform us how specific or general previously examined cognitive processes are in their influence on judgments, showing how much morality theories generalize. Using more realistic dilemmas should increase participant engagement, leading to more valid results. Here, we suggested new dilemmas based on facts. We showed that factual dilemmas are more realistic than trolley-type dilemmas and induce higher participant engagement. From this, we tentatively conclude that factual dilemmas are suitable stimuli for moral psychology research.