Can online interfaces enhance learning for public decision-making? Eliciting citizens ’ preferences for multicriteria decision analysis

Innovative online interfaces informing and consulting citizens about their preferences for multicriteria decision analysis (MCDA) could make public decision-making more participatory. We propose a three-faceted learning for decision-making framework and used it to test newly-designed online weight elicitation interfaces. We investigated two features meant to enhance learning: fully-fledged gamification with a narrative, interaction with nonplayer characters, and ambient music, and learning loops (LL) using consistency checks of elicited weights and the challenge to resolve inconsistencies. We operationalized our framework with a novel systematic set of measure instruments providing complementary data types. We designed a 2 × 2 between-subject experiment with pre-and postquestionnaires. Answers from 769 respondents, representative of the Swiss population in age and gender, indicated that the interfaces successfully raised awareness about wastewater management. Gami-fication was helpful: respondents performed better in the factual learning test, and unexpected social learning occurred. However, gamification lowered the perception of process understanding. The LL were beneficial: objectively, respondents performed better in the factual learning test. However, respondents perceived the LL as cognitively demanding and their factual learning as lower. Our structured assessment highlighted the need for further research to investigate, for instance, high interpersonal variability and the disparities between tested and perceived learning. Measuring preference construction remains challenging; and social learning should be added to the assessment framework. Applying such structured assessment of learning outcomes to more traditional operational research interventions would provide a baseline for future comparison.


General motivation
Making structured decisions, for instance with MCDA (Keeney & Raiffa, 1976) and value-focused thinking (Keeney, 1992), inherently implies learning (e.g.Arvai, Gregory & McDaniels, 2001;Ferretti, Pluchinotta & Tsoukiàs, 2019;Franco & Lord, 2011;Gregory et al., 2012;Linkov et al., 2006;Marttunen & Hämäläinen, 2008).Learning is topical in behavioral OR (Operational Research), and previous studies have focused on evaluating learning in OR interventions (e.g.Monks, Robinson & Kotiadis, 2014, 2016;Thompson, Howick & Belton, 2016).The present work contributes to this endeavor by proposing a comprehensive framework for measuring learning in OR interventions that support decision-making.The framework includes process understanding to the previously proposed framework including factual learning and preference construction.This enhanced framework is applied to investigate the effects of two newly-designed features meant to enhance learning: gamification and learning loops.
Learning for decision-making is all the more important if participation is broadened.Public engagement is at the core of community OR (e. g. Johnson, Midgley & Chichirau, 2018;Midgley, Johnson & Chichirau, 2018) and evaluating such public interventions has gained attention (e. g.Midgley et al., 2013;White, 2006).Decision analysts beyond community OR have also been concerned with increasing participation (e.g.French & Argyris, 2018;Ríos Insua & French, 2010).Broadening participation allows non-subject-matter experts to engage in decision-making.Ideally, these stakeholders (1) learn facts about the decision topic; (2) construct preferences; and (3) understand the evaluation of alternatives based on objectives that are important to them (value-focused thinking, Keeney, 1992).Knowing that this indeed occurs would increase our confidence in their inputs.This is particularly important if the engagement process is happening online because no analyst or moderator can check the reliability of the answers (e.g.Bessette, Campbell- Arvai & Arvai, 2016;Philpot, Philpot, Hipel & Johnson, 2022;Vieira, Oliveira & Bana e Costa, 2020).The present work contributes to the effort of increasing citizen participation in decision-making.
We contribute by testing innovative online interfaces to elicit the relative importance given to objectives (weights) for later use in multicriteria decision analysis (MCDA), specifically, multiattribute value theory (MAVT) (Keeney & Raiffa, 1976).We focused on weights, as they are important preference parameters.The present study includes novelties on two features meant to enhance learning: gamification and learning loops.First, we tested a newly designed gamification, based on studies on gamification of learning, and of online surveys (Section 1.2.2).The innovations are the storyline of the tested interfaces, which connects closely to the decision topic at stake, gender-neutral nonplayer characters, and ambient music.The nongamified interface includes the following novelties: a user-friendly interface, and simple instructions that are available at any time.Second, we tested learning loops with consistency checks, in line with decision analysis and learning literature (Section 1.2.3).As innovative contribution, for the first time these learning loops are tested for a real-world decision including a full set of ten objectives.
In addition to the above-mentioned innovations regarding the two features expected to enhance learning (the gamification, and the learning loops), the present study systematically operationalized the measure of the three learning facets.These instruments generate three complementary data types, namely quantitative performance test, quantitative self-assessment, and qualitative self-assessment (Section 2.4).Finally, to investigate the effect of both innovative features on the three learning facets, we carefully designed the experiment, resulting in a 2 × 2 between-subject experiment with a large and diverse representative population sample.
Below, we present the key related research and our hypotheses (Sections 1.2.1 to 1.2.3; summarized in Section 1.2.4).We continue by presenting the research design and online interfaces in Section 2. We provide the results of our experiment in Section 3. Finally, in Section 4, we summarize our results, reflect on these, and discuss insights for participatory OR and learning through decision-making.

Summary of key related research 1.2.1. Learning for decision-making
Making structured decisions is often considered as an iterative learning process (e.g.Arvai et al., 2001;Gregory et al., 2012;Linkov et al., 2006;Marttunen & Hämäläinen, 2008).Engaged stakeholders are led into new ways of thinking about their decision problem (Ferretti et al., 2019;Franco & Lord, 2011;Kaplan, 2008) (quotes in Supplementary Material S1).This can be interpreted using the five categories of learning, observed at the individual or group level: change, discovery, explanation, clarification, and creation (Belton & Elder, 1994).Engaged stakeholders can change their way of thinking about the problem, discover new alternatives and perspectives on the issue, better explain the system, and clarify or create their preferences for themselves and in comparison with preferences of others.Making a structured decision inherently implies learning.
Learning is one of the typically studied aspects within behavioral OR (BOR; Franco, Hämäläinen, Rouwette & Leppänen, 2021).Early work proposes using the value of information to evaluate the expected benefit of learning, in the sense of knowing a decision aspect with more certainty (Clemen, 1991).Information then relates to the probabilities associated with the outcomes of an alternative.New information is valuable if refining the probability changes the decision choice.The value of learning broadens this concept to the whole decision process (McDaniels & Gregory, 2004).Recent BOR publications on learning refer to the theory of action and its learning loop (Monks et al., 2014(Monks et al., , 2016)).The theory of action is originally used for learning at the level of an organization (Argyris, 1978).Monks et al. (2014) tested the high involvement hypothesis.It suggests that stakeholders who build a model for discrete event simulation learn more than stakeholders who reuse existing models.In a second study, they confirmed that model builders are better at transferring learning from one case to another than model reusers, but model builders are more prone to the overconfidence bias (Monks, Robinson & Kotiadis, 2016).Others refer to constructivist learning theories, in particular transformational learning theory (Mezirow, 2000).They focus on investigating critical learning incidents: those moments of surprise experienced when one's mental model fails to make sense of a situation but a change in the model enhances understanding (Thompson et al., 2016).Our work follows this BOR interest in learning and builds on the constructivist approach.
Given our assumption that preferences are constructed, as opposed to pre-existing and retrieved solely from memory (Lichtenstein & Slovic, 2006), the transformational learning theory (Mezirow, 2000) constitutes an obvious theoretical basis for our work.The transformational learning theory suggests that individuals learn when they actively give meaning to their experience (Mezirow, 2000).This is particularly triggered when individuals face cognitive dissonance (Adcock, 2012) or a disorienting dilemma.Disparities between individuals' existing knowledge or belief and new information lead them to reflect on and adapt their knowledge, mental model, and values.This critical reflection is central to the transformational learning theory.Individuals can critically reflect on the experience itself, termed content reflection; the way they deal with the experience, process reflection; and the long-term established social constructs of the environment in which the experience takes place, premise reflection (Mezirow, 2000).
We suggested conceptualizing an individuals' learning for decisionmaking as content reflection (Aubert & Lienert, 2019).Seen through the lens of preference construction, we interpret content reflection as twofold.First, individuals need to learn facts about the decision at stake by acquiring information and structuring it so that it becomes useful.This factual learning reinforces or changes pre-existing mental models, or creates one if individuals had no prior internal representation of the decision problem.Second, individuals need to confirm or change values relating to the various aspects of the decision by constructing preferences (Aubert & Lienert, 2019).Then, we proposed a novel interpretation of process reflection for learning for decision-making (Aubert, Esculier & Lienert, 2020, 2022).We considered process reflection as a by-product of preference construction: individuals reflect on the decision-making process and learn how to solve a complex problem, for instance by understanding value-focused thinking.Therefore, we understand individual learning in decision-making as composed of factual learning, preference construction, and process understanding.
We postulate that OR interventions supporting decision-making facilitate these three aspects of learning.We developed new online interfaces to elicit weights for later use in MAVT from affected citizens, who are not subject-matter experts.These interfaces provide state-ofthe-art information on objectives and alternatives to enable factual learning.To facilitate preference construction, we follow recommendations to use standard ratio methods such as swing and trade-off for eliciting preferences.These methods emphasize the range of the possible outcomes (Eisenführ, Weber & Langer, 2010;Montibeller & von Winterfeldt, 2015;Riabacke, Danielson & Ekenberg, 2012;von Winterfeldt & Edwards, 1986).Moreover, our interfaces guide the individuals through the elicitation process with instructions that emphasize thinking in terms of objectives (i.e., value-focused thinking) when evaluating possible alternatives.We hypothesize that using the interfaces enables factual learning, preference construction, and process understanding.

Gamification to enhance learning
In addition to testing the occurrence of learning with the developed interfaces, we investigate the effect(s) of two features meant to enhance learning.The first one is presented in this section, the second in Section 1.2.3.First, we investigate whether gamifying an online survey for preference elicitation of weights for use in MCDA can enhance learning from non-subject-matter experts.Gamification is defined as "the use of game design elements in non-game contexts," products, and services to motivate desired behaviors (Deterding, Dixon, Khaled & Nacke, 2011, p.1). Variations of this definition exist, because the concept is relatively new (e.g.Koivisto & Hamari, 2019;Landers, Auer, Collmus & Armstrong, 2018).Gamification can trigger psychological effects that in turn can lead to effective performance at tasks and persistence in complex actions (Landers et al., 2018;Ryan & Deci, 2017).Gamification has been tested for multiple purposes, including education and survey research (e. g.Dichev & Dicheva, 2017;Keusch & Zhang, 2015).We briefly synthesize the state-of-the art of gamification in these two fields because they are relevant for learning and designing effective online survey interfaces.
Gamification of surveys, sometimes called surveytainment (Kostyk, Zhou & Hyman, 2019), was initially proposed as a remedy for existing problems such as high dropout rates and low quality of data collected through online surveys.Gamification is expected to maintain the respondents' interest levels (Bailey, Pritchard & Kernohan, 2015), which in turn should counter heuristics and other cognitive shortcuts, thereby limiting satisficing behaviors (Krosnick, 1991) such as random answering and speeding through the survey (Keusch & Zhang, 2015).However, gamification can distract respondents from the survey, particularly when it is not connected with the survey topic (Guin, Baker, Mechling & Ruyle, 2012).Other results indicate that gamification leads to a lower overall response rate, but indeed does not introduce biases, and increases the time spent to answer (Harms, Biegler, Wimmer, Kappel & Grechenig, 2015).Overall, results are equivocal: it is unclear whether gamification of surveys is beneficial.Our work, adapted to the context of decision-making, contributes to answering this open question.
Gamification of learning is also reported to be beneficial in some cases, and detrimental in others.Some review papers in education literature report successes (e.g.Kasurinen & Knutas, 2018;Subhash & Cudney, 2018).Others have more nuanced views.For instance, in a review of 41 papers on introducing gamification to higher education, only 26% provide empirical evidence for the effectiveness of gamification in improving learning, motivating students, and increasing participation; 64% are inconclusive, and 10% even negative (Dichev & Dicheva, 2017).Another critical example is the conceptual paper entitled "why gamification fails in education" (van Roy & Zaman, 2017).These authors propose nine heuristics for designing meaningful gamification with the self-determination theory (Ryan & Deci, 2017).Examples of these heuristics are to set challenging but manageable goals, and to align gamification with the goal of the activity in question (van Roy & Zaman, 2017).Gamifying learning is considered promising (Plass, Homer & Kinzer, 2015;Ryan & Rigby, 2019), but its effectiveness has not yet been conclusively demonstrated.
We developed a new gamified online survey interface for weight elicitation.Hereby, we followed recommendations from the literature and feedback from test users and game designers on our earlier prototypes (Aubert & Lienert, 2019;Aubert, Lienert, & von Helversen, 2022).While game designers, scientists, and citizens have worked together for modeling and simulation exercises for a long time, such collaboration is innovative for online preference elicitation for MCDA (Aubert, Bauer & Lienert, 2018).Because gamifying online weight elicitation is so new, we proceeded stepwise by first testing lighter (less expensive) gamification (i.e., not including as many game elements as now), and only with student samples (Aubert & Lienert, 2019).S4.1 includes a summary table displaying the main difference points with our previous studies.As a result, the new gamification presented in this paper connects the storyline with the survey topic, and provides manageable goals to the respondents, among further new elements presented in Section 2.2 and S4.1 (e.g., gender-neutral nonplayer characters, ambient music).We hypothesized that respondents using the gamified interface would show improved results for factual learning and preference construction compared to respondents using a control (nongamified) interface.Based on the literature, we could not formulate a clear hypothesis about the effect of gamification on process understanding.Moreover, we tested our hypotheses in a real-world population survey in Switzerland.

Learning loops to enhance learning
Second, we investigated the effect of learning loops for MCDA preference elicitation.Learning loops challenge the respondents to think twice about their preferences.As highlighted earlier when discussing the transformational learning theory (Mezirow, 2000), cognitive dissonance (Adcock, 2012) triggers reflection on and adaptation of one's knowledge, mental model, and values.Facing such instances enhances learning.
In MCDA, recommendations for preference elicitation include asking consistency check questions (Anderson & Clemen, 2013;Payne, Bettman & Schkade, 2006).In practice, the decision analysts make sure that they elicit preferences that represent the opinion of the stakeholders' by asking an additional set of questions (e.g.Hobbs & Meier, 1994;Martin, 2021;Marttunen & Hämäläinen, 2008).For instance, one can use multiple weight elicitation methods and compare the resulting weights.Or, if using swing weight elicitation hierarchically, one can ask about the relative importance of objectives in different branches of the tree, and check if the answer is consistent with the elicited weights.If using pairwise trade-off elicitation, one can ask about a pair of objectives that was not used to calculate the weights, and check if the answer is consistent with the elicited weights.If the stakeholders' answers are consistent, the preferences elicited are considered reliable.If they are not consistent, the decision analysts guide the stakeholders through the weight elicitation again.The decision analysts can ask the stakeholders to "think harder".However, in at least one experiment about elicitation of probability distributions from experts, priming experts to think harder was not very effective in making them revise their initial judgments (Ferretti, Montibeller & von Winterfeldt, 2022).The effect of learning loops created by consistency check questions needs to be investigated.
In education literature, some studies show that tasks requiring high mental effort enhance learning (e.g.Hamari et al., 2016).Such studies refer to the theories of flow (Csikszentmihalyi, 1990) and self-determination (Ryan & Deci, 2017) and suggest that challenges engage the learners if the cognitively demanding tasks match their skills.Thanks to this increased engagement, the challenges lead to greater persistence and better performance at tasks.In the classroom, unchallenging tasks lead to disengagement (Hamari et al., 2016), which hinders learning.In lifelong learning, persistence in the face of challenges was reported to be facilitated if the learner believes that intelligence is malleable (Sheffler, Rodriguez, Cheung & Wu, 2022) and perceives internal motivation and self-efficacy (Merriam & Baumgartner, 2020;Ryan & Deci, 2017).Therefore, if tailored to the learners' needs, challenges can help create an engaging learning environment.Some recent online interfaces designed for use without facilitation by a decision analyst create cognitive dissonance to help stakeholders make decisions that are consistent with their own values (e.g.Bessette et al., 2016;Philpot et al., 2022).They provide feedback on how closely a portfolio or a proposal aligns with their prioritized objectives and values.Resolving the cognitive dissonance and disparities explained in the feedback increase mental effort and stress but also increase process understanding (Bessette et al., 2016).We proposed a concept of learning loops for online preference elicitation, comparing weights elicited with swing and trade-off methods (Aubert & Lienert, 2019).However, the experimental design of this previous study did not allow to investigate the effect of this learning loops feature on learning for decision-making.
In the present study, we therefore investigated the effect of learning loops, using newly designed interfaces (Section 2.2).We implemented two methods for weight elicitation, the swing and trade-off methods.In the online interface, the weights resulting from the two methods are A.H. Aubert et al. compared and shown to the respondents.The respondents are then given the opportunity to reconsider their weights in case of inconsistencies (Aubert & Lienert, 2019).In the present study, the instructions are simpler and always accessible, thereby overcoming issues that users reported upon testing previous online weight elicitation interfaces we had developed (Aubert et al., 2020;Aubert, Lienert & von Helversen, 2022).We based our hypotheses on insights from the literature and the idea that learning loops created by consistency checks can provide a "challenge" and cognitive dissonance.We hypothesized that respondents using the interface with learning loops would show better results for factual learning, preference construction, and process understanding than respondents using a control interface (without learning loop).We also hypothesized that learning loops would increase the mental effort.

Measuring the three facets of learning for decision-making
To measure learning, we operationalized a set of instruments (described in Section 2.3).We based our measures on education literature, where the concept of constructive alignment (Biggs, 1996) suggests developing assessments directly connected to each previously defined learning goal.We formulated the following learning goals, based on our proposed three-faceted framework of learning for decision-making: (i) Respondents can remember facts about objectives and alternatives considered in the decision at stake.(ii) Respondents can express consistent preferences with regards to the relative importance of objectives and (iii) are confident that the elicited preferences reflect their actual preferences.(iv) Respondents can differentiate objectives from alternatives and (v) can understand that the relative importance given to objectives is used to derive a ranking of alternatives.
In addition, both BOR literature and education literature call for collecting complementary data types (den Haan & van der Voort, 2018;Franco et al., 2021), i.e., quantitative and qualitative data.Therefore, we designed a set of measurement instruments collecting quantitative (performance test and self-report on Likert scales) and qualitative (text) data for all learning goals.While we could reuse measure instruments from our previous studies for factual learning (e.g., Aubert et al., 2022Aubert et al., , 2022)), for the present study, we had to revise the measure instruments for preference construction (Aubert et al., 2020(Aubert et al., , 2022)).Moreover, as innovation in this study, we created new instruments for measuring process understanding (i.e. a performance test, and new self-report questions on Likert scales).This comprehensive set of instruments to measure learning for decision-making, derived from a systematic operationalization, is to the best of our knowledge new.

Summary of novelties and hypotheses
This study presents several innovations.First and second novelties relate to the features meant to enhance learning.Specifically, the tested interfacesgamified and with learning loopsare new, and their respective design builds up on lessons learnt from previous attempts.Additionally, the present experimental design, with a large representative population sample, allows to investigate the effect of the learning loops feature specifically.Finally, we propose a comprehensive novel set of instruments to measure learning for decision-making.Table 1 summarizes the hypotheses formulated in the introduction, following our proposed conceptualization of individual learning for decision-making.It is worth mentioning that based on the literature, we could not formulate clear hypothesis about the effect of gamification on process understanding, nor about the interaction of both features meant to enhance learning.

Decision case: wastewater infrastructure
Our experiment addresses a relevant decision about wastewater infrastructure planning.Todays' centralized wastewater system has been highly successful, but also has drawbacks regarding sustainability and costs.Increasingly, decentralized wastewater systems are being developed and discussed (Hoffmann et al., 2020;Larsen, Hoffmann, Lüthi, Truffer & Maurer, 2016).These are no longer connected to sewers; instead wastewater is treated on-site in people's homes.Wastewater infrastructure needs renewal in many OECD countries, and especially in rural municipalities, this offers a window of opportunity for transitioning to decentralized systems.For our experiment, we used information from a decision about wastewater infrastructure planning in two Swiss rural municipalities (Beutler & Lienert, 2019).Deciding about future wastewater management usually affects the inhabitants via wastewater fees.Additionally, decentralized wastewater infrastructure can directly affect people in their homes.The decision in this experiment consisted of choosing between six alternatives, based on preferences regarding ten objectives.Based on information collected in the Swiss cases (Beutler & Lienert, 2019) and expertise from earlier wastewater projects (Haag, Zürcher & Lienert, 2019;Lienert, Duygan & Zheng, 2016), we considered four higher-level objectives, which we specified with two to three lower-level objectives (Fig. 1).The six alternatives were: rehabilitation of the local wastewater treatment plant (WWTP), connection to the regional WWTP, decentralized package plants, decentralized package plants with urine separation, decentralized package plants with urine and feces separation, and construction of a new local WWTP with urine separation (see S2 and S3 for a detailed description of the objectives and alternatives).

The experiment: procedure and treatments
We tested if the three aspects of learning occurred while using the specifically designed interface to elicit weights, and evaluated the gamification and learning loops effects as hypothesized in the introduction (Table 1).To test this, we designed a 2 × 2 between-subject experiment.The two varying factors defining our four treatments were nongamified (control) vs. gamified interface, and without (control) vs. with learning loops (Fig. 2).Note that we could not exclude potential interactions between the treatments.We briefly introduced the survey topic (wastewater management) and explained the survey procedure.Thereafter, respondents completed a knowledge test about wastewater management in Swiss rural contexts.One of the four treatments followed, in which we informed all respondents about the ten objectives and six possible alternatives and elicited how respondents weighed the objectives.In the two treatments with learning loops, respondents also ranked the alternatives from most to least preferred after reading about the alternatives and before weight elicitation.Thereafter, we directed respondents to a post-treatment questionnaire that included a repetition of the knowledge test, performance tests, self-reported assessments, open textbox questions, and thanks (Fig. 2).

Nongamified (control) vs. gamified treatments
The nongamified treatment (Fig. 3) was an interface specifically designed for weight elicitation (Aubert & Masson, 2021).It included improvements of two previously tested prototypes (Aubert et al., 2020(Aubert et al., , 2022)).It described the objectives and alternatives and implemented swing weight elicitation following textbook recommendations (Anderson & Clemen, 2013;Eisenführ et al., 2010;Payne et al., 2006).Novel features to overcome earlier problems included instructions with a graphic example, which remained accessible at all times.This clarified instructions, and avoided repetitive text.Moreover, the new user interface was simpler than the previous interfaces and included a progress bar.It also provided warning messages.For instance, in swing weight elicitation, a message popped up if the respondent intended to rate a lower-ranked alternative higher than a higher-ranked one (S4.1,screenshots in S4.2).Finally, although of lesser importance for the experiment, the new interface included an administration back-end, which significantly eased creating and modifying surveys as no coding knowledge is required.
The gamified treatment (Fig. 3) consisted of an interface that included game elements (also see screenshots in S4.3).A narrative provided a goal, and interactions occurred with gender-neutral nonplayer characters.The innovation of using gender-neutral characters was based on feedback from earlier prototype testing and game designers (Aubert & Lienert, 2019).The novel narrative was strongly anchored in the wastewater decision topic.The survey respondents were candidates in the election to become the mayor of New Waterton.In the neighboring town, wastewater management problems had occurred.New Waterton's citizens wanted to avoid these issues, and the wastewater topic became decisive for their vote.To increase their chances of being elected, the candidates (i.e.respondents) needed to communicate consistently about their preferences.As another new feature, the respondents chose an avatar and provided a name for the exchanges with the nonplayer characters.Each nonplayer character represented a citizen who advocated a single objective.There were additional nonplayer characters who gave feedback and guidance, and challenged the respondents to do better when relevant.To ensure gender neutrality, another novel aspect, characters resembled animals.The interface blended weight elicitation into the narrative, thereby taking some minor liberties for the sake of simplicity regarding textbook recommendations for weight elicitation (S4.4).The interface included original artwork and ambient music that corresponded with the narrative, an innovation not included in our earlier prototypes.It provided more choices to the respondents than the nongamified treatment.For instance, respondents decided on the order of tasks within a chapter.It rewarded the respondents (e.g. with feedback from nonplayer characters, or by being elected).As the nongamified version, it included a progress bar.The informative texts about the ten objectives and the six alternatives were strictly the same as in the nongamified treatment.The objectives' icons and the graphical description of alternatives were also the same.
After a competitive call, the company Entrée de Jeux (https://www.entree-de-jeux.ch/,retrieved on 14.6.2022)conceptualized the gamification, and the company Opinion Games (https://www.opiniongames.ch/, retrieved on 14.6.2022)developed it.The company Youmi (https://www.youmi-lausanne.ch/,retrieved on 14.6.2022)developed the nongamified interface.Both developments closely involved the first author.As a small aside, we wish to point out that this collaboration between science and two game companies was complex, timeconsuming, and expensive.

Without (control) vs. with learning loops
In the treatment with learning loops, we elicited preferences with two methods: swing and trade-off in series (for method description, see e.g.Eisenführ et al., 2010).After elicitation, the two sets of elicited weight preferences were presented to the respondents.Comparing the results originating from the two methods allowed respondents to reflect and think twice about their weight preferences.This comparison step mimics consistency check questions by a facilitator in direct interactions (e.g.Hobbs & Meier, 1994;Payne et al., 2006).After comparing the two sets of elicited weights, the respondents could decide to repeat the elicitation and change their answers, which created a learning loop.We calculated a ranking of alternatives using the additive aggregation  model of multiattribute value theory (MAVT), linear marginal value functions, and the elicited weights (Keeney & Raiffa, 1976).This ranking was compared with holistic ranking of alternatives from most to least preferred provided by the respondents before weight elicitation but after reading about the alternatives.The respondents also received a graphical representation of the prediction matrix that showed how well each alternative performed on each objective.The concept of these two embedded learning loops has been pretested and is described in detail in Aubert and Lienert (2019).The treatment without learning loop consisted of an initial holistic ranking of alternatives, swing weight elicitation, and a final ranking of alternatives based on the elicited weights and the additive MAVT model without feedback.

Sample definition and respondent recruitment
A market research company (Intervista; https://www.intervista.ch/,retrieved on 14.6.2022)invited the respondents by email.In the email, they mentioned the unusual length of the survey, at least 45 min, and that it was only possible to complete it from a desktop computer or laptop.Respondents received a link to the opening part of the experiment.Intervista made sure to invite each respondent only to a single treatment, and assigned them to one of the four treatments.We mandated the company to invite respondents from the German-speaking part of Switzerland, following age and gender quota to represent the Swiss population in each treatment (also see Section 3.1 and S5).Upon completion, respondents received a token compensation according to the company's point system.Because we targeted non-subject-matter experts, the survey started with a filter question asking how much respondents knew about wastewater management.If they knew rather a lot or a lot, they could not proceed to the survey.We excluded the knowledgeable respondents, because we assumed that they would not learn much about wastewater management.However, this could be the topic of a follow-up experiment.In the present study, only respondents knowing nothing at all or a little about wastewater management could answer because our survey was explicitly designed for informing and gathering preferences of laypeople.
We determined the desired sample size from previous studies (Ryan, 2017) and a priori statistical power analysis (Faul, Erdfelder, Buchner & Lang, 2009).We aimed at 200 respondents per treatment.Each subsample was comparable in age, gender, and education distributions (Section 3.1, S6 and S7).We collected data for the nongamified treatments between March and April 2021 and for the gamified treatments between October and November 2021.

Data analysis
The statistical analyses for the quantitative data were performed with R project for statistical computing (R Development Core Team, 2020).We looked at central tendencies and distributions.We investigated whether the treatments explained the measures using linear regression analyses for the performance tests (function lm; performance = f(gamified, learning_loop, gamified*learning_loop)) and beta regressions for the Likert scales measures, because they are constrained (function betareg; self-reported = f(gamified, learning_loop, gamified*learning_loop)).Beta regressions required normalizing the 1 to 5 scales.We controlled for outlying observations with studentized residuals and Cook's distance, absence of multicollinearity (i.e.absence of linear correlation between the explanatory variables) with the variance inflation factor (VIF) and tolerance, and independent errors with absence of autocorrelation and the Durbin-Watson test.The checks for assumptions were acceptable for all models.
The qualitative data were coded in an iterative process.We retained the qualitative answers of 833 respondents, assuming that if they took the time to write something, their input could be useful even if they showed speeding behavior in the previous questions and had been excluded from the main analysis (Section 3.1).The answers to the four questions (Table 2) were redistributed according to the three aspects of learning for decision-making.The coding scheme and data are freely available in the data package (https://doi.org/10.25678/0008WS).
After coding, we produced contingency tables for the occurrence of the codes and corrected the counts by the number of respondents in each treatment.We report hereafter only on those codes that differ by a factor of 2 between treatments.

Respondents and pre-analyses
In total, 2094 respondents started the survey (Table 3).We kept only the complete responses.The length of the survey induced high dropout: Table 2 Measures used to assess the three aspects of learning.

Learning goals
Type Measures

Factual learning
Respondents can remember facts about objectives and alternatives considered in the decision at stake.(S8.1) Quantitative/ performance test -Ten multiple-choice questions about the facts concerning the objectives to consider when deciding about wastewater management (one per objective).For each question, four possible answers were available.Respondents were asked to choose the correct statements among the four propositions.Order of questions and choices for each question was randomized.In the initial test, there was a fifth option: 'I do not know'.Rating system: 1 point if responses to all four choices were correct (which meant the choice was correctly checked or correctly not checked); 0.5 points if three choices were correct; 0 points otherwise (also 0 points if 'I do not know' was chosen).Knowledge score for objectives varied from 0 to 10. -Six multiple-choice questions about the alternatives to choose from in our decision case, one per alternative.As for objectives, order of questions and choices for each question were randomized.Same rating system as for the objectives.Knowledge score for alternatives varied from 0 to 6. Quantitative/ selfreported -Self-reported confidence in the answers provided for the knowledge test (from 1 Not at all confident.I am 1 percent confident.to 5 Very confident.I am 99 percent confident.)-Three self-reporting 5-point Likert scale questions on factual learning (from 1 no learning at all to 5 very much learning).Self-reported factual learning was calculated as the mean of the three questions and varied from 1 to 5.

Qualitative
Optional textbox question asking Please elaborate on the facts you have learnt about wastewater management.The answers were coded right or wrong per objective and alternative.Statements indicating a preference change were coded as well.

Preference construction
Respondents can express consistent preferences with regards to the relative importance of objectives and are confident that the elicited preferences reflect their actual preferences.(S8.2) Quantitative/ performance test -Two multiple-choice questions about the respondent's preferences on objectives (one for the lower level of the objectives hierarchy, one for the upper level).For each question, four possible answers were available.
Respondents were asked to choose the correct statements among the four propositions.Order of the choices for each question was randomized.Same rating system as for the objectives.Preference construction score varied from 0 to 2. -Percentage of respondents with weight patterns: (i) indifferent between all objectives (equal), (ii) one objective receiving a weight over 0.95 (single).Quantitative/ selfreported Three self-reporting 5-point Likert scale questions on preference construction (from 1 no preference construction at all to 5 very much preference construction).Self-reported preference construction was calculated as the mean of the three questions and varied from 1 to 5.

Qualitative
Optional textbox question asking Please elaborate what exactly you have learnt about your preferences.The answers were coded per objective, and alternative.
Statements indicating factual learning were coded as well.

Process understanding
Respondents can differentiate objectives from alternatives and can understand that the relative importance given to objectives is used to derive a ranking of alternatives.(S8.3) Quantitative/ performance test Four multiple-choice questions about value-focused thinking, the meaning of objectives and alternatives, and the task performed.For each question, four possible answers were available.Respondents were asked to choose the correct statements among the four propositions.Order of the choices for each question was randomized.Same rating system as for the objectives.Process understanding score varied from 0 to 4. Quantitative/ selfreported Four self-reporting 5-point Likert scale questions on process understanding (from 1 no understanding at all to 5 very much understanding).Self-reported process understanding was calculated as the mean of the four questions and varied from 1 to 5.

Qualitative
Two optional textbox questions: Please elaborate what exactly you have learnt about the way to tackle complex decision problems and Will you use this method to tackle a complex decision in a different context?Describe shortly how you would do it.The answers were coded per objective and alternative, and as being correct or wrong.Statements indicating a preference change were coded as well.

Extraneous cognitive load (ECL)
Is there an effect of the gamification on ECL?Is there an effect of learning loops on ECL?
Quantitative/ selfreported Four self-reporting 5-point Likert scale questions on the mental effort required for process understanding (from 1 no ECL at all to 5 very much ECL).Self-reported ECL was calculated as the mean of the four questions and varied from 1 to 5. (S8.4) A.H. Aubert et al. median time varied from 42 min for the nongamified treatment without learning loop to 65 min for the gamified treatment with learning loop.Unfortunately, during data collection for the gamified treatment, the server stopped working, which prevented access.This contributed to a higher loss of respondents for the gamified treatment.We removed 51 respondents with satisficing behaviors such as speeding (less than half the median time) and straightlining (e.g.respondents choosing the same value on Likert scales for questions where it does not make sense).We excluded five respondents who wrote dubious comments, such as "One always eats kebab with cocktail sauce."The final sample size was 769 respondents (Fig. 2, Table 3); 54% identified as female, 46% as male; age varied between 18 and 84 (the average age of the sample was 49); most (45%) had a university degree (full description in tables in S6).In the analyzed sample, there was no statistical difference between the four treatment groups regarding gender, age and education (S6).Additionally, there was a balanced distribution of these covariates across the treatments (S7); we thus did not use the socio-demographic variables in the analyses.

Performance test for learning about objectives
Overall, the final scores for factual learning were significantly higher than the initial scores by 1.5 points of 10 total points (t(768) = 23.806,p < .001,d Cohen = 0.81).Therefore, the respondents' delta knowledge scores (Δ KS = KS final -KS initial ) were positive on average (Table in S9.11).For our samples, the KS increased by 1.22 to 1.88 points.This indicates that only limited learning about the objectives occurred.The multiple linear regression analysis showed that gamification (β = − 0.55, p = .002)and learning loop (β = − 0.39, p = .03)had an effect on factual learning about objectives (F(3, 765) = 5.33, p < .001): the nongamified treatment had a lower delta score than the gamified treatment by 0.55 points of 10 total points; the sample without learning loop had a lower delta score than the sample with learningloops by 0.39 points.However, the interpersonal variability was high: only a very small proportion of the variance was explained (R 2 = 1.7%) (for reasons of space, the full regression table is given in Tab.S912).Finally, fewer respondents had negative Δ KS or null Δ KS in the gamified treatments than in the nongamified ones (Table 4).In other words, in the nongamified treatments, more respondents had lower knowledge scores at the end of the survey than at the beginning.The binomial logistic regression (Х 2 (3) = 13.245,p = .004)showed that gamification was a significant predictor of negative Δ KS (β = 0.65, z = 2.51, p = .012);while the learning loop and the interaction (gamified*learning loop) terms were not significant predictors (Tab.S913)

Performance test for learning about alternatives
Overall, the final scores were significantly higher than the initial by 0.6 points of total 6 points possible (t(768) = 13.289,p < .001,d Cohen = 0.59).Thus, the respondents' delta knowledge scores for alternatives (Δ KS alt = KS alt final -KS alt initial ) were positive but low (Tab.S914).For our samples, KS alt increased by 0.41 to 0.71 points.Learning about the alternatives occurred, but was limited.The multiple linear regression analysis showed no statistically significant effect of the treatments on factual learning about alternatives (F(3, 765) = 2.37, p = .07)(Tab.S915).The proportions of negative Δ KS alt or null Δ KS alt did not differ much between the treatments (Table 4).However, the logistic regression (Х 2 (3) = 9.00, p = .029)showed that the interaction term (gamified*learning loop) was a significant predictor of the negative Δ KS alt (β = 0.72, z = 2.43, p = .015):respondents with nongamified treatment without learning loop had more often negative Δ KS alt , meaning that their KS alt at the end were lower than at the start (Tab.S916).

Self-reported factual learning
Overall, the respondents' self-reported factual learning was slightly positive (3: fairly, 4: rather better/increased on 5-point Likert scale; Table 5).Respondents with learning loops had the lowest self-reported learning.The beta regression analysis showed that the learning loop had an effect on self-reported factual learning (β = 0.18, p = .002).To a lesser extent, the interaction between gamified and learning loop also had an effect (β = − 0.13, p = .098).However, again, the interpersonal variability was high: little of the variance was explained for the sample without learning loop that had slightly higher self-reported learning (R 2 = 2.0%) (Tab.S917).
Overall, the respondents' confidence in their answers to the knowledge test slightly increased at the end of the survey compared to the beginning (by 0.49 on average, t(768) = 17.229, p < .001,d Cohen = 0.58; Table 6).The confidence increased more for respondents without learning loop than those receiving the learning loops and for respondents with gamification than those receiving the nongamified treatment on average (Table 6).The beta regression analysis showed that the learning loop was a predictor of self-reported confidence in the answers to the knowledge test (β = 0.06, p = .073).However, the interpersonal variability was high: little of the variance was explained (R 2 = 1.6%) (Tab.S918).

Qualitative data
Ten percent more of the respondents receiving the gamified treatment (55.3%) left comments that were coded as correct than those receiving the nongamified treatment (44.9%) (Х 2 (1) = 9.060, p = .003).More than twice as many respondents receiving the gamified treatment (8.1%) acknowledged the complexity of the wastewater system than the respondents receiving the nongamified treatment (3.6%) (Х 2 (1) = 7.808, p = .006),for instance "that wastewater management is a complex business: it requires lots of expert knowledge to be able to make a good decision," "Complex system with many dependencies and conflicting objectives," "it is a much bigger and more complex issue than I first thought."More than twice as many respondents receiving the nongamified treatment (3.1%) wrote a negative comment suggesting that the information was unclear or that there was too much information than the respondents receiving the gamified treatment (1.0%) (Х 2 (1) = 4.262, p = .054).For instance: "the explanations were complicated… at least for me as a beginner," "too technical and too difficult," "difficult to understand for laypeople".Answers from treatments with and without the learning loop did not differ much.

Performance test for preference construction
Overall, the participants' preference construction score was 0.84 (ranging from 0 to 2; Tab.S921).The lowest score was obtained for the respondents with the nongamified treatment and no learning loop (but

Table 3
Statistics of respondents.Start: the number of respondents who started the survey; Complete: the number of respondents who answered until the end (% of those who started in parentheses); Sample: the number of respondents after data cleaning (removing speeding, straightlining, dubious comments) (% of those who started in parentheses),%Lost: the proportion of respondents in the sample relative to those who started.LL; learning loop.nogam: nongamified.gam: gamified.the differences were not statistically significant, Tab.S922).We checked for suspicious patterns in weights: equal weights, or nearly or all weight assigned to one objective (Table 7).A general linear regression (Х 2 (3) = 51.726,p < .001)showed that neither gamification, learning loop, nor their interaction were good predictors of the occurrence of patterns (Tab.S923).However, only two patterns occurred without learning loop, against 40 with learning loops.

Self-reported preference construction
Overall, the respondents' self-reported factual learning was slightly positive (3: fairly, 4: rather better/increased on a 5-point Likert scale; Table 8).The pre-analysis for reliability between items resulted in a dubious Cronbach's alpha (below 0.7); therefore, we did not aggregate them in one construct.Respondents with learning loops generally had lower self-reported preference construction than those without, except for item 2 gamified (Table 8).The beta regression models to explain items 1 and 2, both about the perceived occurrence of preference construction, showed no significant effect of the treatment (Tab.S924, S925).The beta regression model to explain item 3, about confidence in the elicited preferences, showed some effect of learning loop (β = 0.14, p = .070):with the learning loops, the confidence was lower (Tab.S926).Again, little of the variance was explained (R item3 2 = 0.92%).Results about self-reported preference construction could not support nor contradict our assumptions.

Qualitative data for preference construction
The proportion of respondents with negative comments about preference construction was higher (22.3%) for those receiving the gamified treatment than the nongamified treatment (14.3%) (Х 2 (1) = 9.092, p = .003).More than twice as many respondents receiving the gamified treatment (6.5%) wrote a comment suggesting that they faced difficulties in trading off objectives than those receiving the nongamified treatment (3.1%) (Х 2 (1) = 5.265, p = .031).This is particularly true for those with learning loops: 8.6% for gamified vs. 3.7% for nongamified (Х 2 (1) = 4.214, p = .056,n.s.).For instance, they wrote how important two objectives were, despite the fact that these objectives cannot be achieved at the same time: e.g."Saving energy and water does not go both ways," "that it is difficult to reconcile environmental protection with finances and attractiveness for the population," "It's like having to choose between eating or drinking."Finally, more than twice as many respondents receiving the gamified treatment (7.3%) wrote a comment suggesting population acceptance issues and conflicting opinions than those receiving the nongamified treatment (1.5%) (Х 2 (1) = 16.773,p < .001),for instance "Time cost for the end-user should be close to 0, else the system won't be accepted," "I did not think that wastewater management should consider the attractiveness of toilets to increase acceptance," "User acceptance is very important" (more    quotations in S927).However, it should be noted that a small proportion of respondents used the term "preference" imprecisely (varying between 0.9 and 4.8%), for instance "When different preferences have the same priority," "I became aware of the weighting of my preferences in comparison to each other," "I rate individual preferences higher than others".Eight respondents, seven from the nongamified and one from the gamified treatment, expressed doubts about the reliability of the elicited preferences, for instance "Would be interested to know how many results can be effectively used," "it will be difficult to collect reliable and valid results," "the final results are therefore rather falsified".Finally, 5.1% of the respondents receiving learning loops reported no preference construction compared to 0.5% respondents receiving the treatment without learning loop (Х 2 (1) = 15.869,p < .001),for instance "nothing new learnt," "little to nothing, rather confusion," "actually, I have not yet reflected about preferences in wastewater management!".More results are presented in S927.

Performance test for process understanding
Overall, the respondents' process understanding score was 2.06 (ranging from 0 to 4; Tab.S931).The highest score was obtained by the respondents with the gamified treatment with learning loops in the multiple linear regression (by 0.17 points, p = .073)(R 2 = 0.2%) (Tab.S932).

Self-reported process understanding
Overall, the respondents' self-reported process understanding was slightly positive (3: fairly, 4: rather better/increased on a 5-point Likert scale; Tab.S933).The multiple beta regression analysis showed that gamification had an effect on self-reported process understanding: the subsample without gamification had a slightly higher self-reported process understanding than the subsample with (0.12; p = .054)(Tab.S934).However, here again, very little of the variance was explained (R 2 = 0.55%).

Qualitative data for process understanding
The respondents receiving the nongamified treatment wrote more text that was coded as negative than positive about process understanding than those receiving the gamified treatment.Fewer respondents receiving the nongamified treatment without learning loop (31%) wrote comments coded as positive than respondents receiving the gamified treatment without learning loop (48%) (Х 2 (1) = 12.914, p < .001).
Among respondents writing comments, 8.3% understood that trade-offs were unavoidable, 5.8% differentiated between alternatives and objectives, and 5.3% answered that they would reuse the method again.
Reasons for not learning about the process were diverse: the method was too cognitively demanding (2.5%), they already knew the method (2%), applied it already (1%), used other methods (3%), or decided according to gut feeling (1.3%).The other methods mentioned were heuristics, AHP process, discussion with experts, SWOT analysis, "seven thinking steps," cost-benefit analysis, simple utility analysis, risk analysis, and the FORDEC tool (S9.35).
More respondents receiving the gamified treatment (5.7%) made comments suggesting that they were thinking in terms of alternatives than those receiving the nongamified treatment (1.3%), (Х 2 (1) = 12.201, p < .001),for instance "options that one must weight against one another," "Thinking in options," "before a decision for the optimal objective, one should tradeoff between options".More than half of the respondents receiving the nongamified treatment with learning loops (5.6%) reported that it was more cognitively demanding than all three other treatments (varying between 1 and 1.7), for instance "the survey completely exhausted me," "Still difficult.One had to absorb and remember a relatively large amount of information in a concentrated way," "The methods are interesting, but also a real challenge."

Extraneous cognitive load
Overall, the respondents responded "rather no" to "some" extraneous cognitive load (3: some, 2: rather no on 5-point Likert scale; Tab.S941).Respondents with learning loops reported higher extraneous cognitive load than those without.The beta regression analysis showed that learning loop had an effect on the self-reported extraneous cognitive load: the subsample with learning loops reported statistically significantly higher extraneous cognitive load than the subsample without (0.25, p = .0001)(Tab.S942).Little variance was explained by the model (R 2 = 2.26%).The qualitative data reported in Section 3.4 on process understanding corroborated these results.

Summary of the results
Some of our results met our expectations, although their practical relevance was limited in a number of cases (Table 9).For instance, the scores of the knowledge tests statistically significantly increased between the start and the end of the survey (confirming hypothesis 1a), but this increase was small (similarly to results in Aubert & Lienert, 2019).Moreover, the treatments explained very little of the overall variance in the dependent variables.Our results are in line with previous studies comparing designs of online surveys, including gamified ones, which have reported few differences in response patterns (e.g.Guin et al., Item 1: How much did you learn about your preferences (i.e., by realizing, that some objectives are more important than others)?Item 2: In the preference elicitation, did you answer in a way that reflects your personal preference?Item 3: How confident are you about the preferences that you expressed during the preference elicitation?.
A.H. Aubert et al. 2012).In our case, factual learning was higher with gamification and learning loops (confirming hypotheses H1b and H1c).Note, that our preliminary study with only 107 students had shown no statistically significant differences between the gamified survey and the control (Aubert & Lienert, 2019).However, the same respondents that received the learning loop treatment actually perceived their factual learning as lower (opposed to hypothesis H1c).Some preference construction occurred during the survey (confirming H2a), but significantly less for those receiving the nongamified treatment without learning loop than for the three other treatments (no hypothesis was formulated for the interactions between the treatments).Equally mixed results (obtained with different measure instruments than the ones used in this study) were reported for preference construction in Aubert and Lienert (2019).Process understanding regarding value-focused thinking, which had not been measured as such in earlier studies, occurred (confirming H3a) and was perceived as higher without gamification (no hypothesis was formulated).The extraneous cognitive load was perceived as significantly higher with learning loops, where respondents were asked to adjust their weights in cases of inconsistency (confirming H4).Overall, to some degree, all treatments enabled the occurrence of the three learning aspects that are prerequisites to decision-making, with slight differences between them.We discuss the results in the following section by reflecting on the improved operationalization of the measure instruments, the effects of the two features meant to enhance learning (gamification and learning loops), and how this study contributes to behavioral operational research (BOR).

Measuring learning for decision-making
Before discussing the results, we reflect on our newly proposed comprehensive set of instruments to measure learning for decision-making.These instruments are derived from a systematic operationalization and include novel measures for process understanding.Given our proposed conceptualization of learning for decision-making (which is discussed further in Section 4.4.2),we can recommend the further use of our measure instruments for factual learning, as well as for process understanding.Note that future work should adapt the knowledge test for factual learning to the topic of their decision, and control the test for validity previously to the experiment (e.g.Aubert et al., 2022).
Our measures for preference construction should be improved.Measuring preference construction remains a challenge for decisions analysts, because there is no right or wrong preference.Specifically, in our instrument, there were too few items in the performance test.It would be very important to develop new quantitative instruments to measure some kind of performance in preference construction.Also, measures for self-reported preference construction should be improved.Despite using three items found in the literature, the internal consistency between these was not high enough to aggregate them.In general, further work should focus on the internal consistency and external validation of the measure instruments.This requires careful experimental testing, as is standard in psychology.

Learning loops
Our previous experimental design did not allow us to investigate the effect of learning loops (Aubert & Lienert, 2019).In the present study, based on the literature, we had hypothesized that learning loops improve factual learning (H1c), preference construction (H2c), and process understanding (H3c), but increase the mental effort (H4).Results did not support the hypotheses on preference construction and process understanding (H2c and H3c).However, with learning loops, where respondents could adjust their weights in case of inconsistencies, respondents had a significantly higher wastewater knowledge score (supporting H1c), although the difference was small.With learning

Table 9
Summary of results.Hypotheses are summarized in Table 1.LL: learning loop.√: statistically significant confirming hypothesis.NA: no hypothesis was formulated (e. g. we did not know how learning loop and gamification would interact, neither did we foresee how gamification would affect process understanding).×: hypothesis was not supported.(t): trend.n.s.: statistically insignificant difference between treatments.loops and nongamified survey, the performance scores for preference construction were also higher.Interestingly, respondents receiving the survey with learning loops perceived lower factual learning (contradicting H1c) and higher extraneous cognitive load (supporting H4), wrote more comments suggesting that the survey was complex or cognitively demanding (supporting H4), and, without gamification, also voiced doubts about the reliability of the expressed preferences.We thus observed a contradictory effect of learning loops: performance tests showed that factual learning somewhat increased, but the perception of this learning was lower.
The learning loops challenged many respondents, despite the improved interfaces, as indicated by the qualitative data and the higher extraneous cognitive load (ECL).Learning loops had been perceived as too complex in the qualitative assessments of our preliminary test (Aubert & Lienert, 2019) and we expected that it would be easier to deal with them with novel, improved instructions.Challenging tasks can positively influence learning (Hamari et al., 2016;Sheffler et al., 2022).However, to be positively perceived, the challenge should create a state of "flow" that matches the capabilities of the learners (Csikszentmihalyi, 1990).Otherwise, the challenge overwhelms, stresses, or bores learners (Csikszentmihalyi, 1990;Hamari et al., 2016).The higher ECL and lower perceived learning reported with learning loops suggest that our challenge was negatively perceived even though the factual learning test performance was often higher than without learning loop.Future work could investigate the disparity between measured and perceived learning also reported in education literature (e.g.Deslauriers, McCarty Logan, Miller, Callaghan & Kestin, 2019), and the associated causalities, for instance using the theory of flow (Csikszentmihalyi, 1990).
We maintain that learning loops with consistency checks are important and good practice (Anderson & Clemen, 2013;Payne et al., 2006).They have been recommended on the basis of practical applications (e.g.Hobbs & Meier, 1994;Martin, 2021;Marttunen & Hämäläinen, 2008), and more recently of a computational experiment (Lahtinen, Hämäläinen & Jenytin, 2020).Our learning loops can be improved in two ways.The first is to better assist respondents.Personalized feedback and instructions could guide users lost in learning loops, for example with a chatbot using artificial intelligence (AI).AI has been used to develop learner-centered online learning platforms (Merriam & Baumgartner, 2020) and to elicit stakeholders' preferences (Toffano et al., 2022).The second way is to simplify the learning loops by designing less cognitively demanding consistency checks.Recently demonstrated for discrete choice modeling, simple models increased problem understanding (Tako, Tsioptsias & Robinson, 2020).For instance, learning loops could ask the respondents to select correct statements about their preferences instead of requiring a numerical answer.For example, "In my opinion, improving Objective X from worst to best is more important than improving Objective Y from worst to best."Alternatively, objectives could be directly ranked.In sum, the learning loops are good, but should be simplified.This follows a recent proposition of cognitive psychology scientists, who also argue that there was no added-value of complex models that may be more accurate compared to more simple and understandable models that can easily be used in practice (e.g."fast and frugal heuristics tree") (Katsikopoulos, Ozgur, Buckmann & Gigerenzer, 2020).Moreover, although challenges can improve learning (Hamari et al., 2016;Sheffler et al., 2022), one BOR study found that it was ineffective to debiase overprecision in estimates (Ferretti et al., 2022).Future work should explore these think-harder strategies for preference construction further.
Finally, other ways might be found to enhance (and assess) preference construction.A recent online example, also using cognitive dissonance, proposed to activate values (Philpot et al., 2022).These authors prompted respondents to compare their preferences for alternatives, and their values.Our results support future attempts using learning loops if they keep the ECL low.Research could assess their effectiveness and underlying mechanisms.

Gamification
We had hypothesized that gamification improves factual learning (H1b) and preference construction (H2b).Results supported the hypotheses on factual learning (H1b) but not on preference construction (H2b).In our previous studies with smaller student samples, we did not observe statistically significant differences in factual learning between the gamified and control treatments (Aubert & Lienert, 2019).Additionally, in the earlier studies, we did not observe differences in preference construction (note that the measure instruments differed as we used improved, novel instruments in the present study).Overall, both gamified and nongamified treatments were neutral to good in all three learning aspects, although the treatments explained little of the variance.As expected, gamification led to somewhat higher knowledge scores for understanding wastewater objectives (supporting hypothesis H1b), and, rather unexpectedly, to a lower perceived understanding of value-focused thinking (no hypothesis was formulated).Respondents with gamification wrote more comments.These indicated that respondents learned facts about wastewater management (supporting H1b) and realized that making trade-offs is necessary but difficult (contradicting H2b).Moreover, respondents voiced preferences for alternatives and concerns about the population's acceptance of decentralized wastewater alternatives.Finally, the dropout rate was higher for the gamified treatment than the nongamified one.Technical issues caused some dropouts, but some respondents also emailed us that they stopped the survey because its gamified format trivialized the important topic.These innovative game elements included the storyline with the challenge of winning the next election, an avatar chosen by the respondent who meets gender-neutral nonplayer characters, and ambient music among others (Section 2.2 and S4.3).The perceived trivialization of the wastewater topic criticized by some respondents may be due to using animal characters (see right panel in Fig. 3) rather than people as in the earlier prototypes (Aubert & Lienert, 2019;Aubert et al., 2022).After discussion with game designers, we had decided for animals as they are gender-neutral.Appropriate types of characters, avatar and nonplayer, for serious games could be a future research topic (e.g.Kim, Lee & Chung, 2023).
The high dropout rate and the higher number of voluntary qualitative comments are in line with the existing literature on gamified surveys (e.g.Bailey et al., 2015;Keusch & Zhang, 2015).The dropout rate in the gamified treatment might have biased our study: only those open to the new format could have answered the feedback questions.However, our samples with and without gamification are representative of the targeted population in gender and age (S5-7).Thus, we assume that the preferences are representative of our targeted population.Future studies could let the respondents choose between a gamified and a nongamified survey and investigate the respondents' profiles for characteristics that may determine their choice.These may include the attitude to games (Guin et al., 2012) and individual motivational orientations.According to causality orientation theory (Loughrey & Broin, 2018;Ryan & Deci, 2017), individuals react differently to motivational stimuli.For instance, an autonomy-oriented person may be frustrated by external stimuli, whereas the same stimuli may motivate a controlled-oriented person.Moreover, personality traits could influence whether people like gamification, such as the Big Five (openness, conscientiousness, extraversion, agreeableness, and neurotiscism) (Triantoro, Gopal, Benbunan-Fich & Lang, 2019, 2020).
For those open to this new format, gamification successfully created awareness about the complexity of wastewater management (supporting hypothesis H1b) and evoked a positive attitude.However, this positive attitude might have biased the feedback, because those who disliked the format were underrepresented (i.e. they dropped out).We observed factual learning (supporting hypothesis H1b), but it was limited.Additionally, the qualitative feedback for factual learning and preference construction clearly indicated that respondents realized that decisions about wastewater management required making trade-offs, but they found it difficult to make these trade-offs.Our mixed results thus allow us to state that gamification led to successful awareness raising rather than successful learning.
Another unexpected but, we think, positive result is that respondents receiving the gamified treatment voiced concerns about the population's acceptance of decentralized wastewater alternatives.They highlighted that such decisions require broad information, should satisfy divergent interests, and that the best compromise should be found.We interpreted this as social learning, or learning about diverse worldviews (den Haan & van der Voort, 2018;Reed et al., 2010).This could have been primed by the narrative of the gamified survey.It included meeting citizens who advocated a single objective, and respondents playing candidates to represent these citizens at the municipal council.Some respondents forgot that the aim of the survey was to collect their own preferences to inform a decision.Rather, they interpreted their role as that of a mediator.This is a known effect: the game activity can distract from the main task (Keusch & Zhang, 2015).This may also have caused the lower perceived process understanding with gamification.Follow-up studies could focus on this transfer of learning from the "magic circle" of the game (Huizinga, 1949) to the real-world task.
Some counterarguments about gamification also warrant consideration.First, our gamification was strongly constrained by the norms of the swing and trade-off weight elicitation methods (Eisenführ et al., 2010;von Winterfeldt & Edwards, 1986).Considering simpler methods to elicit weights would be interesting, and the game designers were also keen to simplify weight elicitation (S4.4).Follow-up attempts to gamify weight elicitation for MCDA could ask respondents simply to rank objectives.We would then calculate weights from rankings (e.g.Riabacke et al., 2012;Roberts & Goodwin, 2002).Second, developing our gamified survey required substantial time and money and relied on game designers.Although our gamified survey concept is easily adaptable, new texts and artwork will be needed for the next application case, which will consume resources.Before gamifying a survey, one should weigh the pros and cons, which can be context specific.
Finally, the novel nongamified survey developed here presents some practical advantages.First, the admin interface enables new weight elicitation surveys to be created quickly for other decisions and the weight elicitation process meets the standards.Second, the nongamified survey can also support interviews and facilitated workshops, enabling display of the weights.Third, the code is open source and available: the software for the nongamified survey only needs to be deployed on a server (Aubert & Masson, 2021).

OR and e-participation
Our four survey treatments successfully raised awareness in a sample representative of the Swiss population about a topical decision.Respondents learnt about an unfamiliar topic, constructed and shared preferences, and understood that deciding about wastewater management requires making difficult trade-offs between many objectives.The interfaces informed respondents effectively.Moreover, the survey can support co-deciding, if it is used before and/or after more deliberative types of participation, such as stakeholder workshops and citizens' fora.In sum, our online survey could support the three main aims of public participation in decision-making: informing, consulting, and codeciding (Arnstein, 1969;de Gooyert, Rouwette, van Kranenburg & Freeman, 2017).The decision-makers should clarify the aim of participation at the start and inform respondents about the decision problem and methods used (Linkov et al., 2006).
Our online survey supported a detailed, in-depth understanding of the decision problem and enabled many citizens to participate.This aligns with the paradigms of deep and broad participation (Gregory, Satterfield & Hasell, 2016;Papadopoulos & Warin, 2007).However, the digital divide remains a reality, with disparities in internet access and use (Merriam & Baumgartner, 2020).Thus, surveys such as ours complement rather than replace existing participatory decision-making processes (French & Argyris, 2018;Gregory et al., 2016;Philpot et al., 2022).
Our nongamified survey can be used in other settings.It can support an interactive group weight elicitation workshop, for instance with visualization, or individual interviews.In these examples, a decision analyst would facilitate the process.To date, the novel interface does not support asynchronous, autonomous use by a group.Future development could add features to display weights of multiple stakeholders and to automatically support a fair group process, for instance following the group Delphi method (Niederberger & Renn, 2018).
Originally, we developed the survey to enable individuals to answer without guidance from a decision analyst.Our results displayed high interpersonal variability, with fewer significant differences between the treatments than expected.It is intuitively apparent that different types of games appeal to different types of people (Guin et al., 2012;Triantoro et al., 2019).Additionally, we observed that the learning loops also triggered people differently, possibly depending on their cognitive capabilities.A recent meta-analysis, based on 30 articles on risk preferences, confirmed that cognitive abilities can lead to erroneous choice behaviors (Mechera-Ostrovsky, Heinke, Andraszewicz & Rieskamp, 2022).Consequently, high interpersonal variability challenges the design of an interface meant for a broad public.For this reason, we suggest that citizens could choose their preferred interface, and further studies should examine individual characteristics.Further studies should also test different learning loops and continue to identify the benefits and drawbacks of surveytainment (Kostyk et al., 2019).

Learning and decision-making
Decision analysis, including modeling and structured decisionmaking, has long been described as an iterative learning processes (e. g.Belton & Elder, 1994).McDaniels and Gregory (2004) suggest capturing learning as a separate fundamental objective in structured decision-making.In the BOR literature, learning as a dependent variable is mostly studied at the individual level (Monks et al., 2014(Monks et al., , 2016;;Tako et al., 2020;Thompson et al., 2016) despite the call to follow the rich literature on group model building of system dynamics (Franco et al., 2021).Our study also focused on learning for individuals but targeted laypeople.Such affected stakeholders can be represented in a group decision-making process, but it is rare except in community OR (e.g.Midgley et al., 2018).
Based on learning (e.g.Mezirow, 2000) and decision analysis literature (e.g.Belton & Elder, 1994), we conceptualized learning for decision-making as comprising factual learning, preference construction, and process understanding.Our novel assessment method proved satisfactory, except for preference construction.Unexpectedly, the qualitative feedback in the gamified version of the present study highlighted that social learning is a fourth facet of learning that we had not considered.Further studies should refine our assessment method to measure preference construction and to include social learning.Our results confirmed the complementarity of performance tests, self-reported assessments, and qualitative data (den Haan & van der Voort, 2018;Franco et al., 2021;Hamari et al., 2016).Complementary data revealed that perceived learning can differ from learning measured by performance tests, as observed here and in previous studies (Aubert & Lienert, 2019;Deslauriers et al., 2019).It also revealed interesting questions.As example: how can we explain that the difference between the final and initial knowledge scores was more often negative for alternatives than for objectives?Can this be only explained by tiredness (the questions about alternatives came after those about objectives)?Or did we emphasize objectives so strongly (following value-focused thinking) that alternatives appeared less important?We propose testing our assessment method in other contexts, for instance in decision-making workshops.This would provide a baseline for comparing our interfaces.Finally, to better unravel critical learning incidents, other methods are more suitable, such as think-aloud protocols, video analysis, click-data, and eye-tracking (Franco et al., 2021;Hamari et al., 2016).We encourage future BOR studies to continue investigating learning as a process.
Understanding the learning process will enable improving the design of OR interventions.Learning needs time, as respondents need to process the information (Gregory et al., 2012).Several respondents reported in the qualitative feedback that they would need to be further informed or would look for complementary information sources.We propose that future development could decouple the informing part from preference elicitation.The informing part could include links and references to information sources.Additionally, the preference elicitation part should ensure that the facts are always easily accessible.It could also include small training and practice tasks for swing and trade-off weight elicitation to enhance process understanding (Anderson & Clemen, 2013).Moreover, the respondents should be given the opportunity to take a break from the survey (which was not possible in our case), and could be offered a chance to review and revise their preferences after a set time.Our novel structured assessment method presented in this paper enables comparing all these development options and identifying which most improve the respondents' learning.
As drawback, these developments are resource intensive.Although learning requires time, the survey length is critical.Long surveys increase dropout rates.Good practice recommends a maximum of 20 questions, per stage if multistage, and maximally 13 min to complete the survey (Bailey et al., 2015).However, if respondents think that the data are important, they seem to accept a higher survey burden (Guin et al., 2012).Our survey was long and complicated, and should be shortened.Simpler consistency checks could be tested for the learning loops.The relevance of consistency checks to enhance preference construction should be clarified, as the think-harder strategy seemed to raise doubt and distrust among some respondents.Finally, the interfaces should be adapted for a range of devices, because people increasingly use smaller screens, including cell phones.Because of the costs of such developments, we highly recommend OR researchers and practitioners to first investigate how to best enhance learning.

Conclusion and outlook
Individual learning for decision-making is topical for BOR.In our study, we focused on laypeople, investigating whether learning for decision-making occurred when citizens answered specifically designed online weight elicitation surveys.In addition, we investigated the effect of two features meant to increase learninggamification and learning loops.We used a comprehensive, innovative set of instruments to measure learning for decision-making, derived from a systematic operationalization.
Our conceptualization of learning for decision-making comprised factual learning, preference construction and process understanding.Because we targeted laypeople, we interpreted process understanding as understanding value-focused thinking, i.e. differentiating objectives from alternatives and understanding that the relative importance of objectives is used to derive a ranking of alternatives.Although being more comprehensive than previous attempts, our framework missed social learning, which was unexpectedly triggered by the storyline of the gamified interface.Future OR work investigating learning for decisionmaking should consider all four learning facets.
Using newly-designed interfaces for eliciting weights online, we successfully raised awareness about a public decision concerning wastewater management.Most respondents learnt something about the facts, their own preferences, and value-focused thinking.Gamification and learning loops enhanced factual learning, however learning loops decreased the perception of factual learning and gamification decreased the perception of process understanding.Our results also revealed high interpersonal variability, which is really challenging if we aim at designing user-centered interfaces.In future research, we could study effects of letting the respondents choose the survey format (e.g.gamified or not).
The gamification feature was useful: it enhanced topical awareness and social learning.However, drop-out was high, the practical relevance of the differences to the nongamified interface was low, and developing the gamified version was resource-intensive.For online weight elicitation, based on the present study, we recommend adding a few game elements to the nongamified interface, rather than developing a fullyfledged gamified interface.
The learning loop feature was useful to increase factual learning, but was clearly cognitively demanding and lowered the perceived learning on all three facets.We propose using "think-harder" strategies to further investigate this difference between tested and perceived learning.OR researchers interested in online weight elicitation have to develop tools that assist the respondents more effectively and may consider simplified procedures.We could investigate whether simply ranking the objectives to derive weights is sufficient.We could compare the effects of our learning loops with simpler loops, since we found that thinking harder does not necessarily improve learning.Most likely, we should consider the aim of participation: raising awareness does not require the same tools as co-deciding.
Finally, we newly operationalized the measure instruments of learning for decision-making with complementary types of data (quantitative performance test and self-assessment, qualitative selfassessment).We used this new assessment framework to compare our four interfaces.Further development should revise the measures for preference construction, and add performance tests and self-reported measures for social learning, thus transcending individual learning.We insist on using complementary data types: qualitative feedback provided insights that helped explain the quantitative results.Other instruments such as think-aloud protocols may well be appropriate if the aim is to unravel the learning processes, not only the learning outcomes.OR researchers could use our learning for decision-making framework to measure learning in traditional settings, such as interviews and workshops.This baseline would allow comparison with innovative tools such as the ones presented here.
To conclude, despite many new open questions, our results with these innovative interfaces proved promising: they successfully raised awareness among many affected citizens in public decision-making.

Fig. 3 .
Fig. 3. Examples of screenshots of the interfaces in the nongamified (left) and gamified treatment (right).It displays the swing weight elicitation.

Table 1
Summary of hypotheses.LL: learning loop.
(3b)Based on the literature, we could not formulate a clear hypothesis about the effect of gamification on process understanding.(3c)Using an online survey interface with LL improves process understanding compared to using an interface without LL.(4)Extraneouscognitiveload (ECL)(4) LL increase the mental effort compared to no LL.A.H.Aubert et al.

Table 4
Number of respondents (and percentage) with a negative (lower final knowledge score compared to initial) and null Δ KS (no learning) for objectives and alternatives.Δ KS = final KSinitial KS; Δ KS obj can vary between − 10 and 10.Δ KS alt can vary between − 6 and 6. obj: objectives.alt: alternatives.noLL: no learning loop.LL: learning loop.nogam: nongamified.gam: gamified.*Statistically significant.
* Cronbach's alpha is a measure of how closely related the items of a scale are.If it is close to 1, the items are reliably related and can be averaged.

Table 6
Change in confidence in the answers from the knowledge tests.Difference from confidence in the answers in post-test-pretest; Varied from min = − 4 to max = 4. SD = standard deviation.Mdn = Median.noLL: no learning loop.LL: learning loop.nogam: nongamified.gam: gamified.