The importance of effectiveness versus transparency and stakeholder involvement in citizens’ perception of public sector algorithms

ABSTRACT This paper sheds light on how much citizens value different features of public sector algorithms, specifically whether they prioritize effectiveness over transparency and stakeholder involvement in algorithm design or instead see effectiveness as less important. It does so with choice-based conjoint designs that present variants of algorithms used in policing and health care to respondents from representative German samples. Two studies with overall more than 3000 participants show that people are ready to trade away transparency and stakeholder involvement for small effectiveness gains. Citizens thus seem unlikely to demand accountable algorithms even in sensitive areas like policing and health care.


Introduction
Algorithmic decision-making systems based on machine learning (abbreviated as algorithms in the following) promise to assist or replace human decision-makers while offering effectiveness and efficiency gains.Indeed, the use of such systems is rapidly increasing both in the private and in the public sector (Wirtz, Weyerer, and Geyer 2019).Their scalability, cost-effectiveness, and performance, it has been argued, can substantially alter and improve bureaucratic operations and public value creation (Agarwal 2018;Busuioc 2020;Young, Bullock, and Lecy 2019;Giest and Klievink 2022).However, even though algorithms might work well in certain domains, people affected by them may often not know whether algorithms actually serve their interests.In fact, algorithms may show undesirable biases and can lead to unfair discrimination based on sensitive features such as gender (Barocas and Selbst 2016;Lepri et al. 2018;Pasquale 2015).
Given these potential problems, there has been a surge of scholarly interest in the transparency and accountability of algorithms (Felzmann et al. 2020;Kroll et al. 2017; CONTACT Pascal D. König pascal.koenig@sowi.uni-kl.deSupplemental data for this article can be accessed at https://doi.org/10.1080/14719037.2022.2144938Veale and Brass 2019;Wieringa 2020;Yeung, Howes, and Pogrebna 2019).While realizing algorithmic accountability is a general challenge, public administration research has highlighted that distinct transparency and accountability problems arise when they are used by state agencies (Busuioc 2020;Levy, Chasalow, and Riley 2021;Oswald et al. 2018;Veale and Brass 2019;Wirtz and Müller 2019).In a nutshell, the use of this technology creates new information asymmetries and shifts discretion to those developing and implementing algorithms, actors that make necessarily value-laden choices about the algorithm design.Relinquishing control through 'technological outsourcing' can therefore be the result (Dickinson and Yates 2021, p. 2).
The present paper studies algorithms used in the public sector by examining how much citizens favour an algorithm's effectiveness, i.e. how well an algorithm performs, over its transparency and stakeholder involvement in the design of the algorithm.Focusing on the relative importance of and possible trade-offs between these features of an algorithm directly speaks to central debates on the prioritization of values guiding public management (van der Wal, de Graaf, and Lawton 2011) and different sources of decision-making legitimacy (Schmidt 2013;Strebel, Kübler, and Marcinkowski 2018).It has been noted that a focus on technological capabilities and an emphasis on managerial aspects of effectiveness and efficiency commonly guides the adoption of algorithms in government (Gil-Garcia, Dawes, and Pardo 2018;Schiff, Schiff, and Pierson 2021;Wirtz and Müller 2019).Furthermore, the results of our research are highly relevant as algorithms are discussed as a tool to increase efficiency and performance -features that are central under the New Public Management paradigm (Peters 2011;Verbeeten and Speklé 2015).We focus on citizen's evaluations because, to date, we know very little about how the public thinks about these issues.Yet, knowledge about how much citizens are concerned about algorithms matters for questions of bureaucratic legitimacy and regulatory action.Of course, regardless of what citizens think, there need to be rules put in place that safeguard public accountability.Nonetheless, if indeed policymakers are primarily concerned with the performance of algorithms and citizens care mainly about this aspect, too, public officials may be tempted to gradually implement opaque and unaccountable algorithms that improve the bottom line.
A nascent literature on how citizens view algorithms in government (Aoki 2020;Grimmelikhuijsen 2022;Kennedy, Waggoner, and Ward 2022;Miller and Keiser 2021;Schiff, Schiff, and Pierson 2021) has shown that transparency leads to more positive views of these algorithms, and that the kind and degree of transparency matters, too.It has, however, not focused on how citizen prioritize different algorithm features, specifically transparency versus effectiveness.Nor has it examined the role of stakeholder involvement in algorithm design so far.A literature dealing with fair, transparent, and accountable algorithms has emphasized that both transparency (e.g.Lepri et al. 2018;Wieringa 2020) and stakeholder involvement (e.g.Busuioc 2020;Felzmann et al. 2020) are important instruments for mitigating algorithmic biases.But how much do citizens care about and prioritize these features, specifically when they cannot maximize them together with algorithm effectiveness?
By studying this question, we add novel empirical insights to research on how citizens perceive public sector algorithms.Our main contribution lies in assessing citizens' preferences by probing how citizens prioritize relevant features and how much they are willing to trade one against the other.This does not mean that citizens will always and necessarily face such trade-offs in real life.Rather, we confront them with trade-off decisions as a specific methodological approach to determine how much importance people give to these features and which they prioritize.This approachusing conjoint analysis -has been developed, originally in marketing studies, to deal with the problem known from decision research that people are usually not aware of their preferences for certain features of an option and are unable to self-report them accurately.Hence, while people may value both transparency and effectiveness highly when asked directly about their importance, they may still clearly prefer one over the other when confronted with trade-off decisions.
To elicit such preferences and obtain estimates of how much people value algorithm features relative to each other, we therefore use conjoint analysis as an indirect method of preference measurement.It is well suited to register the relative weight that people give to a set of features and also deals with the problem of desirability bias (Horiuchi, Markovich, and Yamamoto 2021).We conduct two choice-based conjoint analyses using original survey data from a sample of over 3000 respondents representative of the German population.These analyses are based on real-world cases in which algorithms inform frontline decision-making with potentially important consequences for citizens.In our first study, we examine algorithms for predictive policing, whereas the second study replicates these results for predictive policing and additionally introduces an algorithm used for predicting skin cancer.Adding the skin cancer domain allows us to test whether the results found for predictive policing also hold for another domain that might affect citizens' lives deeply.The analyses provide evidence that people are ready to trade away stakeholder involvement in algorithm design and algorithm transparency for small gains in effectiveness, even in sensitive areas such as policing and health care.
The paper is structured as follows.Sections two and three present the state of the art on citizens' evaluations of algorithms in the public sector and the theoretical assumptions guiding the analysis.Section four describes the data, materials, and methods before section five turns to the results from the analysis.Section six closes with a summary and discussion of the findings in light of theoretical contributions on algorithms in government.

The problem of algorithm bias
In this study, we aim at evaluating whether citizens prioritize transparency and/or stakeholder involvement over the effectiveness of an algorithm.To focus on these features makes sense for different reasons that are closely linked to the vast literature of algorithm bias.In fact, a central concern regarding the use of algorithmic systems is that they may show -intended or unintended -biases that go against the interest of those whom they are supposed to serve (Krafft, Zweig, and König 2020;Lepri et al. 2018;Martin 2019).While this constitutes a general agency problem arising with algorithms, the problem can gain special weight in public sector applications, such as for detecting tax fraud or for predicting criminal behaviour.There, citizens are vulnerable to the power wielded by public actors and often cannot opt out of decision-making, a constellation that makes decision biases potentially very harmful and problematic for democratic legitimacy (Warren 2014).At the same time, the use of algorithms creates distinct accountability challenges.They increase information asymmetries and can be black-boxed due to being opaque, highly complex or even inherently unintelligible (Busuioc 2020;Oswald et al. 2018).Their opaqueness also easily hides value-laden design choices that determine how the algorithm performs and which societal values it realizes and prioritizes over others (Levy, Chasalow, and Riley 2021;Veale and Brass 2019).This makes mechanisms for aligning algorithms with the interests of citizens and with public values especially important.
A broad literature has pointed to measures for safeguarding the accountability of algorithms as particularly important in this regard (for an overview, see Wieringa 2020).In a governance perspective, accountability refers to a relationship in which an actor has an obligation to justify her conduct (here: the state to its citizens) (Bovens 2007).Accountability requires transparency, but also comprises answerability, i.e. the possibility to demand justifications, and the possibility to sanction (Warren 2014).When transferring this notion of accountability to algorithmic systems, meaningful algorithmic accountability entails not only technical approaches for making algorithmic systems -the decision model and underlying parameters -transparent and intelligible (on this, see Guidotti et al. 2018), but also comprises justifying the choices behind the algorithm design, such as why certain values have been built into it (Wieringa 2020).Knowing the decision model of an algorithm is one thing but knowing why this model has been chosen and how it relates to certain guiding values is a different question.
Often, the technical transparency of the algorithmic system will already be a major issue.The decision-models of algorithmic systems will usually not be transparent through being openly accessible or even intelligible, i.e. compatible with human semantic interpretation (e.g. as numerical feature weights).Rather, they are mostly either inaccessible or they are accessible but complex or even unintelligible (e.g. in the case of neural networks).Achieving technical transparency then requires rules to ensure access and/or using technical methods for making it transparent and intelligible (Guidotti et al. 2018;Rosenfeld and Richardson 2019). 1 However, realizing transparency will hardly be of much use to individuals as laypersons because they will usually lack the expertise to properly assess the algorithm design, including the data and the learning process for generating its decision models.Mere technical transparency as access to some underlying model will therefore mostly not suffice for generating accountability (Ananny and Crawford 2018;Felzmann et al. 2020).It will therefore often be necessary to delegate this task to experts who perform audits or impact assessments to ascertain whether an algorithm is designed to realize positive goals and values (Kaminski and Malgieri 2020;Oetzel and Spiekermann 2014).
Another way of dealing with the issue of algorithmic bias, it has been argued, is to involve stakeholders in the process of designing an algorithm -a feature that we name 'stakeholder involvement', and the second key element examined in our study besides transparency.In fact, various scholars have argued that the design of algorithmic systems entails value-laden choices that create a need for involving stakeholders, particularly citizens (Busuioc 2020;Felzmann et al. 2020;König and Wenzelburger 2021;Levy, Chasalow, and Riley 2021).They argue that by including stakeholders at different stages of algorithmic development -for instance by clarifying trade-offs and discussing the underlying values involved in them -the algorithm bias will be reduced and legitimacy enhanced.Although use cases are still rare, existing work suggests that involving stakeholders in algorithms design decisions is feasible and can align their performance with stakeholders' values (e.g. Lee et al. 2019;Zhu et al. 2018).

Citizens' evaluations of algorithms in the public sector
In light of the central role of transparency, it is hardly surprising that existing work on citizens' evaluations of algorithms in the public sector shows a prevailing focus on algorithmic transparency.Existing evidence indicates that violations of fairness and transparency expectations lead to lower acceptance of algorithms (Schiff, Schiff, and Pierson 2021).Yet, absent such violations, citizens seem to be more acceptant of algorithms based on a presumed higher fairness if they have previously experienced discrimination by human decision-makers (Miller and Keiser 2021).Regarding transparency, citizens seem to place particular value on more demanding and meaningful transparency through being able to get explanations for the algorithm design and performance (Grimmelikhuijsen 2022; see also König, Wurster, and Siewert 2022).However, people also appear to be swayed already by large datasets used for developing an algorithmic model and by developer reputation, and they care about having a human in the loop (Kennedy, Waggoner, and Ward 2022).
Regarding commercial applications, such as online recommender systems, the evidence similarly suggests that people value transparency and that greater transparency through provided explanations enhances trust in algorithms (Liu 2021;Shin 2021;Shin and Park 2019; for an overview, see Glikson and Woolley 2020).Yet, there is also evidence that too much transparency can have a negative effect on trust (Kizilcec 2016;Schmidt, Biessmann, and Teubner 2020).Interestingly, studies on the relationship between transparency and trust in government and public administration more generally, too, have found that transparency can lead to lower trust, possibly because it primes suspicions and concerns among citizens (Grimmelikhuijsen et al. 2013;Grimmelikhuijsen, Piotrowski, and Van Ryzin 2020). 2  Although existing studies clearly suggest that citizens care about transparency, there is a lack of systematic evidence on how much importance they give to transparency in relation to the performance or effectiveness of an algorithm -and specifically how willing they are to trade off one against the other.Nor is it known how much citizens care about stakeholder involvement as a way to directly align the algorithm design with citizens' views and thus avoid undesirable biases.As citizen involvement and deliberation can engender more positive views about government (Halvorsen 2003), it might also increase acceptance of algorithms in the public sector.
The relative weight of these aspects and evaluative criteria in the use of algorithms in public services directly concerns what has been described as fundamental standards that can legitimate governing authority and public decision-making: Legitimacy can stem from the performance and the outputs realized (output legitimacy), from the acceptability of the process through which outputs are realized (throughput legitimacy), and from decisions being based on the inputs of those affected by them (input legitimacy) (Schmidt 2013;Strebel, Kübler, and Marcinkowski 2018).In the following, we therefore examine how much citizens value features of algorithmic systems that are linked to these three described dimensions of legitimacy.First, regarding the output dimension, we examine the effectiveness of an algorithm, which can be understood as its performance, for instance, when making predictions.One can quantify this performance with a range of different statistical measures, such as accuracy, recall, and precision. 3As we detail below, we will narrowly refer to effectiveness understood as the performance in terms of only one criterion, the so-called recall or true positive rate.Second, by studying the role of algorithm transparency, we cover the throughput dimension of legitimacy.Finally, the input dimension is represented through incorporating stakeholder involvement in algorithm design as a further feature.
One can derive competing expectations about whether citizens see the output or the throughput or the input dimension as more important as the other ones from two strands of the literature.First, work based on procedural justice theory has shown that people do accept even undesirable decisions as long as the process leading to them satisfies certain procedural standards, such as involving different views and accountability mechanisms (Tyler 2000).Findings from this research suggest that there is some innate appreciation -rooted in social-psychological needs -of a fair and transparent process.One would thus expect that citizens perceive stakeholder involvement and transparency as more important than effectiveness (H1).
Second, research on democratic governance has found, in turn, that citizens care mainly about the outputs of a political system, whereas input and throughput are of secondary importance (Strebel, Kübler, and Marcinkowski 2018).Similarly, citizens seem to readily give up accountability mechanisms in liberal democracy if this furthers their own policy preferences (Gidengil, Stolle, and Bergeron-boutin 2021;Graham and Svolik 2020).These contributions thus suggest that individuals' plain material selfinterest leads them to primarily care about results.Correspondingly, one would expect that citizens perceive effectiveness as more important than stakeholder involvement and transparency (H2).The following section describes how we empirically determine how much citizens value these different aspects.

Samples and choice of settings
Participants of both studies were recruited via respondi AG, a certified panel provider according to the internationally recognized Norm ISO 26362.The samples of both studies were drawn from a participant pool that is representative of the German population aged 18 to 74.The criteria for drawing the sample were a valid representation of the population regarding gender, age, and region quotas (see SI Tables A8 and  A9 for sample composition and information on the online panel).Samples that represent the population of interest have been shown to be important in conjoint designs (Hainmueller, Hangartner, and Yamamoto 2015).
We developed the choice designs as a Bayesian optimal design.This process involved finding those optimal choice sets (consisting of algorithm designs as options) of a predefined quantity that are statistically most efficient for estimating the parameters (the feature level partworths) given prior knowledge about their direction and size.This iterative selection process of choices is performed while modelling uncertainty about the priors by sampling them from a prior distribution.To have information about such priors, we first obtained information about the relative coefficient sizes to be expected from a pre-test with a university student sample (N = 87, see SI Table A5 for results).These coefficients were then used as priors for the coefficients (partworth utilities) to generate a choice design that is statistically efficient (see SI for details).
The pre-test coefficients were also used as priors for calculating an adequate sample size for the main study based on a power analysis for discrete choice experiments (de Bekker-Grob et al. 2015).According to the analysis, presuming an alpha of 0.05 and a beta of 0.2, about 200 cases were sufficient to estimate at least all of the highest (third) feature levels and almost all of the second levels.We therefore opted for recruiting about 600 participants, i.e. 300 for each of the two experimental conditions in Study 1.For Study 2, N = 2661 participants were recruited, with more than 600 respondents for each of the four variants of the conjoint design.In both studies, participants were excluded from the analyses if they failed an attention or a speeding check or a selfreport control question on whether their data should be included in data analyses.For the speeding check, only citizens who took at least five seconds for their decision on average over all ten choice sets were included in the analyses. 4Participants who chose more quickly between the algorithms in the choice task are likely not to have taken the task seriously.
The choice sets reflected two real-life cases where algorithms are used in the public sector: Predictive policing and the prediction of skin cancer risk.These cases are similar in several respects: They both relate to an existential value (security, health) and involve algorithms that predict risk (for burglary or skin cancer) and indicate an increased screening as the policy solution (more surveillance, more skin cancer screening).However, they are different in one important respect: Skin cancer prediction algorithms regularly achieve a much higher level of true positive rates than predictive policing algorithms.This contrast made it possible to test whether the perceived importance of differences in the true positive rate (effectiveness) matters.

Conjoint designs
To analyse how much importance participants place on the algorithms' features, both studies use a choice-based conjoint design in which participants saw ten choice sets.These sets were presented in a randomized order for each participant to prevent sequence effects.We limited the designs to ten choice sets and constrained the number of the algorithms' features to four to ensure that participants were not overwhelmed by information overload.Cognitive overload in conjoint designs increases the noise in the response behaviour due to inattention and fatigue (Reed Johnson et al. 2013).It was important to avoid these problems because algorithms and their features are still largely unfamiliar to the broad public and because in online surveys, participants cannot ask an experimenter to clarify the instructions in case of misunderstandings.Therefore, the materials were carefully developed and pre-tested on their understandability (see SI for details on methodological choices and instruction, and SI Table S4 for the main choice design).
In each choice set, the participants chose between two different configurations of an algorithm and the third option of applying no algorithm at all (see SI Figure S1).The presented algorithm configurations differed with respect to stakeholder involvement, transparency, and effectiveness.We also included running costs as an additional feature to put a price tag on the algorithm and make the options more realistic.As the use of an algorithm will have fiscal implications, including running costs allows us to probe citizens' willingness to pay for certain features of an algorithm, such as transparency.Each of the features has three levels, which were set at meaningful and largely realistic options that could be easily understood by participants.The feature levels for stakeholder involvement and transparency are informed by the literature on algorithmic biases discussed above.Based on the idea that stakeholder involvement (I.) in algorithm development, as in other forms of participatory policy design, can differ regarding the degree of involvement and influence, we chose the following levels: (1) no involvement of stakeholders, (2) stakeholders are asked to provide their opinions about the algorithm, (3) stakeholder are asked to give their consent to the use of the algorithm.
Regarding the transparency of the algorithm (II.), we followed the assumption that there are different ways of realizing transparency and that not all of them are equally meaningful and effective (e.g.Felzmann et al. 2020;Grimmelikhuijsen 2022;Wieringa 2020).Transparency as mere access to information about the design and operating logic of the algorithm is not generally enough to ascertain whether an algorithm's output shows an undesired bias.Yet, citizens may still value such transparency even if it is rather weak.Further, even if they had more detailed information about the algorithm and technical design decisions behind it, laypersons operating the algorithm would hardly be able to properly scrutinize an algorithm and determine if it best realizes purported goals (e.g.Felzmann et al. 2020;Lepri et al. 2018).Based on these considerations, we chose the following levels of transparency: (1) no transparency at all (only the software company that develops the algorithm has insight into its operating logic), (2) the organization using the algorithm (e.g. the police department employing predictive policing) has insight into the general operating logic of the system, (3) thorough testing and scrutiny of the algorithm by independent experts (e.g.experts who work for the government).
For the algorithm's effectiveness (III.)we chose differences in the true positive rate (or recall) as an easy to understand and widely used measure of an algorithm's performance. 5It states how many actual positive outcomes are correctly detected by an algorithm.We deliberately chose to keep it simple and did not bother participants with also considering other performance measures besides the true positive rate, focusing their attention on comparisons between different true positive rates.In the study on predictive policing, we stated how many of 100 burglaries occurring in a municipality were correctly predicted by an algorithm.As further specified in the description, this rate is calculated for predictive policing based on burglaries and predictions that are made for the next five days and within quadrants of 500 × 500 metres (see SI for the instructions).Hence, a true positive rate of 10% means that of 100 occurrences across these geographic units (e.g. over the course of several months), 10 have been anticipated.For the skin cancer prediction, the presented number similarly reflects how many of all occurring positive cases, i.e. people who do develop skin cancer, are detected.As we varied the overall level of effectiveness depending on the experimental designs of studies 1 and 2, we will describe the chosen values further below.
Finally, the running costs of an algorithm (IV.) are simply the amount of money that an algorithm would cost a household per year.We framed the running costs as something that concerns the participants as taxpayers directly and that is easy to understand.We distinguished between the following three levels of running costs: (1) 6 Euros per household per year, (2) 12 Euros per household per year, (3) 18 Euros per household per year. 6

Experimental groups and manipulations
In addition to using the conjoint design, we included experimental manipulations in the two studies.In Study 1 (only predictive policing), we introduced a between-subjects factor to test whether information on a human expert's effectiveness in predicting crimes affects respondents' evaluation of the algorithm.The reasoning behind this contrast was that respondents presumably do not have a clear benchmark regarding what constitutes a realistic or adequate algorithm effectiveness.Further, following previous work, it is conceivable that the evaluation of algorithms and especially their performance depends on how these are presented in comparison to a human decision-maker (Hou and Jung 2021;Juravle et al. 2020).Specifically, people might place less importance on effectiveness gains if they know the performance of an algorithm to already surpass that of a human.If this holds true, one would need to consider this issue in the analysis.Thus, in study 1, participants were randomly assigned to one of two groups.The first group received information on how humans perform at predicting burglaries.The other group of participants did not receive this information on human performance (see SI for details).
In Study 2, we dropped this experimental manipulation as we did not find any corroborating evidence for its effect in Study 1. Instead, we introduced two other betweensubject factors.First, we varied the algorithm's domain of application by adding the domain of skin cancer prediction to the existing domain of predictive policing.Second, because the actual performance of algorithms differs between these two domains, we additionally varied the level of algorithm effectiveness between participants.In the first group, our participants chose between algorithms with high true positive rates; in the second group, the choice sets contained algorithms with low true positive rates.Together, these two between-subjects factors resulted in four groups to which citizens were randomly assigned.As Figure 1

Overview Study 1
We investigated citizens' evaluations of an algorithm used for the prediction of burglaries in their municipality.Before participants began the choice task, they were introduced to the experimental set-up (see SI on the materials).Participants were told that these algorithms were programmed to predict burglaries in certain areas of their municipality.They were also informed that these algorithms were intended to efficiently allocate police resources and assist the police in establishing public safety.In other words, we presented these algorithms as potentially useful and beneficial for citizens but emphasized that they could vary in their design.We explained that with the design of an algorithm, a specific configuration of stakeholder involvement, transparency, effectiveness, and running costs was meant.The two algorithms that were going to be presented in each set would differ on these four features.
Participants were also familiarized with these four features.For example, in the description of transparency as one of the algorithms' features, we wrote that without transparency, the prediction of burglaries could show certain biases that remained undetected.Such a bias could be, for instance, that the algorithm does not work equally well in different districts.This description pointed to the value of transparency as a way to detect and avoid undesired biases (for details on the description of features, see SI on the materials and instructions).To create a realistic scenario for the choice between the algorithms' configurations, the effectiveness of the algorithm was described by actual true positive rates in predictive policing, which range between 5 and 10% (Mohler et al. 2015).As we were also interested in how much people would prefer an algorithm with a hypothetical performance better than what currently seems to be achieved by these systems, we varied the effectiveness in terms of true positive rates between 5, 10, and 15%.
For the experimental contrast between providing information about human performance and no such information, we also drew on empirical evidence (Mohler et al. 2015) and chose a true positive rate of 2.1% for the human expert.We asked our participants a control question to check if they could recall the human expert's performance score right before they started the choice task on algorithms.Only participants with the correct answer to this question were included in the analysis.

Results Study 1: Predictive Policing
The results of Study 1 are displayed in Figure 2 (see SI Tables S7 and S8 for regression results).It depicts the coefficients of a multinomial logistic regression for the two experimental conditions (with human expert comparison vs. without human expert comparison).The estimated multinomial choice model specifies the probabilities of participants making an observed choice as the function of differences between the utilities of the presented options (i.e. the algorithm configurations in a given choice set).The coefficients for each of the feature levels can be interpreted as their partworth utilities in relation to the respective reference categories of the features.These partworth utilities together combine into the overall utility of an algorithm relative to an algorithm exhibiting only the reference categories, i.e.: no stakeholder involvement, no transparency, effectiveness (true positive rate) of 5% recall and running costs of 6 Euros per household per year.
Both conditions (human expert comparison vs. no human expert comparison) yield strikingly similar results indicating that the additional information on the human expert's effectiveness in predicting burglaries as a benchmark did not affect citizens' evaluations of the algorithms' feature importance.They seem to evaluate the algorithms merely based on the features of the algorithm itself, i.e. regardless of the information provided on the human experts' effectiveness.The coefficient for the 'no-choice' option (i.e. when people stated that they would not want any of the presented algorithms to be implemented) is positive and strongly significant.This means that participants do not prefer the adoption of an algorithm under all circumstances.Rather, they choose to have no algorithm at all if a combination of an algorithm's features and their partworth utilities is lower than the partworth (positive coefficient) of the 'no-choice' option (i.e.preferring to have no algorithm at all).Remarkably, the partworth of the 'no-choice' option is already surpassed by having at least an effectiveness of 10% true positive rate of an algorithm.Hence, with a true positive rate of 10% respondents, already favour an algorithm at a cost of 6 Euros per household and year with no transparency and stakeholder involvement over not having an algorithm at all.Thus, an algorithm's effectiveness emerges as a factor that very strongly contributes to the algorithm's overall utility in citizens' eyes.
Both stakeholder involvement in the development of an algorithm and algorithm transparency significantly increase the probability of choosing an algorithm.It does not seem to matter though what kind of transparency is realized: whether the institution deploying the algorithm has insight into the algorithms' general functionality or external independent experts can thoroughly scrutinize and audit the algorithm.Similarly, there is no sign that the specific kind of stakeholder involvement in the algorithm's development matters.
The strong impact of the algorithm's effectiveness on its overall utility can be further illustrated by quantifying how much change in percentage points of its true positive rate citizens were willing to trade off for e.g.transparency assured through testing by independent experts (see SI Table S7, model 4).Based on our analysis, increasing the algorithm's true positive rate by one percentage point corresponded to an increase in its overall utility of about 0.15.This means that, e.g. the algorithm's transparency ensured by internal oversight (the organization that is deploying the algorithm, e.g. the government, has insight into the general operating logic of the system) corresponds to about the same additional utility as increasing the true positive rate of the algorithm by 4.5% points.Calculating these trade-offs therefore illustrates how a higher effectiveness can easily compensate for a lack of accountability features.In this sense, an algorithm's effectiveness seems to dominate citizens' evaluations of an algorithm while an algorithm's accountability (stakeholder involvement in algorithm development and the transparency of the algorithm) is less important to citizens.

Overview Study 2
In Study 2, we explored citizens' evaluations of algorithms in predictive policing, as in Study 1, while also introducing algorithms used for predicting skin cancer as a second setting.Note that algorithms on skin cancer prediction, in contrast to algorithms on predictive policing, typically show high effectiveness in terms of the true positive rate.The latter reaches scores of up to about 90% (Roffman et al. 2018).We therefore varied the true positive rates (i.e.effectiveness of an algorithm) of algorithms predicting skin cancer at 85, 90, and 95%.To control for the fact that effectiveness is very different from the predictive policing setting, we used the contrast between a low and high effectiveness condition as a between-subjects factor for both domains of application, i.e. predictive policing and skin cancer prediction.In the low effectiveness condition, citizens were confronted with a true positive rate of an algorithm of 5, 10, or 15%, as already used in study 1, a rather realistic effectiveness for predictive policing but strongly underestimating the actual performance of algorithms applied for skin cancer prediction.In the high effectiveness condition of Study 2, participants were presented a true positive rate of 85, 90, or 95% -reflecting the real performance of algorithms on skin cancer prediction, while being highly unrealistic for predictive policing.
As in Study 1, we pre-tested our experimental material with a student sample (SI Table S3 for the results) before inviting citizens to participate.Using the same choice design as for predictive policing, but with high true positive rates, the pre-test's coefficients did not reveal a marked deviation from the coefficients observed in Study 1.Therefore, we ran the same choice design as in Study 1.The instructions and the choice tasks (see SI on materials) that we used for the skin cancer prediction condition were analogue to the instructions in the predictive policing case.We also included a question on how much knowledge respondents had about predictive policing or health care and/or the use of algorithms in these domains.This question was used for a robustness check based on excluding respondents with a high knowledge (see SI Table S14 for results) and allowed us to analyse the data separately for the entire sample and for a sample without participants with special knowledge on algorithms and/or the domains in which they were applied.

Results Study 2: Comparison of predictive Policing and Skin Cancer Prediction
The results of Study 2 are shown in Figure 3.This figure differentiates between the two domains (predictive policing and skin cancer prediction) and between the low versus high effectiveness condition (see SI Tables S11 and S12 for regression tables).The analysis yields several insights.First, the coefficient estimates from the multinomial logistic regressions on the participants' choices between the presented algorithms are largely the same as in Study 1.Second, the coefficients are rather similar in the predictive policing and the skin cancer domain.Looking at the low effectiveness condition, which is realistic for predictive policing but unrealistic for the skin cancer domain, the coefficients are highly similar for these two domains.Hence, the domain as such does not notably affect citizens' partworth utilities of the various features of the algorithms.Third, the estimated feature level partworths are even rather similar for the low versus high effectiveness condition.This is surprising as the presented true positive rates in both domains differ strongly.
Overall, the message from Study 2 is the same as in Study 1 and clearly more in line with H2 than with H1: Algorithm features that guarantee stakeholder involvement and even more so features guaranteeing the transparency of an algorithm have a discernible positive effect on the estimated overall utility of an algorithm among citizens (or the general public, to put it like that) -but it is the algorithm's effectiveness that clearly has the strongest influence on the overall utility of an algorithm according to respondents' choices. 7 Figure 3 also indicates that the relative importance of differences in an algorithm's effectiveness at least to some degree depends on the level of effectiveness that respondents were presented with.For those who saw the highly effective algorithms (true positive rates between 85 and 95%), the coefficients for effectiveness are significantly smaller than in the low effectiveness condition with true positive rates between 5 and 15% (SI Table S12).This pattern holds for both domains, i.e. predictive policing and skin cancer prediction, but is particularly visible in the former.It implies that the added utility per unit increase of the true positive rate gets smaller with higher values of this feature.features of the algorithm (based on SI Table S11, models 2, 4, 6, and 8).In the low effectiveness condition, the utility of an algorithm's transparency is equal to roughly a 4-point increase in its true positive rate.Yet, in the high effectiveness condition citizens are willing to trade more than a 6-point increase for having some sort of transparency.This means that gains in an algorithm's effectiveness are comparatively less important to citizens in the high effectiveness condition.

Discussion
The results from our studies on two algorithms used in the public sector strongly suggest that when citizens must make trade-offs between transparency and stakeholder involvement in algorithm design, and the algorithm's effectiveness, they clearly prioritize the latter.According to the findings, citizens on average trade away algorithm transparency based on testing of the algorithm by independent experts for an increase of about 4 to 6% points in the true positive rate of an algorithm -with the true positive rate representing the algorithm's effectiveness in our studies.Stakeholder involvement and algorithm transparency emerge as comparatively unimportant in citizens' evaluations of algorithms in our analysis.These results are consistent for the two domains of predictive policing and skin cancer prediction.
The central role of algorithm effectiveness does not mean, though, that citizens are not interested in stakeholder involvement and transparency of an algorithm.In line with previous findings (Grimmelikhuijsen 2022;König, Wurster, and Siewert 2022;Liu 2021;Schiff, Schiff, and Pierson 2021;Shin 2021;Shin and Park 2019), we find that respondents do appreciate more transparent algorithms.However, and interestingly, the extent of transparency does not seem to matter as we do not find that the scrutiny and auditing of algorithms by independent experts is valued more than basic transparency.Given that some recent studies argue that explaining algorithms better, e.g. by providing information on what features where crucial in producing a given algorithmic decision, is more valuable than basic transparency (Grimmelikhuijsen 2022;König, Wurster, and Siewert 2022), this is an interesting finding: Professional auditing by experts should arguably be more rigorous than explaining an algorithm to laypersons.The findings furthermore add to existing research by showing that stakeholder involvement in algorithm design, too, matters for how citizens evaluate algorithms.At the same time, the extent to which stakeholders have influence over design decisions again makes no difference for citizens perceptions of algorithms.
In sum, we can take away from the analysis that transparency and stakeholder involvement do lead to more positive evaluations of algorithms in the public sector.However, the importance of both these features is clearly surpassed by the effectiveness of the algorithms.Notably, this is similar to what has been found with a vignette experiment for the trade-off between effectiveness (as perceived usefulness) and privacy protection (Willems et al. 2022).
Altogether, the findings have implications for the implementation and regulation of algorithms and are relevant for scholars, policymakers, and educators.The results tell us that when asked to choose between different applications, citizens would clearly favour a higher effectiveness over higher transparency and greater stakeholder involvement in the design of an algorithm.As a theoretical implication of our research, the findings are more in line with the idea of citizens' self-interest leading them to care mainly about outputs (our second hypothesis, H2) rather than with some inherent appreciation and prioritization of process standards, as indicated by procedural justice theory (Tyler 2000) and reflected in H1.
Hence, the results suggest that if policy makers were to offer different algorithms to citizens, they could justify a less transparent algorithm based on public demand.This is particularly important when bearing in mind the observations that when algorithms are put in place it is often because of concerns that prioritize effectiveness and efficiency (Schiff, Schiff, and Pierson 2021).Our finding that citizens prioritize effectiveness over measures that safeguard transparency and stakeholder influence parallels such an output-orientated view and indicates that citizens will hardly present a strong counterweight to that orientation.This altogether increases the likelihood of a gradual 'technological outsourcing' (Dickinson and Yates 2021) with potentially problematic information asymmetries.The results therefore underscore the importance to have robust rules in place to safeguard the transparency of algorithms and guarantee their accountability.Given the dominance of effectiveness concerns in citizen evaluations, to put such rules in place is arguably even a bigger challenge.
When interpreting the results, several limitations need to be kept in mind, though.First, people's evaluations might be different in other decision contexts, especially in a setting in which they have been or are negatively affected by an algorithm.Existing evidence suggests that citizens are likely more concerned about algorithm transparency in such cases (Schiff, Schiff, and Pierson 2021).Further, procedural justice theory also posits that people demand standards of algorithm accountability and fairness especially when they dislike the outputs of an algorithm's decision.Future research could shed further light on how our findings might change when citizens are presented the outcomes of an algorithm in a loss instead of a gain frame, or when the context in that an algorithm should be applied is one in which citizens' personal stakes are higher than in our study.Second, it also appears that with algorithms used at the highest levels of political decision-making, citizens do not care much about performance, but want strong accountability mechanisms for algorithms or request that algorithms are not applied at all (Starke and Lünich 2020).Hence, much seems to depend on the concrete application area and further research could look more systematically at how such contexts affect citizens' evaluations of algorithms.
Third, we let citizens choose between algorithms differing in combinations of useful features to indirectly estimate citizens' latent preferences of an algorithm's features.This design parallels conjoint analysis as it is commonly performed in consumer research where this method serves to indirectly obtain valid measures of consumers preferences for specific product dimensions (e.g. the sweetness and the calories of a chocolate bar).However, confronting citizens with specific algorithm designs (differing in performance levels and other features) is not a natural situation that they will encounter in their daily lives very often.Hence, while the method used here has the advantage to get at latent preferences, we acknowledge that the choice sets presented in the study are less rooted in daily choice situations than products that people evaluate in marketing studies.It is conceivable that citizens' opinions about algorithms are largely shaped by heuristics, given that algorithms are hardly something about which people will give much thought.
Fourth, while we focused on the relative importance that people place on certain algorithm features, this should not distract from the fact that absolute levels of importance matter too.We deliberately confronted citizens with trade-off decisions in order to determine how much they favour certain features of an algorithm over other features.Yet, certain levels of an algorithm's feature, such as no transparency at all, may not even be an option for people to choose, particularly in the public sector.
Furthermore, there may well exist algorithmic systems that do not entail the trade-offs investigated in the present research.Relatedly, it is also an important question whether citizens generally demand an adequate absolute level of an algorithm's feature, e.g. of effectiveness or transparency.This concerns especially the aspect of effectiveness as it is not per se clear what constitutes an adequate (or acceptable) performance of an algorithm that justifies applying it in the public sector.Indeed, our results suggest that citizens do not have a clear reference point when evaluating the effectiveness of an algorithm.The experimental group that saw much less effective algorithm designs (in terms of the true positive rate) were not more inclined to reject these algorithms than the group of citizens that was confronted with the very effective algorithm instead.This anchoring effect (Mussweiler 2002) means that policymakers probably have quite some leeway in framing algorithms as effective, as perceptions of effectiveness do not depend on some absolute standard in most populations.
Note that besides the citizen perspective, there is also some need to extend our research to recent findings on how public officials view algorithms (e.g.Criado and de Zarate-Alcarazo 2022; Yigitcanlar et al. 2019) and specifically on whether officials evaluate trade-offs similarly as citizens.We furthermore point out that there are, of course, other features of algorithms that could be included in a preference estimation as the one performed above.Our findings can make only statements about the relative importance of an algorithm's features that we investigated.To avoid cognitively burdening citizens with a technical matter, we chose a parsimonious design while focusing on the theoretically interesting trade-off between an algorithm's effectiveness and the features of transparency and stakeholder involvement.One may also want to study further what role complete automation versus having a human in the loop plays for citizens.We have refrained from including this contrast as complete automation is entirely unrealistic in the applications that we explored.
While research on algorithms like the present one may seem to have a narrow focus, it has rather broad implications for questions of democratic and bureaucratic legitimacy.Using algorithms in the public sector introduces challenges and tensions by promising to improve outputs in order to increase citizens' life-satisfaction but this might happen at the cost of input and especially throughput legitimacy.At the same time, these tensions can also provide an opportunity to instil an appreciation of an algorithm's transparency and accountability in citizens.Democratic government, after all, is not just about outputs, but also about the decision-making procedures producing them.

Notes
1.These challenges are discussed under explainable artificial intelligence (for an overview, see Guidotti et al. 2018), which covers involves not only technical question, but also extends into philosophical debates about what constitute explanations.2. Recent work has also examined the perceptions of public officials and found that they approach algorithms with caution (Yigitcanlar et al. 2019), but also see greater transparency as a potential benefit (Criado and de Zarate-Alcarazo 2022).3. It should be noted that 'accuracy' is sometimes used generically to refer to the performance of an algorithm.However, in the technical sense, it refers to one specific way of quantifying the performance of a classifier, namely as all correctly predicted outcomes divided by all outcomes.A problem with this measure is that any meaningful interpretation is only possible in comparison to the baseline as the ratio of positive to negative outcomes.
4. The speeding filter reduces the number of respondents by 79 in study 1 and by 326 in study 2. Without the speeding filter, the coefficients are slightly smaller, but the results are essentially the same (see SI Tables S9 and S13). 5.The true positive rate (recall) is arguably more meaningful for participants than the positive predictive value (precision).A high precision means that whenever a prediction of a positive outcome (burglary) is made, it is often correct.However, this could mean that only few such positive predictions are made overall, such that only few of all occurring outcomes are detected.Recall, in contrast, represents the rate at which occurring outcomes are detected.6.This variation of running costs of an algorithm' s application per year was informed by piloting studies on the costs of algorithms on predictive policing in Germany.For the sake of comparability, we kept these values the same in the skin cancer prediction setting (for the complete choice design, including information on the algorithms' features that citizens received prior to the experiment, see SI Table S4).7.These results hold true for the entire sample.To explore potential subgroup differences based on relevant features, we performed several analyses splitting groups by their high vs. low scores on algorithmic literacy, the self-reported importance of security/of health, technophobia, personality traits (Big Five), gender, and age.We used a median split to generate subgroups of citizens for these analyses.These individual differences between participants were measured in a post-experimental questionnaire.The findings from these additional analyses do not indicate consistent differences in the trade-offs between the algorithms' features described in the main analyses (see SI Tables S15 to S20 for details).Thus, our results are not affected by differences in respondents' attitudes.Nor are they affected by excluding participants with expert knowledge about the domain in question (i.e.predictive policing or skin cancer prediction respectively) and/or about algorithms in these domains (see SI Table S14).
illustrates, Choice set A of Study 1 was replicated in the predictive policing/low effectiveness condition of Study 2.

Figure 1 .
Figure 1.Overview of the study designs.Dashed arrows indicate that citizens were randomly assigned to the domain and experimental conditions.Note that choice sets a of Study 2 were a replication of the choice set in Study 1.The choice sets in the variants A, B, C, and D varied depending on the randomized domain and experimental condition.

Figure 2 .
Figure 2. The coefficient estimates from a multinominal regression reflect the partworths of the algorithm's features used for predictive policing.These partworths have to be interpreted in relation to the reference categories of the respective features.The horizontal error bars reflect the 95-percent confidence intervals.Running costs are Euros per household per year.

Figure 3 .
Figure 3.The coefficient estimates from a multinominal regression reflect the partworths of algorithm's features used for predictive policing and skin cancer prediction.These partworths have to be interpreted in relation to the reference categories of the respective features.The horizontal error bars reflect the 95-percent confidence intervals.A) the feature level partworths of the algorithm for predictive policing.B) the feature level partworths of the algorithm for skin cancer prediction.In the low effectiveness condition, citizens are shown algorithms with true positive rates of 5, 10, and 15%.In the high effectiveness condition, the true positive rates are 85, 90, and 95%.Running costs are Euros per household per year.

Figure 4
illustrates these differences in the feature trade-offs separately for the low and the high effectiveness condition.The bars represent how many percentage points of a change in the true positive rate citizens would trade away for gains in the other

Figure 4 .
Figure 4.The partworths of algorithm's features expressed in changes of the algorithm's effectiveness, indicated by changes of the true positive rate in percentage points.The x-axis displays this change of the algorithm's true positive rate (effectiveness) in percentage points.Running costs were Euros per household per year.The bars indicate how many percentage points of the true positive rate citizens are estimated to trade for a given algorithm's feature as compared to the absence of such a feature (e.g.stakeholder consent versus no stakeholder involvement).A) Evaluation of algorithms for predictive policing.B) Evaluation of algorithms for skin cancer prediction.In the low effectiveness condition, citizens were shown algorithms with true positive rates of 5, 10, and 15%.In the high effectiveness condition, the true positive rates were 85, 90, and 95%.Running costs are Euros per household per year.