Algorithmic legitimacy in clinical decision-making

Holm, Sune

doi:10.1007/s10676-023-09709-7

Algorithmic legitimacy in clinical decision-making

Original Paper
Open access
Published: 02 July 2023

Volume 25, article number 35, (2023)
Cite this article

Download PDF

You have full access to this open access article

Ethics and Information Technology Aims and scope Submit manuscript

Algorithmic legitimacy in clinical decision-making

Download PDF

Sune Holm ORCID: orcid.org/0000-0002-3812-7942¹

1230 Accesses
Explore all metrics

Abstract

Machine learning algorithms are expected to improve referral decisions. In this article I discuss the legitimacy of deferring referral decisions in primary care to recommendations from such algorithms. The standard justification for introducing algorithmic decision procedures to make referral decisions is that they are more accurate than the available practitioners. The improvement in accuracy will ensure more efficient use of scarce health resources and improve patient care. In this article I introduce a proceduralist framework for discussing the legitimacy of algorithmic referral decisions and I argue that in the context of referral decisions the legitimacy of an algorithmic decision procedure can be fully accounted for in terms of the instrumental values of accuracy and fairness. I end by considering how my discussion of procedural algorithmic legitimacy relates to the debate on algorithmic fairness.

Mitigating Bias in Clinical Machine Learning Models

Article 10 February 2024

Machine Learning in Practice—Evaluation of Clinical Value, Guidelines

Towards a pragmatist dealing with algorithmic bias in medical machine learning

Article Open access 13 March 2021

Introduction

There are high hopes of improving medical decision-making by utilizing machine learning (ML) algorithms. And there are pressing concerns about the responsible application of these systems. On the one hand, ML algorithms may improve the accuracy and efficiency of decision-making. On the other hand, there are concerns about the fairness of decisions based on algorithmic systems.^{Footnote 1} In this article I consider a related but distinct question: When are decisions produced by deference to an algorithmic output legitimate? People tend to find decisions they disagree with acceptable to obey if the decision procedure meets certain normative conditions (Waldman, 2020, p. 107). Thus there is an important theoretical question about the conditions that an algorithmic procedure must satisfy for its decision output to be such that stakeholders should accept it. Clearly, if a procedure does not generate legitimate decisions, then stakeholders can rightfully reject to comply with them, and the moral and epistemic virtues of the procedure will be in vain. Moreover, a lack of legitimacy seems like a strong reason why those who have the power to adopt an algorithmic decision procedure should not do it. In this article I argue that in the context of a subset of medical decisions, the properties that confer legitimacy on an algorithmic decision procedure are instrumental properties of the procedure. However, I do not claim that the properties that are relevant for the legitimacy of the kind of decisions I discuss in this article are relevant in all other algorithmic decision contexts in health and elsewhere. However, while I will not explicitly argue for it, I contend that the instrumental value of an algorithmic decision procedure is always relevant for its legitimacy, even if non-instrumental properties of the procedure matter too in certain contexts.

My focus is on the use of binary recommendations from black-box ML algorithms to make referral decisions in primary care. This focus is motivated by the fact that algorithmic decision support is already being developed and deployed in medicine to make consequential referral decisions about individuals. This trend is supported by increasing evidence that ML systems may prove effective for clinical referrals to tertiary care centres (Alam & Hallak, 2021).^{Footnote 2} It is my contention that the framework and discussion presented in this article in relation to cases of referral decisions will also provide a valuable input to other decision contexts.

Let me also note that I do not claim that there is something new or unique about the use of black-box ML algorithms that raises normative issues around legitimacy that are not actualized in non-algorithmic decision procedures. Rather, my aim is to show how the proceduralist approach to legitimacy developed in the context of political decision-making may help to analyze the legitimacy of a subset of medical decisions deferring to the recommendations of a much-debated type of algorithms. Thus, I aim to clarify how the conceptual framework of Proceduralism can help systematize the discussion of the justified use of algorithms in medical practice. As I show in the final section such systematization will also be relevant for clarifying the way in which claims about (un)fairness can be understood as a criticism of the legitimacy of an algorithmic decision procedure.

In this article I will focus on cases in which ML algorithms are to be deployed by practitioners to make important decisions about the referral of patients for further examination. In these contexts, a scarce resource is being allocated on the basis of the recommendation of the algorithm, and the question is whether this decision procedure transmits legitimacy to the resulting decisions.

The first medical AI device that was approved for marketing by the FDA is called IDx-DR.^{Footnote 3} Diabetes patients are at risk of developing diabetic retinopathy, a condition which may result in blindness if not treated in time. The device allows practitioners to acquire an expert-level assessment of diabetes patients in the clinic. The practitioner takes a set of images of the patient’s retinas and uploads them to IDx-DR. In a matter of minutes the device provides the physician with a binary recommendation:

(1)
“More than mild diabetic retinopathy detected: refer to an eye care professional,” or
(2)
“Negative for more than mild diabetic retinopathy; rescreen in 12 months.”

A key benefit of the device is that it will enable practitioners, who have no expertise in analyzing retinal images, to make more accurate referral decisions about diabetes patients. In turn this will ensure better use of scarce eye specialist resources, which will be allocated to those who need them to a higher degree, and thus improve quality of care for patients.

There are several other areas of medicine in which a lot of work and funding is going into developing ML algorithms which may assist practitioners making referral decisions (Alam & Hallak, 2021). In dermatology ML algorithms have been tested to perform the task of identifying skin lesions such as melanoma with higher accuracy than human experts. Introducing such models into primary care may improve practitioners’ ability to distinguish skin lesions requiring biopsy from benign lesions that do not require allocation of specialist resources (Jones et al., 2022). As with the case of diabetic retinopathy, using ML algorithms to make more accurate referrals of patients with suspicious skin lesions will support a more effective use of expert resources and earlier diagnosis of skin cancer to the benefit of patients. For brevity I will focus my discussion on the case of IDx-DR.

In the use contexts described, referral decisions are made under uncertainty. Whether a patient has DR is unobservable for the decision-maker. Using IDx-DR the decision-maker can make a “best guess” about whether the patient has DR on the basis of observable features present in the images. The algorithm calculates a score estimating how likely it is that the patient has DR, applies a threshold T to the score, and outputs a binary classification—“positive for DR” or “not positive for DR”—corresponding to whether the patient’s score is below the threshold or not. For example, if the algorithm calculates a score on a scale from 0 to 100, the threshold for how high a score a patient must have to be recommended for specialist examination may be set at 75.^{Footnote 4} A decision about whether to refer the patient is then made on the basis of the algorithmic classification and recommendation.

For the sake of argument, I will assume that the decision follows automatically from the classification such that a positive classification results in an offer to see a specialist, and a negative classification results in a decision to offer rescreening in 12 months.^{Footnote 5} Thus, the sort of cases I have in mind are cases in which a practitioner defers a decision to the binary output of an algorithm. While there are many other uses of medical AI that one might consider, I think it is fair to say that many of the ML algorithms expected to improve healthcare in the future are of this kind.

Following Estlund (2008, p. 8) I will assume that a decision can be legitimate even if it is incorrect. Hence, I will assume a fallibilist account according to which the output of a decision procedure may be incorrect and legitimate. I will focus on proceduralist accounts of legitimacy. According to Proceduralism a decision is legitimate if it is produced by an appropriate procedure (Monaghan, 2022, p. 110).^{Footnote 6} This allows Proceduralism to recognize that incorrect decisions can be legitimate. For example, proceduralists may argue that even an incorrect guilty verdict is legitimate and therefore should be accepted because of features of the criminal procedure that produced it.

At the heart of Proceduralism we find the Transmission Thesis:

Transmission Thesis A procedure P with properties Q will transmit normative property N to its output O. (Monaghan, 2022, p. 114).

As should be clear, the normative property I have in mind in this article is legitimacy. The output I have in mind is the referral recommendations of IDx-DR. The procedure I have in mind is a procedure where a decision-maker defers decisions to the output of an ML algorithm with little or no human input (Binns, 2018, p. 543, see also Barocas et al., 2022, p. 25). I will refer to such decisions as “algorithmic decisions” and refer to a decision-maker, who defers to an ML algorithm, as an “algorithmic decision-maker.”

The question that I want to ask is this: Under what conditions, if any, are referral decisions based on the output of IDx-DR legitimate? To decide on this question, I will distinguish between instrumental and non-instrumental Q properties. And I will argue that in the decision context envisaged for IDx-DR and similar systems to be used by practitioners in primary care to make referral decisions, all Q properties are instrumental and that they are accuracy and fairness.

The question about the legitimacy of algorithmic decisions is an overlooked discussion in the debate about algorithmic decision-making. There has been a lot of focus on questions about the fairness and explainability of algorithms, and while there is a growing literature on legitimacy and algorithmic decision-making in various contexts it is comparatively small and less focused on a sharply defined problem. Danaher (2016) is concerned with what he calls “the threat of algocracy,” which is the threat that opaque algorithmic decision procedures poses to the legitimacy of decision-making. Danaher’s focus is mainly on what I refer to as a non-instrumental feature of algorithmic procedures and how to deal with it assuming that it threatens legitimacy. Thus, I consider this article to complement Danaher’s interesting considerations. Other recent discussions of algorithmic legitimacy include Brownsword (2022), Chomanski (2022), Grimmelikhuijsen and Meijer (2022), Lazar (unpublished), Waldman (2020), and Wang et al. (2023). While these authors all discuss algorithmic decision-making in terms of legitimacy, they do not offer an explicit and systematic analysis of what to identify as the Q properties of an algorithmic decision procedure from a proceduralist perspective. To do so is the task of this article.

Let me finally note that, as posed, the question about which Q properties a procedure must have to be legitimate expresses a normative approach to legitimacy. This is a different question from the descriptive question about which conditions a procedure should meet to be perceived to be legitimate by stakeholders. My focus on the normative perspective is not to be taken as a dismissal of the relevance of the descriptive perspective. It is just not within the scope of this article to consider the descriptive perspective.

The plan is this. In section “The accuracy argument” I present a common instrumental justification for implementing algorithmic decision-making in primary care, namely that it will increase the accuracy and efficiency of referral decisions. In section “Algorithmic instrumental proceduralism” I provide a more detailed discussion of instrumental accounts of algorithmic legitimacy. In section “Non-instrumental Q properties?” I turn to consider the view that algorithmic legitimacy may also require a decision procedure to have certain non-instrumental properties and consider three widely proposed candidates. In section “Concluding remarks: algorithmic fairness and legitimacy” I conclude and consider the relationship between algorithmic legitimacy and fairness.

The accuracy argument

A key reason for using algorithmic decision-making is that it will improve the quality of decisions in the sense that the proportion of correct decisions will be higher than for alternative procedures. In many contexts, such as in the case of using IDx-DR, referral decisions must be made under uncertainty. Increasing the accuracy of these decisions thus seems like a very good reason to deploy an algorithmic procedure. For example, if there is a limited amount of eye specialist examinations available for examination of diabetes patients, and we think that those with DR should have it, then the rational thing to do would seem to be to use the procedure that will maximize the frequency of correct decisions.

This argument seems compelling both from the perspective of expert human decision-makers and from the perspective of decision subjects. Thus, even expert decision-makers will have “a prima facie epistemic and professional obligation” to defer their decisions to the algorithmic system if it is more accurate and reliable than they are (Bjerring & Busch, 2020, p. 349). And from the point of view of decision subjects, it would also be rational to prefer the epistemically superior algorithmic procedure. After all, the more accurate the procedure, the more confident decision subjects can be that they get a correct decision. In short, from an epistemic point of view, the point of view of getting it right,^{Footnote 7} it would be rational for stakeholders to defer decisions to algorithmic recommendations because the algorithm knows best.^{Footnote 8}

In the context of legitimacy, the Accuracy Argument expresses a form of Instrumental Proceduralism. The legitimacy of the procedure is exclusively a matter of how well it serves as a tool for achieving a procedure-independent value. Once we have agreed on what that value is, there are no further features of the procedure that are relevant for determining its legitimacy.

Some argue that there is more to the justification of an algorithmic procedure than its accuracy. Thus Grgić-Hlača et al. (2016) claim that there are properties of the procedure, which may be relevant for whether it is legitimate, independently of whether and, if so, how they affect the quality of the outcome of the procedure. I will refer to such properties as non-instrumental Q properties.

Grgić-Hlača et al. (2016) find that it is inappropriate for a decision procedure to use protected features, such as race features, as an input even if it improves the accuracy of the system. Some argue that for a procedure to be legitimate it must be possible for humans to understand how it arrived at a particular decision, even if that requirement excludes available and more accurate procedures.^{Footnote 9} Others have argued that a procedure should treat people who are alike in relevant respects in the same way by applying a single classification threshold.^{Footnote 10} For example, if male diabetes patients with a risk score above T are classified as high risk and recommended for further examination, then female diabetes patients with a risk score above T should be classified as high risk and recommended for treatment too.^{Footnote 11} While this requirement is compatible with maximal accuracy the property of having a single threshold is non-instrumental in that it is required regardless of whether it increases or decreases the instrumental value of the algorithm. And, as will become clear, the single threshold property is not compatible with respect to other instrumental values that one might find relevant for legitimacy.^{Footnote 12}

I have suggested that Q properties will be either instrumental or non-instrumental, and that accuracy is an instrumental Q property with respect to the legitimacy of IDx-DR. As I will show in the next section, I think that a strong case can be made for the claim that fairness is an additional instrumental Q property relevant for the legitimacy of IDx-DR.

Algorithmic instrumental proceduralism

In the debate about algorithmic decision-making prominent scholars have defended a view which I will call Accuratism. According to Accuratism the only Q property of a procedure that is relevant for its transmission of legitimacy is its accuracy. One of the most detailed defenses of this view is presented in Corbett-Davies and Goel (2018), where the authors argue for deploying an “unconstrained” algorithm to maximize a purely instrumental value of the algorithm, namely its “social utility.” Corbett-Davies and Goel (2018) present their argument within a maximization of utility framework. However, given their assumptions, Corbett-Davies and Goel’s (2018) maximization approach can in effect be understood in terms of maximization of accuracy (Barocas et al., 2022, p. 19, see also Mitchell et al., 2021, p. 149). Hence, for the sake of simplifying the exposition I will present their argument in terms of accuracy and not utility maximization. This should not make a substantive difference to the form of their argument, which I think illustrates the perspective of Instrumental Proceduralism well. In this section I first present Corbett-Davies and Goel’s account. Then I argue that from the point of view of Instrumental Proceduralism additional instrumental values seem to be relevant to consider as well.

Corbett-Davies and Goel (2018) find that the most accurate algorithm will apply a single, uniform threshold to all individuals, and may require using protected features as input (Corbett-Davies & Goel, 2018, p. 3, see also Corbett-Davies al., 2017, p. 797). Well-aware that applying a single uniform threshold will result in differential impact on different groups, Corbett-Davies and Goel claim that such differential impact would be justified by the fact that the system will achieve superior accuracy (2018, p. 7). Thus, in effect they suggest that stakeholders should accept as legitimate the decisions made by the decision procedure because it is the most accurate, even when the procedure has very different impact on different groups of people. The properties of the system should be tuned to produce maximal accuracy. I will call a form of Instrumentalism that only accepts a single value as instrumentally relevant Monistic Instrumental Proceduralism. According to this view, decisions made by IDx-DR are legitimate if and only if deferring to the output of IDx-DR is the most accurate available procedure for making referral decisions.

Corbett-Davies and Goel (2018) represent one version of Monistic Instrumental Proceduralism about algorithmic legitimacy. However, instrumentalists may disagree about what to consider to be the relevant instrumental value. For example, one might argue that the legitimacy of an algorithmic decision procedure is a matter of its ability to achieve a certain (e.g., egalitarian) pattern of distribution of correct and incorrect decisions across groups.

In a recent article Monaghan (2022) argues that instrumental proceduralists should be pluralistic and recognize that the distribution of incorrect decisions matters for legitimacy in addition to accuracy. Thus, Monaghan claims that failures must be “relatively uniformly distributed in the population” (Monaghan, 2022, p. 111), and that even if a procedure is highly accurate “(…) certain distributions of failure can render it inappropriate and block the transmission of legitimacy (Monaghan, 2022, p. 123). Thus, a very accurate procedure may lack legitimacy if a minority group suffers the “vast majority of the procedural failures” (Monaghan, 2022, p. 122, see also Barocas et al., 2022, pp. 19–20). In that case minority members are within their rights to ask, “Why should I obey the output of this procedure? It clearly does not work for us” (Monaghan, 2022, p. 122).^{Footnote 13} Importantly, this complaint is not that the procedure is too inaccurate for the minority group in the sense of being too low. Monaghan suggests that a procedure that is 90% accurate can be reasonably rejected by members of a group, if that groups suffer all the failures. This is clear from Monaghan’s presentation of the following case:

We cannot expect or demand that people obey the outcome of a procedure on the grounds that it tends to be reliable if they shoulder most of the burden of the unreliability. (Monaghan, 2022, p. 122).

Consider the use of IDx-DR on a minority and majority group. Suppose the minority group makes up 20% of the population as well as 20% of the people subjected to the decision procedure. To make the burden of failures vivid, assume also that all failures are false negatives. Some with DR do not get recommended for specialist examination. Now consider two cases in which all failures of a 90% accurate procedure fall on one of two groups.

Case A For every 100 decisions, there is 80 out of 80 (100%) correct decisions for the majority group and 10 out of 20 (50%) correct decisions for the minority group.

Case B For every 100 decisions, there is 20 out of 20 (100%) correct decisions for the minority group and 70 out of 80 (87.5%) correct decisions for the majority group.

In both Case A and Case B one group shoulder the whole burden of failures. According to Monaghan’s argument neither of these failure distributions should be acceptable to the group shouldering the whole burden of failures. Hence the procedure will not be legitimate despite its superior accuracy.

Based on this sort of consideration Monaghan finds that in addition to overall accuracy instrumentalists must also recognize the distribution of failures across groups as a condition on the legitimacy of the procedure. I will call the view that there is more than one instrumental value that must be considered for a procedure to be legitimate Pluralistic Instrumental Proceduralism.

There are a couple of questions that arise in relation to Pluralistic Intrumental Proceduralism. One question is how to understand the instrumental value identified as “relatively uniformly distributed across groups.” Consider the following case:

Case C For every 100 decisions, there is 75 out of 80 (93.75%) correct decisions for the majority group and 15 out of 20 (75%) correct decisions for the minority group.

There is a sense in which Case C achieves equal distribution of failures across groups in that half of the failures burden the minority group and half of the failures burden the majority group. However, there is another sense in which the minority group is still much more burdened by the procedure than the majority group. Members of the minority members will suffer failure in 25% of cases, whereas majority group members will suffer failure in less than 8% of cases. If the burden of failure is the same for all individuals, the expected burden for a minority member will be approximately three times as high as the expected burden for a majority member. So, minority members still seem to have good reason to claim that the procedure in Case C “does not work for them.” Instead, they might argue that burdens should be proportional to group size. Thus, the minority groups should suffer 20% of the failures and the majority group should suffer 80% of the failures.

Instrumentalists who find error distribution to be relevant for an algorithmic procedure’s legitimacy may disagree about what constitutes a fair distribution. As documented by the rapidly growing literature on algorithmic fairness there is a whole family of statistical fairness criteria which consider some sort of equal performance across groups to be necessary for the fairness of an algorithmic procedure (see e.g., Mitchell et al., 2021; Corbett-Davies and Goel 2018; Verma & Rubin, 2018 for overviews of different statistical parity criteria of fairness). While there is widespread disagreement about what fairness criteria to accept (see Holm 2023a for discussion), the focus on error distribution strongly indicates that, in the context of algorithmic decision-making, instrumental considerations other than accuracy are widely thought to be relevant for the legitimacy of algorithmic decision procedures.^{Footnote 14}

Another question arising in relation to Pluralistic Instrumentalism concerns how to trade off competing instrumental values. It is widely recognized that the most accurate algorithmic procedure may result in unequal error rates across groups. Proponents of Pluralist Instrumentalism will be committed to answering hard questions about how much accuracy it will be acceptable to sacrifice in order to achieve a more equal distribution of incorrect decisions across groups. Even so, when it comes to the legitimacy of using IDx-DR recommendations for referral decisions, it certainly seems relevant to consider the distribution of errors across salient groups in addition to accuracy. How exactly to weigh up a gain in equality against a loss in accuracy is a further question (see Holm 2023b for discussion).

Before I end this section, let me briefly return to Monistic Instrumental Proceduralism. One might think that accuracy considerations are essential to an instrumental account of algorithmic legitimacy. This seems to be assumed by Monaghan. Equal error distribution does not replace accuracy as the instrumental value relevant for legitimacy, but is put forward as an additional value to consider. Hence, we get a pluralistic account. However, an Instrumental Proceduralist may argue that achieving a certain distributional pattern is the only instrumental value relevant for legitimacy. This kind of view seems to be defended by Wachter et al. (2021) who argue in favor of what they call “bias transforming” fairness measures when assessing algorithmic decision procedures. Such measures will, by definition, not be fulfilled by a perfectly accurate algorithm (Wachter et al., 2021, p. 761). Interpreting their view in light of the notion of legitimacy, their view suggests that a legitimate algorithm will be one which addresses past wrongful discrimination inherent in the historical data by enforcing a certain distribution to correct for these historical injustices. Thus Wachter et al. write that:

Historical trends in decision-making have led to diminished and unequal access to opportunities and outcomes among certain groups. It is in this sense that the status quo is not neutral. Maintaining it by treating it as a neutral baseline for comparison cannot therefore be considered a politically, ethically, or legally neutral act. (Wachter et al., 2021, p. 768)

A Monistic Instrumental Proceduralist might thus argue that a legitimate decision procedure for referrals to specialist treatment may not consider accuracy as a relevant instrumental value. What is important is how the procedure contributes to achieving “substantive equality” in the resource distribution across salient groups.^{Footnote 15}

In this section I have distinguished and considered Monistic and Pluralistic Instrumental Proceduralism about algorithmic legitimacy. Regardless of which form one adopts, there is a basic claim that one subscribes to: Properties of the procedure, which do not make a contribution to maximizing instrumental value, do not contribute to its ability to transmit legitimacy.

Non-instrumental Q properties?

There is much debate about whether fair algorithms must have certain non-instrumental Q properties. Some argue that it is impermissible to use protected features such as race features as input. Some argue for requiring that only a single threshold is used for classification. And there is a huge debate about the importance of ensuring that algorithmic decisions can be explained.^{Footnote 16} What input features are used, whether a single threshold is applied, and whether an algorithmic output is explainable are all non-instrumental Q properties of an algorithmic decision procedure. Moreover, these non-instrumental properties are often considered relevant for the legitimacy of algorithmic procedures. What do instrumentalists like Corbett-Davies and Goel (2018) have to say about these non-instrumental Q properties?

Regarding input features, Corbett-Davies and Goel (2018) claim that there should be no limitations on which input features it is permissible to use. Input features which have predictive value will be useful for increasing the accuracy of the algorithm and leaving them out will typically “lead to unjustified disparate impacts” in the sense that it precludes using an alternative algorithm, which may perform better for some group and achieve the same or higher overall accuracy.^{Footnote 17} Mayson (2019, 2218) presents a powerful practical example of this problem:

When I worked in New Orleans as a public defender, the significance of arrest there varied by race. If a black man had three arrests in his past, it suggested only that he had been living in New Orleans. Black men were arrested all the time for trivial things. If a white man, however, had three past arrests, it suggested that he was really bad news! White men were hardly ever arrested; three past arrests indicated a highly unusual tendency to attract law enforcement attention. A race-blind algorithm would not observe this difference. It would treat the two men as posing an identical risk.

Proponents of Pluralist Instrumentalism, who aim to achieve a distributive pattern such as the same error rates across groups, will also find that explicit consideration of protected features may be required for legitimacy. In addition to its contribution to higher accuracy, consideration of protected features will also be necessary for achieving equal error rates across salient groups in that one will have to apply different thresholds to members of different groups.

Another issue is whether it is ever morally permissible to deploy multiple thresholds when using an algorithm to make referral decisions. Doing so means that people, who are equally at risk of disease, will not be treated alike. For example, to achieve equality in error rates across dark-skinned and light-skinned patients, it will typically be required that individuals with the same risk of suffering a harmful skin lesion will not be treated the same. Thus, one might argue that a decision procedure must apply a single threshold to be legitimate because members of the group facing the higher threshold can, it seems, reasonably complain that the procedure is morally objectionable.

Proponents of Accuratism can point out that if risk scores mean the same for members of different groups,^{Footnote 18} optimizing for accuracy dictates that the procedure should apply a single classification threshold to all individuals. This is because from any algorithm deploying more than one threshold, we can switch to a single threshold algorithm and make a gain in accuracy (see Corbett-Davies and Goel, 2018, p. 12). However, when Mitchell et al. (2021) conclude that Corbett-Davies and Goel’s view is that a “decision is considered to be fair if individuals with the same score (…) are treated equally, regardless of group membership” (Mitchell et al., 2021, p. 149), this is a misunderstanding of their view. Being instrumentalists Corbett-Davies and Goel (2018) do not, and should not, claim that it is of moral significance in itself to use a single threshold. Rather, what they point out is that Accuratism entails it. This is important because it means that the use of a single threshold is not more important or justified by the instrumentalism of Corbett-Davies and Goel (2018) than it is by the instrumentalism of someone who considers legitimacy to require equal error rates across groups. Hence it does not make Pluralistic Instrumentalism less plausible than Accuratism (Barocas et al., 2022, pp. 100–103). Accuratists, like Corbett-Davies and Goel, would arguably require differential thresholds if it was necessary to achieve maximal accuracy.

Proponents of Pluralist Instrumental Proceduralism will find it legitimate for IDx-DR to apply different thresholds to different groups to achieve a more equal distribution of incorrect recommendations. However, any such adjustments must be balanced against any ensuing loss in accuracy. What to identify as the right balance is a thorny practical question. Suffice it to say that the pluralists will be tasked with clarifying how this balancing is to be made with respect to the application in question. The fact that such explicit discussion of the weighing of fairness and accuracy will not arise for Accuratism makes for an important difference between the two views. However, and importantly, the fact that the pluralist instrumentalist is presented with this practical challenge does not amount to a theoretical weakness.

In the context of using an algorithmic system such as IDx-DR to make referral decisions can the system be said to issue legitimate decisions if patients with the same risk score will not receive the same decision? Consider Paula and Paul. They belong to different salient groups and receive the same score, but Paula gets referred to specialist examination and Paul does not. It seems as if Paul can reasonably complain that he is being treated worse than Paula, despite being like her in all relevant respects, namely in his risk of requiring specialist examination.

While I recognize that this is a very difficult question, I think the best response available to the pluralist instrumentalist is to point to the nature of the decision context. When a healthcare system requires referral decisions, the practitioner functions as a gatekeeper, whose task it is contribute to the overall aim of distributing the resource to those who need it (Greenfield et al., 2016).^{Footnote 19} Thus, if the practice of requiring referral decisions is accepted in the first place, a population-level perspective is at the heart of the decision context. However, as already argued, when distributing a scarce resource across a population, concerns about group fairness also become relevant for the legitimacy of the procedure. However, from the population-level perspective, the point of risk scoring individuals is not to treat them according to their risk score, but to treat them in order to achieve a population-level goal such as the right balance between overall accuracy and fairness. From this perspective, when Paul complains that he did not get referred and Paula did, the response to Paul would be that “the decision is not really about you.”

This consideration also provides an indication of why I want to limit my defense of Pluralist Instrumentalism to referral decisions such as those made deferring to IDx-DR. There are many medical decisions which are not to be made from a population-level perspective.^{Footnote 20}

The third non-instrumental property I will mention in relation to legitimacy is explainability. London writes:

Despite this accuracy, deep learning systems can be black boxes. Although their designers understand the architecture of these systems and the process by which they generate the models they use for classification, the models themselves can be inscrutable to humans. (London, 2019, p. 17)

The inscrutability of algorithmic decisions is sometimes considered to be a problem for the legitimacy of the algorithmic procedure, when used to make important decisions about people:

If an individual is given a longer prison sentence because of a decision of an algorithm, it is at least plausible to think that there is a moral obligation to explain to that individual why the algorithm produced the result that it did. This prima facie moral obligation is at the heart of the “right to explanation” contained in the European Union’s General Data Protection Regulation (GDPR). (Biddle, 2021, p. 7).

Why not require that referral decisions are explainable even if it detracts from the instrumental value of the procedure?^{Footnote 21} As before I find that the best response to this question is to point to the decision context. The decision problem is essentially about the distribution of a scarce resource across a population. This is a different decision context from those in which a practitioner is engaging in decision-making about e.g., how to treat an individual patient for a diagnosed condition. For a wide range of clinical decisions, simply deferring to the output of a black box algorithm will be in conflict with the principles of shared decision-making (Elwyn et al., 2010, see also Holm 2023c). Such decision contexts are essentially oriented towards the needs and values of the individual patient. This, I contend, is not so for pre-screening and referral decisions.

Concluding remarks: algorithmic fairness and legitimacy

I have presented a proceduralist framework for discussing algorithmic legitimacy and I have argued that in the case of IDx-DR legitimacy will require the procedure to have certain instrumental properties. I will end by considering how competing and much debated criteria of algorithmic fairness relate to my discussion of algorithmic legitimacy.

Arguing that an algorithmic decision procedure is unfair is a way to criticize the procedure’s legitimacy. Within the framework of algorithmic legitimacy presented, it becomes clear that different fairness considerations are directed at different accounts of what to consider as the Q properties.

To see this, consider the view that the true positive rate should be the same for all individuals with the same outcome—what Hardt et al. (2016) call Equality of Opportunity. According to this view, all patients who in fact have DR should have the same chance of a positive classification and recommendation to see a specialist. If this is not the case, then the algorithm is not fair, which plausibly entails that its decisions are not legitimate. However, arguing against the legitimacy of an algorithmic decision procedure by pointing out that it does not satisfy Equality of Opportunity presupposes that procedural legitimacy is a matter of the procedure's usefulness as an instrument for achieving a procedure-independent value. Moreover, if Equality of Opportunity is also seen as sufficient for fairness, which Hardt et al. seems to suggest, then the criticism assumes that one will take a procedure’s legitimacy to be purely a matter of its value as an instrument for achieving a certain distribution of correct decisions. One would assume Monistic Instrumental Proceduralism.

On the other hand, if one argues that an algorithmic procedure is unfair because it deploys multiple thresholds, this is only an argument against the legitimacy of its decisions on the assumption that non-instrumental properties of a procedure are also relevant for its legitimacy. This brings out an important theoretical insight. From the point of view of legitimacy, proponents of single-threshold algorithms or constraints on input features are not in direct disagreement with proponents of Equality of Opportunity because they assume different accounts of legitimacy. Rather, from the point of view of legitimacy, the direct competitors to a fairness principle such as Equality of Opportunity are those who propose alternative statistical fairness criteria or alternative values such as accuracy or utility as relevant for judging the legitimacy of a decision procedure.

Notes

When I use the term “algorithmic” I have in mind ML algorithms, which are too complex for humans to comprehend, e.g., deep neural networks.
As Alam and Hallak (2021) note “Studies have shown multiple models and pipelines for triage of pneumonia using chest x-rays; for referral for diabetic retinopathy (vision threatening), cataract, and retinopathy of prematurity” Se also Jones et al. (2022) for an informative summary of these kinds of uses.
United States Food and Drug Administration News Release. FDA Permits Marketing of Artificial Intelligence-based Device to Detect Certain Diabetes-related Eye Problems. https://www.fda.gov/newsevents/newsroom/pressannouncements/ucm604357.htm (2018). A total of 29 AI/ML-based medical devices have been FDA-approved according to a recent survey in Benjamens et al. (2020).
How and where to set the threshold is a matter of debate. Typically, the threshold will be set to maximize the accuracy of the algorithm, but as the discussion of algorithmic fairness has made clear, applying a single, accuracy-maximizing threshold may result in very different performance for salient groups. I return to this issue when discussing Instrumentalism and algorithmic fairness.
Other typical examples of algorithmic decision-making is loan, hiring, and parole decisions made on the basis of a set of known features of individuals and the application of a threshold to a risk score (see e.g., Mitchell et al., 2021 for some of these examples).
The notion of legitimacy is ambiguous. In this article I discuss legitimacy as a feature of decisions. This is a well-established way of thinking about legitimacy in the philosophical discussion of political decisions. However, on another common understanding in political philosophy, legitimacy is the right to rule (Adams, 2018). For a overview and discussion of political legitimacy covering both authority and decisions see Peter (2017).
As we will see, exclusive focus on “getting it right” may conflict with views about what is required for “doing it right.”
For the sake of argument, I assume that the algorithm in question is validated and in fact performs as well in practice as in tests. This is not always the case, and there is a growing literature on the gap between the performance of algorithms in tests and in actual clinical practice. I will also leave aside considerations of the possibility of human–machine interaction, which may achieve higher accuracy than the AI system does on its own. I focus on uses of systems like IDx-DR, which are expected to provide value in virtue of reaching expert-level accuracy in their classification.
See e.g., Amann et al. (2022), Lipton (2018) and Rudin (2019) for discussion.
Binns (2020) characterizes this view thus: (…) individual fairness ensures that people who are ‘similar’ with respect to the classification task receive similar outcomes.”
A reason why one might want to apply different thresholds to e.g., men and women, is that it will ensure that the outcomes are fairer according to some notions of algorithmic fairness.
I will return to this point in the next section.
It is worth noting that while a distributional concern may involve a predictable failure, because it is a concern arising from the observation that “the procedure usually fails that group,” Monaghan finds that “the distributional concern raises an additional legitimacy problem” (Monaghan, 2022, p. 126).
Interestingly, if it is possible to create a perfectly accurate algorithm for predicting DR in diabetes patients, then fairness does not arise as a separate issue because there will be no errors to distribute. Unfortunately this is not the realistic scenario.
In the context of political legitimacy, it is also recognized that instrumentalists may “refer to some ideal egalitarian distribution” (Peter, 2008, p. 35). In Peter’s characterization of instrumentalism, legitimacy will increase as the procedure approaches the ideal egalitarian distribution.
Here I simply present and discuss some of the candidates for non-instrumental Q properties that have been suggested in the literature. It is beyond the scope of this article to provide a list of all candidates for instrumental and non-instrumental Q properties. Given that I want to defend a Pluralistic Instrumental Proceduralism about IDx-DR, I focus on those candidates for non-instrumental properties, which have been presented in the literature. There might well be additional instrumental and/or non-instrumental Q properties which are relevant depending on the context, e.g., on whether the device is deployed by a public or private institution, or on whether the decision subject has provided informed consent to the use of the system. Thanks to an anonymous reviewer for raising these points.
Mayson presents a powerful example of the problem in practice.
That is to say, if they are calibrated across groups, such that a score of 8 on a scale of 1–10 means that 8 out of 10 given that score will have property Y, regardless of their group membership (see Kleinberg et al., 2017).
This is the ideal of gatekeeping. I do not want to claim that gatekeeping practices meet this aim to a satisfactory degree. For more discussion of this see Greenfield et al. (2016).
I say a bit more about this at the end of this section.
Explainability may be instrumentally important because it enables algorithm designers to achieve a more accurate and fairer algorithm. However, the question here is whether it should be required even when it does not contribute to the performance of the algorithm in these respects.

References

Adams, N. P. (2018). Institutional legitimacy. Journal of Political Philosophy, 26, 84–102. https://doi.org/10.1111/jopp.12122
Article Google Scholar
Alam, M., & Hallak, J. A. (2021). AI-automated referral for patients with visual impairment. The Lancet Digital Health, 3, e2–e3. https://doi.org/10.1016/S2589-7500(20)30286-7
Article Google Scholar
Amann, J., Vetter, D., Blomberg, S. N., Christensen, H. C., Coffee, M., Gerke, S., Gilbert, T. K., Hagendorff, T., Holm, S., Livne, M., & Spezzatti, A. (2022). To explain or not to explain?—Artificial intelligence explainability in clinical decision support systems. PLoS Digital Health, 1(2), e0000016. https://doi.org/10.1371/journal.pdig.0000016
Article Google Scholar
Barocas, S., Hardt, M., & Narayanan, A. (2022). Fairness and machine learning. fairmlbook.org. Retrieved September 19, 2022, from https://fairmlbook.org/
Benjamens, S., Dhunnoo, P., & Meskó, B. (2020). The state of artificial intelligence-based FDA-approved medical devices and algorithms: An online database. NPJ Digital Medicine, 3, 118.
Article Google Scholar
Biddle, J. (2022). On Predicting Recidivism: Epistemic Risk, Tradeoffs, and Values in Machine Learning. Canadian Journal of Philosophy, 52, 321-341. https://doi.org/10.1017/can.2020.27
Binns, R. (2018). Algorithmic accountability and public reason. Philosophy & Technology, 31(4), 543–556.
Article Google Scholar
Binns, R. (2020). On the apparent conflict between individual and group fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency.
Bjerring, J. C., & Busch, J. (2020). Artificial intelligence and patient-centered decision-making. Philosophy & Technology, 34, 349–371.
Article Google Scholar
Brownsword, R. (2022). Rethinking Law, Regulation, and Technology. Cheltenham, UK: Edward Elgar Publishing. Retrieved Jul 1, 2023, from https://doi.org/10.4337/9781800886476
Chomanski, B. (2022). Legitimacy and automated decisions: the moral limits of algocracy. Ethics Inf Technol, 24, 34. https://doi.org/10.1007/s10676-022-09647-w
Corbett-Davies, S., & Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning (pp. 1–25). ArXiv. https://arxiv.org/abs/1808.00023
Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., & Huq, A. Z. (2017). Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining.
Danaher, J. (2016). The threat of algocracy: Reality, resistance and accommodation. Philosophy and Technology, 29(3), 245–268.
Article Google Scholar
Elwyn, G., Coulter, A., Laitner, S., Walker, E., Watson, P., & Thomson, R. (2010). Implementing shared decision making in the NHS. BMJ, 341, c5146.
Article Google Scholar
Estlund, D. (2008). Democratic authority. Princeton University Press.
Google Scholar
Greenfield, G., Foley, K., & Majeed, A. (2016). Rethinking primary care’s gatekeeper role. BMJ (Clinical Research Edition), 354, i4803. https://doi.org/10.1136/bmj.i4803
Article Google Scholar
Grimmelikhuijsen, S. & Meijer, A. (2022). Legitimacy of Algorithmic Decision-Making: Six Threats and the Need for a Calibrated Institutional Response. Perspectives on Public Management and Governance, 5, 232–242. https://doi.org/10.1093/ppmgov/gvac008
Grgić-Hlača, N., Zafar, M. B., Gummadi, K. P., & Weller, A. (2016). The case for process fairness in learning: Feature selection for fair decision making. In Symposium on machine learning and the law at the 29th conference on neural information processing systems.
Hardt, M., Price, E., Srebro, N. (2016). Equality of opportunity in supervised learning. In: Proceedings of the international on advances in neural information processing systems (NIPS) (pp. 3315–3323).
Holm, S. (2023a). The Fairness in Algorithmic Fairness. Res Publica, 29, 265–281. https://doi.org/10.1007/s11158-022-09546-3
Holm, S. (2023b). Egalitarianism and Algorithmic Fairness. Philos. Technol. 36, 6. https://doi.org/10.1007/s13347-023-00607-w
Holm, S. (2023c). On the Justified Use of AI Decision Support in Evidence-Based Medicine: Validity, Explainability, and Responsibility. Cambridge Quarterly of Healthcare Ethics, 1-7. https://doi.org/10.1017/S0963180123000294
Jones, O. T., Matin, R. N., van der Schaar, M., Prathivadi Bhayankaram, K., Ranmuthu, C. K. I., Islam, M. S., Behiyat, D., Boscott, R., Calanzani, N., Emery, J., Williams, H. C., & Walter, F. M. (2022). Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: A systematic review. The Lancet Digital Health, 4, e466–e476. https://doi.org/10.1016/S2589-7500(22)00023-1
Article Google Scholar
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In: Proceedings of the 8th innovations in theoretical computer science conference. ACM.
Lipton, Z. (2018). The mythos of model interpretability. Communications of the ACM, 61(10), 36–43.
Article Google Scholar
London, A. J. (2019). Artificial intelligence and black-box medical decisions: Accuracy versus explainability. Hastings Center Report, 49(1), 15–21. https://doi.org/10.1002/hast.973
Article Google Scholar
Mayson, S. (2019). Bias in, bias out. Yale Law Journal, 128(8), 2122–2473.
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application. https://doi.org/10.1146/annurev-statistics-042720-125902
Article MathSciNet Google Scholar
Monaghan, J. (2022). The limits of instrumental proceduralism. Journal of Ethics and Social Philosophy, 22(1), 109.
Article Google Scholar
Peter, F. (2008). Pure epistemic proceduralism. Episteme: A Journal of Social Epistemology, 5, 33–55.
Article Google Scholar
Peter, F. (2017). Political legitimacy. In E. N. Zalta (Ed.) The Stanford encyclopedia of philosophy (Summer 2017 ed.). https://plato.stanford.edu/archives/sum2017/entries/legitimacy/
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206–215.
Article Google Scholar
Verma, S., & Rubin, J. (2018). Fairness definitions explained. In Proceedings of the international workshop on software fairness—FairWare ’18 (pp. 1–7). ACM Press. https://doi.org/10.1145/3194770.3194776
Wachter, S., Mittelstadt, B., & Russell, C. (2021). Bias preservation in machine learning: The legality of fairness metrics under EU non-discrimination law, W. Va. L. Rev, 123, 735–790. West Virginia Law Review. https://researchrepository.wvu.edu/wvlr/vol123/iss3/4
Waldman, A. (2020). Algorithmic legitimacy. In W. Barfield (Ed.), The Cambridge handbook of the law of algorithms (Cambridge law handbooks, pp. 107–120). Cambridge University Press. https://doi.org/10.1017/9781108680844.005
Wang, A., Kapoor, S., Barocas, S., & Narayanan, A. (2023). Against Predictive Optimization: On the Legitimacy of Decision-Making Algorithms that Optimize Predictive Accuracy. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT '23). Association for Computing Machinery, New York, NY, USA, 626. https://doi.org/10.1145/3593013.3594030

Download references

Funding

Open access funding provided by Royal Library, Copenhagen University Library.

Author information

Authors and Affiliations

Department of Food and Resource Economics, University of Copenhagen, Frederiksberg, Denmark
Sune Holm

Authors

Sune Holm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sune Holm.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Holm, S. Algorithmic legitimacy in clinical decision-making. Ethics Inf Technol 25, 35 (2023). https://doi.org/10.1007/s10676-023-09709-7

Download citation

Accepted: 22 June 2023
Published: 02 July 2023
DOI: https://doi.org/10.1007/s10676-023-09709-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Algorithmic legitimacy in clinical decision-making

Abstract

Similar content being viewed by others

Mitigating Bias in Clinical Machine Learning Models

Machine Learning in Practice—Evaluation of Clinical Value, Guidelines

Towards a pragmatist dealing with algorithmic bias in medical machine learning

Introduction

The accuracy argument

Algorithmic instrumental proceduralism

Non-instrumental Q properties?

Concluding remarks: algorithmic fairness and legitimacy

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Algorithmic legitimacy in clinical decision-making

Abstract

Similar content being viewed by others

Mitigating Bias in Clinical Machine Learning Models

Machine Learning in Practice—Evaluation of Clinical Value, Guidelines

Towards a pragmatist dealing with algorithmic bias in medical machine learning

Introduction

The accuracy argument

Algorithmic instrumental proceduralism

Non-instrumental Q properties?

Concluding remarks: algorithmic fairness and legitimacy

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Mitigating Bias in Clinical Machine Learning Models