Achieving Fair Inference Using Error-Prone Outcomes

Recently, an increasing amount of research has focused on methods to assess and account for fairness criteria when predicting ground truth targets in supervised learning. However, recent literature has shown that prediction unfairness can potentially arise due to measurement error when target labels are error prone. In this study we demonstrate that existing methods to assess and calibrate fairness criteria do not extend to the true target variable of interest, when an error-prone proxy target is used. As a solution to this problem, we suggest a framework that combines two existing fields of research: fair ML methods, such as those found in the counterfactual fairness literature and measurement models found in the statistical literature. Firstly, we discuss these approaches and how they can be combined to form our framework. We also show that, in a healthcare decision problem, a latent variable model to account for measurement error removes the unfairness detected previously.


I. Introduction
S upervised learning is used to guide human decisions across a wide range of different fields.In sensitive areas such as healthcare or criminal justice, a key issue is that these decisions are equitable and fair.To this end, an active area of research investigates how fairness criteria can be incorporated into supervised learning [1]- [6].This literature has focused on supervised learning for a single objective, assumed to be the target variable of interest.
However, focusing on fair inference for a single objective is not sufficient in many real-world applications.The motivating example for this paper is presented in [7]: a commercial health prediction algorithm, widely used by health insurance companies and affecting millions of patients, exhibits significant racial bias -at a given risk score, black patients are considerably sicker than white patients, as evidenced by signs of uncontrolled illnesses.The bias arises because the algorithm predicts healthcare costs rather than illness, but unequal access to care means that less money is spent caring for black patients than for white patients.Thus, substantial racial biases arise, despite healthcare cost appearing to be an effective proxy for health by some measures of predictive accuracy, and despite these predictions complying with conventional standards of fair inference on outcomes [8].The situation presented in [7] is but one example of a more general common framework of using a proxy to measure outcomes which cannot be directly measured -another example would be predicting true criminal recidivism using only observed recidivism, which is an error-prone proxy [9].In this paper, we suggest using an approach from the field of social science: to make use of multiple observable proxies to build a measurement model representing the unobserved (latent) variable of interest.We propose to integrate such an approach when developing prediction models.This issue cannot be ignored because fairness is generally conceptualised on a level more abstract than the proxy label [10]; for example, it is reasonable to require that fairness in a healthcare need prediction system should extend to a person's true health status.However, it is challenging to measure a patient's true health status, as such measures are typically impossible to observe directly.In social science, a common approach is to make use of multiple observable indicators to build a measurement model representing the unobserved (latent) variable of interest.We propose to integrate such an approach when developing prediction models.This paper addresses the problem of prediction unfairness arising from measurement error.By considering the supervised learning problem at the level of a latent variable of interest, we reformulate the problem as one of adequate measurement modelling.In effect, instead of requiring perfect measurement to achieve fairness, we propose that researchers developing a prediction model to be used for decisionmaking collect several independent, possibly error-prone, measures of the variable of interest (e.g.health).These measures act like errorprone labels made by independent annotators, each containing some information about the true health status (similar to, e.g., [11], [12]).We then suggest to combine measurement models from the statistical literature with techniques from the literature on fair ML to assess and ameliorate the problem of unfair predictions in the face of measurement error.
Our contributions are as follows: • We illustrate that existing methods to examine unfairness in errorprone outcomes are insufficient; • We suggest a framework, based on the existing measurement modelling literature, to investigate and ameliorate such issues; • We perform an exemplary analysis to demonstrate the sug-gested approach.In an existing healthcare application, this demonstrates that replacing one proxy with another does not lead to parity, while our approach does.
In Section II, we provide a summary of basic concepts in fairness.In Section III prior approaches with respect to fair inference are discussed.In Section IV, the failure of these approaches is discussed when making use of proxies, and the proposed framework is introduced based on existing measurement models.In Section V the proposed framework is then applied to the exemplary data set provided by [7].

II. Problem Definition
We consider probabilistic classification and regression prob-lems with a set of features X and true outcome Z.Among the features, there is a sensitive feature S ϵ X (e.g.race, gender), with respect to which discriminatory predictions are to be avoided.Furthermore, although the prediction problem is with respect to the true outcome Z -e.g."health" or "crime" -this outcome is not directly observed; instead, we have observed a set of error-prone proxy variables Y.For example, in practice a proxy for "health", Y ϵ Y, might be the costs of healthcare or the number of chronic conditions experienced by the patient, whereas, instead of "crime", the number of arrests might be measured.Following [8], we represent the goal of the regression or classification problem as a query on the (generative) joint distribution p(Z, X), potentially after conditioning on a set of "fixed" covariates C, i.e. the (discriminative) conditional joint p(Z, X\C | C).Typically, this query will be the point prediction Ẑ:= E(Z | X).
Following standard social-scientific measurement theory [13], the fact that Y is a measurement proxy for Z is reflected by a causal model, in the sense of [14], [15], in which Z → Y, i.e., the true outcome is a common cause of all available proxy variables.Because Z is an unobserved latent variable, our causal model will be identifiable only through additional assumptions of conditional independence; we discuss these assumptions later.The key point to note here is that, generally, i. e. predictions using error-prone proxies as labels, Ŷ , will, of course, differ from the Ẑ that would have been obtained had the true labels been available.

III. Related Work
A large and growing literature on fairness of predictions for the error-free outcome Z exists, with divergent and sometimes mutually exclusive definitions of the notion of algorithmic fairness.An excellent overview of this literature can be found in [6], which identified 20 separate definitions.Broadly, a distinction can be made between statistical metrics, distance-based measures, and causal reasoning [6].
Statistical metrics define fairness as the presence or absence of a (conditional) independence in the joint distribution p(Z, Ẑ, S).For example, take a classification problem in which the decision is taken as d := I(Z > τ), where I is the indicator function and τ is some threshold on the predicted score.Statistical parity ("group fairness") is then defined as for all s ≠ s', i.e., the decision should not depend on the sensitive attribute, whereas predictive parity is defined as for all s ≠ s'-i.e. the positive predictive value should not depend on the sensitive attribute.Further definitions include conditional statistical parity [2], overall accuracy equality [1], and well calibration [4].
Distance-based measures of fairness account for the non-sensitive predictors X\S, in addition to the observed and predicted outcomes and sensitive attribute.The well-known "fairness through awareness" framework [3] generalises several of the preceding notions, such as statistical parity, by defining fairness as "similar decisions for similar people".Consider a population of potential applicants P, and consider any randomised output from the prediction algorithm, M( ϵ P).Fairness is achieved whenever the distance among the decisions M made for two people is at least as small as the distance between these people, i.e. when for any ,  ϵ P .Here, D and  are arbitrary metrics on the distance between outputs and people, respectively.Careful choice of these metrics can yield some of the above definitions as special cases.Since the fairness condition can be trivially achieved, for example by always outputting a constant regardless of the input, the prediction model should be trained by minimising a loss function under the above constraint.
Finally, in recent years, results from the causal modelling literature have been leveraged to define and achieve "counterfactual" fairness [5], [8].In these definitions one first considers a causal model involving Y, X\S, and S such as Panel A of Fig. 1.This causal model then induces a counterfactual distribution  () (Ẑ | X ), i.e. the distribution we would observe if S were set to the value s [14].[5] then defined counterfactual fairness as Graphical representation of causal relations between the sensitive feature (S), the predictors (X), and the error-prone outcome (Z) in the naive case (A), in the measurement error framework (B), and in the measurement error framework with differential item functioning on the Y 1 proxy (C).The dotted arrow indicates the discriminatory causal pathway (as in [8]) which is blocked when performing fair inference, evaluating E[Z | X, S] to compute a risk score Ẑ.
Note that this definition looks superficially similar to the definition of statistical parity (group fairness), but is distinct because it refers to an individual.This definition has as a disadvantage that any causal effect of the sensitive attribute on the prediction is deemed illegitimate.Based on the same framework, [8] suggested a more general definition: some causal pathways originating in S are denoted discriminatory, while others are not.Fairness is then achieved by performing inference on a distribution p*(Z, X), in which the "fair world" distribution p*(Z, X) is close in a Kullback-Leibler sense to the original p(Z, X), but all discriminatory pathways have been blocked (up to a tolerance) using standard causal inference techniques.Note that, if all causal pathways originating in S are deemed discriminatory and the tolerance set to zero, the counterfactual fairness criterion by [5] will be satisfied.

A. Fair Inference in Error-prone Outcomes
The existing methods from Section III do not consider the target Z to be error-prone.However, in practice, the target feature Y ϵ Y in the data set is not a perfect representation of the true underlying outcome Z.There can be several sources for this imperfect representation.For example, the true underlying outcome of interest may not be directly measurable at all (i.e., Z ≠ Y for any possible Y).In this case, the outcome of interest will only partially explain any feature used as its proxy.For example, in using healthcare costs Y as a proxy for health Z, the observed value will in part be determined by other factors besides Z, such as the location of residence of the patient.Then, even if the outcome of interest were "true healthcare costs" -thus in principle measurable -the observed feature will in practice still not be an infallible proxy, because health records are never perfect observations and always contain some form of noise [16].Together, such sources of noise in the observation process are termed "measurement error", and any outcome Z containing measurement error can be considered latent [17] and modelled as such.
Crucially, the presence of measurement error may result in unfair inferences for the error-prone outcome, even after applying the procedures presented in Section III to account for unfairness.This is shown in a compelling example by [7], who concluded that commercial algorithms used by insurance companies for patient referral contain a fundamental racial bias.In the algorithm under consideration, healthcare costs Y ϵ Y are used as a proxy for health Z. [7] illustrated that although there is no bias in healthcare costs, there is strong racial bias in other proxies of health such as whether patients have chronic conditions.Specifically, in order to be referred to a primary care physician, the true underlying health status Z of black patients was worse than that of white patients.[7] concluded that fair inference requires selecting a better proxy for health as the outcome variable Z. Indeed, their analyses were possible precisely due to the availability of different proxies of health, such as the number of chronic conditions.However, we note that solving racial bias in a new proxy does not guarantee the absence of racial bias in other proxies indicating other aspects of health.Instead, here we suggest incorporating several proxies, or indicators Y in a measurement model for the unobserved, error-prone outcome Z [18].In the next section, we introduce the existing literature on measurement models and its approach to fair inference.

B. Fair inference in Measurement Models
When outcomes are thought to be error-prone, an existing literature suggests the use of measurement models [16], [19].At their core, measurement models describe the causal relationship between observed scores Y and unobserved "true scores" Z as Z → Y.A measurement model adequately represents the empirical conditions of measurement if conditional independence can be assumed [20].More specifically, measurement models assume that Y 1 and Y 2 are conditionally independent given Z, i.e., A plethora of variations of measurement models assuming conditional independence have been developed, such as latent class models [21], item response models [22], mixture models [23], factor models [24], structural equation models [25], and generalised latent variable models [26].
Measurement models are suggested here as a convenient way to account for a latent variable's relationship to sensitive features.The measurement error of a proxy variable (e.g.Y 1 ) is then assumed to differ over different groups of S. To account for group differences in proxy variables, a large body of literature is available where this issue is known under different labels.Generally, these approaches are applied within the structural equation modelling (SEM) framework [27], as SEM explicitly separates the measurement model (Z → Y) from the structural model (X → Z).Approaches for investigating how features S influence Z are investigating item bias [28], Differential Item Functioning (DIF) [29] and measurement invariance [30].For an extensive overview of the different approaches and their benefits and drawbacks, we refer to [30]- [33].

C. Proposed Method for Fair Inference on Latent Variables
We propose our framework for fair inference on outcomes which are measured only through error-prone proxies in a step-by-step manner.To clarify the framework and make it more comparable to earlier work, we use the running example of health risk score prediction from [7].Their healthcare data set contains several clinical features X at time point t -1 (e.g., age, gender, care utilisation, biomarker values and comorbidities) which are used to predict healthcare cost Z at time t.In addition, the patient's race is the sensitive feature S, coded as S = b for black patients and S = w for white patients.The relations between these features are shown in panel A of Fig. 1.
Based on X, the expectation of a persons' healthcare cost is used as a risk score Ẑ := E[Z | X, S].The risk score is used to make a decision D to refer a patient to their primary care physician to consider program enrolment.More specifically  = 1 if Z is above the 55 th percentile.In this setting, attributes X can be legitimately controlled.However, conditional on X both groups in S should have equal probability of being referred: As mentioned in Section A and shown by [7], this procedure leads to bias in other proxies of Z, such as a patient's number of chronic conditions.
Our proposed framework is a SEM implementation of the second and third panels of Fig. 1.The general structure of the model is that of a Multiple Indicator, Multiple Causes (MIMIC) model.
In SEM, a latent variable (a hypothetical construct that is not directly observed) can be related to observable variables, such as indicators and causes of the latent variable, through sets of regression equations [34] and where parameters are typically estimated by means of maximum-likelihood [35].A MIMIC model is a particular structure of a SEM model where a latent variable is simultaneously related to both observed indicator and cause variables [36].In our model, the outcome variable Z (e.g., health) has multiple proxy indicators (e.g., chronic conditions, healthcare costs, hypertension), and the X features predict Z directly (thus the proxies only indirectly).A graphical representation of the MIMIC SEM model is shown in Fig. 2.This implementation imposes additional assumptions on the general causal graphs, most notably linear relationships between the variables and multivariate Gaussian residuals.We implement our proposed correction procedure on the outcome variable Z in an existing fair inference approach [8] by means of the following steps: 1.The data-set is split in half to obtain a training set and a test set.

Regression parameters (X, S → Z) are estimated on the training set
using the MIMIC model.
3. The path from race to health is blocked by setting S = b for all rows in the test set.
4. Predictions are generated for the adjusted test set by using the parameter estimates obtained in step 2.
To summarise, during estimation of the regression parameters (X → Z), health is conditioned on race, but during prediction the path from race to health is blocked by setting S = b.Following the notation of [8], this yields a "fair world" distribution p*(Z, X).The expectation Ẑ = E[Z | X, S] is then computed from this distribution, meaning for two participants who differ only on S but not on X, the risk score Ẑ will be exactly the same.Because in SEM the latent outcome Z is modelled as a linear combination of the different proxies, the risk score is a reflection of the underlying health rather than only health cost.

V. Experiments
In this section, we evaluate the proposed framework on an application of the procedures discussed in this paper.We first prepare the data set as provided by [7] to create a basic risk score based on healthcare cost similar to the commercial risk score reported in their paper.Then, we illustrate our argument from Section A: we perform fair inference on the proxy measure for health (healthcare cost) to show that this does not solve the issue of unfairness in other proxy measures.This is a reproduction of the results shown by [7].Next, we use the SEM framework from Section C to show how including a formal measurement model for Z -as in panel B of Fig. 1 -can largely solve the issue of unfairness in the proxies.Last, we show how existing differential item functioning (DIF) methods in the SEM frameworkpanel C of Fig. 1 -can aid in interpreting the extent to which proxy measures contain unfairness.Fully reproducible R code for this section is available as supplementary material to this paper at the following DOI: 10.5281/zenodo.3708150.

A. Data Preparation and Feature Selection
Log-transformations are applied to highly skewed variables at timepoint t, such as costs, to meet the assumption of normally distributed residuals in regression procedures.As an additional normalisation step, the predictors at time-point t - 1 are re-scaled to homogenise their levels of variance.The data set is then split into a training and a test set.In this section, estimation is always done on the training set and inference is done on the test set.
To simplify our proposed framework for the purpose of this application, we select a subset of features at time-point t - 1 for prediction of the target of interest at time point t, health.We want our procedure to be comparable to the commercial algorithm which produces the risk scores described in [7].If the features we select are the same features used by the commercial algorithm, then our procedure would yield very similar results upon generating a risk score.Unfortunately, the predicted risk scores used by [7] cannot be replicated exactly using the provided data set.
To select the subset of predictor features for further use in our procedure, we performed a LASSO regression [37] where all available features at time-point t - 1 are used as predictor variables, and the provided algorithmic risk score at time-point t is used as a target.Following the guidelines by [38], we used cross-validation to select the optimal λ penalty value.This yields a set of non-zero predictors which predict the algorithmic risk score well.Superman's rank correlation between the commercial and the replicated risk score is high ρ = .82,indicating that the commercial and replicated risk scores perform similarly in the rank-based cutoff applied in [7].The predictors selected in this model are used as predictors X in the structural equation models of the following sections.

B. Fair Inference on Cost as a Proxy of Health
Pane A of Fig. 1 illustrates conditional statistical parity as defined by [6].To perform standard statistical parity correction, the outcome Z is conditioned on sensitive feature S when estimating the coefficients of the prediction model (X → Z), and during prediction all subjects are assumed to have the same level of S, e.g., S = b, such that However, in the current situation we do not measure Z directly, but only a proxy Y ϵ Y. Standard parity correction for this proxy does not necessarily mean the parity is achieved for other proxies [7].The reason for this is explained in Fig. 3. Pane A illustrates that statistical parity is present when plotting the risk score against healthcare costs, meaning that for a given risk score, the healthcare costs for both races are approximately equal.However, Pane B illustrates that when the number of chronic conditions are plotted against healthcare costs, there are differences between the two race groups, meaning that for a given amount of chronic conditions, white patients cost more than black patients.
As a result, standard statistical parity correction on healthcare cost does not remove the disparity in chronic conditions.This becomes visible when comparing Pane B of Fig. 3 with Pane A of Fig. 4. In addition, from Pane B of Fig. 4 it can be seen that the results improve compared to not including race at all (Pane A of Fig. 4), yet race differences remain for the chronic conditions proxy.As a consequence, individuals belonging to S = b will still have a lower health status when being selected for intervention.

C. Fair Inference on Latent Health
A cause for the fact that conditional statistical parity is not met when following Pane A of Fig. 1 can be that Ẑ is a (bad) proxy.Instead of using one bad proxy, it is better to use multiple (bad) proxies as indicators of an unobserved latent variable measuring 'true health'.How such a model can be specified is illustrated in Pane B of Fig. 1.Such a model can be applied in practice by following the steps in the framework described in Section C. Similarly to [6], the sensitive feature is excluded during prediction.Fig. 4 shows the effect of including a measurement model in constructing risk scores.The figure illustrates that using a measurement model with multiple imperfect measurements of health as indicators for 'true health' substantially improves conditional statistical parity, when compared to either the uncorrected risk score on a proxy, or a parity-corrected risk score on the proxy.Additionally, Table I shows a numerical summary which corroborates this finding.Here, we created a prediction model for the number of chronic conditions using both risk score and race.The parameter for race then indicates whether a race difference exists for health, conditional on the risk score.This conditional dependence becomes close to 0 when using the latent risk score (95% CI = [´0.113,0.012]).Thus, by using this measurement model, the problem that individuals belonging to S = b had a lower health status when being selected for intervention is minimised.(a) Fig. 3.Although the risk score displays statistical parity on healthcare costs (no differences between the lines in panel A), these costs conditional on health (as measured by chronic illness) depends on race (panel B).This causes statistical disparity for the risk score on the level of health (Fig. 4, panel B). Figure replicated from [7].

D. Investigating Unfairness in Proxies
When using a measurement model with multiple imperfect measurements of health as indicators of 'true health', differences in measurement error over the different groups of the sensitive feature can still be present.Panel C of Fig. 1 illustrates how differences over the sensitive feature groups in the error prone indicator variables can be incorporated directly when estimating 'true health'.For example, differences in measurement error of healthcare cost can be present for the different groups of race.
Including a DIF parameter δ on the healthcare cost variable yields a model which fits significantly better on the test set than the model without the DIF parameter (χ 2 (1) = 50, p < 0.001).The value of the DIF parameter on cost is estimated as δ = 0.198 (95% CI = [0.172,0.225]).This means that for the same level of health, the log-healthcare costs of the white race class in this data set is estimated to be 0.198 higher.This means that the cost of healthcare for white patients is (e 0.198 - 1) • 100% = 21.9% higher than that for black patients, given an equal level of health as measured by the measurement model (95% CI = [18.7, 25.2]).
Applying the same procedure to the other indicators leads to estimates of DIF for those indicators.The results are shown in Table II.This table shows that some proxies have stronger DIF than others, meaning some proxies are more unfair than other proxies.Notable, the avoidable healthcare cost and the renal failure items have low levels of DIF for race, whereas the healthcare cost and the number of active chronic conditions have strong DIF.

VI. Conclusion
In this paper, we have argued that when measurement error is at play, performing fair inference on a proxy measure of the outcome is insufficient to achieve a fair inference on the true outcome.This manifests itself, as shown in [7], as unfairness in other proxy measures of the outcome of interest.Alternatively, in this study we proposed to make use of existing measurement models containing multiple errorprone proxies for the outcome of interest.In addition, fair inference can be accounted for in each of these proxies simultaneously if needed by allowing for measurement error in proxies to differ over groups defined by differing values of a sensitive feature.We provided a framework to perform these estimations and applied this framework to the exemplary data set provided by [7].Here, it was concluded that fair inference was accounted for when multiple proxies were used in a measurement model instead of a single proxy.Additionally accounting for differences in measurement error over race groups was not needed to further improve fairness in predicted risk scores, although substantive group differences were found for some proxies.

Fig. 4 .
Fig.4.Effect of including a measurement model in constructing risk scores.The first panel shows the uncorrected risk score based on healthcare cost, the middle panel shows the same risk score but corrected for the sensitive feature, and the third panel shows the corrected risk score based on the latent health outcome using a measurement model.
[7].2.Structural equation model for the proposed framework on the healthcare data set.For clarity, residual variances of the endogenous variables are not drawn in the diagram.EHR stands for Electronic Health Record.For more information on the variables used in the model, see[7].

TABLE I .
Estimated Conditional Parity on the Number of Chronic Conditions for Different Risk Scores.β Parameters Are Linear Regression Parameters, Indicating the Deviation of White Patients From Black Patients in the Number of Chronic Conditions, Conditional on Risk Score.For Example, a Value of -0.963 Means that White Patients Have on Average a 0.963 Fewer Chronic Conditions for the Same Risk Score

TABLE II .
Estimated Differential Item Functioning Parameters for Each Indicator (Proxy) of Health.δ Parameters Should Be Interpreted as the Mean Deviation of the Black Patients Compared to the White Patients Given Health.