Algorithmic encoding of protected characteristics in chest X-ray disease detection models

Summary Background It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring methodology for subgroup analysis in image-based disease detection models. Methods We utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups. Findings We confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We find that transfer learning alone is insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable insights about the way protected characteristics are encoded in the feature representations of deep neural networks. Interpretation Subgroup analysis is key for identifying performance disparities of AI models, but statistical differences across subgroups need to be taken into account when analyzing potential biases in disease detection. The proposed methodology provides a comprehensive framework for subgroup analysis enabling further research into the underlying causes of disparities. Funding 10.13039/501100000781European Research Council Horizon 2020, 10.13039/100014013UK Research and Innovation.

Introduction particular type of information, say a patient's racial identity, is being used by the prediction layer, even if we know that such information is present in the original input data, either explicitly, e.g., in tabular data, or implicitly, e.g., in medical scans. An important point we make in this article is that even the presence of such information in the learned feature representation of the penultimate layer is not a sufficient indication that this information is being used by the prediction layer. We illustrate this with some concrete examples and a real world, clinical application of disease detection in chest X-ray.

The unreasonable performance of deep learning
Despite the opaque behavior of deep neural networks, these models are now ubiquitous, and have become the state-of-the-art approach for most image-based prediction tasks. 3 An intriguing example for the 'power' of deep learning is the discovery that cardiovascular risk factors can be accurately predicted from retinal fundus images, including age, biological sex, smoking status, systolic blood pressure and major adverse cardiac events. 4 Additional biomarkers were discovered shortly after. 5 Another remarkable finding was that deep neural networks were capable of predicting patients' experienced pain from knee X-rays enabling an algorithmic approach for reducing unexplained racial disparities in pain. 6 Here, the technique of deep learning does not only help us to discover such previously unknown associations between 'raw' inputs and outputs but also to capture them in compact mathematical models that we can use for making accurate predictions on new data. 2 In addition to being able to predict a patient's age and biological sex, 7 a recent study further demonstrated that deep neural networks are also capable of recognizing a patient's racial identity from chest X-ray and other medical scans with astonishing accuracy. 8 This study by Gichoya et al. is remarkable in multiple ways. Not only was it unknown that it is possible to recognize racial identity from these scans, it also appears to be a task that expert radiologists are not capable of doing (or at least not trained for). The exact mechanism and types of imaging features that are being used for making these predictions are yet to be uncovered. But more importantly, these results may have profound implications in the context discussed earlier, that there is a real risk that ML models may amplify health disparities. [9][10][11] Since it seems straightforward (given the very high accuracy) to extract features related to racial information from medical scans, any spurious correlations between race and clinical outcome present in the data could be picked up by a model that is trained for clinical diagnosis. Assuming that features predictive of race are easier to extract than features associated with pathology, there are concerns that the model may learn 'shortcuts' that could manifest an undesirable association in the model between the patient's race and the prediction of disease. 12 This emphasizes again the importance for being able to know what information is being used when a model makes predictions.
In the study by Gichoya et al. the authors tried to establish whether a deep neural network trained for disease detection may have implicitly learned to recognize racial information. They used a specific type of test

Research in context
Evidence before this study We searched PubMed and Google Scholar for machine learning, deep learning, and AI studies focusing on "model bias", "algorithmic bias", "fairness", "subgroup analysis", "performance disparities", "protected characteristics", and "ethical implications" in the context of "disease detection methods" published before April 2022. The list of publications was complemented with the authors' knowledge of the literature and suggestions from colleagues. This process identified several relevant publications, all discussed in the manuscript. There has been an increase in studies reporting biases in image-based disease detection models, highlighting the potential risk that the use of machine learning could amplify health disparities. The connection between the ability to predict patient characteristics from images and reported subgroup disparities in disease detection models has not yet been established, and demands further investigation.

Added value of this study
The proposed methodology including test-set resampling, multitask learning, and model inspection provides a comprehensive approach for subgroup analysis of disease detection models, complementing recent frameworks for auditing AI algorithms. Our findings may advance an important and timely discussion about what constitutes safe and ethical use of AI.

Implications of all the available evidence
To establish whether specific information is used by machine learning models for making predictions is of high relevance and concerns all stakeholders including clinicians, patients, developers, regulators, and policy makers. Identifying the underlying causes of disparate performance remains a challenge and will require further research. Besides the development of technical solutions to construct fair and equitable predictive models, it will be equally important to consider ethical limitations and regulatory requirements, highlighting the need for representative and unbiased test sets to faithfully assess potential biases in machine learning models.
Articles based on transfer learning that we will discuss below. The study found that it is possible to predict race from the feature representations of a disease detection model, even when race and disease were poorly correlated. It was concluded that deep learning models are at risk of incorporating these unintended cues in their decision making. 8 This conclusion is of particular concern in the light of recent studies that found performance disparities across racial subgroups. 13,14 The connection between the ability to predict patient characteristics from images and subgroup disparities in disease detection models demands further investigation as the implications are of high relevance for the safe and ethical use of AI.

The inter-relationship of prediction tasks
One seemingly intuitive approach to investigate whether particular information is being used for making predictions is to check if the model (or more precisely, the learned feature representations) trained for a primary task of disease detection can be used for a secondary task for predicting patient characteristics. Assuming the secondary task can be performed reasonably well, one may conclude that the two tasks are closely related.
Here, we need to first clarify what we mean with "tasks are related". From a machine learning perspective, one may distinguish between three scenarios as illustrated in Fig. 2 with the example of separating colors and shapes. In scenario A, the two tasks are unrelated both on the feature-and the output-level; In this case, we can solve each task independently using a different set of features and no information about the other task is relevant nor helpful. In scenario B, the tasks are related on a feature-level but not on an output-level; Here, the two tasks make use of the same feature representation, but apply different weights and aggregations for making the predictions. The information about one task, however, remains irrelevant for the other. In scenario C, the tasks are related both on a feature-and output-level, and it appears impossible to Fig. 2: Illustration of different inter-relationships of prediction tasks. a The two classification tasks of separating colors (blue vs orange) and shapes (crosses vs circles) are unrelated, both on the feature-and the output-level. The color classification can be performed by only considering feature x 1 while shape information is irrelevant. Similarly, shapes can be classified using feature x 2 with color being irrelevant. b The two tasks are related on a feature-level but not on their outputs. In both tasks, the features x 1 and x 2 need to be considered for classifying colors and shapes, however, shape information remains irrelevant for separating colors, and vice versa. While in both tasks the exact same features are being used, they are combined in different ways. c The two tasks are related both on a feature-and an output-level. Solving one of the tasks also solves the other. Shape and color information is highly correlated. The dashed green and gray lines indicate the optimal decision boundaries for color and shape classification, respectively. An input medical scan is processed by a sequence of network layers in the so-called 'backbone' applying non-linear transformations whose parameters (or weights) are learned during training. This results in a complex feature representation at the penultimate layer which is then processed by the final prediction layer. The prediction layer assigns weights to each feature in the penultimate layer, aggregates the weighted features, and generates an output prediction. The prediction layer has the role of making the decision on what information is being used for making predictions. disentangle the tasks. Solving one task will, at least to some degree, also solve the other task. We can say that the information related to each task is correlated with the other task. In practice, there can be of course different degrees of correlation.
To establish whether certain information (say shape) is being used for solving the primary task (say color classification), we need to identify which is the relevant scenario. Indeed, we can see that in scenario A, the feature representation from the primary task (here just a single feature x 1 ) is not useful for solving the secondary task. We thus may safely conclude that shape has no role to play when classifying color and vice versa in this given example. In scenario C, we would find the opposite. The secondary task can be solved using the model learned for the primary task (and vice versa). We may conclude that there is no way of disentangling shape and color information, and a model trained for one task may use the information related to the other task. Scenario B is the most intriguing one. Here, we would find that the secondary task can be solved by using the features from the primary task. However, we need to learn a new set of feature weights specific to the secondary task, as the weights will be different from the ones learned for the primary task. So, while the information about one task is neither relevant nor helpful for the other task, we can still solve each task by using the features from the other task. This is an important observation which is relevant in the context of our real world application of disease detection. In cases where we have knowledge that the secondary task information (say a patient's race) should not be used for the primary task (say detection of disease), we need to avoid using models under scenario C. Models under scenario B, however, are potentially safe to use despite the fact that their feature representations could be predictive for the secondary task. In the following, we explore different methods including transfer learning, test set resampling, multitask learning, and model inspection to study the inner workings of disease detection models, drawing connections between their subgroup performance and the way protected characteristics are encoded.

Study population
We study the behavior of deep convolutional neural networks trained for detecting different conditions using two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR. 15,16 The datasets contain detailed patient demographics including self-reported racial identity, biological sex, and age. The CheXpert sample contains a total of 42,884 patients with 127,118 chest Xray scans divided into three sets for training (76,205), validation (12,673) and testing (38,240). The MIMIC-CXR sample contains 43,209 patients with 183,207 scans divided into training (110,280), validation (17,665), and testing (55,262). No scans from the same patient are used in different subsets. The validation sets are used for model selection, while the test sets are the hold out sets for measuring disease detection performance and assessing model behavior. Both CheXpert and MIMIC-CXR are highly imbalanced and skewed across subgroups. The large majority of scans are from patients identifying as White (78% and 77%), while scans from patients identifying as Asian (15% and 4%) and Black (7% and 19%) are underrepresented. Black patients in CheXpert are on average 5-8 years and in MIMIC-CXR 2-5 years younger than Asian and White patients. The proportion of females in CheXpert is 40%, 43%, and 49% for White, Asian, and Black patients, and 43%, 44%, and 60% in MIMIC-CXR. The study samples and splits are identical to the ones used in the study by Gichoya et al. 8 A detailed breakdown of the population characteristics is provided in Table 1 with a visual summary provided in Fig. S1 in the Supplementary Material.

Disease detection models
The basis of our investigation is a real world, clinical application of image-based disease detection. We note that developing the disease detection models is not the primary concern of our study, nor do we claim any contribution in this respect. Here, we are studying models that have been used in previous works. 8,13,14 We train deep neural networks for detecting 14 different conditions annotated in the CheXpert and MIMIC-CXR datasets. Similar to previous work, we use a multi-label approach as patients may have multiple conditions. The presence of each condition is predicted via a dedicated one-dimensional output which is passed through a sigmoid function to obtain predictions between zero and one. The simultaneous detection of the individual conditions uses a common feature representation obtained from a shared neural network backbone. We focus our analysis on two tasks, one for detecting the presence of a specific pathology ('pleural effusion' label) and another aiming to rule out the presence of disease ('no finding' label). These two labels are mutually exclusive which makes them suitable for our model inspection, discussed later. The varying prevalence of these labels across subgroups and datasets makes them particularly interesting to study in the context of performance disparities under population and prevalence shifts (cf. Table 1 and Fig. S1). The prevalence of 'no finding' is 8%, 9%, and 10% for White, Asian, and Black patients in CheXpert, and 31%, 29%, and 38% in MIMIC-CXR. For 'pleural effusion', the prevalence is 41%, 42%, and 33% in CheXpert, and 27%, 27%, and 16% in MIMIC-CXR. The 'no finding' label was also the focus of a recent study which reported subgroup disparities assumed to be associated with underdiagnosis bias. 14 By using the same data and models, our performance analysis may shed further light on the reported issue of algorithmic bias.

Articles
Test-set resampling for unbiased estimation of subgroup performance Previous studies have reported subgroup disparities in the form of shifted true and false positive rates in underserved populations, raising concerns that models may pick up bias from the training data which is then replicated at test-time. 13,14 A limitation in these studies, however, arises from their use of test data exhibiting the same biases as the training data (due to random splitting of the original datasets) which complicates the interpretation of the reported disparities. 17 Population and prevalence shifts across subgroups are known to cause disparities in predictive models. 18,19 In order to faithfully assess the behavior of a potentially biased model, one would require access to an unbiased test set which is difficult to obtain. For this reason, we explore the use of strategic resampling with replacement to construct balanced test sets that are representative of the population-of-interest. 20 Test set resampling allows us to correct for variations across subgroups such as racial imbalance, differences in age, and varying prevalence of disease. Controlling for specific characteristics when estimating subgroup performance and contrasting this with the performance found on the original test set may allow us to identify underlying causes of disparities. A visual summary of the population characteristics of the resampled test-sets is provided in Fig. S2 in the Supplementary Material.

The supervised prediction layer information test
Besides subgroup performance, it is equally important to assess whether patient characteristics are encoded and then potentially used in a disease detection model. One approach to test this is based on transfer learning which has been used in the study by Gichoya (17) Breakdown of demographics over the set of patient scans by racial groups and training, validation and test splits. Percentages in brackets are with respect to the number of scans. We also report the number of unique patients for each group. what is often called the neural network 'backbone') and then replacing the prediction layer with a new one (cf. Fig. 1). The new prediction layer is then trained specifically for the secondary task to learn a new set of weights assigned to the features in the penultimate layer. The features are generated by passing the input data through the frozen backbone from the primary task. We may then measure the accuracy of this new prediction layer on some test data. We might conclude that the two tasks are related and possibly even share information when the level of accuracy is reasonably high. In the following, we refer to this approach as the 'supervised prediction layer information test' or SPLIT. We argue that SPLIT is insufficient to confirm whether a disease detection model may have implicitly learned to encode protected characteristics such as racial identity. Considering the example of shape and color classification, SPLIT can only tell us whether we are either in scenario A, in which case the accuracy obtained with SPLIT would have to be very low as the feature learned for one task is not useful for the other task (we call this a negative SPLIT result), or we are in one of the other two scenarios, B or C. However, SPLIT is unable to distinguish between those two. In fact, even the absolute value of the observed SPLIT accuracy is uninformative as we can easily confirm for the given example in scenario B (cf. Fig. 2). Here, SPLIT would result in perfect classification of shape, despite shape being irrelevant for the classification of color. Obtaining a positive SPLIT result is a necessary condition for establishing that information is being used, but not a sufficient one. SPLIT is like a diagnostic test that has 100% sensitivity but unknown specificity.
To confirm these shortcomings, we applied SPLIT for both race and sex classification to different disease detection model backbones. Our main model uses the whole training set including all patients that identify as White, Asian, or Black. We trained two other disease detection models each using only a subgroup of patients to contrast our findings on the encoding of racial identity and biological sex. To minimize the effect of different amounts of training data on the model performance, we used the subgroups with the largest number of scans available, which are the groups of patients that identified as White, and male patients. The models trained on subgroups only are not exposed to the same variation in patient characteristics as the model trained on all data. In addition, we also considered two backbones that were neither trained for disease detection nor with any medical imaging data. One of them is based on random initialization of the network weights where the backbone then acts as a random projection of the input imaging data. This backbone is entirely untrained before applying SPLIT, and when combined with a prediction layer resembles a shallow model with limited capacity similar to a logistic regression. The motivation here is to provide baseline results for assessing the general difficulty of the tasks. The other non-medical backbone corresponds to a network trained for natural image classification using the ImageNet dataset. 21 Assessing task relationships via multitask learning While SPLIT cannot provide a definite answer whether specific information is being used or not, with a simple tweak to the prediction model, we can assess more explicitly the relationship of tasks. Recall that under scenario C where tasks are closely related both on the feature-and the output-level, the information from one task should be helpful for solving the other task. We can assess this by using the idea of multitask learning where we train a single model for simultaneously solving multiple prediction tasks. 22 This can be easily implemented with deep neural networks by using one shared neural network backbone and multiple prediction layers. In our setting of disease detection from chest X-rays, we can simply add two prediction layers to our model, one for predicting biological sex and one for racial identity, connected to the same penultimate layer of the backbone as the disease prediction layer. We then make explicit use of the labels for disease, sex, and race during training, to learn a feature representation that is shared across the three prediction tasks. If a patient's sex or racial identity is directly related to the prediction of disease (for example, due to unwanted correlations in the historical data), we may find that the task-specific features align in similar 'directions' in the feature space. Thus, inspecting the learned feature representation of a multitask model and comparing it to the feature space learned by a single task disease detection model may provide valuable insights about the interrelationship of these prediction tasks.

Unsupervised exploration of feature representations
We employ a model inspection approach utilizing unsupervised machine learning techniques that allow us to directly explore what information is encoded, how it is distributed, and whether it aligns in the learned feature space with the primary task of disease detection. We recall that the prediction layer makes the ultimate decision about what information to use. The difficulty is that the feature representations are typically highdimensional. In a DenseNet-121, the representations in the penultimate layer have 1024 dimensions. 23 To inspect the learned feature representations, we need to make use of dimensionality reduction techniques.
We use principal component analysis (PCA) to capture the main modes of variation within the feature representations. We then generate two-dimensional scatter plots for the first four PCA modes, and overlay different types of patient information. Additionally, we use t-distributed stochastic neighbor embedding (t-SNE), a popular algorithm for visualizing highdimensional data, to capture the similarity between samples in the feature space. 24 We also plot the output predictions of the primary task prediction layer. In our case, the output of the disease detection model has 14 dimensions (one output for each of the 14 conditions). We may either apply dimensionality reduction on the 14-dimensional outputs, or focus on specific conditions of interest. Here, we focus on the two tasks of classifying 'no finding' and 'pleural effusion'. The twodimensional logits (which are the unnormalized prediction scores for the two output classes) can then be directly visualized in a single scatter plot. Samples that are labeled neither 'no finding' nor 'pleural effusion' are labeled in the plots as 'other'. For each scatter plot of PCA, t-SNE and logit outputs, we also visualize the corresponding (marginal) distributions that one obtains when projecting the 2D data points against the axes of the scatter plots. We then visually check if any patterns emerge in these visualizations, and we use statistical tests to compare the marginal distributions in PCA and logit space for all relevant pairs of subgroups using the two-sample Kolmogorov-Smirnov test. Contrasting the encodings in the feature embeddings with the outputs of the prediction layer may allow us to assess whether particular information is used for making predictions.

Statistics
The primary metrics for performance evaluation of the disease detection models include the area under the receiver operating characteristic curve (AUC), true positive rate (TPR), and false positive rate (FPR). TPR and FPR in subgroups are determined at a fixed decision threshold, which is optimized for each model to yield an FPR of 0.20 on the whole patient population. TPR is equal to sensitivity (and recall), while FPR is equal to 1-specificity. We also report Youden's J statistic which is defined as J = sensitivity + specificity-1, or simply J = TPR-FPR, providing a combined measure of classification performance. The relationship between TPR and FPR under different decision thresholds is illustrated in ROC curves. AUC and ROC curves allow the comparison of a model's classification ability independent of a specific decision threshold, while TPR and FPR allow the identification of threshold shifts causing subgroup disparities. SPLIT performance for race and sex classification is primarily measured with AUC. We also report TPR/FPR calculated for a decision threshold optimized for the highest Youden's J statistic. For the three-class race classification, we use a one-vs-rest approach for each racial group to measure classification performance. For all reported results, bootstrapping (stratified by targets) with 2000 samples was used to calculate 95% confidence intervals. 25 Twosample Kolmogorov-Smirnov tests are used in the unsupervised feature exploration to determine p-values for the null hypothesis that the marginal distributions for a pair of subgroups are identical in PCA and logit space.

Role of the funding source
The funders had no role in study design, data collection, data analysis, data interpretation, or writing of the report. The authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Ethics
This research is exempt from ethical approval as the analysis is based on secondary data which is publicly available, and no permission is required to access the data.  Table S1 in the Supplementary Material.

Disparities in disease detection performance
While AUC seems largely consistent across groups and ROC curves appear similar in shape, we observe some clear TPR/FPR shifts when using a fixed threshold for the entire population. On CheXpert, Black patients have an increased FPR of 3% for 'no finding' and a decreased FPR of 4% for 'pleural effusion' compared to the target value of 20%. On MIMIC-CXR, we observe an increased FPR of 5% for Black patients and a decreased FPR of 3% for Asian patients for 'no finding'. For 'pleural effusion', we also observe a decreased FPR of 5% for Black patients, and some shifts across biological sex. These findings seem to confirm that a decision threshold optimized over the whole patient population may not generalize and lead to disparate performance in underrepresented subgroups. 14 It is important to note, however, that most of the observed shifts do not imply that the models perform worse (or better) on some subgroups, but that they rather perform differently. A change in FPR usually comes with a corresponding change in TPR, reflected in largely consistent values for Youden's J statistic across subgroups. Nonetheless, such changes in model behavior would have important clinical implications. For example, a consistently increased FPR for 'no finding' would mean that in practice some patients are more likely to be underdiagnosed than others. 14 Articles www.thelancet.com Vol 89 March, 2023 A key question is whether the model may have implicitly learned to perform differently on different subgroups due to specific biases in the training data and the model's ability to recognize patient characteristics from the images. To assess the effect of the training characteristics on the subgroup disparities, we evaluated the performance of the CheXpert and MIMIC-CXR test sets each using the model trained on the other dataset. Because CheXpert and MIMIC-CXR vary substantially in the distribution of patient characteristics and prevalence of disease, we may expect to see differences compared to a model that was trained and tested on subsets from the same dataset. The results are given in Tables 4 and 5. We observe similar disparities across subgroups as before despite the different training characteristics of the CheXpert and MIMIC-CXR disease detection models. The shifts in TPR/FPR largely remain the same, suggesting that these disparities are potentially caused by the specific composition of the test sets, rather than by model bias.
To investigate the effect of the test set characteristics, we assessed performance of each disease detection model using strategic resampling to create race balanced test sets while controlling for age differences and disease prevalence in each racial group. The results are given in Tables 2 and 3 with the corresponding ROC curves shown in Figs. 3 and 4. We find that using resampled test sets has a large effect on reducing TPR/

SPLIT performance
The results for applying SPLIT to race classification on CheXpert are summarized in Fig. 5 (with additional metrics provided in    , which indicates that it is possible to successfully train a prediction layer to recognize race from chest X-ray using features learned for natural image classification. Here, the model trained on ImageNet, however, cannot have possibly learned to extract racial information as it has never seen any medical data, and yet, its features are useful for race classification when training a new prediction layer. In the case of ImageNet, we are likely seeing an example of scenario B discussed earlier, where the same features are useful for different tasks that are, however, unrelated on an output-level. We also observe higher than chance classification accuracy for the backbone with random weights (with AUC for White 0.70 (0.70-0.71), Asian 0.75 (0.74-0.76), and Black 0.67 (0.66-0.68)) which suggests that even random data transformations retain some signal from the raw images about racial identity. Fig. 6 and Table S3 provide results for applying SPLIT to biological sex classification. Here, we replaced the backbone trained on White patients with a backbone trained on male patients. Similar to race classification, we obtain high positive SPLIT results for all four backbones. We observe AUC values above 0.90 for the backbones trained on ImageNet, all patients, and male patients, which further confirms that even high SPLIT responses are insufficient for drawing conclusions whether specific patient information is encoded in the backbone and used for making predictions about the presence of disease. SPLIT results on MIMIC-CXR are also provided in Figs. 5 and 6 with more details in Tables S4 and S5. SPLIT results using a ResNet-34 on CheXpert are provided in the Supplementary Material in Fig. S3 and Tables S6 and S7, all leading to similar findings.   Tables 2  and 3). The multitask model also preserves similar high performance for both race and sex classification when compared to the corresponding single task models (cf. Figs. 5 and 6). The fact that for the multitask model the performance for the individual prediction tasks is largely unaffected may suggest that in our setting race and sex information is not informative for detection of disease. Although, one may argue that if a (single task) disease detection model is already capable of implicitly classifying sex and race, explicitly adding this information may actually not affect the performance. However, if such implicit encoding of race and sex in a single task model was similarly strong as the explicit use of the patient information in a multitask model, we would expect to find SPLIT responses on the single task disease detection backbone close to the performance of a multitask model when classifying race and sex. This is not the case, and in fact, there are large differences in average AUC between the multitask models and SPLIT of about 0.17 and 0.07 for race and sex classification, respectively. Here, the inspection of the feature representations discussed next aims to shed some light on the key differences in the way patient characteristics are encoded in the different models. Additional multitask results for a ResNet-34 on CheXpert with similar findings are provided in Table S1 in the Supplementary Material.

Unsupervised feature exploration
The inspection of the learned feature representations via PCA and t-SNE help to uncover the relationship between patient characteristics and disease detection, allowing us to assess whether a model maybe be under scenario B or C (cf. Fig. 2). In Fig. 7, we show a variety of different plots produced for the single task disease detection model trained on CheXpert applied to the resampled test-set for an unbiased assessment. The feature representations typically align well with the ground truth labels which can be observed clearly in the first mode of PCA separating samples that have different disease labels along the direction of largest  Articles variation. Similar is observed in the t-SNE embedding and the logit outputs for 'no finding' and 'pleural effusion'. This is because the features are learned to be discriminative with respect to the disease detection task. Samples with the same disease labels should obtain similar feature values and logit outputs, hence, visible grouping will emerge in the scatter plots. We also observe a separation in the marginal distributions along the dimension that best separates the data. The idea of the unsupervised exploration is then to inspect whether . Classification performance is determined in a one-vs-rest approach for each racial group. The first four columns are the race classification results for SPLIT using different neural network backbones. Column five and six correspond to results from a single task race classification model and the multitask model trained jointly for disease, sex, and race classification. SPLIT performance on the ImageNet backbone and disease detection backbones is very similar.  other types of information may show similar patterns which would indicate that the inspected information is related to the primary task. In such a case, we may have a strong indication that the model is under scenario C, and the inspected information may indeed be used for making predictions for the primary task. If no patterns emerge, neither in the embeddings of the feature representations nor in the logit outputs, and there are no obvious differences in the marginal distributions across subgroups, one may be carefully optimistic that the model is under scenario B. 26 For the single task disease detection model in Fig. 7, we observe no obvious patterns in the scatter plots for biological sex and race, neither in the feature embeddings nor in the logit outputs. This is despite the high SPLIT responses for race and sex classification (cf. Figs. 5 and 6). In contrast, we observe some visual patterns for age where younger patients are grouped together with features and logit outputs aligning with 'no finding', which is not surprising as age strongly correlates with the presence of disease. We also observe a slight shift in the marginal distribution of 'pleural effusion' for Black patients (Fig. 7d, fourth column) which explains the disparate performance in TPR/FPR. Fig. 8 provides the corresponding plots for the multitask model trained on CheXpert. Interesting observations can be made when comparing the feature representations of the multitask model with the ones obtained for the single task disease detection model, both trained and tested on the same set of patients. Recall that the multitask model is explicitly exposed to the patient characteristics during training. Here, we observe that biological sex and racial identity are strongly encoded in the feature representation of the multitask model, clearly separating the patients from different subgroups. However, we observe that the separation of subgroups is not aligned with the direction in feature space separating disease. Inspecting the PCA plots, in particular, we observe that biological sex becomes the predominant factor of variation encoded in the first PCA mode. Disease seems best separated in the second mode, while racial identity seems to be mostly encoded in the third and fourth mode of PCA. Given that the modes are orthogonal (which is a property of PCA), this may indicate that the most discriminative features for disease, sex, and race are largely unrelated on the outputlevel which would resemble scenario B in Fig. 2. Similar to the single task model, no obvious patterns emerge for race and sex along the direction of disease. In the logit outputs of the disease prediction layer we again observe a shift for the marginal distribution of 'pleural effusion' for Black patients. More subtle interactions, however, may be missed by the visual inspection. For this reason, we also report p-values for statistical tests performed on all relevant pairs of subgroups when comparing their marginal distributions in PCA and logit space. The results are provided in Table 6. For the PCA modes that primarily encode disease (mode 1 for the single task model, mode 2 for the multitask model), the differences for all except one of the pairwise comparisons within race and sex subgroups are statistically non-significant. However, some tests indicate possible associations between disease and race in the feature space and the logit outputs. Asian patients show differences on the output for 'no finding' (when compared to White and Black patients). The tests also suggest an association of Black patients with the logit output of 'pleural effusion'. The statistical tests alone, however, are not sufficient indicators of disparate performance and need to be considered in combination with a comprehensive subgroup performance analysis.

Discussion
The objective of this article was to highlight the general difficulties when trying to answer the question of what information is used when ML models make predictions. We have highlighted that SPLIT is insufficient and cannot provide definite answers. We argue that our proposed combination of test-set resampling, multitask learning, and unsupervised exploration of feature representations provides a comprehensive framework for assessing the relationship between the encoding of patient characteristics and disease detection. Our work fits well within the recent discussion of algorithmic auditing which specifically highlights subgroup analysis as an important component. 27  Articles both racial identity and biological sex in the context of image-based disease detection in chest X-ray. We found that previously reported disparities for 'no finding' disappeared when correcting for statistical subgroup differences using resampled test sets. 14 However, our analysis confirmed disparate performance for detecting 'pleural effusion' in Black patients. While we could not find strong evidence that race information is directly or indirectly used by the disease detection models, we have to remain careful as weak correlations between prediction tasks may not be detected due to limitations of using visual interpretation of the embedding plots. 26 The statistical analysis of the marginal distributions in PCA and logit space suggest some association between protected characteristics and prediction of disease. Identifying the underlying causes of disparate performance remains a challenge and will require further research. Besides the scarcity of data from underrepresented groups, which may limit the generalization capabilities of machine learning models, other sources of bias such as label noise are likely to contribute to subgroup disparities. 29,30 Label noise is of particular concern as it cannot be corrected for with strategies such as resampling, and instead, would require careful re-annotation of the dataset. A possible source of label noise is systematic misdiagnosis of certain subgroups causing a severe form of annotation bias. In the presence of multiple sources of bias, and the absence of specific knowledge (or assumptions) about the extent of bias, assessing model fairness is difficult. 31 It has been argued previously that integrating causal knowledge about the data generation process is key when studying performance disparities in machine learning algorithms. 18,32 It is worth highlighting that there is a very active branch of machine learning research aiming to develop methodology that can prevent (or at least discourage) the use of protected characteristics for decision making. 33 Here, the goal is to learn fair representations that do not discriminate against groups or individuals. 34 A popular approach in 'fair ML' is adversarial training where a secondary task model is employed during training of the primary task. [35][36][37] The secondary, adversarial model acts as a critic to assess whether the learned feature representations contain features predictive of subgroups. This is related to SPLIT, with the difference that the secondary task directly affects the learning of the feature representations, similar to multitask learning, but encouraging the active removal of predictive features during training. 38 Other approaches focus on fair predictions by auditing and correcting performance disparities across subgroups during and even after the primary task model has been trained. 39,40 These advances in fair ML are encouraging, in particular, in cases where we can identify the causes of disparities (e.g., specific biases in the training data) and we have reliable information that the use of protected characteristics would be harmful. However, it is also worth highlighting that in many applications it may not be obvious that this is the case. In fact, some works in the fairness literature show that under certain circumstances in order to obtain fair machine learning models, the use of characteristics related to subgroup membership may be desired or even required. 41,42 Ethical limitations and regulatory requirements will also need to be considered for the development of such technical 'solutions'. 43 In any case, approaches for model inspection such as unsupervised exploration of feature representations, will remain important to establish whether certain information may or may not be used for making predictions.
In conclusion, we would like to re-emphasize the need for rigorous validation of AI including assessment of performance across vulnerable patient groups. Reporting guidelines such as CONSORT-AI, 44 STARD-AI, 45 and others, advocate for complete and transparent reporting when assessing AI performance. A detailed failure case analysis with results reported on relevant subgroups is essential for gaining trust and confidence in the use of AI for critical decision making. Disparities across patient groups can only be discovered with detailed performance analysis which requires access to representative and unbiased test sets. [46][47][48] We believe no machine learning training strategy or model inspection tool alone can ever replace the evidence gathered from well designed and executed validation studies and these will remain key in the context of safe and ethical use of AI. 1,49 New frameworks for auditing AI algorithms will likely play an important role for clinical deployment. 28,50,51 We would hope that our work makes a valuable contribution by complementing these frameworks, offering a practical and insightful approach for subgroup performance analysis of image-based disease detection models.
Contributors B.G. conceived and designed the study and conducted the experiments; B.G., C.J., M.B., and S.W. performed the statistical analysis, interpreted the results, and verified the underlying data; S.W. performed data preprocessing. The authors jointly conceptualized, edited, and reviewed the manuscript, and approved the final version of the manuscript.

Data sharing statement
All data used in this work is publicly available. The CheXpert imaging dataset together with the patient demographic information can be downloaded from https://stanfordmlgroup.github.io/competitions/ chexpert/. The MIMIC-CXR imaging dataset can be downloaded from https://physionet.org/content/mimic-cxr-jpg/2.0.0/with the corresponding patient demographic information available from https:// physionet.org/content/mimiciv/1.0/.
All information to recreate the exact study sample used in this paper including splits of training, validation, and test sets, and all code that is required for replicating the results is available under an open source Apache 2.0 license in our dedicated GitHub repository https://github. com/biomedia-mira/chexploration.