Fairness in Predicting Cancer Mortality Across Racial Subgroups

This cohort study evaluates whether racial bias exists in a machine learning model that identifies cancer mortality risk among patients with solid malignant tumors.


Introduction
Machine learning (ML) models can transform cancer care by providing oncologists with more accurate and accessible information to augment clinical decisions.There are a variety of use cases for incorporating ML in oncology. 1,2Implementation of a mortality predictive model integrated into a decision support system that encourages clinicians to discuss end-of-life care with patients improved rates of these conversations. 35][6][7] For example, patients from minoritized racial and ethnic groups have poor access to health care, 8 resulting in low sample sizes or missing or incomplete data for training datasets for ML.The use of models based on data lacking detail may further propagate inequities.Such models may demonstrate promising performance in the overall population but fail to meet standards for specific subgroups.While it is possible to create multiple models for each group, small sample sizes for specific groups may not allow for comprehensive implementation.Because race is socially constructed, its inclusion as a variable within prediction tools may lead to unwanted effects, including perpetuation of disparities. 9,10Conversely, excluding race may overlook important factors, such as social determinants of health, and continue to bias the algorithm.
When developing an ML model, it is important to evaluate for fairness using a variety of methods.President Biden's executive order on artificial intelligence (AI) highlighted the importance of developing AI tools that do not contribute to discrimination in health care. 11Approaches include examining the data used in development to identify issues in subgroup data quality that may limit model generalizability; measuring performance metrics to ensure there are no between-group discrepancies 8 ; and calculating fairness metrics, which provide additional measurements of the extent of bias, 12 with the most common metrics being equal opportunity, equalized odds, and disparate impact.Equal opportunity asserts that the true-positive rates (TPRs) between groups should be equal.Equalized odds specifies that the TPRs and the false-positive rates (FPRs) between groups should be equal.Disparate impact measures whether the positivity rate (ie, percentage of predictive positive [PPP]) is equal between groups. 12cologists may incorporate a predictive model in decision-making to identify patients for serious illness conversations (SICs), which occur between clinicians and patients and/or family members to elicit patients' values, preferences, and goals for medical care. 13Patients with cancer who engage in SICs often cite a better quality of life and goal-concordant care. 14However, most patients from minoritized racial and ethnic groups who have cancer die without a documented conversation. 15Oncology practitioners have identified many barriers to SICs, with patient and familial factors-such as difficulty accepting a poor prognosis and lack of agreement on decision-makingbeing among the most common. 16Uncertainty in estimating prognosis is another contributor; clinicians cannot identify patients at risk of short-term mortality using existing tools and often overestimate prognosis. 17Prognostic uncertainty, racial bias, and structural racism may lead clinicians to assume that patients from minoritized racial and ethnic groups do not want to have these

JAMA Network Open | Oncology
Fairness in Predicting Cancer Mortality Across Racial Subgroups conversations. 18Machine learning has the potential to help clinicians prognosticate, which can help prioritize patients for SICs.Caution is warranted in building models that do not exacerbate existing bias.
This study describes assessments to identify racial bias (measurable statistical variation in model performance and/or fairness metrics across racial groups) during the development of an ML model that predicts mortality among patients with solid tumors.The model was developed with intention for clinical use to identify patients for SICs, and the racial bias assessments were an important step to ensure safety prior to implementation.

Data Sources and Patient Selection
Patients were included in this cohort study if they had a date of cancer diagnosis recorded in the Mount Sinai Health System (MSHS) cancer registry between January 2016 and December 2021, were 21 years of age or older, and received care at an ambulatory cancer clinic within the MSHS.Data were obtained by matching records from the MSHS cancer registry, Social Security Death Index, and electronic health record and included all available retrospective data up to the date when databases were accessed for cohort extraction (February 2022).Our study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline. 19The MSHS institutional review board approved this study and waived informed consent due to minimal risk based on the protected health information that was accessed.
The MSHS cancer registry is part of the national cancer registry; all data were abstracted by certified tumor registrars to adhere to the North American Association of Central Cancer Registries' standards. 20The Social Security Death Index was obtained from the Social Security Administration's death information master file and was used to calculate the date of death.Variables included were admission-discharge-transfer events, laboratory results, assessments from nursing flow sheets, cancer stage, and cancer treatment plans (the eTable in Supplement 1 shows the full list).Cancer stage and status and patients' race and ethnicity were obtained from the cancer registry.Race categories were Asian, Black, Native American, White, and other (not broken down further in the database accessed) or unknown, and ethnicity categories were Hispanic, non-Hispanic, unknown, and missing.Cancer stage was specified as the stage at the time of incident cancer diagnosis.Cancer status was defined as present or not currently detectable (cancer was cured or in complete response from treatment).Race and ethnicity were recorded by certified tumor registrars as part of routine operations.Based on data availability and interval of measurements, different sampling logics were used to extract measurements for each variable (the eTable in Supplement 1 includes the number of measurements sampled).Missing values were imputed by using the median value of the variable over the entire cohort at the time of sampling.

Model Creation
The target variable was 180-day mortality from the date of prediction.The period for prediction was set as 6 months leading up to the most recent recording of the patient's vital status (eg, if the patient's last recording was in December 2021, the prediction period was defined as from June to December 2021).For the retrospective validation cohort, each patient received 1 prediction.To accentuate the differences in data between deceased and living patients, we structured the death case profiles using data from the 30 days preceding death.This strategy was implemented to reduce the cosine similarity in the data representation, thereby enhancing the ML process during training.
Structuring the profiles differently for records of alive vs dead individuals was chosen because the objective of the model was to predict the patient's chance of dying at any time within the next 180 days rather than to predict patient status specifically at day 180.
The dataset was randomly split into the training dataset (70%) and the test dataset (30%).
Within the preliminary dataset, it was noted that there was a higher cumulative prevalence of records with a status of alive compared with dead.There were concerns that a training dataset that overrepresented a class in the target variable may lead to model underperformance.To manage this, the training dataset was adjusted using random undersampling; we randomly removed excess records of the alive patients until both classes in the target variable were equally balanced.10-fold cross-validation was used to train the model by using the random forest algorithm from the opensource Apache Spark project ML library, and recursive feature elimination was used for feature selection. 21,22The importance of each feature was calculated using the Gini coefficient.Variable importance is the sum of the Gini coefficients for each measurement of a variable.

Model Assessment
The primary outcomes for the study were discriminatory performance and fairness metrics among each race category (Asian, Black, Native American, White, and other or unknown) in the test dataset.
Overall model discriminatory performance was measured using the F1 score and the area under the receiver operating characteristic curve (AUROC). 23,24Fairness metrics were equal opportunity, equalized odds, and disparate impact.Based on consensus among the clinical operational leadership (including C.B.S.) who stewarded the planned implementation of the tool, it was prespecified that a threshold of 80% (ie, a fairness metric ratio between 0.80 and 1.25 [1/0.8]) was evidence of no racial bias.We also considered stricter thresholds of 90% (ratios between 0.90 and 1.11) as a sensitivity analysis.

Statistical Analysis
Descriptive statistics were used to evaluate patient characteristics in the test cohort.The predicted mortality was compared with the actual mortality.Equal opportunity ratio was calculated as the ratio of TPRs.Equalized odds ratio was both the ratio of TPRs and the ratio of FPRs.Disparate impact ratio was the ratio of the positivity rate.Ratios of TPRs, FPRs, and PPPs and their 95% CIs were computed.
The formula used for 95% CIs was exp(ln[ratio of proportions] ± [Z × SE]).All statistical calculations were performed using Python, version 3.9.13(Python Software Foundation).

Results
The 19.2%, other or unknown race; and 0.1% had missing race data.A total of 9.6% were Hispanic and 83.6%, non-Hispanic; 6.7% had unknown ethnicity, and 0.1% had missing ethnicity data (Table 1).
The Native American cohort was small (n = 45) and was not included in subsequent subgroup analysis.The most common cancers were breast (20.3%), genitourinary (19.9%), or gastrointestinal (19.1%).Among patients who were dead compared with those who were alive, there was a higher proportion of patients who were older than 65 years (63.8% vs 47.8%), were Black (25

Discussion
This study comprehensively assessed potential racial bias in a model predicting mortality for patients with solid tumors.In anticipation of clinical implementation, the AUROC and F1 score metrics demonstrated good agreement across the 4 analyzed racial subgroups, indicating similar performance for all groups (ie, no evidence of racial bias).Furthermore, all 6 comparisons of racial categories met all 3 fairness metrics within our safety threshold established prior to expected deployment.It is crucial to recognize that fairness metrics are descriptive tools for operational leaders, not prescriptive mandates.The decision to implement this model ultimately hinges on the interpretation of these metrics within the specific clinical context.While some numerical variation was observed in the fairness metric scores, all ratios remained within the prespecified range of 0.80 to 1.25 (1/0.8),suggesting that the prediction model is fair.
Our workflow prioritized achieving equal opportunity over the stricter equalized odds criterion if the latter proved to be impractical.A positive prediction triggers an SIC, which focuses on understanding the patient's goals of care.False positives leading to SICs for low-risk patients are unlikely to cause harm, as the intervention is merely a conversation.Furthermore, the prediction is meant to supplement clinical judgment, not to replace it.Clinicians can initiate conversations with any patient for whom they deem the conversation to be necessary regardless of the model output.
Therefore, a slight discrepancy in FPRs across race categories might not be significantly detrimental in this context.

Table 3 .
Fairness Metrics Comparisons Across Races a Other race was not broken down further in the database accessed.