Phenotyping Hepatic Immune-Related Adverse Events in the Setting of Immune Checkpoint Inhibitor Therapy

PURPOSE We present and validate a rule-based algorithm for the detection of moderate to severe liver-related immune-related adverse events (irAEs) in a real-world patient cohort. The algorithm can be applied to studies of irAEs in large data sets. METHODS We developed a set of criteria to define hepatic irAEs. The criteria include: the temporality of elevated laboratory measurements in the first 2-14 weeks of immune checkpoint inhibitor (ICI) treatment, steroid intervention within 2 weeks of the onset of elevated laboratory measurements, and intervention with a duration of at least 2 weeks. These criteria are based on the kinetics of patients who experienced moderate to severe hepatotoxicity (Common Terminology Criteria for Adverse Events grades 2-4). We applied these criteria to a retrospective cohort of 682 patients diagnosed with hepatocellular carcinoma and treated with ICI. All patients were required to have baseline laboratory measurements before and after the initiation of ICI. RESULTS A set of 63 equally sampled patients were reviewed by two blinded, clinical adjudicators. Disagreements were reviewed and consensus was taken to be the ground truth. Of these, 25 patients with irAEs were identified, 16 were determined to be hepatic irAEs, 36 patients were nonadverse events, and two patients were of indeterminant status. Reviewers agreed in 44 of 63 patients, including 19 patients with irAEs (0.70 concordance, Fleiss' kappa: 0.43). By comparison, the algorithm achieved a sensitivity and specificity of identifying hepatic irAEs of 0.63 and 0.81, respectively, with a test efficiency (percent correctly classified) of 0.78 and outcome-weighted F1 score of 0.74. CONCLUSION The algorithm achieves greater concordance with the ground truth than either individual clinical adjudicator for the detection of irAEs.


INTRODUCTION
Immune-related adverse events (irAEs) associated with immune checkpoint inhibitor (ICI) therapy can lead to premature termination of therapy, reduce quality of life, cause organ damage, and can be fatal.][10][11] Infrastructure and approaches to report irAEs in real-world settings remain under development and are less standardized than in clinical trials. 12,13rect drug-induced liver toxicity is identified using validated instruments, such as the Roussel-UCLAF Causality Assessment Method (RUCAM). 14The RUCAM provides a highly sensitive, highly specific detection instrument when used by clinical domain experts to evaluate individual patients.However, the RUCAM is intended to be applied at the level of the individual patients and can suffer from reduced performance because of subjectivity and reliability. 15An electronic update to the RUCAM, the Revised Electronic Causality Assessment Method (RECAM) was recently reported based on registry data, providing similar performance to the RUCAM. 15Both the RUCAM and RECAM focus on clinical diagnosis.The presentation of liver injury is heterogenous and may vary greatly with the type of medication and patient population.In particular, irAEs associated with ICI therapies may present over a prolonged duration following initiation. 15As such, ICI-specific tools to identify hepatic irAEs are needed.
Here, we present and validate a rule-based algorithm computable phenotype for the detection of moderate to severe liver-related irAEs using longitudinal patient-level electronic health record (EHR) data for hepatic irAEs in the setting of ICI therapy.Our phenotype is intended to identify hepatic irAEs in retrospective, real-world, large-scale clinical data where individual chart review is not feasible in a cohort of patients with hepatocellular carcinoma (HCC) and underlying chronic liver disease where biopsies are unlikely to be performed.We desire to use minimal features for ease of use and to facilitate future predictive modeling where the use of highdimensional variable space may introduce sparseness and create selection bias.Our phenotype is built upon considerations of the RUCAM/RECAM criteria.We use longitudinal patient-level EHR data, particularly changes in laboratory values and the chronology of ICI treatment and medications associated with the treatment of moderate to severe irAEs that are clinically actionable.Using this approach, we provide a framework to test other associations between irAEs, patient characteristics, and set of chronologic criteria by which to facilitate physician review of potential irAEs.

METHODS
We conducted a retrospective study using data derived from a national cohort spanning patients in the Veterans Health Administration (VHA). 16We obtained electronic data on all patients who initiated ICI treatment in the US Department of Veterans Affairs (VA) system using the VA Corporate Data Warehouse, a national, continually updated repository of VHA electronic health records developed specifically to facilitate research. 17The study was approved by the Research and Development Committee and Institutional Review Board of the VA Boston Healthcare System and received a waiver of informed consent because the study presented minimal risk and could not practicably be conducted without a waiver.
Our cohort comprised patients diagnosed with HCC and treated with ICI.Patients with HCC were identified using a previously validated algorithm between January 1, 2015, and June 30, 2021 (Fig 1). 18Relevant ICI therapies were adjudicated from a list of therapies derived from the HemOnc ontology. 19We defined the date of first ICI as the earliest ICI order date and took this to be the index date for this study.
We restricted our study to patients with at least one baseline laboratory measurement of ALT, AST, or bilirubin (TBIL) within 42 days before the initiation of ICI therapy (see the Data Supplement, Methods: Baseline laboratory value determination and Consideration of RUCAM/RECAM criteria and determination of time cutoffs, and the Data Supplement, Results: Data Supplement, Fig S1 and Table S1, for additional details).We all required all included patient to have relevant therapy orders presented in the VA instantiation of the Observational Medical Outcomes Partnership (OMOP) data model 20 and to have at least one hematology-oncology progress note to facilitate chart review.
We considered the criteria of the RUCAM/RECAM in conjunction with expert clinical domain opinion as well as role of patient history and the influence of comorbidities, which may elevate ALT, AST, or TBIL directly through presentation or indirectly as the result of associated medications using a phecode-only PheWAS (see the Data Supplement, Methods: Analysis of comorbidities and patient-informed kinetics and Consideration of RUCAM/RECAM criteria and determination of time cutoffs, and the Data Supplement, Results: Data Supplement, Figs S2 and S3 and Table S2, for additional details).
We identified patients with baseline chronic steroid exposure defined as exposure before ICI with an exposure era lasting longer than 90 days (n 5 8 patients) in the 6 months before the initiation of ICI (see the Data Supplement, Methods: Chronic steroid exposure, for additional details).For patients who experience changes in ALT, AST, and/or TBIL grades after the initiation of ICI, we calculate the time differences between the ICI cycle start/stop, steroid interval start/stop, and the change in ALT/AST/TBIL grade start/stop (Fig 2 ).
We selected the time cutoffs upon the reported time course of hepatic irAEs and clinical practice to identify irAEs 6 and the observed kinetics of laboratory value elevations, and steroid treatments in our patient cohort (see the Data Supplement, Methods: Analysis of comorbidities and patient-informed kinetics, and Consideration of RUCAM/ RECAM criteria and determination of time cutoffs, and the Data Supplement, Table S2, for additional details).We first grade elevations of ALT, AST, and/or TBIL.After assigning each liver enzyme elevation a Common Terminology Criteria for Adverse Events (CTCAE) grade, we iterate through each elevation present in the patient's EHR.We classify any elevation as an acute hepatic irAE if the following criteria are met (summarized in Fig 2B Regardless of grade, if all criteria are satisfied, but the patient has been identified as having baseline chronic exposure to steroids, the event is classified as indeterminant, otherwise the elevation is classified with non-irAE status. We assessed the performance of our initial detection algorithm against the results of clinical expert chart review (see the Data Supplement, Methods: Chart review, for additional details) and took the consensus as the ground truth.Multiclass algorithm classification performance was then assessed by comparing the algorithm classification to the label determined by chart review on a per-class basis.The confusion matrix was determined for each class and then specificity, sensitivity, positive predictive value, negative predictive value, and accuracy were computed (see the Data Supplement, Methods: Confusion matrix definitions and Multiclass algorithm performance, for additional details).
We also determined patient characteristics stratified by baseline grade based on EHR data (see the Data Supplement, Methods: Baseline patient characteristics, for additional details).
Patients with HCC treated with ICI in the VA Nationwide January To demonstrate potential applications of our algorithm for population-scale studies, we calculated the incidence of immune-mediated hepatitis by type and compared univariate relationships between baseline elevations of laboratory values, previous treatment with tyrosine kinase inhibitors (TKIs), median peak eosinophil to leukocyte ratio Non-grade 2 Events  .irAE detection algorithm.(A) A conceptual example of the time intervals considered in our hepatic irAE phenotype.These include: the time intervals between elevations in ALT, AST, and/or TBIL, ICI initiation/end, the introduction of steroid treatment, and the duration of steroid treatment.In this example, the irAE occurred at week 7 after ICI initiation.Here, steroid therapy was initiated for more than 2 weeks, and both steroid initiation and cessation of ICI therapy occurred within 2 weeks (in this example, within 1 week) after the elevation of the ALT/AST/TBIL lab.(B) Our phenotype definitions depending upon CTCAE grade of the potential event.
(1) For all grades, an elevation in ALT, AST, or TBIL is determined to be an irAE if (a) it occurs between 2 and 14 weeks before the initiation of ICI or most recent cycle of ICI, (b) steroid treatment is given within 2 weeks of the elevation, and (c) for a duration of at least 2 weeks provided that the patient has no history of chronic steroid exposure.If these conditions are met, but the patient has a history of chronic steroid exposure, the elevation of ALT, AST, and/or TBIL is deemed to be of indeterminant status.If any of these three conditions are unmet, the elevation of ALT, AST, and/or TBIL is deemed to be a non-irAE.
(2) For grade 2 events, we also deem an elevation in ALT, AST, or TBIL to be an irAE if (a) it occurs between 2 and 14 weeks before the initiation of ICI or most recent cycle of ICI and (b) ICI therapy is stopped within 2 weeks of the event.CTCAE, Common Terminology Criteria for Adverse Events; ICI, immune checkpoint inhibitor; irAE, immune-related adverse event; LFT, liver-associated blood test; TBIL, bilirubin.
(ELR), median peak neutrophil to lymphocyte ratio (NLR) during 0-14 weeks of treatment, and the intrapatient variability of ELR and NLR between those patients deemed to have experienced an irAE versus those who deemed to not have experienced an irAE (see the Data Supplement, Methods: Calculations for downstream applications of the irAE phenotype to prior TKI exposure and derived ELR and NLR, for additional details).
For all categorical variables with counts, P values were calculated using Pearson's chi-squared test for count data as implemented in the stats R package, while continuous variables, including age and Charlson score, were calculated using the Kruskal-Wallis test. 21

RESULTS
We identified 682 patients overall diagnosed with HCC and treated with ICI between January 1, 2015, and June 30, 2021, who met our inclusion criteria (Fig 1).Patient characteristics are described in Table 1 (see the Data Supplement, Results: Extended patient characteristics, for additional details).We considered history of previous decompensation events associated with cirrhosis. 22,23A statistically significant greater proportion of patients with one or more baseline laboratory value equivalent to grade 1 or higher toxicity had a least one or more decompensation event compared with those patients whose patient laboratory values were within normal range.Previous local/regional therapy showed statistically significant differences at the P < .05level when comparing across baseline grades.Most patients whose baseline laboratory values correspond to toxicity grades 0-2 were treated with transcatheter arterial chemoembolization before ICI initiation, while most patients with baseline laboratory values equivalent to grade 3 and 4 toxicities did not receive previous local or regional therapy.

Interannotator Agreement
Sixty-three of 682 patients were selected for annotation to create a balanced sample of patients based on the number of patients determined to be irAEs, no irAEs, and indeterminant status by an initial version of the algorithm (see Table 2, see the Data Supplement, Methods: Chart review, for additional details).Of these patients, expert adjudicators agreed on the irAE status of 44 of 63 patients (69.8%,Fleiss' kappa: 0.434).Annotators agreed on the date of first dose of ICI, defined as a date difference between reviewers of 7 days or less, in 55 of 63 patients (87.3% concordance with a median absolute difference of 0 days and IQR of [0-2] days).

Algorithm Performance
Performance results and confusion matrices are presented in Table 3 (see extended results in the Data Supplement, Table S3, for additional details).Of 63 adjudicated patients, 25 were determined to be irAEs of any CTCAE grade or type, of which 16 were determined to be hepatic irAEs, two were determined to be indeterminant, and the remaining 36 patients did not present any irAE based on chart review consensus.
The algorithm achieves a weighted F1 score of 0.74, weighted by prevalence of irAEs.For detection of hepatic irAEs, the specificity of the algorithm is 0.81 and sensitivity is 0.63 with a resulting positive predictive value (PPV) of 0.53, negative predictive value (NPV) of 0.82, and a test efficiency of 0.76. 24n the context of hepatic irAEs, the algorithm achieves a percent concordance of 74.6%, which corresponds to Fleiss' kappa of agreement with the ground truth (consensus of expert opinion) of 0.40 (P < .001).The algorithm correctly identifies seven of nine available event dates for hepatic irAEs (77.8%); 10 of 10 types of laboratory values associated with the event (100%); and seven of 10 (70%) grades of severity, including one of two (50%) CTCAE grade 4 irAEs; five of six (83.3%) of grade 3 irAEs; and one of two (50%) of grade 2 irAEs.By comparison, the algorithm achieves a specificity of 0.82 and sensitivity of 0.44 for detection of any irAE with a resulting PPV of 0.63 and NPV of 0.68 with an overall efficiency of 0.68 and prevalence-weighted F1 score of 0.65.
By contrast, the all-comers algorithm achieves sensitivities of 1 and 0.92 for hepatic and nonhepatic irAEs, respectively, with a specificity of 0.21 for both hepatic and any irAE.The approach identifies the correct event date in 10 of 15 (66.7%) of hepatic irAEs, the correct type of laboratory value in 14 of 16 detected patients (87.5%), and the correct grade in 12 of 16 detected patients (75%), including one of one grade 1 hepatic irAEs (100%), one of three grade 2 detected hepatic irAEs (33.3%), nine of 10 (90%) grade 3 detected hepatic irAEs, and one of two (50%) grade 4 hepatic irAEs correctly detected (see the Data Supplement, Results, Table S3 and Fig S4, for additional details).

Error Analysis
We undertook a root cause analysis to understand the performance of the algorithm.Of six irAEs misclassified as non-irAEs, three of six (50%) were events corresponding to CTCAE grade 1 severity, below the intended design specifications of the algorithm.Of these, one was misclassified because of the date of the adverse event relative to ICI initiation and two because of the lack of steroid treatment, consistent with the clinical management of grade 1 hepatic irAEs. 1,6,8,11One grade 2 irAE was misclassified because of an initial date greater than allowed by our rules and one grade 2 event was misclassified because of no use of steroids, which may reflect irAE management and differences in clinicians' choice of management with additional knowledge of the patient's history. 1,6,8,11One grade 3 irAE was misclassified because of a steroid treatment duration that exceeded the duration selected in the rules, which reflects the wide and varied presentation of immune-mediated hepatitis (see the Data Supplement, Results: Patient-informed kinetics, Baseline FIB-4 score unavailability reflects the requirement that PLT and ALT/AST be derived from laboratory reports within 14 days of one another and most proximate to and within 42 days before the initiation of ICI.
b May sum to larger than the stratum of the cohort as the same patient may be in multiple strata, for example, a patient may have multiple types of decompensation events or previous therapy.
Unknown previous targeted therapy exposure reflects the requirement that drug era be within 67 days of an encounter coded with an ICD-9/10 code for HCC.
Extended patient characteristics, and Analysis of false positives, for additional details).

Downstream Applications of irAE Detection/ irAE Incidence
We first used our algorithm to calculate the incidence of immune-mediated hepatitis by type (see the Data Supplement, Results: Incidence of irAEs in the cohort, and the Data Supplement, Table S4, for additional details).

DISCUSSION
Overall, the performance of the algorithm is comparable with the agreement of individual clinician experts when evaluating retrospective patients for hepatic irAEs.Clinical Here, the irAE algorithm is compared with the all-comers approach, in which any elevation of ALT, AST, or TBIL is inferred to be an irAE (an algorithm selected for sensitivity 5 1).Abbreviations: Con., concordance; Eff., efficiency (accuracy); FN, false negative; FP, false positive; irAE, immune-related adverse event; Macro.F1, macro-weighted F1 score; PPV, positive predictive value; Prev., prevalence; Sens., sensitivity; Sp., specificity; TN, true negative; TP, true positive; W.F1, prevalence-weighted F1 score.
domain experts achieve somewhat higher concordance when identifying irAEs compared with the algorithm, 69.8% compared with 62.5%, but overall agreement with the ground truth, as indicated by the test efficiency is 76.2% exceeding the concordance of annotators.Moreover, the algorithm exceeds concordance with annotators for hepatic irAEs on the bases of the presenting laboratory value, events of grade 2 or 3 severity, and the grade of severity.The algorithm exceeds interannotator concordance in considerations of the date of the adverse event.Although some of these performance comparisons may represent typographical errors on the part of annotators, such factors would also be present in any case of chart abstraction and highlight some of the advances to implementing automated irAE detection.
The algorithm achieves high specificity (0.81) and NPV (0.84) but modest sensitivity (0.63) and PPV (0.53) for detection of hepatic irAEs.Although not intended as a causality assessment tool for clinical decision making, to provide context for these performance metrics, recent drug-induced liver toxicity algorithms based on the RUCAM assessment report PPVs between 0.01 and 0.402. 25,26  NOTE.Peak NLR in the first 14 weeks of ICI treatment was significantly higher in those patients who were determined to have experienced a hepatic irAE than those who did not at the level of P < .05.NLR variability was also observed to be higher in those patients who experienced irAEs but did not rise to the level of significance of P < .05.ELR and NLR counts reflect availability of data adhering to (1) baseline measurements and (2) temporal definitions in the first 14 weeks of treatment.Abbreviations: ELR, eosinophil to leukocyte ratio; ICI, immune checkpoint inhibitor; irAEs, immune-related adverse events; NLR, neutrophil to lymphocyte ratio.
In comparison, we also evaluated an all-comers algorithm to mimic the approach used in some randomized controlled trials of ICI 1 and to mimic a design tuned to sensitivity of 1. 25,26 We see while this approach does indeed detect all true patients with irAEs, it does so with the expected loss of specificity.As severe irAEs are relatively rare, this loss of specificity may be particularly detrimental for applications of case screening where the algorithm may overestimate the caseload requiring subsequent screening.This contrasts with the use case of a clinical trial in which the clinician may have additional knowledge of patient history to attribute elevations in ALT, AST, and TBIL to ICI.
Patient characteristics highlight the difficulty in identifying irAEs by routine laboratory biomarkers as well as justify the focus on the temporal relationship between elevations in AST, ALT, and TBIL, and prescribing patterns of steroids associated with the clinical management of irAEs.
Despite the modest PPV, our approach identifies NLR as a distinguishing marker of irAE, supported by previous studies 9 and suggests that there is no direct relationship between exposure to TKI therapy before ICI and the development of irAEs.
Previous studies used emergent encounters and steroid treatment to identify ICI-mediated irAEs to examine outcomes. 27,28NLR and detection of irAEs was found to be prognostic of irAEs, not necessarily hepatic in nature, but in a cohort of patients in which irAEs were identified by clinician adjudication. 12Here, we demonstrate a computable hepatic irAE phenotype using only routine laboratory components and medication orders extracted from the EHR.The output of our phenotype (1) can be useful in large-scale EHR-based studies to identify hepatic irAEs where chart abstraction would be unfeasible, (2) includes considerations of onset timing relevant to the extended courses associated with ICI therapy (>90 days after initiation), and (3) if desired, summarizes and can provide output of time parameters to streamline chart review.
As with any cohort study, our phenotype has several limitations.It was developed using a cohort of patients from the VHA, which is composed predominantly of male patients, which may limit its applicability to other cohorts or study populations.Our cohort inclusion criteria included a diagnosis of HCC and, as such, may not be applicable to cohorts of patients with different diagnoses.Future work will examine its applicability to other diagnoses.For our purposes, we only seek to identify hepatic irAEs.However, patients may still experience a range of nonhepatic irAEs.We recognize that our phenotype has modest performance as a tool intended to facilitate retrospective identification of irAEs and not intended as a tool for clinical decision making.
In conclusion, we developed a rule-based algorithm to retrospectively identify hepatic irAEs in a national cohort of patients with HCC using continuous monitoring of routine biomarkers.Our approach achieves comparable performance to clinical experts and provides a framework for facilitating clinician review.We obtain modest PPV that improves upon reported RUCAM results by 31.8% and has a specificity of 0.81 and overall efficiency of 0.76 despite using only biomarkers that are neither sensitive nor specific to irAEs.

FIG 1 .
FIG 1. Flow diagram of the HCC cohort studied.Six hundred eighty-two patients treated with ICI for HCC met inclusion criteria.Patients are stratified by their maximum baseline grade of either ALT, AST, or TBIL.HCC, hepatocellular carcinoma; ICI, immune checkpoint inhibitor; OMOP, Observational Medical Outcomes Partnership; TBIL, bilirubin; VA, US Department of Veterans Affairs.
FIG 2. irAE detection algorithm.(A) A conceptual example of the time intervals considered in our hepatic irAE phenotype.These include: the time intervals between elevations in ALT, AST, and/or TBIL, ICI initiation/end, the introduction of steroid treatment, and the duration of steroid treatment.In this example, the irAE occurred at week 7 after ICI initiation.Here, steroid therapy was initiated for more than 2 weeks, and both steroid initiation and cessation of ICI therapy occurred within 2 weeks (in this example, within 1 week) after the elevation of the ALT/AST/TBIL lab.(B) Our phenotype definitions depending upon CTCAE grade of the potential event.(1) For all grades, an elevation in ALT, AST, or TBIL is determined to be an irAE if (a) it occurs between 2 and 14 weeks before the initiation of ICI or most recent cycle of ICI, (b) steroid treatment is given within 2 weeks of the elevation, and (c) for a duration of at least 2 weeks provided that the patient has no history of chronic steroid exposure.If these conditions are met, but the patient has a history of chronic steroid exposure, the elevation of ALT, AST, and/or TBIL is deemed to be of indeterminant status.If any of these three conditions are unmet, the elevation of ALT, AST, and/or TBIL is deemed to be a non-irAE.(2)For grade 2 events, we also deem an elevation in ALT, AST, or TBIL to be an irAE if (a) it occurs between 2 and 14 weeks before the initiation of ICI or most recent cycle of ICI and (b) ICI therapy is stopped within 2 weeks of the event.CTCAE, Common Terminology Criteria for Adverse Events; ICI, immune checkpoint inhibitor; irAE, immune-related adverse event; LFT, liver-associated blood test; TBIL, bilirubin. ).

TABLE 1 .
Baseline Patient Characteristics Stratified by Highest Baseline Toxicity Grade (ALT, AST, or TBIL)

TABLE 2 .
Interannotator Agreement Including Hepatic 1 Nonhepatic irAEs NOTE.Two clinician-adjudicators reviewed 63 patients for chart review.Reviewers agreed on irAE status 69.8% of the time, leading to a Fleiss' kappa of 0.434.Interannotator agreement exceeded 85% for date of first ICI (where agreement was defined as a date within 7 days), adverse event type, and the laboratory value presenting the irAE.However, reviewers only achieved agreement of 42.1%, 47.4%, and 52.9% for the adverse event date, grade of severity, and date of treatment, respectively.Disagreements include cases in which one reviewer did not provide a response; quantiles reflect provided numerical inputs.Abbreviations: ICI, immune checkpoint inhibitor; irAE, immune-related adverse event.

TABLE 3 .
Algorithm Performance and Confusion Matrices

TABLE 4 .
Patient Characteristics Associated With Detected irAEs (all patients determined by the algorithm to be irAEs)