Do changes in health reveal the possibility of undiagnosed pancreatic cancer? Development of a risk-prediction model based on healthcare claims data

Background and objective Early detection methods for pancreatic cancer are lacking. We aimed to develop a prediction model for pancreatic cancer based on changes in health captured by healthcare claims data. Methods We conducted a case-control study on 29,646 Medicare-enrolled patients aged 68 years and above with pancreatic ductal adenocarcinoma (PDAC) reported to the Surveillance Epidemiology an End Results (SEER) tumor registries program in 2004–2011 and 88,938 age and sex-matched controls. We developed a prediction model using multivariable logistic regression on Medicare claims for 16 risk factors and pre-diagnostic symptoms of PDAC present within 15 months prior to PDAC diagnosis. Claims within 3 months of PDAC diagnosis were excluded in sensitivity analyses. We evaluated the discriminatory power of the model with the area under the receiver operating curve (AUC) and performed cross-validation by bootstrapping. Results The prediction model on all cases and controls reached AUC of 0.68. Excluding the final 3 months of claims lowered the AUC to 0.58. Among new-onset diabetes patients, the prediction model reached AUC of 0.73, which decreased to 0.63 when claims from the final 3 months were excluded. Performance measures of the prediction models was confirmed by internal validation using the bootstrap method. Conclusion Models based on healthcare claims for clinical risk factors, symptoms and signs of pancreatic cancer are limited in classifying those who go on to diagnosis of pancreatic cancer and those who do not, especially when excluding claims that immediately precede the diagnosis of PDAC.


Methods
We conducted a case-control study on 29,646 Medicare-enrolled patients aged 68 years and above with pancreatic ductal adenocarcinoma (PDAC) reported to the Surveillance Epidemiology an End Results (SEER) tumor registries program in 2004-2011 and 88,938 age and sex-matched controls. We developed a prediction model using multivariable logistic regression on Medicare claims for 16 risk factors and pre-diagnostic symptoms of PDAC present within 15 months prior to PDAC diagnosis. Claims within 3 months of PDAC diagnosis were excluded in sensitivity analyses. We evaluated the discriminatory power of the model with the area under the receiver operating curve (AUC) and performed cross-validation by bootstrapping.

Results
The prediction model on all cases and controls reached AUC of 0.68. Excluding the final 3 months of claims lowered the AUC to 0.58. Among new-onset diabetes patients, the prediction model reached AUC of 0.73, which decreased to 0.63 when claims from the final 3 months were excluded. Performance measures of the prediction models was confirmed by internal validation using the bootstrap method. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Introduction
Over 50,000 new cases and 40,000 deaths from pancreatic cancer occur annually in the U.S. [1] With a 5-year survival proportion below 10%, pancreatic cancer is the deadliest solid organ cancer. [1,2] If current trends continue, pancreatic cancer will become the second leading cause of cancer death by 2030. [3] Most pancreatic cancer patients have advanced stage disease at diagnosis; [1] therefore, strategies for detecting pancreatic cancer earlier could expand treatment options and improve survival.
Metabolic and gastrointestinal changes are strongly associated with incident pancreatic cancer. For example, people with new diagnoses of diabetes are at �4 -fold increased risk of pancreatic cancer diagnosis in the next two years. [4][5][6] In some patients, new-onset diabetes reflects a paraneoplastic phenomenon arising from tumor in the pancreas. [7,8] Development of pancreatic ductal adenocarcinoma (PDAC) is also often marked with unintentional weight loss. [9] Recent diagnosis of pancreatitis is also strongly associated with PDAC risk with an odds ratio (OR) of 13.6, reflecting potential misdiagnosis of PDAC as pancreatitis, or the causation of pancreatitis by the developing neoplasm. [10] Similarly, recent initiation of protonpump inhibitor (PPI) use is related to PDAC risk (OR = 6.2), suggesting that PDAC-related abdominal discomfort is sometimes treated as dyspepsia. [11] Collectively, changes in health as manifested in healthcare claims could potentially be used to detect PDAC at earlier stages. Previous prediction models for PDAC that have incorporated data on changes in health have shown modest discriminative power, but have varied applicability to the general population in the U.S. [11][12][13] We hypothesize that predictive modeling using healthcare claims from a national insurance program in the U.S. can help identify older adults who are at high risk of pancreatic cancer. Using Medicare-linked data on cancer diagnoses reported to Surveillance, Epidemiology, and End Results (SEER) cancer registries between January 2004 and December 2011, we conducted a matched retrospective case-control study to develop a prediction model for pancreatic cancer.

Data sources
The SEER database includes information on cancer incidence and survival from populationbased registries in geographic regions currently comprising approximately 28% of the U.S. population. [14] Linkage of SEER to Medicare claims on inpatient and outpatient procedures and diagnoses offers unique population-based source of information on patterns of care before and after diagnosis that can be used for epidemiological and health services research. [15,16] For the purposes of the current analyses, we extracted pathology and diagnosis information on PDAC cases from SEER, selected controls from a matched random sample of Medicare members, and extracted covariate data from Medicare claims. SEER-Medicare data pertaining to pancreatic cancer cases and controls were obtained and analyzed as a limited data set without direct identifiers. The Institutional Review Board of Cedars-Sinai Medical Center has approved this study.

Selection of cases
Based on topography code C25.x and ICD-O-3 histology codes for adenocarcinoma of the pancreas (8000, 8010, 8020, 8021, 8022, 8050, 8140, 8141, 8211, 8230, 8260, 8441, 8450, 8453,  8470, 8471, 8472, 8473, 8480, 8481, 8500, 8503, 8521), [17] we identified all newly diagnosed PDAC patients at least 68 years old. We chose 68 years as the minimum age so that eligible patients had at least three years enrollment duration in Medicare Parts A and B prior to diagnosis of pancreatic cancer. We only included people with PDAC that was confirmed by microscopy, laboratory test, direct visualization, or imaging, and excluded cases with unknown months of diagnoses or those diagnosed at autopsy. Because SEER reports only the month and year of cancer diagnosis, we set the 1 st of the month as the diagnosis date for the purposes of designating pre-diagnosis claims.

Selection of controls
Using the 5% random sample of Medicare beneficiaries, we selected 3 controls for each case and matched them by sex, 5-year age group and year of diagnosis. Controls were free of pancreatic cancer as of July 1 st of the same year as case diagnosis, and had been enrolled in Medicare A and B for at least three years as of that point in time. This methodology parallels control selection methods by Engels, et al. [18] The same control was allowed to be sampled across multiple years; however each control was only sampled once in a calendar year. Index date was defined as July 1 st of the same year as the matched case.

Covariates
On the basis of consensus between investigators with expertise in oncology, gastroenterology and epidemiology and published literature, we selected clinical health changes known to be associated with PDAC, including acute pancreatitis, chronic pancreatitis, any abdominal pain, chest pain, diabetes mellitus, weight loss/anorexia/cachexia, nausea and/or vomiting, digestive problems, dyspepsia/gastritis/peptic ulcer disease, fatigue, itching/pruritis, depression, jaundice, gallbladder disease, acute cholecystitis, and esophageal reflux. S1 Table lists these covariates and their corresponding ICD-9 codes. We extracted ICD-9 coded claims for these factors from Medicare inpatient and outpatient data files.

Healthcare access
Healthcare claims are more likely to be consistent among patients who make use of recommended preventive services. A proxy indicator for such individuals among Medicare enrollees is compliance with the annual influenza vaccine recommendation, which is correlated with health literacy and motivation to seek care. [19,20] To adjust for healthcare access, we included influenza vaccination in all models. Compliance with the vaccine recommendation was determined by extracting claims data on receipt of influenza vaccination (HCPCS codes G0008, Q2035, Q2036, Q2037, Q2038) in the 12-month period prior to index date.

Statistical analysis plan
To visualize the trends of claims for covariates of interest prior to diagnosis with PDAC and to identify a pre-diagnosis window of time when such trends diverge between cases and controls, we summarized the ratios of percent of cases to controls who had healthcare claims for the covariates of interest within 24 months prior to diagnosis. The 24-month history was divided into 3-month intervals (total of 8 quarter years). For the purpose of the main prediction model, we included claims within 15 months prior to PDAC diagnosis or index date to incorporate as many covariates that diverge between the cases and the controls, as well as to have sufficient lead time prior to pancreatic cancer diagnosis to identify potentially useful early detection signals.
To describe covariate distributions of the case and control sample groups, we computed frequencies and percentages for categorical variables and medians and interquartile ranges for continuous variables. The primary outcome was the occurrence of PDAC. We compared covariate distributions between the case and control groups by Wilcoxon rank-sum statistics or chi-square statistics, as appropriate. To quantify associations between the covariates and the outcome, we constructed unconditional logistic regression models under adjustment for the matching variables: sex, age group, and year of diagnosis. Because we sampled some patients more than once, we accounted for repeated measurements on the same control across multiple years by robust variance estimates. Variables initially considered for inclusion in the multivariable model included race and influenza vaccine status and all of the covariates described above.
Model selection was conducted by stepwise variable selection procedure based on Quasilikelihood under the Independence model Criterion (QIC) statistic. [21,22] The final multivariable model was chosen by the lowest QIC value, a statistical alternative to Akaike's information criterion [23] but for correlated data. Age group, sex, year of diagnosis, race and influenza vaccine status were kept in the model regardless of statistical significance.

Model performance
We evaluated the sensitivity of the models at specificities of 99% or higher, 95-99%, and <95%. We set thresholds based on specificity, rather than sensitivity, given the infrequency of the disease, and the high cost of false positivity (e.g., patient anxiety, costly imaging). Performance of the models on predicting occurrence of pancreatic cancer was further assessed with measures of discrimination and calibration. [24] Discrimination was evaluated by receiver operating characteristic (ROC) curve and area under the ROC curve (AUC, or C-index). 16 Calibration of the prediction models was evaluated with calibration slope intercepts, and graphically assessed with predicted versus observed probability of the occurrence of PDAC based on the loess algorithm. [25] Internal validation of the models was performed by estimating and correcting for possible overfitting and optimism in the model performance estimates by bootstrap methods with 1000 replicates. [25][26][27]

Sensitivity analyses
To evaluate how the prediction model may have been influenced by claims immediately preceding the diagnosis of PDAC, which may reflect diagnostic work-up for cancer, we conducted sensitivity analyses excluding claims occurring less than 3 months prior to PDAC diagnosis. Because new-onset diabetes can be an early indicator of pancreatic cancer, [7,8] and has been the focus of published prediction models, [11][12][13] we also performed sensitivity analyses among those with new claims for diabetes within 15 months prior to the index date, without any claim for diabetes prior to this period. Finally, a separate prediction model was also created based on claims presented 16-24 months prior to the index date, to evaluate possible prediction utility further before diagnosis. To consider the influence of including weak associations in the prediction models, we also constructed models with parsimonious selection of variables that were associated with PDAC with OR > 2 for each of the models above. In all models, except for new-onset diabetes, we included relevant claims within the specified time period whether or not they were the first ever claim for the condition.
All statistical analyses were performed using SAS 9.4 (SAS Institute, Inc., Cary, North Carolina) and R package version 3.5.0 (The R Foundation for Statistical Computing). The Institutional Review Board of Cedars-Sinai Medical Center approved the study. We followed the STROBE guidelines for reporting of results of case-control studies, [28] and the PROBAST guidelines for reporting on potential bias and applicability of prediction models. [29]

Results
In total, 51,540 non-deceased pancreatic cancer patients with known diagnosis month and year were reported to SEER between 2004 and 2011; 44,882 of these were malignant primary PDAC. Diagnosis was confirmed by microscope or laboratory tests or by imaging in 41,305 cases, of whom 29,646 met all our study eligibility criteria. Of note, 23,332 of the cases were microscopically confirmed (79%). We selected 88,938 controls matched to the cases. Table 1 provides characteristics of the cases and controls. Covariates such as chronic pancreatitis, acute pancreatitis, jaundice and poorly controlled diabetes are present in greater frequency in cases vs. control from as early as 24 months prior to cancer diagnosis or matched date. In addition to these factors, covariates such as upper abdominal pain, gallbladder disease, digestive symptoms and weight loss were present in greater proportion of patients with pancreatic cancer than in controls within 15 months prior to cancer diagnosis or matched date. All factors were more elevated in cases vs. controls in the last 3 months prior to cancer diagnosis, and ratios for cases vs. controls steeply increased in this quarter. (Fig 1) A summary of proportions of cases and controls with claims for each covariate by quarter is provided in S2 Table. Multivariable results Table 2 shows the results of multivariable analyses. In the analyses focusing on the 15 months before diagnosis, factors significantly associated with PDAC included black race (OR = 1.14) relative to white race, and presence of at least 1 claim for acute pancreatitis, (OR = 4.72), chronic pancreatitis (OR = 3.72), diabetes mellitus (OR = 1.52), dyspepsia (OR = 1.25), gallbladder disease (OR = 1.34), any abdominal pain (OR = 2.38), weight loss (OR = 2.70), and jaundice (OR = 24.0). Influenza vaccination (OR = 0.82), depression (OR = 0.72), and chest pain (OR = 0.89) were significantly associated with reduced PDAC risk.

Pre-diagnostic claims history in cases and controls
Excluding claims from the final 3 months before index date weakened these associations. For example, acute pancreatitis and jaundice were associated with 3.1-fold and 3.8-fold increased risk of PDAC. The strength of the association for diabetes did not change with the exclusion of the final 3 months of claims, but that for weight loss decreased from OR of 2.70 to 1.57. Dyspepsia and gallbladder disease were associated in the 1-15 month model were no longer significantly associated with PDAC risk when we excluded claims from the final 3 months. Table 3 presents the covariate distributions between the case and control groups among those with new-onset diabetes, comprising 7.8% of the cases (n = 2,319), and 3.8% of the controls (n = 3,400). The results of the multivariable model for persons with new-onset diabetes are presented in Table 4 and show similar trends to the entire case-control sample. Patients with acute pancreatitis, chronic pancreatitis, abdominal pain, weight loss, and jaundice experienced increased risk of PDAC. Also of note, in persons with new claims for diabetes, poorly controlled diabetes was additionally associated with PDAC risk. As in the model based on the full subject sample, depression was negatively associated with PDAC risk. Excluding the final 3 months of claims eligibility attenuated the associations between the covariates and PDAC risk. Regardless, acute pancreatitis, chronic pancreatitis, abdominal pain, weight loss, and jaundice were associated with PDAC risk. Poorly controlled diabetes and nausea/vomiting were no longer associated with PDAC risk and omitted from the model, while depression and chest pain were inversely associated with PDAC risk. models in persons with new-onset diabetes. For these subjects, at a specificity of 99%, the prediction model on �15 months claims yielded sensitivity of 18.2%, and excluding 3 months of claims, 4.4%. The corresponding 1-year positive predictive values were 3.5% and 0.87%, respectively, assuming baseline annual risk of PDAC of 200 cases per 100,000 person-years after new-onset diabetes. [31] For each of the models presented above, we also examined parsimonious models including only risk factors associated with PDAC with OR > 2 (acute pancreatitis, chronic pancreatitis, diabetes, abdominal pain, weight loss and jaundice). Parsimonious models performed slightly lower than the QIC-driven models but by no more than 0.01 AUC point (S3 Table).

Model performance
Considering that claims more distant from the index date potentially offer greater lead time, we developed a prediction model based on claims 16-24 months prior to PDAC diagnosis, for which the AUC (0.552) was lower than that of the <15 months model. (S3 Table).

Discussion
In this analysis of older adults in the U.S., we showed that healthcare claims for risk factors and PDAC-related symptoms and signs start to increase months ahead of PDAC diagnosis and that healthcare utilization intensifies nearing the time of PDAC diagnosis. The AUC of the prediction model built on 15 months of claims prior to the index date reached 0.68 when all study subjects were considered and 0.73 among persons with new-onset diabetes. With omission of claims in the three months before diagnosis, the AUCs dropped substantially both for all cases and controls (0.58) and for persons with new-onset diabetics (0.63). At a specificity threshold of 99%, models that incorporate all claims with 15 months of index date have limited sensitivity of 16-18%, which drops to 4-5% by excluding the final 3 months of claims.
Two previously published models have focused on new-onset diabetes: one a model based on new-onset U.K. diabetes patients aged �50 years that incorporated clinical diagnosis as well as laboratory data from electronic health records, [12] another a model based on biochemically-determined new-onset diabetes patients aged �50 years in Olmsted County, Minnesota, that incorporated data on changes in glucose and weight. [13] The U.K. model reached an AUC of 0.82 by internal validation and the Olmsted County model reached an AUC of 0.87 by external validation within another population in Olmsted county (S4 Table). Our model in new-onset diabetes patients, with AUC of 0.73, differs from previous models on three major aspects: age range, regional scope, and type of data. Our study population comprised persons aged � 68 years, who have higher baseline incidence of type 2 diabetes than younger persons, therefore the likelihood that a recent diagnosis of diabetes could be attributable to pancreatic cancer is lower. Our model comprised Medicare patients spanning 28 SEER regions in the nation. Variability in documenting and billing clinical diagnoses may have been greater than in the U.K. and in Olmsted County, with health systems that are less heterogeneous. [32,33] Finally, our model relied on insurance claims, rather than medical records, thus information on laboratory test results and self-reported complaints were lacking. Because continuous formats of laboratory test results (e.g., glucose level) and weight provide more granular information on physiological state than binary diagnoses, incorporating such parameters may explains more of the variation in PDAC risk. Previously published pancreatic cancer prediction models on populations not selected by diabetes status include a Korean nationwide study that incorporated laboratory data from regular health examinations (AUC = 0.81), [34] a population-based case-control study in Connecticut incorporating questionnaire-based data on ethnic ancestry, ABO blood group, smoking cessation, pancreatitis and recent use of proton-pump inhibitor medications (AUC = 0.764), [11] and a pancreatic cancer consortium (PanScan) analysis of multiple observational studies with questionnaire-based data on epidemiologic risk factors and blood group genotype (AUC = 0.61). [35] (S4 Table) Our prediction model in the overall population reached AUC of 0.68, which was lower compared to that estimated in the Korean and Connecticut models. We attribute lower performance to the lack of information on laboratory test results and medications, to the lack of self-reported data not available in claims databases, as well as to the older age of our population (�68 year). In addition to the advantage of laboratory tests described above, over-the-counter medications like proton-pump inhibitors provide indications of abdominal pain prior to seeking help from health professional, thus adding more granular and potentially earlier information on subclinical health changes. Also, Medicare claims data do not include lifestyle risk factors of PDAC, such as smoking and alcohol consumption, and family history of cancer, which increase the risk of PDAC. [36][37][38] The availability of such risk factor data would have improved our models. Considering that models incorporating data on health changes leading up to pancreatic cancer diagnosis performed better than the PanScan model that relied on data on static etiologic risk factors and ABO genotypes [35] suggests that models based on such etiologic risk factors do not well identify exactly when such factors should operate, compared to prediction models based on changes in health. Whether prediction models based on recent changes in health aid in detecting cancer sufficiently early enough for better treatment options, especially potentially curative resection or aggressive multifractionated radiation, is a critical question. Our sensitivity analysis results excluding the final 3 months of claims before index date led to a substantial drop in AUC. One of the strongest predictors was jaundice, which was associated with 24-fold risk of PDAC including all claims within 15 months of index date and 3.8-fold risk excluding the final 3 months. The odds ratios for other strong predictors of pancreatic cancer, such as chronic pancreatitis, acute pancreatitis, abdominal pain and weight loss also attenuated substantially when the final 3 months of claims were excluded. With longitudinal data from healthcare claims, we observe that healthcare claims are comparatively more present in PDAC patients than controls prior to PDAC diagnosis; however, often these health changes are noted very close (<3 months) to the diagnosis of PDAC, thus limiting their predictive value for early detection.
In our analyses, one limitation of using Medicare files is that healthcare claims not billed to Medicare would not have been reflected in the files. By restricting the population to those continuously enrolled in both Medicare Parts A (inpatient care) and B (outpatient care), we limited the population to those who have opted for fee-for-service outpatient reimbursement through Medicare, which therefore would have records of most services covered for its members. Another limitation of Medicare claims data is that claims do not distinguish incident from prevalent conditions. Indeed, knowing the duration of a condition since onset can help improve the model as demonstrated by Risch et al. [11] For conditions like diabetes, pancreatitis and dyspepsia, the strength of the association with PDAC decreases with time since onset; thus, parameterizing the timing of the onset of disease would enhance the fit of the model. Another limitation of Medicare claims data is the lack of representation of younger people who may still be at risk of PDAC. Regardless, the mean age of PDAC diagnosis is 70, [17] thus our model applies to a majority of older persons in U.S at risk for PDAC. Although we aimed to include a comprehensive list of risk factors and symptoms of PDAC, some factors may not have been represented in our analysis. An example is back pain, which has been associated PDAC with odds ratios ranging from 1.3 to 1.4. [39,40] While including additional factors could improve the prediction model, relatively weak associations are unlikely to improve the predictive performance of the model appreciably.

Conclusion
We created a PDAC prediction model that applies to Medicare enrollees living in SEER regions in the U.S. The model provides some information bearing upon the emergent diagnosis of pancreatic cancer, but not enough on its own to be useful in population screening. Excluding the final 3 months of claims prior to PDAC diagnosis reduced the discriminative performance of the model appreciably. Future models should consider sensitivity analyses excluding health changes noted in the final months of PDAC diagnosis in order to evaluate true clinical utility of prediction models for PDAC early detection.
Supporting information S1 Table. Covariates of pancreatic cancer and their ICD-9 codes.