Administrative healthcare data to predict performance status in lung cancer patients

The dataset includes 4488 patients diagnosed with lung cancer (ICD-O 3[3], C33-C34) between 2010–2012 and 2016–2018 in the territory of the Agency for Health Protection (ATS) of Milan, Italy, and selected from its population cancer registry on the basis of availability of the following information: performance status (PS), age, sex, and stage at diagnosis. The dataset includes also the following variables, extracted from the health databases of the ATS and linked to the variables derived from the cancer registry through deterministic record linkage on a unique key (tax code): Charlson comorbidity index, presence of chronic obstructive pulmonary disease, number of hospitalizations, outpatient visits, emergency accesses and prescribed drugs in the previous year, and dispensed durable medical equipment in the previous three years. The dataset was used to develop a logistic prediction model for PS, dichotomized as ‘poor’ (ECOG, 3–5) and ‘good’ (ECOG, 0–2), on the basis of all other variables in the dataset. The prediction model was developed on a 50% random subsample of the described dataset (development dataset, n = 2,244) and validated on the remaining half. The area under the curve (AUC) of the model in the development and validation samples were 0.76 and 0.73, respectively. The developed model was used to predict ‘good’ vs. ‘poor’ PS in a sample of patients with advanced lung cancer, from the same registry and years, for which the information was not available. Researchers using registry data, or electronic claims, to perform studies of oncologic therapy effectiveness for lung cancer could use the reported coefficients to predict PS value, dichotomized as ‘good’ or ‘poor’.

effectiveness for lung cancer could use the reported coefficients to predict PS value, dichotomized as 'good' or 'poor'.  Table   Subject Oncology Specific subject area Evaluation of real-world effectiveness of oncologic therapy for lung cancer Type of data Tables How data were acquired Data from the Cancer register were extracted from the register database hosted at the Epidemiology Unit of the Agency for Health protection of Milan Variables from the administrative health databases of ATS of Milan, stored in the ATS datawarehouse, were linked to the cancer register data on a unique identifier (tax code). Data were then anonymized. All procedures were performed in the safe environment of the ATS. Data format Raw and analyzed data Parameters for data collection The dataset includes patients diagnosed with lung cancer (ICD-O 3, C33-C34) between 2010-2012 and 2016-2018 in the territory of the Agency for Health Protection (ATS) of Milan, Italy, and registered in its population cancer registry. Excluded patients were death certificate only, non-malignant and non-epithelial tumours, and patients with missing information on stage or Performance Status. Description of data collection Information on patients, including Performance Status, and stage were derived from the cancer registry. All other variables were extracted from the health databases of the

Value of the Data
• Performance Status is an important confounding factor in studies of treatment effectiveness for lung cancer. The reported regression coefficients allow prediction of Performance Status in individual lung cancer patients from diverse cohorts, using few variables commonly available from administrative health databases. • Researchers using registry data, or electronic claims, to perform studies of oncologic therapy effectiveness for lung cancer are frequently confronted with the lack of availability of performance status for a part of or the entire cohort. They could use the reported coefficients to predict Performance Status value, dichotomized as 'good' or 'poor'. • The dataset and the reported coefficients could be used to perform external validation of the model, including re-calibration for a population with a different baseline risk (e.g. different stage distribution).

Data Description
The dataset includes 4,488 patients with lung cancer. There are four variables derived from the cancer registry of the Agency for Health Protection (ATS) of Milan: patient Eastern Cooperative Oncology Group (ECOG) performance status ( knownPS: 0 = Fully active to 5 = Dead), sex ( Sex : 1 = Male, 2 = Female), age in years ( Age) and stage ( Stage : IA, IB, IIA, IIB, IIIA, IIIB,  IV), both at the time of diagnosis. The following 7 variables were obtained from the administrative health databases of the ATS: Charlson comorbidity index ( Cindex , 0 to 11), presence of chronic obstructive pulmonary disease ( COPD : 1 = yes or 0 = no), number of hospitalizations ( N_admission ), outpatient visits ( N_outpatient_visits ) and emergency access ( N_emergency_access ) in the previous year, dispensed durable medical equipment (Durable_equip: 1 = yes or 0 = no) and number of prescribed drugs ( N_prescription ). The variable Devel_valid indicates if the record was randomly assigned to development (D) or validation (V) dataset. Table 1 displays descriptive statistics of all variables included in the dataset i.e. number and percentages for categorical variables, and median and interquartile range for continuous nonnormally distributed variables. Some of the continuous variables are additionally described after categorization, for easier interpretation. The statistics are presented for the entire dataset and separately for: patients with 'poor' (ECOG 3-5, n = 776) and 'good' (ECOG 0-2, n = 3712) known performance status; the development ( n = 2,244) and validation ( n = 2,244) subsets. The pvalue for the appropriate test for difference between the development and the validation set (i.e. χ 2 or Man-Whitney U test) is also presented in the last column of the table. Table 2 displays the estimated logistic regression parameters ( β) and standard srrors (s.e.) for the model predicting 'good' Performance Status (ECOG 0-2) in lung cancer patients using cancer registry data and information derived from administrative databases of the ATS of Milan.

Experimental Design, Materials and Methods
The scale of Performance Status (PS), developed by the Eastern Cooperative Oncology Group (ECOG) in 1982 [1] describes patient's level of functioning in terms of their ability to care for themself, daily activity, and physical ability (walking, working, etc.). We wanted to estimate the average treatment effect (ATE) of immune checkpoint inhibitors in any line of treatment in a 2016-2018 population-based cohort of patients with advanced non-small cell lung cancer (NSCLC) [2] . PS was among the variables needed for adjustment, but it was available only in 23% of the 1673 patients included in the study. To obtain the information for the remaining patients, a prediction model for PS dichotomized as 'good' or 'poor' was then developed on the presently described dataset.
The dataset includes all patients diagnosed with lung cancer (ICD-O 3 [3], C33-C34) between 2010-2012 and 2016-2018 in the territory of the Agency for Health Protection (ATS) of Milan, Italy, and registered in its population cancer registry, member of the International Association of Cancer Registries (IACR) [4] . The number of incident cases of lung tumour in the period was 14,441. Excluded patients were death certificate only (DCO, i.e. diagnosed only by means of death certificate, n = 311), non-malignant tumours (ICD-O-3 [3] behaviour code different from /3, n = 67), non-epithelial tumours (morphology ICD-O-3 [3] code equal or higher than 8680/3, n = 63) and patients with missing information on stage ( n = 2,767). Of the remaining 11,233 patients, 4,488 had a recorded PS value in the cancer registry, either abstracted from clinical records or derived from trained research nurses from the same source, and were used to develop ( n = 2,244 random records) and validate ( n = 2,244 remaining records) the prediction model.
Age in years at diagnosis, sex, and TNM 8th edition stage at diagnosis [5] were derived from the cancer registry. The remaining variables were derived from administrative health databases of the Lombardy Regional Health System, available at the ATS level for registered residents, as following: Table 1 Sample characteristics, and number of contacts with the health system and of prescribed drugs in the year prior to lung cancer diagnosis in 4,488 patients with known Performance Status residing in the territory of the Agency for Health Protection (ATS) of Milan.       -number of hospitalizations in the year before diagnosis: by tax code, sum of any hospital admission recorded in the hospital discharge sheet (SDO) database in the 365 days starting 30 days before date of diagnosis. -number of outpatient visits in the year before diagnosis: by tax code, sum of any record with ICD-9-CM code starting with '89' in the outpatient database in the 365 days starting 30 days before date of diagnosis. -number of emergency accesses in the year before diagnosis: by tax code, sum of any emergency room access recorded in the emergency care database in the 365 days starting 30 days before date of diagnosis. -dispensed durable medical equipment in the 3 years before diagnosis: by tax code, if a durable medical equipment among portable oxygen, walkers, canes, wheelchairs, and hospital beds had been dispensed in the 365 * 3 days starting 30 days before date of diagnosis. -number of prescribed drugs in the year before diagnosis: by tax code, number of prescribed drugs with different ATC codes [6] in the outpatient drug dataset in the 365 days starting 30 days before date of diagnosis. -Charlson comorbidity index was calculated adapting the algorithm from Quan et al. [7] , based on hospital discharge sheets, to include information from the outpatient drug and exemption from co-payment datasets, the latter including exemptions for named chronic diseases. The specification of the algorithm for defining presence of the different chronic diseases included in the comorbidity index from the outpatient drug and exemption from co-payment datasets are the same described in supplementary material of Murtas et al. [8] . -chronic obstructive pulmonary disease was considered as present if the subject had an hospitalization prior to diagnosis with the following ICD-9-CM codes in any diagnosis field: 416.8, 416.9, 490.x to 505.x, 506.4, 508.1, 508.8 [7] ; or if he had more than 45 years and prescription of drugs starting with ATC code R03 ( drugs for obstructive airway diseases ) and a defined daily dose (DDD) [6] of at least 30% in the year before diagnosis.
Date of diagnosis in the above calculations was always the date recorded in the cancer register which, as per international rules, is the date of first histological or cytological confirmation when available [9] . For this reason, the events in the 30 days preceding this date of diagnosis were not counted, as likely related to the lung cancer diagnosis process. Deterministic record linkage on a unique key was used to match all information at patient level within the information system of the ATS, which houses the cancer registry and the administrative data, and was anonymized prior to analysis. The resulting dataset is described in Table 1 .
A logistic regression model was fitted on a 50% random subsample of the described dataset (development dataset, n = 2,244) with dichotomized PS 'Good' (ECOG 0-2) vs. 'Poor' (ECOG 3-5) as the dependent variable and the following predictors: age, sex, stage, Charlson's index, chronic obstructive pulmonary disease, number of hospital admissions, outpatient visits, emergency accesses and prescribed drugs in the previous year, and durable medical equipment in the previous three years. The predictors to be included in the model were chosen on the basis of the literature [5] . It was decided to develop a full model, without performing model selection using automatic statistical techniques, given the high number of events and the minimal cost represented by the collection of this information, and in order to maximize the expected discrimination ability based on administrative data only. The interactions included in the model (age x sex, and Charlson Comorbidity Index x N of prescribed drugs) were pre-specified on subject-matter knowledge basis. Age was included in the model as a restricted cubic spline with 3 knots. Coefficients and standard errors for the fitted model are presented in Table 2 . The AUC of the model in the development sample was 0.76, the Brier score 0.12. The AUC of the developed model on the validation dataset was 0.73, the Brier score 0.12. Intercept and calibration slope in the validation dataset were 0.27 and 0.81.

Ethics Statement
The study project of which these data are a part has been approved from ethics committee Milan Area 2 (protocol review number 231_2021bis of March 17 2021, id study 2059).

Declaration of Competing Interest
These data, and the described analysis, have been used within a project supported by Roche S.p.A. e M.S.D. Italia s.r.l. The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors declare that they have no other competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.