Missing data and chance variation in public reporting of cancer stage at diagnosis: Cross-sectional analysis of population-based data in England

Highlights • Indicators of early stage at diagnosis are routinely reported for geographical populations.• The current specification of these indicators results in biased and unreliable comparisons.• Changes to the approach to handling missing data, and the reporting period are suggested.• Organisational indicators for cancer care must address bias from missing data and low reliability.


Introduction
The percentage of cancer patients diagnosed at an 'early stage' (i.e. TNM stages 1-2) has been routinely reported for National Health Service commissioning organisations (Clinical Commissioning Groups, CCGs) since 2014 [1], following recommendations in the 2011 national cancer strategy for England [2]. Recently, this indicator has been adopted into a pay-for-performance scheme for CCGs [3]. Typical CCGs meeting the relevant targets in a given year would receive a financial incentive of £250,000. The aim of these public reporting and pay-forperformance schemes is to promote diagnosis of cancer at an earlier stage and thereby improve outcomes for patients across England. We further summarise this policy context and the technical aspects of the indicator in Box 1.
Indicators used for comparing the performance of healthcare organisations should, among other considerations, be both valid and reliable. Valid indicators truly measure the intended construct of interest, while reliability indicates the precision by which the construct is measured. The validity of performance indicators based on routinelycollected healthcare data may be undermined by missing information [4,5]. Low reliability, where measures are not precise enough to distinguish organisational performance, is a prevailing concern when person-level measures are aggregated into organisation-level scores [6][7][8][9]. Frequently, indicators are published and used in pay-forhttps://doi.org/10.1016/j.canep.2017.11.005 Received 10 August 2017; Received in revised form 9 November 2017; Accepted 11 November 2017 performance schemes without these concerns being examined or addressed.
The validity and reliability of the early stage indicator for CCGs as currently specified have not been evaluated. Currently, patients with cancer with no recorded stage are treated as though they had late stage cancer, but an alternate specification excluding such patients may be more appropriate. Furthermore, the annual reporting period may be either unnecessarily long or too short to allow for reliable estimation of performance. In this article, we demonstrate how appropriate statistical techniques may be used to examine the properties of this indicator, and identify specific improvements to reduce bias and improve its reliability.

Examining bias arising from missing data in indicators of early stage at diagnosis
In the study year (2013) stage completeness across all 10 cancer sites was 82%, ranging from 71% to 91% for renal and endometrial cancer, respectively. We used multiple imputation by chained equations (MI) to produce a 'best estimate' early stage indicator, which we treated as the gold standard. Separately by cancer site, a binary early stage indicator for each patient was imputed with logistic regression [12], using auxiliary information on important patient and tumour characteristics associated with stage at diagnosis including patient age, sex, tumour grade (partially missing), CCG, and survival time from diagnosis [13][14][15][16]. The MI indicator for each CCG was estimated as the mean percentage of tumours diagnosed at early stage over ten imputed datasets [17]. Appendix A contains further details of the imputation model.
We judged a priori that indicators based on the MI approach were not suitable for routine use in public reporting, primarily due to the need for follow-up periods to have elapsed to obtain survival information for use in imputation models, as well as the computational complexity and lack of end-user familiarity with the underlying statistical methods. Instead simpler approaches would be preferable if they are not associated with a substantial degree of bias. We therefore investigated the degree of bias in CCG scores using two simpler approaches for producing early stage indicators. First, the 'missing-is-late' indicator, where the percentage of all tumours with recorded early stage is estimated assuming that those without recorded stage information are advanced stage tumours. The missing-is-late approach is currently used to produce early stage indicators [1,3,10]. Second, the 'complete-case' indicator, where the percentage of staged tumours diagnosed at early stage is estimated based only on tumours with observed stage. We described the degree of bias in either missing-is-late or complete-case indicators by comparing organisational estimates against the 'best estimate' MI indicator.

Examining the reliability of early stage indicators
The statistical reliability of a measure indicates its reproducibility (consistency) in repeated measurement and its robustness to random measurement error. Here we are concerned with organisation-level (or Spearman-Brown) reliability which represents the extent to which organisational measures (in our case the measured percentages of cancer patients diagnosed in early stage) reflect true differences between organisations, as opposed to random (i.e. chance) variation [7,[18][19][20]. For further details of the calculation of reliability for binary indicators, see Appendix B.
Mixed effects logistic regression models were used to model variation in the percentage of tumours diagnosed at early stage estimated using the complete-case indicator. Our main focus was the composite (all 10 cancers) indicator for CCGs, but we performed similar analyses for each individual cancer site (see Appendix B) and for local government organisations (local authorities) and general practices. These models produced an estimate of the organisation-level variance on the log-odds scale. The estimated variance was used to calculate odds ratios for diagnosis at early rather than late stage comparing the 75th/25th and 95th/5th percentiles of the distribution to illustrate the variation between organisations. Importantly, this was the underlying (true) variation which can be thought of as that which would be seen with very large sample sizes in each organisation, such that the influence of sampling variation would be minimal. This underlying (true) variation will be less than the variation in observed stage metrics as the latter will also include a contribution from chance/sampling [19]. The organisation-level variance on the log-odds scale was also used to calculate the reliability for each indicator based on the number of cases in the study year.
In addition to estimating the reliability of the observed data, model outputs were used to estimate the number of tumours required for each

Box 1 Early stage at diagnosis indicator
In the English National Health Service (NHS), the planning, funding and monitoring of healthcare delivery is the responsibility of 'healthcare commissioning' organisations currently known as Clinical Commissioning Groups. These are responsible for geographicallydefined populations. There are about 200 Clinical Commissioning Groups across England, covering an average general population of about 250,000 residents. To support and promote their planning, funding and monitoring function, high level performance indicators for Clinical Commissioning Groups are published annually, across different disease areas, including cancer. In England, a nationwide population-based cancer registration system has been in existence since 1971. In recent years, the modernisation of cancer registration systems has enabled the capturing of information on stage at diagnosis for a high proportion of patients. This has allowed for the introduction of the 'early diagnosis' indicator for Clinical Commissioning Groups studied in our paper. This indicator relates to the stage at diagnosis of 10 different solid tumour sites, and can be met by a Clinical Commissioning Group if either of the following criteria apply: a) 60% or greater proportion of all registered cases with relevant tumours are known to have been diagnosed in TNM stages 1 or 2; or b) there has been a 4% or greater absolute increase within a year in the proportion of all registered cases with relevant tumours known to have been diagnosed in TNM stages 1 or 2.
organisation to have a reliable estimate of the percentage diagnosed at an early stage based on reliability thresholds of 0.7 and 0.9. A reliability of 0.7 or higher is commonly required in public reporting, while a reliability of 0.9 may be required for high-stakes reporting, including payfor-performance schemes [6,[19][20][21]. Following this we calculated the number of years of data required for reliable reporting at current completeness levels.
To illustrate the direct impact of low reliability, we used the estimated distribution of CCG performance in 2013 to evaluate expected misclassification rates for CCGs on the Quality Premium pay-for-performance thresholds. Estimating the overall CCG misclassification rate (in respect of both targets combined) was not possible using one year of data. We therefore performed two similar simulation processes, one for investigating the 60% criterion and one for the ≥4% change criterion (Appendix D). This proceeded as follows. We started with a list of 209 CCGs and the number of staged tumours (N i ) in 2013 for each CCG. We simulated plausible values of the true performance of each CCG, P i , using the intercept and random effect from our multi-level model, and mapping back from the logistic to the probability scale. We used the binomial distribution with probability of success P i and number of trials N i to generate plausible observed performances for each CCG, given the simulated underlying performance and actual number of staged tumours. For the ≥4% change criterion we simulated two years of data for each CCG with a true, uniform change in performance between the two years, repeated for true changes between −4% and +12%, in steps of 0.1%. We repeated each simulation 10,000 times, examining the sensitivity, specificity, and positive and negative predictive values of both the 60% and ≥4% change criteria. All analyses were carried out in Stata 13 [22].
There was little association between CCG early stage percentages estimated using the indicator based on multiply imputed data and CCG percentages of tumours with missing stage (Fig. 1 panel A). In contrast, when using the missing-is-late specification, we observed a very strong negative relationship between early stage and missing stage percentages (panel B). The complete-case specification did not show a clear association of these two measures (panel C). Fig. 2 shows the bias associated with the amount of missing stage information compared with the 'best estimate' MI indicator (i.e. where bias is the difference between the 'best estimate' MI indicator and the indicator of interest). Bias in the missing-is-late specification increased in magnitude rapidly as the percentage of tumours with missing stage information increased; median bias across all CCGs was −6% (range −30% to −2%). Using a complete case specification typically produced less biased estimates than the missing-is-late approach across all CCGs, irrespectively of the degree of data completeness. There was a slight positive association between the degree of bias and the percentage of patients with missing stage among CCGs with < 20% missing stage data, and no apparent association among CCGs with > 20% missing stage data. Median bias in the complete-case specification across all CCGs was +2% (range −2% to +7%). Importantly, between-CCG variation in bias due to missing data under the missing-is-late specification (observed range of bias: 28%) was larger than observed variation in early stage on the 'best estimate' (observed range of performance: 21%), while this was not the case for the complete-case indicator (observed range of bias: 9%).

Reliability of the complete-case indicator
The median reliability of the early stage indicator for CCGs was 0.66 (Table 1), despite strong evidence of variation between CCGs (p < 0.0001) and moderate sample sizes for each CCG (median 691 staged tumours). This is below levels of reliability required for use in public reporting or pay-for-performance schemes. The aggregation of three years of data would suffice to produce indicators suitable for public reporting (λ ≥ 0.7) for 90% of CCGs. Indicators for 90% of CCGs with sufficient reliability for use in pay-for-performance schemes (λ ≥ 0.9) would require aggregation of nine years of data. Reliability estimates for individual sites are given in Table C1. For breast and lung cancer, indicators based on three and four years of incident cases respectively would allow for adequate reliability (λ ≥ 0.7) for about 70% of all CCGs, respectively. For other cancer sites, eight (renal cancer) to 35 (endometrial cancer) years would be required. Results for local authorities were similar, while general practice indicators had very low reliability (Table C2).

Probable misclassification on CCG Quality Premium targets for reporting periods of varying length
Considering the CCG Quality Premium criterion providing financial incentives to CCGs which have 60% of tumours diagnosed at stage 1 or 2 in a single year, based on our simulation (which assumes the complete-case indicator is used), we would expect 40 of the 209 CCGs to appear to meet this 60% target, of which only 21 would have an underlying or long-run performance of 60% or higher, giving a positive predictive value of 53% (Fig. 3). We would expect 29 CCGs to have underlying performance above the 60% target, of which one quarter (eight of 29) would appear to miss the target, giving a sensitivity of 74%. Aggregating multiple years of data reduces expected misclassification rates. Using 2.5 (9) years of data, giving reliability of 0.7 (0.9) for more than 90% of CCGs, increases the expected number of true positives to 23 (25) and reduces the expected number of false positives to 11 (5) (Table C3).
For the 4% year-on-year increase criterion of the CCG Quality Premium, misclassification rates depend on the size of underlying changes in performance expected in the long-term for individual CCGs as well as CCG size. If the CCGs' underlying performance did not change, then with very large sample sizes we would not expect to see any CCGs meet this target. However, based on the actual sample sizes for one year of data we would expect 8% of CCGs to be misclassified as meeting the target if the underlying performance did not change for any CCG (Fig. 4). Furthermore, for a CCG to have an 80% chance of meeting the 4% improvement target they would have to improve their underlying performance such that they increased the percentage of cases diagnosed at early stage by 6.2% (Fig. 4).

Discussion
The current specification of the early stage indicator for English commissioning organisations is biased due to organisational variation in stage completeness. For the period we examine, the degree of bias is so large that it dominates the variability in this indicator. An alternative specification of the indicator based only on tumours with recorded stage is substantially less biased. Nonetheless, such complete-case indicators will not be reliable when based on one year of data, and will be associated with a high degree of random misclassification if used in pay-for-performance schemes. Complete-case indicators will be suitable for public reporting if based on three-year reporting periods. Timely Fig. 1. Observed early-stage percentage calculated using: A. the 'best estimate' multiple imputation approach; B. the missing-is-late approach; and C. the complete-case approach, plotted against the percentage of tumours with no recorded stage information, CCGs, England 2013. There are no previously published evaluations of the bias or reliability of indicators of cancer stage at diagnosis. Many studies have evaluated the reliability of other performance indicators in healthcare for physicians [7,9], hospitals [23,24], and general practices [8,21] including for several diagnostic activity indicators reported in the Cancer Services Public Health Profiles [19]. Bias due to missing data is also a common problem for measures based on routinely-collected data, and multiple imputation in particular is commonly used to correct this in cancer registry data [4,25,26].
The key strength of our study is that we use the same English cancer registry data as the early stage indicator, ensuring our results are directly relevant to the current public reporting and pay-for-performance schemes in England. The main weakness is the lack of an objective gold standard for assessing bias in the indicator. Our estimates of bias under different specifications of the indicator are based on comparisons with complete data produced using multiple imputation, as by definition we do not know the stage of tumours with no recorded stage. This approach could itself be biased if the 'missing at random' assumption does not hold, but this is mitigated by the inclusion of important auxiliary information in the imputation process [15,16,25].
As we had no data on successive years, we only estimated true misclassification rates against the 60% early stage target, but as we have shown, CCGs may be additionally misclassified when considering the 4% early stage improvement criterion. The degree of misclassification we report represents an under-estimate.
Among the 10 cancer sites included in the current indicators, some have higher than average proportion of late stage disease (e.g. lung cancer) whereas the opposite is true for other sites (e.g. breast cancer). The indicator does not take into account between-CCG variation in sitespecific incidence or in patient demographics, and this may reduce the validity of the current indicator for comparing CCG performance [27,28]. Adjusting for case-mix factors would be expected to reduce variation between organisations, and so a potential case-mix adjusted indicator might be more valid but less reliable. Future studies should establish the degree by which case-mix drives apparent organisational attainment and potential implications for public reporting conventions.
Continuing improvements in stage completeness in English cancer registry data will reduce the size and the variation of bias in the missing-is-late approach. However, bias due to missing stage information under this approach will remain a major problem until all CCGs have very similar stage completeness rates. In our study year the alternative complete-case approach has less bias than the current missingis-late approach even for CCGs with very high stage completeness, and so would be expected to remain the best option as stage completeness continues to improve.
Aggregating 3 years of data will produce a reliable early stage indicator, suitable for use in public reporting, and we endorse this approach. Pay-for-performance schemes for Clinical Commissioning Table 1 Number of CCGs, staged tumours per CCG, odds ratios over estimated underlying distribution of CCG performance, quartiles of the reliability of the complete-case early stage indicator, and the number of tumours and associated aggregated years of data for 50%, 70%, 90% and 100% of CCGs to have reliability of 0.7 or higher or of 0.9 or higher.   Groups should not use the early stage indicator, as sufficiently reliable indicators require more than eight aggregate years of data which greatly limits potential uses. The resulting high levels of misclassification on the indicator when based on a single year mean that many CCGs will receive financial rewards despite their underlying performance being below the pay-for-performance threshold. The opposite is also true, i.e. some CCGs should be rewarded but will not be. Appropriate process indicators could give more accurate, reliable, and timely information about local diagnostic performance for cancer [29,30], where there are clear links between processes and improved stage at diagnosis, survival, or quality of life. Screening coverage, for example, is a useful measure for breast, colorectal and cervical cancers [31,32]. Other examples might include organisational measures of use of endoscopies or urgent referrals for suspected cancer (otherwise known as 'two-week-wait' referrals), as they are associated with clinical outcomes [33,34]. More generally, there is a need for research to identify diagnostic process indicators which are truly linked to better outcomes for cancer patients, and to identify the organisations bestplaced to improve local and national performance.
The development of indicators of cancer diagnosis must involve the evaluation and correction of issues of bias and low reliability. The methods we have highlighted here allow for investigation of these problems, and should form part of the process for the development of such indicators before their introduction into practice. Organisations should not be ranked on severely biased quality measures, and financial incentives should only be linked to highly reliable indicators. Cancer stage indicators should not form part of pay-for-performance schemes for CCGs, and public reporting of the early stage indicator should use three-year reporting periods and be calculated as the percentage of staged tumours diagnosed at an early stage.

Authorship contribution statement
GL and GAA conceived the study. GAA and MB designed the study. MB and DG analysed data. All authors contributed to decisions about data analysis interpretation and drafted the article. All authors approved the final version for submission.

Conflicts of interest
None.

GL is funded by a Cancer Research UK Advanced Clinician Scientist
Fellowship award (grant number C18081/A18180). We thank Lucy Elliss-Brookes, Sean McPhail, and Sam Johnson for helpful discussions about the design of early stage indicators. Data used in this study were collated, maintained and quality assured by the National Cancer Registration and Analysis Service, which is part of Public Health England (PHE).

Appendix A. Details of multiple imputation of stage for patients with tumours with no recorded stage information
Stage data were 82% complete overall, with at least 70% completeness for each cancer site. However, stage completeness and the distribution of stage at diagnosis where known varied substantially by site (Fig. A1), and stage completeness also varied substantially by CCG (Fig. A2).
Multiple imputation is a recommended method for handling missing stage information in cancer registry data (Table A1). We created a binary stage variable being 'early' (TNM stages 1 or 2) or 'late' (TNM stages 3 or 4) stage. Imputation was performed separately for each cancer site, splitting colorectal cancer into colon and rectal cancer.
We used logistic regression to impute the binary indicator of early stage at diagnosis on: • CCG of patient at diagnosis • Region of residence of patient at diagnosis • Sex of patient • Interaction between sex and region • Age group of patient at diagnosis (30-39, then five-year age groups, then 90-99, except for prostate and bladder cancer where the youngest age group was 30-44 due to smaller numbers in this age range) • Interaction between age group and region • Deprivation group, fifth of the income domain of IMD 2010 • Interaction between deprivation group and region • Ethnicity of patient (white or non-white)    [28][29][30][31][32][33][34][35][36][37][38][39][40][41][42] We only included patients aged 30-99 at diagnosis. We felt that predictors of stage at diagnosis for patients outside this age range may not reflect those of more typical patients. There were few patients either aged 29 and under (1591 of 208,141, 0.8%) or 100 and older (104 of 208,141, 0.05%), so separate imputation was not feasible.
Screening detection status was applicable for breast, colon and rectal cancers. For melanoma and endometrial cancer, early mortality and nonmicroscopic diagnosis were both extremely rare and the inclusion of such indicators led to problems with model convergence. For melanoma, tumour grade is both less clinically relevant and had low completeness..
All variables used in imputation models were complete, except for tumour grade. For cancer sites other than melanoma, we used predictive mean matching to impute tumour grade based on the (possibly imputed) binary indicator of early stage at diagnosis and on the other variables and interactions used in imputing stage.
Thus for melanoma we used multiple imputation by logistic regression, while for other sites we used multiple imputation by chained equations. We used ten iterations of the chain as burn-in, having previously checked graphically that doing so led to convergence.

Appendix B. Organisation-level reliability for binary indicators
The statistical reliability of a measure generally indicates its reproducibility (consistency) in repeated measurement and its robustness to random measurement error. Here we are concerned with organisation-level reliability, also termed unit-level reliability where units could be commissioners, providers, or geographical areas. In the context of our study, organisation-level (or Spearman-Brown) reliability represents the extent to which measured percentages of cancer patients diagnosed in early stage reflect true differences between organisations, as opposed to random (i.e. chance) variation. Alternatively, the Spearman-Brown reliability is the proportion of the observed organisational variation not due to chance.
Poor reliability often arises when the typical number of cases per organisation (in a given reporting period) is small. The problem is further exacerbated when small sample sizes are combined with limited variation between organisations. Reliable indicators can help to classify organisational performance and thus enable accurate targeting of improvement efforts and rewards. Conversely, using unreliable indicators can lead to harm through wasting of scarce improvement resources and related opportunity costs. Further, misclassified 'poorly performing' organisations may sustain unfair reputational or financial loss [6,9].
Reliability takes a value between 0 and 1, with higher values denoting more reliable indicators. A reliability of 0.5 indicates that half of the observed variance is due to chance. A reliability of 0.7 is often required for public reporting of indicators, while a reliability of 0.9 may be required for pay-for-performance use [6,20,21]. Organisation-level reliability λ i for organisation i is defined as For continuous indicators, this calculation is straightforward [6]. For binary indicators, the within-organisation variance will depend directly on the level of achievement at each individual organisation, according to the binomial distribution [18,20]. It is important to note that as reliability depends on both the organisational sample size and organisational achievement it is specific to each organisation rather than to the indicator as a whole.
We used mixed effects logistic regression models to estimate the organisation-level variance on the log-odds scale (σ 2 ). Reliability is then given by whereπ i is the observed performance of organisation i on the indicator as a proportion [18]. From this formula it can be seen that higher reliability can be achieved by increasing the between-unit variation or by increasing sample sizes. Additionally, for binary indicators, higher reliability is achieved with performance closer to 50%.
Appendix C. Reliability of early stage indicators for the composite indicator for CCGs, local authorities and general practices, with years of data required for indicators suitable for public reporting and pay-for-performance use and associated expected misclassification rates