Challenges in risk estimation using routinely collected clinical data: The example of estimating cervical cancer risks from electronic health-records.

Electronic health-records (EHR) are increasingly used by epidemiologists studying disease following surveillance testing to provide evidence for screening intervals and referral guidelines. Although cost-effective, undiagnosed prevalent disease and interval censoring (in which asymptomatic disease is only observed at the time of testing) raise substantial analytic issues when estimating risk that cannot be addressed using Kaplan-Meier methods. Based on our experience analysing EHR from cervical cancer screening, we previously proposed the logistic-Weibull model to address these issues. Here we demonstrate how the choice of statistical method can impact risk estimates. We use observed data on 41,067 women in the cervical cancer screening program at Kaiser Permanente Northern California, 2003-2013, as well as simulations to evaluate the ability of different methods (Kaplan-Meier, Turnbull, Weibull and logistic-Weibull) to accurately estimate risk within a screening program. Cumulative risk estimates from the statistical methods varied considerably, with the largest differences occurring for prevalent disease risk when baseline disease ascertainment was random but incomplete. Kaplan-Meier underestimated risk at earlier times and overestimated risk at later times in the presence of interval censoring or undiagnosed prevalent disease. Turnbull performed well, though was inefficient and not smooth. The logistic-Weibull model performed well, except when event times didn't follow a Weibull distribution. We have demonstrated that methods for right-censored data, such as Kaplan-Meier, result in biased estimates of disease risks when applied to interval-censored data, such as screening programs using EHR data. The logistic-Weibull model is attractive, but the model fit must be checked against Turnbull non-parametric risk estimates.


Introduction
Large-scale epidemiologic research is shifting from formally designed trials and observational studies to increasing use of electronic health-records (EHR) that contain longitudinal data for millions of people in routine clinical practice. In particular, EHR facilitate studies of screen-detected disease and precursors in population screening programs, where risks of disease are typically low and thus very large studies are required. The longitudinal nature of these records is of particular importance when estimating risk over time, for example to provide evidence on which to base screening intervals.
Three key features of electronic health record data should be accounted for when disease is asymptomatic and only detected through screening or clinical testing. First, asymptomatic disease onset is only known to occur between the last time considered 'disease-free' and the time of diagnosis; this is called interval censoring.(1) Unlike designed observational studies, where researchers control when participants return, patients are encouraged to comply with doctors' suggestions, but can return at any time for unknown reasons, and thus the interval censoring is irregular and may be informative (i.e., linked to risk of disease). Second disease may be present at first screen (prevalent disease), called left-censoring. Third, prevalent disease is not always immediately diagnosed. In particular, people with negative screening tests generally do not undergo definitive disease ascertainment (known as verification bias (2)), and in the real world many individuals with a positive screen will not have a diagnostic test, or postpone such testing. Consequently, when prevalent disease was not ascertained but disease is diagnosed at future screens, some of that 'incident' disease is actually undiagnosed prevalent disease. In our experience, designed observational studies either exclude prevalent disease (and even incident disease diagnosed too close to baseline), or assume all detected disease is incident; however prevalent disease is often relevant, not merely a nuisance factor to be removed. These three issues also apply to designed observational studies of screening, if baseline disease ascertainment is not performed on everyone (or at least a representative sample) or if there was substantial variation in visit times between participants. As risk estimates from routine clinical practice are crucial to inform screening guidelines, risk must be estimated accurately using appropriate methodology.
We critically examine the performance of different methods for the analysis of EHR to estimate risk, providing examples to show the magnitude of bias when using off-the-shelf methods in realistic settings. The standard Kaplan-Meier method cannot account for the above three features, yet is commonly used. We have proposed a new model, the logistic-Weibull model,(3) that does account for the above three issues. We have also proposed a modification of the non-parametric Turnbull method (3) to check the fit of the logistic-Weibull model. We use simulations to examine critically the assumptions underlying all methods, and sensitivity to these assumptions, to recommend a robust methodology and practical advice for epidemiologists calculating risk from EHR. In addition to the simulations, we provide an example using observed cervical screening data.

Cervical cancer and cervical screening
Cervical cancer originates from a persistent high-risk human papilloma virus (HPV) infection, which may progress to asymptomatic precancer, occult cancer and symptomatic invasive cancer. There is also natural regression within this process, when the body naturally clears/suppresses HPV or even apparent precancer, immunologically. Cervical screening has been highly successful at preventing cervical cancer where good programmes with wide coverage exist, through detection and removal of precancers. There are currently two tests which are widely used in cervical screening programmes. Traditionally cytology was used, where the cells are examined and abnormal cells are identified. More recently HPV testing was introduced, which tests for the presence of carcinogenic HPV. HPV testing has higher sensitivity to detect pre-cancerous disease than cytology, but is less specific.(4) Depending on the result of the tests, women are invited back for their next test at a routine screening interval, invited back earlier for more intensive surveillance, or referred for magnified examination (colposcopy), biopsy and possible treatment. Disease ascertainment only occurs at colposcopy. Cervical cancer is an ideal disease for a screening programme because detectable and highly treatable asymptomatic pre-cancerous lesions can be targeted, and their rate of growth and invasion is typically slow. As there are harms associated with screening as well as benefits, it is important that screening intervals are an appropriate length; short enough not to miss the passage through precancer to cancer, but long enough that the risk of detecting precancer is not negligible. It is therefore useful to know how long after a given screening result precancer first becomes detectable, to determine appropriate screening intervals. It is also important to estimate prevalent risk accurately, as this informs whether to intervene when the initial test results are known. Data from screening programmes are routinely collected, containing the date and result of each screening test.

Methods for estimating risk
There are three main classes of models available to analyse data: non-parametric, semiparametric and parametric models. Statistical models are simplifications of reality, embodying numerous assumptions. Non-parametric models for a distribution function make no distributional assumptions, unlike parametric models, which specify the distribution in terms of parameter(s). Semi-parametric models have both parametric and non-parametric components. More details on the methods described below are available in Supplementary Material 1.
The Kaplan-Meier method,(5) which is non-parametric, is the most widely used method of estimating risk of precancerous cervical disease as a function of time from an 'entry' screen. (6-10) However, Kaplan-Meier is not appropriate when disease is screen-detected. The most popular adaptation of the Kaplan-Meier estimator equates the time of onset with time of diagnosis, which consistently overestimates the time of onset. To improve this, the midpoint of the interval in which disease could have occurred can be imputed as the time of onset for  This too causes bias unless the screening interval is homogeneous and short. (11) The Kaplan-Meier estimator only correctly estimates time of screen-detected disease diagnosis (not time of disease onset) at times when all participants have their disease status ascertained. However, risk of diagnosis for asymptomatic conditions is largely determined by the testing schedule, limiting consideration to the chosen screening intervals underlying the data. Consideration of disease onset is required to calculate risks for any possible screening interval, and requires consideration of prevalent and incident disease.
The Turnbull estimator (12) is non-parametric, and appropriate for prevalent (including undiagnosed) and incident disease. We have adapted Turnbull to account for undiagnosed baseline disease (3) (see Supplementary Material 1). However, the adapted Turnbull method cannot account for covariates and can result in survival curves with big steps and wide confidence intervals even in a dataset of 1 million women.
To account for covariates and improve statistical efficiency, we recently developed the logistic-Weibull model, a parametric model that jointly models prevalent and incident disease.(3) The absence/presence of prevalent disease is modelled using logistic regression, and cumulative risk of incident disease among women who did not have disease diagnosed at baseline is modelled using a Weibull survival model. Weibull models account for interval censoring, and are reasonable when follow-up times are short relative to a woman's typical time to event. (13) The cumulative risk is a weighted sum of prevalent logistic-regression disease risk π i (β) and incident Weibull-model disease risk (1 − S i (t; (γ, τ))) at time t: , X i and Z i are vectors of covariates, β are regression coefficients for prevalent disease effects, γ are regression coefficients for incident disease effects, and τ governs the shape of the Weibull distribution. Because undiagnosed baseline disease can be considered as "missing data", we developed an EM algorithm to estimate model parameters. (14) In the special case of no undiagnosed baseline disease (i.e. everyone at baseline undergoes definitive disease ascertainment), the logistic regression and Weibull regression can be conducted separately to obtain parameter estimates. Logistic-Weibull models were recently used by Katki et al (15)(16)(17) to estimate cervical cancer and pre-cancer risks among 1 million women undergoing cervical screening to inform U.S. risk-based screening guidelines for cervical cancer. (18) Both Turnbull and the Weibull models account for interval censoring, and are fitted to interval data, the (Li, Ri] in which disease could have occurred, for all women who were not known to have prevalent disease. As the software for fitting models to interval censored data cannot handle intervals of width 0, for Turnbull we define prevalent disease to occur in the interval [0,0.01 years), to avoid overlapping with incident disease. In the logistic-Weibull model the intervals for participants in whom baseline disease status was not ascertained, but disease was present in follow-up, is set to begin at exactly 0, whereas incident disease intervals that begin at zero are pushed to 0.01 years.
The Weibull model smooths over flats and sudden peaks which could result from Kaplan-Meier or Turnbull models, producing estimates at all time points. The Weibull model was chosen based on the Armitage and Doll model of carcinogenesis, (19) which hypothesised that the accumulation of a few critical steps in pathogenesis (e.g., mutations) was required to transform a normal cell to a cancerous cell. We have recently generalized logistic-Weibull to a semiparametric logistic-Cox model that makes fewer distributional assumptions, and a weakly-parametric model using integrated B-splines to model incident disease. This model can be useful when no parametric assumptions are a good fit, but a smooth risk estimate is preferred. (20) While a testing schedule with shorter screening intervals will result in more accurate estimation, the results of Turnbull and logistic-Weibull analyses are not biased by the testing schedule. Both Turnbull and logistic-Weibull assume baseline disease ascertainment occurs at random, though for logistic-Weibull this can be at random conditional on covariates which can be adjusted for, such as prior screening history or age. We note that whilst these methods are generalizable to other screen-detected diseases, Weibull may not always be an appropriate parametric model.
We compare risk estimates from Kaplan-Meier, Kaplan-Meier using the interval midpoint as date of diagnosis, adapted Turnbull and Weibull/logistic-Weibull, to demonstrate the impact the choice of statistical method can have on the results.

Cervical Screening Data
Data were available on 1,037,065 women aged 30-64 who had at least one cervical screening co-test (HPV test and cervical cytology) between 1 st January 2003 and 30 th June 2013 at Kaiser Permanente Northern California (KPNC). We estimate risk of precancer or worse (CIN2+, including cervical intraepithelial neoplasia grade 2, 3, adenocarcinoma in situ, or cancer) following an HPV-positive, cytology-negative co-test using data from the 41,067 women who had this screening result. This is a relatively common, non-normal screening result, the management of which remains controversial.
To investigate the impact of the choice of interval in which disease is defined to have occurred, we reanalyse these data using three different definitions for intervals, defined in Supplementary Material 2. Different interval definitions allow analyses to account for test insensitivity. As this example uses observed data, the true risk is unknown.

Simulations
We carried out a series of simulations to examine critically the properties of Kaplan-Meier using the diagnosed time and the interval midpoint as the event time, adapted Turnbull and Weibull/logistic-Weibull models. Unless otherwise specified, screening was simulated to occur independently of disease risk at intervals drawn from an exponential distribution with mean 3.5 years, variance 9 years, up to a maximum of 10 years of follow up. The screening test was presumed to always detect disease present at the time of testing.
In simulation 1, event times were simulated assuming a Weibull distribution (Figure 1a), with disease present in 1.8% by 3 years and 7.0% by 6 years (corresponding to shape parameter 2 and log-scale parameter 3.1) with no prevalent disease at time 0. In total, 1000 datasets were simulated from this distribution with population sizes of (i) 250, (ii) 2500 and (iii) 10,000 people.
In simulation 2, we assumed a prevalent disease risk of 3%, and thereafter event times were simulated under the Weibull distribution used in simulation 1. Baseline disease status was ascertained in 20% of the population selected at random. 1000 datasets of size 10,000 were simulated from this distribution. We then repeated the 1000 simulations, allowing for nonrandom baseline disease ascertainment: 52% of the population with prevalent disease had their baseline disease status ascertained, as did 19% of the population without prevalent disease, corresponding to 20% of the population overall.
In simulation 3, event times were simulated under a gamma distribution (Figure 1b), with disease present in 2.1% by 2 years and 35.2% by 4 years (corresponding to shape parameter 8 and scale parameter 0.6). 1000 datasets of size 10,000 were simulated.
Simulating data from known distributions allowed the true risk to be known, and compared to the estimated risks. We calculated the sum of the squared difference between the estimates from each model and the true cumulative risk calculated each day up to 10 years to compare the goodness-of-fit between the models.

Results
When analysing the observed KPNC cervical screening data, we see that the risk estimates vary considerably over time between the four methods of analysis ( Figure 2). Naïve Kaplan-Meier provided the lowest risk estimates at earlier times, and the highest estimates at later times. The largest difference in risk estimates is at baseline, where Kaplan-Meier estimates are close to zero due to low levels of prevalent disease ascertainment, whereas Turnbull and logistic-Weibull models provide higher estimates of prevalent (including undiagnosed) disease. Other than for prevalent disease, Kaplan-Meier using the midpoint provides similar estimates to the Turnbull and logistic-Weibull models in this example.
In the simulations, we can compare the risks estimated by each method to the true risk. For simulation 1, where event times were simulated from a Weibull distribution with no prevalent disease, we show the risk in a single simulation with sample sizes 250, 2,500 and 10,000, and the mean risk over 1,000 simulations with sample size 2,500 for each of the four methods considered, (Figure 3). Kaplan-Meier significantly underestimates risk until around 8.5 years, after which it overestimates risk. Although Kaplan-Meier using the interval midpoint provides reasonable risk estimates at some time points, it is inaccurate at other points (years 5-7). In general, it is impossible to know when these estimates will be accurate. The Turnbull estimates are generally reasonable; although the mean over 1000 simulations yields unbiased risks (Figure 3d), the risks in any given analysis (Figures 3a, 3b, 3c) remain choppy as additional jump points accrue slowly as the sample size increases. Increasing sample size reduces the choppiness of Kaplan-Meier, Kaplan-Meier using the interval midpoint and Turnbull estimates. In contrast to Kaplan-Meier and Turnbull, the Weibull model provides smooth estimates of risk at all times. The sum of squared errors (a measure of goodness of fit, where low values indicate better fit) was lowest for Weibull (sample size 10,000: 9.1×10 −6 ), then Turnbull (0.059).
Using data from the same simulation, we present the 90% empirical confidence intervals for the Turnbull and Weibull models, as well as the true risk, in Figure 4. The confidence intervals are wider for the Turnbull than Weibull model, especially for smaller sample sizes.
In our simulations a sample of approximately 20,000 gave confidence intervals the same width as from a Weibull model using a sample of 2,500, i.e. the sample required was eight times bigger (not shown).
Although Kaplan-Meier can be adapted to handle disease diagnosed at baseline, it cannot be adapted to handle undiagnosed prevalent disease, which results from incomplete disease ascertainment at baseline. Using simulation 2, where 3% of the sample had prevalent disease, with 20% random baseline disease ascertainment, Figure 5 shows that Kaplan-Meier underestimated baseline disease risk by 80%, as disease status was ascertained in only 20% of the sample in this simulation. The (unadapted) Turnbull model also underestimated prevalent disease risk. The pure Weibull model does not allow prevalent disease, resulting in a poor fit to the data. In contrast, the adapted Turnbull and logistic-Weibull provide accurate estimates of prevalent disease risk. Table 1 shows the prevalent disease estimates for sample sizes 250, 2500 and 10,000, for random and non-random baseline disease ascertainment. The Kaplan-Meier estimates are as expected (0.52*3% + 0.19*0%), estimating the risk as the amount of disease that was detected at baseline. Although the Turnbull and logistic-Weibull estimates were quite biased with a sample size of 250, in different directions, they were less biased than the Kaplan-Meier estimates. Turnbull provided the least biased results with the larger sample sizes.
In Figure 6, where data were simulated from a gamma distribution in simulation 3, the naïve Kaplan-Meier estimates are much lower than the true risks, Kaplan-Meier using the midpoint estimates are high, but getting closer to the truth from 4 years. The Weibull model produces reasonable estimates at most time points, but from years 1.5-3 the estimates are poor; for example at 2 years the true risk is 2.1%, but the Weibull model estimate is 4.6%. The Weibull estimates are also diverging from the truth from 4 years. The Turnbull estimates are unbiased.
Using the observed KPNC cervical screening data, we estimated disease risks using three sets of rules to define the start and end of the interval in which disease occurred. Table 2 demonstrates that, although there are differences in risk estimates between the riskestimation methods, the results are very similar (the maximum absolute difference is 2%) within each method for the 3 interval definitions. In this example, 30% of the intervals are affected by the choice of when to start and end the interval. The biggest difference occurs in the prevalent disease risk. More details on the results in Table 2 can be found in Supplementary Material 2.

Discussion
It is well known that the choice of analysis model affects the validity of the results. For electronic health-records of screen-detected disease, where the data are interval-censored, standard methods for right-censored data, such as naïve Kaplan-Meier, are not appropriate, as they produce biased results. The Turnbull estimate is unbiased, though is choppy and has wide confidence intervals, whereas parametric estimators are smooth and require much smaller sample sizes to produce the same width confidence intervals. It is important to handle prevalent disease appropriately, particularly when using parametric models, and to check the parametric logistic-Weibull model versus the non-parametric adapted Turnbull, as the underlying distribution of the data is rarely known, and the Weibull model may not provide a good fit to the data.
Kaplan-Meier methods may be appropriate in the absence of undiagnosed baseline disease, and when the intervals between screens are short or everyone returns at regular intervals. Such conditions might hold for designed studies, but are unlikely to hold for EHR data from routine clinical practice.
The risk of detectable prevalent disease is of interest as it determines whether to refer patients immediately for further (potentially invasive) tests. Outside of trials, it is difficult to know what proportion of women with a negative test actually have disease at baseline. This is because the women with negative tests who are referred for further tests are not a random sample of the women with negative tests; rather they are the women believed to be at higher risk, for example due to symptoms. This is also true for the situation modelled here, of risk following an HPV-positive cytology-negative screening result, where guidelines suggest women are invited for a repeat test at 12 months. Unlike Kaplan-Meier, both the adapted Turnbull (14) and logistic-Weibull models allow some disease found in follow-up to be undiagnosed prevalent disease, though both assume baseline disease ascertainment occurs at random (missing completely at random in Turnbull; missing at random, given covariates which can be adjusted for, in logistic-Weibull), which may not be true. When baseline disease ascertainment depends on baseline disease status, this is a form of informative censoring, which none of these methods are designed to handle; therefore the prevalent disease risks were overestimated by both Turnbull and logistic-Weibull for the larger sample sizes, as women at higher risk were preferentially sampled. Actual referral data is required to validate whether estimated prevalent disease risks are correct; additionally, if we are to refer immediately for uncomfortable, risky, and/or costly tests it is especially important for prevalent disease that the risk estimate is for diagnosable disease. If prevalent disease status is never verified then the estimated prevalent disease risk is purely model driven. Therefore prevalent disease risk is best estimated when prevalent disease status is verified for everyone, or at least a representative subsample of participants.
Although the logistic-Weibull makes assumptions, it has advantages over the non-parametric Turnbull estimates. Logistic-Weibull accounts for covariates, and allows differences in risk between groups to be tested. The Turnbull estimates can be choppy and estimates may not be generated for all time points, which occurs even in our dataset of 1 million women. Cumulative risk curves can be flat for a length of time before implausibly suddenly increasing.(13) On the contrary, logistic-Weibull estimates are smooth and defined at all times. As screening guidelines are based on risk of disease at a given time point, Turnbull estimates may be less reliable if there are jumps in the estimated risk around the time of interest.
Parametric assumptions can shrink confidence intervals, particularly with smaller sample sizes, but we reiterate the importance of checking distributional assumptions and the parametric estimates against the Turnbull estimates. In Figure 6, the Weibull model provided a good estimate at some time points, but did not provide a good fit overall. In principle any parametric model can be fit, but the fit should be checked. We have developed two more flexible models; first the logistic-Cox model, which relaxes Weibull assumptions to require only proportional hazards for incident disease. The second is a weakly parametric model using integrated B splines, which is smooth and robust to different distributions, though has some loss of power as more parameters must be estimated. (20) However, these models are computationally intensive and thus we promote parametric models, such as the logistic-Weibull, for routine use as long as the fit is checked. Our package for fitting the logistic-Weibull, logistic-Cox and weakly-parametric models, as well as adapted Turnbull, in R, PIMixture, is available at https://dceg.cancer.gov/tools/analysis/PIMixture.
Although the interval definitions for when disease could have occurred are subjective, they can incorporate scientific knowledge about the disease and screening tests. Wider intervals in which disease could occur imply less efficiency and larger confidence intervals around risk estimates, but are appropriate if tests are known to lack sensitivity or specificity. For different choices of reasonable intervals in which disease could occur, we find similar results for each analysis method. Future work could consider how best to use information from individuals with positive screening results, but no confirmed diagnosis. These individuals are at higher risk of disease, but are currently right-censored identically to individuals censored following a negative screening test. Reasonable intervals in which disease could occur should reflect a scientific understanding of disease pathogenesis and test quality, to narrow down when disease may have occurred/could occur.
It is important to consider the difference between undetectable disease that exists, detectable disease, and clinically relevant disease. Detectable disease that is not clinically relevant, although of potential etiologic interest, is not usually ascertained in routine clinical practice. Finally, undetectable latent disease is conceptually important, but difficult to account for.
When estimating cumulative risks over time from electronic health record data, it is important to consider whether interval censoring and the possibility of undiagnosed prevalent disease are present. If so, appropriate statistical methods should be applied, such as logistic-Weibull. However the model fit must be checked against Turnbull non-parametric risk estimates.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

•
Onset of screened disease is unobserved and known only to occur between screen visits • Additionally, disease may be prevalent, but is not always diagnosed immediately    Risk estimates using naïve Kaplan-Meier, Kaplan-Meier using the midpoint, Turnbull and Weibull models, and the true risks, from one simulation, with event times simulated from a Weibull distribution: a) sample size 250, from one simulation, b) sample size 2,500, from one simulation c) sample size 10,000, from one simulation d) sample size 2,500, mean risk estimates from 1,000 simulations Note that the Turnbull, Weibull and truth lines overlie one another. 90% intervals of risk estimates using Turnbull and Weibull models, and the true risks, from 1,000 simulations, with event times simulated from a Weibull distribution with sample sizes 250 and 10,000. Mean risk estimates using naïve Kaplan-Meier, Kaplan-Meier using the midpoint, Turnbull, adapted Turnbull, Weibull and logistic-Weibull models, and the true risks, from 1,000 simulations, sample size 10,000, with event times simulated from a Weibull distribution, and 3% prevalent disease. Note that the adapted Turnbull, logistic-Weibull and truth lines overlie one another. Mean risk estimates using naïve Kaplan-Meier, Kaplan-Meier using the midpoint, Turnbull and Weibull models, and the true risks, from 1,000 simulations, sample size 10,000, with event times simulated from a gamma distribution. Note that the Turnbull and truth lines overlie one another. Mean prevalent disease estimates from 1,000 simulations for random and non-random baseline disease ascertainment * , when event times were simulated from a Weibull distribution with 3% prevalent disease.  Table 2 Estimated risk of CIN2+ following an HPV-positive, cytology-negative screening result using Kaplan-Meier, Turnbull and logistic-Weibull, among 41,067 women aged 30-64, using three definitions of intervals