Estimating disease burden using national linked electronic health records: a study using an English population-based cohort.

Background Electronic health records (EHRs) have the potential to be used to produce detailed disease burden estimates. In this study we created disease estimates using national EHR for three high burden conditions, compared estimates between linked and unlinked datasets and produced stratified estimates by age, sex, ethnicity, socio-economic deprivation and geographical region. Methods EHRs containing primary care (Clinical Practice Research Datalink), secondary care (Hospital Episode Statistics) and mortality records (Office for National Statistics) were used. We used existing disease phenotyping algorithms to identify cases of cancer (breast, lung, colorectal and prostate), type 1 and 2 diabetes, and lower back pain. We calculated age-standardised incidence of first cancer, point prevalence for diabetes, and primary care consultation prevalence for low back pain. Results 7.2 million people contributing 45.3 million person-years of active follow-up between 2000–2014 were included. CPRD-HES combined and CPRD-HES-ONS combined lung and bowel cancer incidence estimates by sex were similar to cancer registry estimates. Linked CPRD-HES estimates for combined Type 1 and Type 2 diabetes were consistently higher than those of CPRD alone, with the difference steadily increasing over time from 0.26% (2.99% for CPRD-HES vs. 2.73 for CPRD) in 2002 to 0.58% (6.17% vs. 5.59) in 2013. Low back pain prevalence was highest in the most deprived quintile and when compared to the least deprived quintile the difference in prevalence increased over time between 2000 and 2013, with the largest difference of 27% (558.70 per 10,000 people vs 438.20) in 2013. Conclusions We use national EHRs to produce estimates of burden of disease to produce detailed estimates by deprivation, ethnicity and geographical region. National EHRs have the potential to improve disease burden estimates at a local and global level and may serve as more automated, timely and precise inputs for policy making and global burden of disease estimation.

phenotyping algorithms to identify cases of cancer (breast, lung, colorectal and prostate), type 1 and 2 diabetes, and lower back pain.We calculated age-standardised incidence of first cancer, point prevalence for diabetes, and primary care consultation prevalence for low back pain.

Results
7.2 million people contributing 45.3 million person-years of active follow-up between 2000-2014 were included.CPRD-HES combined and CPRD-HES-ONS combined lung and bowel cancer incidence estimates by sex were similar to cancer registry estimates.Linked CPRD-HES estimates for combined Type 1 and Type 2 diabetes were consistently higher than those of CPRD alone, with the difference steadily increasing over time from 0.26% (2.99% for CPRD-HES vs. 2.73 for CPRD) in 2002 to 0.58% (6.17% vs. 5.59) in 2013.Low back pain prevalence was highest in the most deprived quintile and when compared to the least deprived quintile the difference in prevalence increased over time between 2000 and 2013, with the largest difference of 27% (558.70 per 10,000 people vs 438.20) in 2013.

Conclusions
We use national EHRs to produce estimates of burden of disease to produce detailed estimates by deprivation, ethnicity and geographical region.National EHRs have the potential to improve disease burden estimates at a local and global level and may serve as more automated, timely and precise inputs for policy making and global burden of disease estimation.

Introduction
Over 98% of the population in England are registered with a general practice.Almost all general practices use Electronic Health Records (EHRs) during their individual consultations to record clinical diagnoses, symptoms, lab results, tests, referrals to other specialties and prescriptions 1 .In this analysis we also consider data captured during visits to secondary care hospitals, national disease registries and death records as EHRs.
These high coverage data could be used to generate automated, timely and detailed burden of disease estimates for national and local policy makers.The algorithms used to create such estimates could be made openly accessible allowing estimates to be widely, regularly and consistently produced.However, Global Burden of Disease studies in England found that accurate local data on mortality was better than for morbidity -a situation that is problematic for estimating burden for conditions with high levels of morbidity but low mortality such as low back and neck pain, skin and subcutaneous diseases, and depressive disorder 2 .Primary care records linked to data from secondary care and mortality records may provide a comprehensive single source of consistently collected national data that can be used to give a detailed and consistent picture of disease burden over time for a diverse range of conditions.National EHRs in England have been used extensively to conduct research studies 3 and these data are available for long time periods (for example, Hospital Episode Statistics from 1997, and Clinical Practice Research Datalink from 1987 1 ).Whilst national primary care data coverage is high, there is a great deal of regional variation in the different clinical computer systems used 4 .Additionally, the strengths and weaknesses of using linked national EHRs to produce stratified (e.g. by ethnicity, region or deprivation) disease burden estimates is uncertain.
In this study we aimed to address evidence gaps in estimating disease burden using national EHRs.To achieve this we had three objectives.First, to use a single source (primary care) and linked multi-source (primary care records linked to data from secondary care and mortality records) national EHRs to produce routine disease burden estimates for conditions with both high mortality (cancer) and high morbidity (diabetes and low back pain).Nearly everyone with these three conditions is registered with a GP and/or uses NHS acute services and as a result, will be included in Hospital Episode Statistics even if some treatment is provided in the private sector.Second, to compare these single source national EHR estimates of disease burden to existing incidence and prevalence data from other source estimates including disease registries and national cross-sectional health surveys.These comparisons enable us to triangulate our estimates with those currently used for public health and healthcare planning purposes.Our comparison estimates were selected based upon their strengths, including: high completion and coverage levels (e.g.National Cancer Registration and Analysis Service, Quality Outcomes Framework); high quality sampling strategies (e.g.Health Survey for England); high quality of the primary care data used as a result of multiple iterations of the data gathering process and training with participating general practices (e.g.Consultations in Primary Care Archive -CiPCA).Finally, to examine the ability of single source national EHR data to produce disease burden estimates over time and stratified by age, sex, ethnicity, deprivation and region.

Study design
We conducted a national population-based retrospective cohort study using EHRs to estimate the disease burden of four cancers (lung, colorectal, breast, prostate), diabetes (type I and II) and low back pain in England.Two cohorts were created; first a cohort containing data only from primary care records and second a nested cohort containing linked data from primary care, secondary care and national death certification registrations.

Data resources
For the first cohort we used data from the Clinical Practice Research Datalink (CPRD GOLD), a large primary care database that contains anonymized electronic health records from more than 11 million people.We used records that were classified as acceptable for research having met previously specified data quality standards and shown to be broadly representative nationally 1,5 .Data within CPRD can relate to a consultation with a healthcare professional within primary care or describe other activities relating to the individual's health or care including repeat prescriptions, lab test results or diagnoses made outside primary care.These data are recorded in the EHR using the Read code 6 system.

Amendments from Version 1
In our response to peer review, we have reiterated our choice of estimating incidence for cancer and prevalence for diabetes and low back pain, aiming to align our results with previously published estimates and ensure comparability.Specifically, our cancer data is paralleled with the national cancer registry, diabetes comparisons are made with the National Health Survey for England, and for back pain, they rely on consultation prevalence.This approach addresses the historical difficulty in determining the exact duration of back pain through GP and hospital records.We have removed the sentence "Analyses by strata were restricted to those with 5 incident/prevalent cases or more" as we had sufficient sample size that this was not an issue in our final analysis.Figure 3 and Figure 5, included 95% confidence intervals, but these were not labelled correctly -we have updated this in the manuscript.In Figure 2 and Figure 4 error bars represent 95% confidence intervals which we have now updated in the figure title.In Figure 4 we have clarified that the prevalence refers to low back pain.Finally, we have we have added a paragraph of further discussion on the topic describing global analyses that have examined disease surveillance efforts and now include the five helpful references provided by the reviewer.

Any further responses from the reviewers can be found at the end of the article
Through the CPRD linkage scheme, and where practices have provided consent, data were linked at person-level to inpatient data from Admitted Patient Care (APC) Hospital Episode Statistics (HES), a dataset containing details of all hospital admissions in England.Datasets were linked with a deterministic data linkage algorithm using date of birth, gender, postcode and NHS number (a unique ten digit numerical identifier assigned to NHS patients at birth or at first interaction with the healthcare system).Linkage was done by the Medicines and Healthcare products Regulatory Agency (MHRA) and NHS England with 88% of patients with research quality records having valid NHS number and eligible for linkage in the CPRD standard linked dataset release 7 .
Data used to estimate cancer disease burden were also linked to mortality data from the Office for National Statistics (ONS) death certification register.Clinical diagnoses and procedures recorded during hospital admissions were recorded using the International Statistical Classification of Diseases and Related Health Problems, 10 th Revision (ICD-10) clinical classification system 8 .We used the Index of Multiple Deprivation (IMD), linked by the MHRA, to be able to explore how disease burden estimates varied by relative deprivation in England 9 .

Inclusion/exclusion criteria
Our study population included male and female people registered at English general practices.Using these data, the study period we investigated was from 1 st January 2000 to 31 st December 2013.People contributed active follow up from the latest of: the date the primary care practice started to provide 'Up-to-Standard' data 1 , the current registration date of the person, or the start of the study period.People stopped contributing active follow up from the earliest of: the last date CPRD collected data from the practice, the date the individual transferred out of the practice, date of death (recorded either in CPRD or ONS), or the end of the study.For prevalence estimates, the earliest date relating to a diagnostic read code found in either clinical, test and referral records or ICD code found within the active follow up period was used as the date of diagnosis.

Case definitions for national EHRs.
We expanded existing rule-based disease phenotyping algorithms to identify cases of cancer (breast, lung, colorectal and prostate), type 1 and 2 diabetes, and low back pain.Cancer diagnoses were determined using validated Read and ICD-10 terms as outlined by Bhaskaran et al. 10 Read and ICD-10 terms used to identify type 1 and 2 diabetes cases were based on work described by Eastwood et al. 11 .Our low back pain estimates were derived using the same method and an expanded code list from Jordan 12 et al. 2014 with the addition of exacerbation of backache (16C8.00)and lumbalgia (N142.12)codes.

Case definitions for comparison EHRs.
Cancer incidence estimates were compared to those from the cancer registration statistics produced by the Office for National Statistics using the same method of standardisation (e.g.European Standard Population 2013 in 5-year age-band) to enable direct comparison 13 .Table 1 summarises the sources of data used, case definitions, coverage and phenotypic depth of these cancer registry data.We used the following groups of ICD10 codes for our comparison estimates from the registry: C34 for lung, C18-C20 for bowel, C50 for breast cancer and C61 for prostate cancers.Our diabetes prevalence estimates were compared to those from estimates from Health Survey for England (HSE) and Quality Outcomes Framework (QOF).We applied HSE methods for estimating the annual diabetes prevalence.In survey year 2010 and for previous years of the survey, diabetes Type 2 was defined as being a self-reported diagnosis at aged 35 or older and not treated with insulin and Type 1 diagnosed before 35 and treated with insulin.For later surveys, 2011 and onwards, time of diagnosis was not used as a proxy measure to classify people by diabetes type.For QOF we used annual point-prevalence (percentage) for type 1 and 2 diabetes.Our low back pain estimates were compared to HSE and the Consultations in Primary Care Archive (CiPCA).The final algorithms used for all conditions are provided in the Extended Data File 2 14 .

Statistical analysis
We estimated the annual incidence of the four cancers studied, point-prevalence for diabetes and consultation period-prevalence for back pain for the period 2000-2013.
For fair comparison with existing data sources and estimates, where possible we used methodology consistent with these existing data.Annual incidence and prevalence were estimated for all included people.Population estimates were described for both cohorts and compared with the estimated population of England in 2013 (Table 2).Analyses were stratified by age, sex, deprivation, ethnicity and region.We analysed data sub-regionally by using the nine statistical regions in England as defined by ONS.Quintiles of Index of Multiple Deprivation (IMD), at the person-level, were used as a measure of deprivation.Ethnicity was determined using methods previously described by Mathur et al. 15,16 with further details provided in Extended Data File 2 14 .An available case analysis approach was used for all analyses.Analyses were performed using Stata version 14 SE and Stata version 15 SE.RWA, HE, AM, KB, RM, LS had access to the database population used for the study.
Cancer estimates: first incidence.Incidence was calculated by dividing the number of incident cases by the total active follow up time (in 100,000 person years) for lung, bowel, breast and prostate cancer.Follow up time ended at the time of the first cancer.Previous work has found evidence to suggest misclassification of prevalent cases, including cancer, as incident cases recorded in the first year after registration 17 .For example, a healthcare professional registering a new person with their practice may enter the date of the primary care registration rather than the retrospective date of diagnosis, which leads to overestimation of incidence rates because the event occurred before registration and the corresponding years at risk were not included.In order to avoid this potential bias, active follow up was restricted to exclude the first year after registration for all incidence estimates 18 .People identified with cancer before the active follow up period

Point-prevalence (estimates for diabetes).
Annual pointprevalence of diabetes among those aged 17 years or older was calculated at mid-year (defined as July 2nd).The analysis was restricted to this age group to be consistent with QOF estimates.For a person to be included in the denominator for a given year they needed to be 17 years or older and be contributing active follow-up at mid-year.For a person to be included in the numerator they needed to be contributing to the active follow up and have at least one diagnosis Read code recorded before or at the mid-year point for that year 11 .We calculated annual point prevalence by year of age for diabetes type 1 and 2 alone and for both types combined, the latter of which enabled comparisons with HSE and QOF estimates.When calculating prevalence estimates using linked CPRD-HES, we used CPRD dates and if diabetes was not recorded in CPRD but was recorded in HES, then the date of hospital admission recorded in HES was used.

Period-prevalence (estimates for low back pain).
Period prevalence, defined as a 1-year consultation prevalence, was estimated for lower back pain.The denominator was the total number of active people at mid-year (defined as 2nd July) for each year of the study.Among these people, where one or more consultation for back pain was recorded within an individual's active follow-up period, then the person would be counted once in the numerator.Consistent with the methodology applied to incidence, active follow-up was restricted to follow-up after one year from registration with the practice.People that were in their first year since primary care practice registration at mid-year were therefore excluded from the numerator and denominator for that year.Primary care records from CPRD were defined as consultation records if they were classified as either a face-to-face or telephone consultation (Extended Data File 2 14 ) 12,19 .Primary care consultations were restricted to those that had a low back pain Read code in either their clinical, test or referral records.These additional codes did not exist at local level for this previous study.Secondary care admissions were restricted to those with an ICD-10 code 12 recorded with any diagnosis (code list given in Extended Data File 2 14 ). of the English population.The median follow-up time for CPRD active people was 9.0 years (IQR 3.5 -13.4).Of these active individuals, 2.4 million were linked to HES and of these linked people, 99.9% had Index of Multiple Deprivation (IMD) data recorded and 79.6% had ethnicity recorded.

There
In the following sections we describe our results for each condition in the following order.First, we present a description of our linked and unlinked estimates of disease burden for each disease over time.Second, we compare our new estimates of disease burden to existing incidence and prevalence data.Third, we describe the findings of stratified estimates (e.g.age, sex, region and IMD deprivation groups) from our results to comparator estimates.
In 2006 and 2007, QOF estimates were considerably lower than CPRD and CPRD-HES estimates (the largest difference in 2007 of 0.62% and 0.92%, respectively), but from 2008 to the end of the study period QOF estimates were similar to CPRD-HES estimates.We compared our estimates to those from the Health Survey for England (HSE) study to CPRD and CPRD-HES.HSE showed an increase in type 2 prevalence but did not show a similar increase in type 1 (Extended Data File 2 14 ).HSE weighted estimates for type 1 and type 2 diabetes were very similar over time to our CPRD-HES estimates.
Between 2000 and 2013, prevalence of diabetes type 2 increased over time for each age group (Figure 3d).Prevalence was similar between the 65-74 and 75+ age groups up to 2007 (0.5% greater among 75+ in 2007 -12.5% vs 12.0%).After 2007 there was a reduction in the relative increase of prevalence over time and by 2013 the difference between these age groups was greatest and 3.5% greater in the 75+ age group (16.9% vs 13.4%).Prevalence of type 2 was highest in the most deprived quintile for IMD and increased more in this group compared to the least deprived, widening the difference between these quintiles over time.Prevalence in the least deprived group increased from 1.5% (95%CI, 1.2 -1.8%) in 2000 to 4.8% (95%CI 4.3 -4.8%) in 2013 and from 2.5% (95%CI 2.1 -2.8%) to 6.9% (95%CI 6.5 -7.2%) in the  Black with a 3.7%, 3.5% and 3.1% increase respectively between 2000 and 2013.Geographically, prevalence was generally higher in the North of England and the Midlands compared to the South and East of England.Prevalence estimates of type 2 diabetes using CPRD and CPRD-HES showed similar patterns by region in 2013 (Figure 3f).
Prevalence of low back pain increased with age using national CPRD and CPRD regional data for the West Midlands (Figure 5).Our prevalence estimate of low back pain for the CPRD region of West Midlands were generally consistent with 95% confidence intervals overlapping estimates from CiPCA.Low back pain estimates increased up to age group 45-64 for males and females for our national EHR (CPRD-HES) estimates and CiPCA.
Prevalence estimated using linked CPRD-HES data was 38% greater among women (prevalence ratio 1.38 (95%CI 1.37 -1.40) with a prevalence difference of 153.6 per 10,000 people (534.3 vs 380.8) compared to men in 2013 -a trend that was consistent over time.Between 2000 and 2013 there was an overall increase in prevalence among those with white and 'other' ethnicity, while there was a decline among those with south asian, black and mixed ethnicity, with black, south asian and white ethnicities becoming similar in 2013 (Figure 4c).The largest difference between any two ethnic groups was 252.5 per 10,000 people, or 67% (PR 1.67 (95%CI  4e).

Discussion
Our study uses national EHRs for routine surveillance of disease burden.We used data from more than seven million people to estimate disease burden for three diverse conditions that have either high levels of morbidity, mortality or both.
Our results address important evidence gaps in estimating disease burden estimation using a single source of national electronic health records and meet our three study objectives.
First, we have used national EHRs to produce estimates for conditions with a high mortality such as cancer and high morbidity such as diabetes and low back pain.We undertook this work building upon existing validated disease phenotypes and applied these to national EHRs.In future it will be possible to relatively quickly reapply our code to update estimates of these three sets of conditions, but also expand on the number of phenotypes to provide more comprehensive estimates for a wider range of conditions.With the move to SNOMED 20 coding lists in primary care -a structured clinical vocabulary for use in an electronic health record -Read code morbidity lists will need to be converted to SNOMED.
Second, for cancer and diabetes, we have compared our single source national EHR estimates of disease burden to current national incidence and prevalence data that have high completion and coverage (e.g.National Cancer Registration and Analysis Service, Quality Outcomes Framework) and shown that these estimates are broadly comparable.Our prostate and breast cancer were higher than cancer registry estimates, which may reflect the true burden of disease as a result of under-reporting of these common conditions in the cancer register, or they may be false over-estimates.Falsely inflated results may occur due to misclassification bias within primary care records.For example, positive prostate specific antigen results may initially be incorrectly classified as prostate cancer in primary care records, but actually be due to benign prostate enlargement.This misclassification is less likely to occur within the cancer registry data.Read codes used to record diagnoses in primary care can be recorded using non-specific codes; e.g.codes that specify back pain rather than low or upper back pain or diabetes rather than type 1 and type 2 diabetes.In order to estimate specific and more meaningful morbidity prevalence estimates, we encourage GPs to record completely and accurately.Our CPRD-HES estimates for type 1 and type 2 diabetes were very similar over time to those from Health Survey for England which is produced using self-reported data from a stratified random sample of the population.Our prevalence estimates for lower back pain were generally consistent with estimates from CiPCA, which includes high quality primary care data as a result of multiple iterations of the data gathering process and training with participating general practices.Our results are also consistent with more recently published data 21 .
Finally, we have demonstrated that national EHR data from across healthcare settings can produce disease burden estimates over time and stratified by age, sex, ethnicity, deprivation and region.Our results provide national estimates of lower back pain, and in addition to this, provide important data on subgroups.Our data indicate that South Asian and Black ethnicities had highest prevalence estimates for low back pain and that the burden was highest in the lower socio-economic groups.
A weakness of the data used are that they are derived from one primary care computer system, which is distributed unevenly across the UK with low representation in the north-east and midlands.Future work should develop a framework (a governance framework and a harmonised data model) to enable these studies to be run on general practice databases from all major systems.We did not cluster by general practice in the construction of the confidence intervals, which would have resulted in wider confidence intervals than those currently presented.There were high levels of missing data for ethnicity, and our data were overrepresented for white, and underrepresented for south asians and black populations.
Our results showed variation by ethnicity and as a result our national and age group estimates, which did not control for ethnicity, may be biased towards levels found in white ethnic groups.Missing data appeared not to have caused substantial bias in the comparison of EHR-based estimates versus those from national registries or survey in the outcomes being investigated in this work.However, missing data could be an issue with other EHRs, even in high population coverage settings, when certain variables are not missing completely at random, which should be considered when exploring the potential automated derivation of disease burden estimates from EHRs.Secondary, tertiary, or further illness or health conditions in patients with high levels of co-morbidities may not be comprehensively recorded in busy general practice consultations, especially for elderly patients or patients with difficulties with communicating.EHR-based prevalence or incidence estimates of these illnesses or conditions, and for certain subgroups of patients, may be underestimated in the EHR as a result.
Our proposed methods have several strengths over existing disease burden data sources.Compared to cross-sectional surveys, our methods do not rely on self-reported outcomes, are clinician coded, and once validated, could be produced at lower annual costs on a routine basis.The data are national in scale and are detailed enough to produce small local area estimates without the need for substantial statistical modelling.
A recently conducted systematic review examined the challenges and solutions of using electronic health records as based disease surveillance systems 22 .The review found that many challenges exist including the amount of time and investment required to set up such systems and the privacy and security challenges of such approaches.However, it also identified potential opportunities including the use of machine learning algorithms for unstructured data and appropriate technical solutions for data processing.Other studies have compared the results of electronic health record disease surveillance systems to survey results and found that they can produce robust estimates of smoking and obesity prevalence in the general population 23 .Other UK and international studies have examined the use of electronic disease surveillance for specific diseases including musculoskeletal conditions 24 , chronic disease 25 and cardiovascular disease 26 and described their use in these specific settings.
In the UK, a growing number of studies are using primary care data linked to HES data.It is encouraging that the data are broadly representative nationally and that estimates from this study provide temporal trends comparable with other work to-date.
Our data support a growing body of evidence from other studies for the validity of national EHRs for estimating disease burden from a diverse range of conditions including cardiovascular disease 27,28 , bronchiectasis 29 , obesity, and learning disabilities 30 and a chronological map of 308 physical and mental health conditions 31 .This growing body of evidence suggests that this approach is generalisable to a wide range of conditions and we envisage that health data providers such as CPRD and NHS England will increasingly be able to provide data within shorter periods of time for pre-approved studies.Expanding this approach could enable continuous collection of registry data for some conditions, instead of using intermittent validation surveys, or augmenting national EHR where needed with individual or clinician reported information.Linked EHRs could be used to produce a picture of the individual journey through primary care and secondary care and these results would allow opportunities for primary, secondary and tertiary prevention to be identified and acted upon.Population level estimates of multiple morbidity, for which we currently lack data in England, could also be produced using these data.In countries with health systems with high population coverage and complete linked electronic health records, automated, timely and precise disease burden estimates could potentially be available based on use of existing data.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Version 1
Reviewer

Joyce C Ho
Emory University, Atlanta, USA The article assesses the use of electronic health records (EHRs) to produce disease burden estimates.Using existing disease phenotypes for 4 types of cancer, type 1 and type 2 diabetes, and lower back pain, the EHR-based estimates are compared to disease registries and national cross-section health surveys.The article also examines the disease burdens stratified by age, sex, ethnicity, deprivation, and region.The results suggest that the estimates are consistent with existing collection mechanisms and can offer more timely and precise inputs for estimating disease burden.
The article overall is well-written and the statistical analysis and methods are appropriate, especially the use of existing, validated disease phenotypes to construct the disease estimates from EHR.Unfortunately, access to the primary care EHR data is not available due to privacy concerns, so the study is not fully reproducible but sufficiently described that the process could replicated.
One limitation of the article is the lack of discussion compared to other EHR-based disease surveillance efforts worldwide.Two notable omissions include a recent survey on EHR-based surveys and estimating musculoskeletal conditions (the latter one seemingly uses the same dataset).There are also 3 works linked below while not necessary would also be interesting to compare/contrast the article findings against.Since these citations focus on the United States, not all findings will be relevant/repeatable given differences in EHR data collection, but discussions of EHR data quality and other insights might be portable to better understand how generalizable it is to other countries.

Thaddäus Tönnies
Leibniz Center for Diabetes Research at the Heinrich Heine University, Düsseldorf, Germany Aldridge et al. estimated the burden of several diseases in terms of incidence and prevalence using electronic health records linked to additional data sources.The long-term goal is to provide a data source of consistently collected national data that can be used to give a consistent, automated, and timely picture of disease burden over time for a diverse range of conditions.Overall, the paper is clearly written and presents interesting results which are relevant for decision makers and the scientific community working with EHR.I only have some minor comments.According to the authors, one advantage of EHR is the potential to provide timely results.However, the data in this study are rather dated (2000 to 2013).I think this is totally fine for this study, as it appears more concerned with feasibility than with substantive results.Nevertheless, it would be interesting to have some information on how timely results might be available in future, one the procedures in this study are routinely implemented. 1.
Is there a rationale for estimating incidence for cancer, but prevalence for diabetes and low back pain?Results for the incidence of diabetes and low back pain would also very relevant.
In fact, in would be a strength of this study if incidence of diabetes could also be estimated, since data on prevalence is more often available.

2.
During linkage, the sample size substantially decreased from 11 million to 2.4 million.
Please specify what led to this reduction, whether this could introduce selection bias and if representativeness can still be assumed.Also discuss whether the difference between linked and unlinked estimates of disease burden could be due to this selection rather than improved detection of cases.

3.
The authors stratified the data to obtain stratum-specific estimates.A probably more efficient approach could be to use regression models with the stratifying variables as 4.
(continuous) predictors.This might also overcome problems with small sample sizes in some strata (only strata with at least 5 cases were analyzed).
The results were compared to published estimates.Please mention if one of the comparators can be considered the "gold standard" or most valid source.

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Epidemiology
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
substantive results.Nevertheless, it would be interesting to have some information on how timely results might be available in future, one the procedures in this study are routinely implemented." Author response: The reviewer correctly identifies that the study is a feasibility study hence it does not impact our results.We agree that it would be useful to have more timely results in future.It is possible that the data could be provided every month, quarter or year from the data provider -CPRD -if permissions were sought and approved, which would provide a useful close to real-time monitoring mechanism.
Review comment: "Is there a rationale for estimating incidence for cancer, but prevalence for diabetes and low back pain?Results for the incidence of diabetes and low back pain would also very relevant.In fact, in would be a strength of this study if incidence of diabetes could also be estimated, since data on prevalence is more often available."Author response: We estimated incidence for cancer, but prevalence for diabetes and low back pain to make our results comparable with previous published estimates.In this paper, for cancer we compare with the national cancer registry.For diabetes, we compare with the National Health Survey for England.For back pain, we use consultation prevalence (percentage of patient that have had at least one consultation about backpain in a year), similar to Jordan (reference 12).This is because historically there is a lag between having back pain and seeing the GP and therefore it is difficult to know both start and end date of back pain using GP and hospital records.We agree with the reviewer that incidence for diabetes could be calculated using these data, along with prevalence of cancer.We agree that these estimates would be useful in future studies.
Review comment: "During linkage, the sample size substantially decreased from 11 million to 2.4 million.Please specify what led to this reduction, whether this could introduce selection bias and if representativeness can still be assumed.Also discuss whether the difference between linked and unlinked estimates of disease burden could be due to this selection rather than improved detection of cases." Author response: Table 2 describes, the demographic distribution of patients registered with practices contributing active follow-up to CPRD and the subset with linked data.Table 2 shows that the subset of participants with linked data have broadly similar demographic distributions to those without linked data.The differences in our estimates of disease burden between linked and unlinked data are more likely driven by the differences in health care utilization and clinical practice for the conditions in our study (e.g.cancer is much more commonly managed in hospital not primary care and therefore cases are more likely to appear in these hospital records) rather than any smaller selection bias introduced through linkage.
Review comment: "The authors stratified the data to obtain stratum-specific estimates.A probably more efficient approach could be to use regression models with the stratifying variables as (continuous) predictors.This might also overcome problems with small sample sizes in some strata (only strata with at least 5 cases were analyzed).
Author response: We have removed the sentence "Analyses by strata were restricted to those with 5 incident/prevalent cases or more" as we had sufficient sample size that this was not an issue in our final analysis.As a result, we also feel that our approach is appropriate.Future analyses could consider using regression models with the stratifying variables as (continuous) predictors, but unfortunately, we would not be able to do this for our current analysis as we are no longer able to access underlying data.
Review comment: "The results were compared to published estimates.Please mention if one of the comparators can be considered the "gold standard" or most valid source.." Author response: We deliberately avoided the use of the word "gold standard" in our manuscript as we believe no such "gold standard" data source exists for this setting and study.Instead, our estimates were compared to the highest quality and most appropriate existing data.
were 7.2 million acceptable people contributing 45.3 million person-years of active follow-up between 2000-2013 (Table 2; CPRD Active 2000-2013).Our nested primary care cohort (CPRD-HES Active 2000-2013) linked data to secondary care and death registry data and included 5.6 million people contributing 35.6 million person-years of follow-up between 2000-2013.At mid-2013 there were approximately 3.0 million active CPRD individuals.This equates to 5.6%(3,015,525 / 53,865,817)

Figure 1 .
Figure 1.Directly standardized by age estimates of cancer for male and females (2000-2013) comparing estimates created using the following: 1) primary care records; 2) linked primary care and hospitalisation records; 3) linked primary care, hospitalisations and death records; and 4) National bespoke cancer registry records.

Figure 2 .
Figure 2. Incidence of cancer for male and females by age category comparing national bespoke cancer registry estimates (2013) to linked primary care, hospitalisations and death records.Error bars represent 95% confidence intervals.

Figure 3 .
Figure 3. Annual point-prevalence for diabetes comparing national primary care quality outcome framework data to linked CPRD-HES for (a) Type 1 and Type 2 combined (b) Type 1 and Type 2 combined and separate; (c) Type 2 by gender, (d) Type 2 by age group, (e) Type 2 by ethnicity (Error bars represent 95% confidence intervals) and (f) Type 2 by region comparing unlinked CPRD and linked CPRD-HES (Error bars represent 95% confidence intervals).

Figure 4 .
Figure 4. Prevalence estimates (per 10 000 person) for Low Back Pain (LBP) by (a) source, (b) gender, (c) ethnicity, (d) age group, (e) region and (f) Index of Multiple Deprivation (IMD).Primary care records used for all figures except (a) and (e) where unlinked primary care and linked primary care and hospitalisation records were used (Error bars represent 95% confidence intervals.).

Figure 5 . 1 -
Figure 5. 1-year consultation prevalence for low back pain nationally, the West Midlands and Consultations in Primary Care Archive for North Staffordshire (CiPCA; includes regional data from North Staffordshire general practices) in 2010 using primary care records with secondary care data in CIPCA recorded by primary care from hospital correspondance.Error bars represent 95% confidence intervals.Note: CiPCA Estimates are from Jordan et al. 2014 12 .
Report 19 October 2023 https://doi.org/10.21956/wellcomeopenres.21567.r67551© 2023 Ho J.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Figure 4 :
Figure 4: Please specify in the legend, to which disease the prevalence refers.8.
Review comment: "Figure1, 3, 5: I suggest to add confidence intervals to the lines.."Authorresponse: Figures3 and 5, included 95% confidence intervals where possible, but these were not labelled correctly.Thank you for noting this omission -we have updated the figure title to accurately reflect this point.Review comment: "Figure 2, 4: Please specify in the legend, what the error bars represent."Author response: Thank you for noting this omission -Error bars represent 95% confidence intervals which we have now updated in the figure title.Review comment: "Figure 4: Please specify in the legend, to which disease the prevalence refers."

Table 1 . Existing data English and International sources for estimating population disease burden, their scale, strengths and weaknesses. Data source Description and coverage Phenotypic depth Case definition used in study Scale Strengths* Weaknesses*
*strengths and weaknesses specifically in relation to disease burden estimation

Table notes :
Acceptable patients registered with English, CPRD, up-to-standard practices among male and females in our study period(2000 and 2013).*ifactivepatients at mid-year are in their first year since current registration, these patients are excluded.Note: these are the patients that contribute to 1-year consultation prevalence and incidence estimates.There isn't any publically available estimates for population by ethnicity for mid-2013 so 2011 was used.Data sources for national estimates: Age and Gender (mid-2013): https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/lifeexpectancies/adhocs/005676englishpopulationestimatesanddeathsbysexandsingleyearofage1993to2013Region (mid-2013): via Population Estimates, Analysis Tool mid-2013: http://webarchive.nationalarchives.gov.uk/20160106144703/http://www.ons.gov.uk/ons/rel/pop-estimate/population-estimates-for-uk-england-and-wales--scotland-and-northern-ireland/2013/index.htmlEthnicity (mid-2011 [census year]): was from the National Archives through (Nomis): https://www.nomisweb.co.uk/, then go to Data Downloads/Query Data/Census 2011/Key Statistics All last accessed on 20th February 2018 began were excluded from our analyses.We used direct age-standardization, standardizing to the European Standard Population 2013 in 5-year age-bands.

Is the work clearly and accurately presented and does it cite the current literature? Partly Is the study design appropriate and is the work technically sound? Yes Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? No Are the conclusions drawn adequately supported by the results? Yes Competing Interests: No
: 4832-4843 PubMed Abstract | Publisher Full Text 3. Kraus EM, Brand B, Hohman KH, Baker EL: New Directions in Public Health Surveillance: Using Electronic Health Records to Monitor Chronic Disease.J Public Health Manag Pract.28 (2): 203-206 PubMed Abstract | Publisher Full Text 4. Williams BA, Voyce S, Sidney S, Roger VL, et al.: Establishing a National Cardiovascular Disease Surveillance System in the United States Using Electronic Health Record Data: Key Strengths and Limitations.J Am Heart Assoc.2022; 11 (8): e024409 PubMed Abstract | Publisher Full Text 5. McVeigh KH, Newton-Dame R, Chan PY, Thorpe LE, et al.: Can Electronic Health Records Be Used for Population Health Surveillance?Validating Population Health Metrics Against Established Survey Data.EGEMS (Wash DC).2016; 4 (1): 1267 PubMed Abstract | Publisher Full Text competing interests were disclosed.Reviewer Expertise: Machine learning, Electronic health records I

confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. differences
in EHR data collection, but discussions of EHR data quality and other insights might be portable to better understand how generalizable it is to other countries."Authorresponse: In response to this helpful comment, we have added a paragraph of further discussion on this topic describing global analyses that have examined disease surveillance efforts, including the five helpful references provided by the reviewer.
https://doi.org/10.21956/wellcomeopenres.21567.r62082© 2023 Tönnies T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.