An algorithm‐based approach to ascertain patients with rare diseases in electronic health records using hypereosinophilic syndrome as an example

Improved hypereosinophilic syndrome (HES) ascertainment in electronic health record (EHR) databases may improve disease understanding and management. An algorithm to ascertain and characterize this rare condition was therefore developed and validated.

• To improve the ascertainment of patients with HES, an algorithm was developed based on unique HES features identified from healthcare records (UK database) of patients with versus without HES (age, sex, and index date matched).
• The strongest predictors of HES versus non-HES cases (odds>1000 times greater) were an ICD-10 code for white blood cell disorders plus BEC ≥1500 cells/μL in the 24 months pre-index.
• The novel algorithm was sufficiently sensitive to ascertain HES versus non-HES cases; this methodology could be applied to other rare diseases.

Plain Language Summary
Patients with hypereosinophilic syndrome (HES) have high numbers of eosinophils, a type of white blood cell found in blood and body tissues.Eosinophil accumulation can lead to organ damage and failure.HES can be difficult to diagnose because symptoms can overlap with those of other conditions.Improved HES ascertainment could improve understanding and treatment of this condition.
An algorithm was developed to help ascertain patients with HES from databases containing patient healthcare records.Using a UK database, 88 patients with HES and 2552 similar patients without HES were ascertained.Patients with and without HES were compared based on the presence of other diseases, prescribed treatments and laboratory results.Any features that differed between the cohorts were included in 270 different algorithm models.The best performing model, which most accurately predicted patients with and without HES, was selected.The final model was able to ascertain 61/88 true HES cases (69% accuracy).This study found that patients with HES have unique features compared with patients without HES.Including these unique HES features in an algorithm may help to ascertain patients with HES, even without a confirmed HES diagnosis.It may be useful in ascertaining other rare diseases from healthcare record databases.

| INTRODUCTION
Hypereosinophilic syndrome (HES) is a group of rare hematologic disorders characterized by elevated blood and tissue eosinophil counts with organ damage and dysfunction, without evidence of secondary causes of hypereosinophilia. 1,24][5] The International Cooperative Working Group on Eosinophil Disorders (ICOG-EO) uses the following criteria to identify HES: blood eosinophil count >1500 cells/μL on two examinations ≥1 month apart and/or tissue eosinophilia; organ damage and/or dysfunction attributable to tissue eosinophilia; and exclusion of other causes of organ damage. 6e low incidence and prevalence of HES, 7 along with diagnostic complexities, make investigation of novel treatments and management strategies challenging.Novel and validated methods to ascertain HES patients in real-world populations are required.A specific International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis code for HES was not available until 2021. 8Prior to this, a general code for eosinophilia (D72.1) was used, making patients with HES difficult to ascertain in most electronic health records (EHR) databases.However, specific codes for HES, including the READ/EMIS/Systemized Nomenclature of Medicine (SNOMED), are available in the UK-based Clinical Practice Research Datalink (CPRD)-Aurum EHR database.SNOMED provides a standardized coding scheme for capturing precise information from clinicians. 9 Consequently, the CPRD-Aurum EHR database represents a valuable tool for developing an algorithm to ascertain cases of HES.
Improved ascertainment of HES cases facilitates further examination of clinical characteristics, treatment patterns and outcomes, enabling optimization of disease management.
This study aimed to develop and internally validate an algorithm to ascertain and characterize cases of HES, based on a HES reference population, using data from the CPRD-Aurum database.The HES cohort was defined as patients with an existing READ/ SNOMED/EMIS code for HES.A matched non-HES cohort comprised patients with similar characteristics but without specific diagnosis codes for HES.Patients from both cohorts were identified in the CPRD Aurum primary care database Unique characteristics of the HES cohort were evaluated and used to develop an algorithm to ascertain HES cases without the need for specific SNOMED diagnosis codes (Figure 1).Index date was the date of the first diagnosis code for HES during the study period (HES cohort) or the date of the eosinophil blood test (non-HES cohort) that met the specified inclusion/exclusion criteria.

| Patients
Patients in the HES cohort had ≥1 specific SNOMED code/s for HES during the study period and no other record of HES within the 12 months prior to index.Patients in the non-HES cohort had no SNOMED code for HES at any time during their medical history and ≥1 blood eosinophil code during the study period.Patients included in both cohorts were ≥ 18 years of age, had ≥12 months of CPRD data prior to index and were eligible for linkage.From this sample, duplicated patients were removed to ensure each patient's data was unique.The HES cohort was matched to a randomly selected non-HES cohort (1:29 matching ratio) by age (±2 years), sex and index date (±12 months).When >29 controls were matched to the case, a random sample of 29 controls was selected for the study final sample (patient attrition in Supplementary Table 1).

| Variables
Pre-defined variables of interest included demographic characteristics of the overall cohort and clinical variables based on expected unique features of the HES cohort, including the most reported comorbidities, most prescribed treatments, pre-defined potential HES treatment, and blood eosinophil measurement variables (Supplementary Table 2).As part of the data cleaning process, all variables were checked for outliers, and blood eosinophil counts units were unified to cells/μL.Asthma and white blood cell condition were identified using ICD-10 and READ codes, with most SNOMED codes mapped to a READ code.Eosinophil and treatment variables were based on SNOMED codes only.Variables identified in the HES cohort were then assessed in the non-HES cohort.The clinical variables, including HES treatments, are detailed in Supplementary Table 2 a list of CPRD Aurum specific codes for these variables are detailed in the Supplementary Materials appendix.

| Algorithm development
After variables of interest were identified and summarized for cases and controls separately, the algorithm was developed over three stages: variable review; model fitting, and identification of the top-five performing models; and internal validation.All analyses were conducted using SAS version 9.4.
Twenty-three variables were selected for model fitting and grouped into four categories: potential HES treatment (2 variables); asthma (3 variables); white blood cell treatment and condition-related variables (3 variables); eosinophils (15 variables).Variables were selected based on the largest absolute differences between cases and controls in the summary measure and were clinically meaningful for HES patients.Models were tested for their prediction power, where the response variable was a binary variable based on HES status (yes/no).All models had six predictor variables: four main predictors of interest (one from each category described above), sex, and age (as a continuous variable).Only one variable from each of the four categories were included in each model, as variables within the same category were expected to be highly correlated.Age and sex were included in all models to minimize risk of confounding by these matching factors. 11Following this approach, 270 models were tested (mathematical formula in Supplementary Materials).Each model was fitted F I G U R E 1 Study design.CPRD, clinical practice research datalink; HES, hypereosinophilic syndrome.
using Firth logistic regression. 12The Hosmer-Lemeshow test 13 was used to assess the null hypothesis that the logistic regression model was a good fit for the data; this test gave no evidence against the model ( p > 0.05).
For each model, performance measures of Akaike's information criterion (AIC), area under the receiver operating characteristic curve (AUROC) and sensitivity at the 50% predicted probability threshold were calculated.Models with ICD-10 variables were prioritized; if performance measures were similar between models with or without ICD-10, the ICD-10 model was selected.
Since AIC, AUROC and sensitivity measures do not always agree on the best model, the top five development-stage models were selected.Three models were based on each individual measure (best AIC score, highest AUROC, and highest sensitivity); two models that performed well on a combination of the three measures were also selected.Selection was based on the best measure value, and manual review, with models with ICD-10 variables (where applicable) prioritized (for more details, see Results section).
Variables identified for each of the five models were internally validated using the Leave-One-Out Cross Validation (LOOCV) method, 14 with 50% predicted probability threshold.The model with the highest internal validity, as judged by accuracy, discrimination measured by AUROC, sensitivity and specificity, was selected as the final model.Sensitivity and specificity were re-calculated for the final model at 80% predicted probability threshold to allow for a balance between true positives, false positives and false negatives when ascertaining cases of HES.Wilson's Score method incorporating continuity correction was used to obtain 95% confidence intervals (CIs). 15sitive predictive value (PPV) and Negative predictive value (NPV) were calculated using sensitivity and specificity combined with a predefined prevalence of 1.15 per 100 000 persons.These estimates were based on a previous prevalence study using the same database (CPRD Aurum) and the same SNOMED codes calculating annual prevalence of HES in adults in 2018 16 A sensitivity analysis of the final model was performed to examine the importance of the blood eosinophil count variable on model performance.

| Sample size
Please see the Supplementary material for a description of the sample size determination.

| Patient population
Of the 135 patients with ≥1, HES code between January 2012 and June 2019, 88 patients met the study criteria and were included in the HES cohort (Supplementary Table 1); 2552 matched patients were included in the non-HES cohort.Patient demographics are summarized in Table 1.
The most commonly reported READ codes were summarized for HES and non-HES cohorts (Table 1).The top 5 conditions in primary care records using the hospital admission care data (%, ICD10 codes) were: other disorders of white blood cells (34%, D72); pneumonia Strategy combining READ and SNOMED codes were used; most of the SNOMED codes were mapped to READ code; some SNOMED codes did not have an associated READ code, but they were not consistent between the cases and would not be considered as the top 5 most reported conditions.

| Algorithm development
Clinically meaningful variables differing between HES and non-HES cohorts were included as predictor variables ( Of the top five models selected, the top AIC (model 1) and the top sensitivity (model 3) models had the best AIC score, an AUROC of 93%-94%, 69% sensitivity and the lowest number of false-positives (n = 3-4) and false-negatives (n = 27).The second-best AUROC model (model 4) performed less well than the other four models (44% sensitivity; 49 false-negatives; 7 false-positives); however, this model was included to "loosen" the requirement for patients to meet a specific blood eosinophil count threshold as in the other four models.
There was no evidence of poor fit in any of the selected models (Hosmer and Lemshow p-values >0.05).
Following LOOCV validation, the top AIC (model 1) and top sensitivity (model 3) were found to be top-performing models; both had 69% sensitivity, the ability to detect 61/88 true HES cases, and an AUROC of 85% (Table 4 and Figure 2).The top sensitivity model (model 3) was selected as the final model, as it gave three falsepositives versus four for the top AIC model (model 1).

| Final model
Final model estimates are shown in Table 5.The highest predictors for a HES case were an ICD-10 code for a white blood cell count any time during patient history, followed by a blood eosinophil count ≥1500 cells/μL in the 24 months prior to index.Patients with these characteristics had >1000 times higher odds of having HES versus those without these conditions.Patients with a code for a potential HES treatment 6 months post-index and an ICD-10 asthma code any time during patient history had 23 times and 9 times higher odds, respectively, of being a HES case versus patients lacking these codes.
Age at index and sex were not statistically significant predictors for a HES case.
A sensitivity analysis of the final model, removing the blood eosinophil count variable, found that at the 80% predictive probability threshold, the model ascertained 33/88 true HES cases and 2550/2552 true non-HES controls (Supplementary Table 3).

| DISCUSSION
The algorithm developed supports ascertainment of patients with HES from EHR databases using a combination of medical codes, data   Test rate refers to the proportion of "left out" patients correctly ascertained as having HES or not having HES.
HES had >1000 times higher odds of having an ICD-10 code for "other disorders of white blood cells" (D72) (including eosinophilia [D72.1]),during their patient history, or a blood eosinophil count of ≥1500 cells/μL in the 24 months prior to index, versus non-HES cases.Patients with HES also had 23 times higher odds of having received potential HES treatment during the 6 months post-index, and 9 times higher odds of having an ICD-10 code for asthma at any time during their patient history, versus non-HES patients.Though high odds ratios indicate the predictor variables are highly related to the outcome of having HES, in statistical terms, while these predictor states are almost necessary, they are not sufficient to ascertain HES cases in the database. 17Model sensitivity reduced to 38% when the blood eosinophil criterion was removed from the final model, suggesting that both this and the "other disorders of white blood cells" criterion are crucial for the ascertainment of a HES case.Age and sex, included to adjust for confounding, 11 were not significant predictors of HES in the final model.
The final model included covariates using ICD-10 codes, as AUROC and sensitivity measures were similar between models with and without ICD-10 codes.Although AIC in the final model was 6.5 points higher than the top AIC model, the final model had the second lowest AIC values of the top five models, suggesting good overall model performance.An AUROC value of 93% for the final model indicated a good ability to differentiate between HES and non-HES cases.
High AUROC would be expected as it is based on the inherent high specificity of the variables that were carefully selected based on the unique features of the HES cohort.PPV and NPV estimates were not informative for model selection, reflecting the very low prevalence of HES (1.15 per 100 000 persons) 16 on which these estimates are based.
The algorithm was specifically developed for ascertainment of patients with HES but the methodological approach could also be applied to other rare diseases.Diagnosis of rare diseases is difficult, and may take an average of 5-8 years to confirm. 18,19Data from a patient-based survey showed that it could take >20 years and 11 or more physicians to reach a diagnosis. 18,19These challenges make it more difficult to develop disease management strategies and investigate novel treatments.By using a HES-specific SNOMED code in the CPRD-Aurum database to ascertain patients, a set of clinical characteristics were selected to allow for ascertainment of patients in the absence of a HES-specific diagnostic code.This algorithm could be applied to other EHR databases to ascertain HES cases, while the modeling principle may be used to help ascertain cases of other rare diseases in EHR databases.As such, these methods can facilitate further in-depth examination of clinical characteristics, treatment patterns and outcomes in patients with HES or other rare diseases.
This study had several limitations, which should be considered when interpreting results.First, although SNOMED codes were the most specific way to ascertain patients in this EHR database, their use for ascertaining patients with HES requires further validation, or confirmation by the physicians that use these codes, as is the case for other validation studies done within this database. 20Second, as with all database studies, prescribed medications do not always reflect utilization.Third, as clinical expertise is important for diagnosing this rare condition, there is the potential for misdiagnosis with the algorithm.
Fourth, other diseases can be the cause of eosinophilia, such as parasitic disease, eosinophilic granulomatosis with polyangiitis, allergic bronchopulmonary aspergillosis, eosinophilic chronic rhinosinusitis, and cancer.However, as HES specific codes within the database were used to develop this algorithm, it was assumed that HES was diagnosed after excluding all other potential causes.Reassuringly, the most common differential diagnosis codes for HES were not among the most frequent comorbidities present in the HES population.Fifth, this algorithm may not be applicable to all databases or to countries outside England.Sixth, there may be more innovative approaches to addressing the research question, including using machine learning techniques (e.g., decision trees) and natural language processing (enabling the extraction of flexible search terms, such as "HES" and "hypereosinophilic syndrome" from unstructured EHR text); these are ongoing areas of research in many other rare diseases. 21e statistical limitations of this study include the large difference in sample size between HES and non-HES cohorts that caused problems when fitting standard or exact logistic regressions; however, Firth logistic regression was utilized to overcome these issues. 12As some events were very rare, calculations of exact odds ratios and specificity were not informative.Additionally, as there are few published algorithms to help ascertain rare diseases, it is difficult to compare the sensitivity of our algorithm to other studies.The variables considered for algorithm development were pre-defined and clinically driven, and do not represent an exhaustive list of all possible variables.
Interaction terms between the variables were not investigated and should be considered in future studies.Lastly, model validation using LOOCV was performed on the same dataset used for the development of the algorithm.This was unavoidable, due to the rarity of HES and the consequent small size of the HES cohort.As such, this means that external validation of the final model is still required using either a new cohort from the CPRD-Aurum or an alternative data source.
Even without external validation, this study still has substantial value, having identified a set of variables with a clear association with HES.
In many medical databases, a diagnosis of HES is not recorded, and the association found here will facilitate the ascertainment of patients likely to have HES in such databases.

| CONCLUSIONS
The unique characteristics of patients with HES in comparison with a matched non-HES cohort enabled the development of an algorithm with sufficient sensitivity to ascertain cases of HES in an EHR database (CPRD-Aurum).The strongest predictors of HES were having a code for other white blood cell disorders any time during patient history and blood eosinophil counts ≥1500 cells/μL in the 2 years prior to index, followed by a code for a potential HES treatment in the 6 months pre-index or asthma at any time during patient history.This algorithm-based approach could be applied to ascertain cases of HES in other databases and may be useful for ascertainment of cases of other rare diseases.Estimates used formulas based on sensitivity, specificity and a pre-defined HES prevalence of 1.15 patients with HES per 100 000 persons. 14

AUTHOR CONTRIBUTIONS
This cross-sectional study (GSK ID: 213879) used data (1st January 2012 to 30th June 2019) from the CPRD-Aurum database linked to the Hospital Episode Statistics database, at person-level data.CPRD-Aurum is a database containing routinely collected data from primary care practices in England that use EMIS clinical systems, capturing diagnoses, symptoms, prescriptions, referrals and tests for over 19 million patients 10 and uses READ, local EMIS, and SNOMED codes.The Hospital Episode Statistics database contains information on admitted patient care and outpatient appointments in England using ICD-10 coding.In this study, only the admitted patient care data was employed, due to the lack of ICD-10 diagnosis codes available in the outpatient dataset.Access to the CPRD Aurum and linked data from the Hospital Episode Statistics database was done under the CPRD license agreement and under terms of the approved study protocol.

Model 1 (
on prescribed treatments and laboratory results in the form of blood eosinophil counts in the absence of specific HES diagnosis codes.The final model had a sensitivity of 69% (95% CI: 59%, 79%) to ascertain HES cases at an 80% probability threshold based on unique characteristics identified in these patients when compared with matched non-HES controls.Although none of the covariates included in the final model represent the common definition of HES or would in themselves constitute a diagnosis of HES, individual covariates differentially impacted the probability of a patient having a confirmed HES diagnosis (SNOMED code).Therefore, these covariates in combination represent useful predictors of HES.Cases missed by the algorithm had lower or missing blood eosinophil values and therefore they were not selected as cases.Patient's recover, or unrecorded treatment received might explained the lack of these specific variables for these patients, however, if the algorithm had been more inclusive toward these patients, its specificity would have been reduced.-Patientswith T A B L E 4 LOOCV validation of the top five models.Top AIC) • Potential HES treatment -6 months post-index • ICD-10 (3rd digit) asthma a c Requena, Sandra Joksaite, and Nicholas Galwey contributed to the conception or design of the study, acquisition of study data, data analysis, and interpretation.Rupert W Jakes contributed to the conception or design of the study, data analysis, and interpretation.ACKNOWLEDGMENTS This study is based in part on data from the Clinical Practice Research Datalink obtained under license from the UK Medicines and Healthcare Products regulatory agency.The data is provided by patients and collected by the NHS as part of their care and support.The interpretations and conclusions contained within this manuscript are those of the authors alone.The authors would like to acknowledge the contributions of Melissa Van Dyke in the conception/design of the study T A B L E 6 Performance measures of the final model at the 80% probability threshold for the ascertainment of a patient with HES.

1
Summary of patient demographics and clinical characteristics at index.
a If a patient had ≥1 value in the time period of interest, the maximum value was used in the summary measure.b

Table 2
Variables selected for inclusion in the development models.
).Of 270 fitted models, the top five were selected for internal validation.Selected models comprised: the top AIC model; the top AUROC/second-prior to or post-index date, an ICD-10 code for asthma and a blood eosinophil count >1000-1500 cells/μL or blood eosinophil count code in the 24 months prior to index.Apart from the top AIC model (model 1), which included a READ code for eosinophilia instead, all models included an ICD-10 code for other disorders of white blood cells.T A B L E 2 Top five models based on AIC, AUROC and sensitivity testing.
ascertained 61/88 in the HES cohort as HES cases and 2551/2552in the non-HES cohort as controls, giving a sensitivity of T A B L E 3 Abbreviations: AIC, Akaike's information criterion; AUROC, area under the receiver operating characteristic curve; HES, hypereosinophilic syndrome; HL, Hosmer-Lemeshow goodness of fit test for logistic regression; ICD-10, International Classification of Diseases, Tenth Revision.a Any time during patient history.b Patients was predicted as having HES if their predicted probability was ≥50%.
Summary of odds ratios for the final model.Analysis using Firth logistic regression model where the response variable is HES status (Yescase; Nocontrol) and covariates are listed as rows.
Covariates in the final model included HES treatment 6 months post-index, asthma ICD-10 code (3rd digit), other disorders of white blood cells ICD-10 code (3rd digit), a blood eosinophil count ≥1500 cells/μL 24 months prior to index, age and sex.Abbreviations: CI, confidence intervals; HES, hypereosinophilic syndrome; International Classification of Diseases, Tenth Revision; NPV, negative predictive value; PPV, positive predictive values. a