Assessing heterogeneity of electronic health-care databases: A case study of background incidence rates of venous thromboembolism

Purpose: Heterogeneous results from multi-database studies have been observed, for example, in the context of generating background incidence rates (IRs) for adverse events of special interest for SARS-CoV-2 vaccines. In this study, we aimed to explore different between-database sources of heterogeneity influencing the estimated background IR of venous thromboembolism (VTE). Methods: Through forest plots and random-effects models, we performed a qualitative and quantitative assessment of heterogeneity of VTE background IR derived from 11 databases from 6 European countries, using age and gender stratified background IR for the years 2017 – 2019 estimated in two studies. Sensitivity analyses were performed to assess the impact of selection criteria on the variability of the reported IR. Results: A total of 54 257 284 subjects were included in this study. Age – gender pooled VTE IR varied from 5 to 421/100 000 person-years and IR increased with increasing age for both genders. Wide confidence intervals (CIs) demonstrated considerable within-data-source heterogeneity. Selecting databases with similar characteristics had only a minor impact on the variability as shown in forest plots and the magnitude of the I 2 statistic, which remained large. Solely including databases with primary care and hospital data resulted in a noticeable decrease in heterogeneity. Conclusions: Large variability in IR between data sources and within age group and gender strata warrants the

2. After mitigating unwanted heterogeneity through harmonization of database characteristics, there might still be some heterogeneity present, but this should be considered as a source of knowledge; our study confirmed prior knowledge that VTE backgrounds IRs were different dependents on the age and gender of the individual.
3. The level of the heterogeneity in estimates depends on differences in database characteristics.In our study, databases collecting data from different parts of the health-care systems were the largest contributors to heterogeneity in estimates.
4. When heterogeneity is present, a careful trade-off has to be made for the choice of IR, between stratified estimates or a pooled estimate, to support use in pharmacoepidemiological and regulatory evaluation. 5. To attenuate heterogeneity, a pre-screening of database characteristics through a metadataset and adequate analytical tools at study design stage might be considered.

Plain Language Summary
Real-world data collected in everyday clinical practice can complement information used in regulatory decision-making and provide evidence to support the benefit-risk assessment of medicines.To improve the added value of real-world data for regulatory decision-making, regulators pool information from multiple databases to provide a more accurate picture of the outcome of interest.However, there is regularly variability, also called heterogeneity, in study outcomes when using data from different databases and this poses challenges for interpretation and communication.In this study, we examined incidence rates of venous thromboembolism, identified as a potential side effect for some COVID-19 vaccines, derived from multiple databases.We investigated how differences in database characteristics might cause variation between rate estimates and concluded that the largest contributor to heterogeneity was the use of data from different health-care settings.Understanding which database characteristics contribute to variability can allow to mitigate variation.This can be done by selecting databases with similar data characteristics, such as harmonised codes to refer to clinical outcomes and comparable selection criteria for participants and by using appropriate statistical methods to analyse the variability.
Overall, our study provides an overview of the complexity of real-world evidence and can be used to better understand and analyse sources of variability.

| INTRODUCTION
In the past two decades, the usage of large health-care databases has increased greatly. 1 Regulatory agencies such as the European Medicines Agency (EMA) and the US Food and Drug Administration (FDA) have highlighted the value of real-world data (RWD) in medicines regulation. 2,3In Europe, the initiation of the Data Analysis and Real World Interrogation Network (DARWIN EU), 4 as well as the European Health Data Space are changing the landscape of real-world evidence (RWE) generation towards multi-database studies.
While there are already a number of advantages in using RWD in regulatory decision-making, those benefits can be improved by using more than one data source. 5Trivially, incorporating data from multiple data sources in an analysis will increase sample size.This can be crucial in situations with low event counts, such as for estimating the incidence rates (IRs) of a rare disease.While observational data are more generalizable to the real world than randomised controlled trials, the level of generalizability can be even further increased, by covering a broader and more representative population, thus possibly mitigating selection biases that are specific to single databases and by allowing for the quantification of true differences between populations.
Even though the benefits of using multiple data sources cannot be denied, data from those sources should not be pooled without a preliminary assessment of the suitability of pooling data, due to inherent differences in their characteristics.Simulations have shown increased risks of false-positive and false-negative safety signals when pooling data from multiple databases. 6This heterogeneity can have multiple forms, some of which are desirable for understanding true differences in outcomes event rates, while others make interpretation of results and decision-making regarding selection of suitable background rates for specific purposes such as observed-to-expected analyses for vaccines highly challenging. 7urces of heterogeneity can be categorized into three types: measurement heterogeneity, information heterogeneity (both may be considered methodological heterogeneity), and true heterogeneity (also called clinical heterogeneity). 8While measurement (e.g., clinical classification systems) and information heterogeneity (e.g., granularity of clinical codes) can generally be considered undesirable, clinical heterogeneity has its value, for example, by improving external validity of results or understanding differences in prescription patterns or impact of risk minimisation measures (RMMs) between different geographical regions, health-care systems or behaviours.However, to understand heterogeneity, it is important to use appropriate tools to detect, report and account for it.
During the COVID-19 pandemic, RWD rapidly provided impactful evidence on safety and effectiveness of therapeutics and vaccines. 9is included the generation of background IR for adverse events of special interest (AESIs) for COVID-19 vaccines. 10Those background rates continue to be used in observed-to-expected analyses to estimate the expected number of cases in the general population prior introduction of COVID-19 vaccination, or during SARS-CoV2 circulation in non-vaccinated populations.
The list of AESIs included the concepts of deep vein thrombosis (DVT) and pulmonary embolism (PE).These two concepts make up the term venous thromboembolism (VTE). 11EMA pharmacovigilance activities identified VTE as a possible adverse event of Jcovden (former COVID-19 Vaccine Janssen) 12 and listed VTE as an adverse event of Vaxzevria. 13Background rates were used to calculate the excess number of cases potentially linked with these vaccines.However, the reported background rates showed large differences between EU countries as reflected through national health-care records. 14,15e objective of this study was to explore data characteristics that trigger heterogeneity in IR through both descriptive and statistical measures, using VTE as a case study.This investigation will further provide support in selecting adequate statistical methods for handling heterogeneity when pooling observations to derive meaningful pooled estimates to support regulatory decision-making.

| Data
To demonstrate an analytical workflow for handling heterogeneity between databases, we selected VTE, a safety concern listed for a class of COVID-19 vaccines, as a case study.
In order to assess potential adverse reactions related to approve COVID-19 vaccines in the EU, EMA-funded two studies through large research consortia: the ACCESS project with University Medical Center Utrecht and the EU PE&PV Research Network 15,16 and a study by ERAS-MUS University Medical Center 17 to generate aggregated background IRs of AESIs, including VTE.Both consortia reported background IRs from multiple databases stratified by age group and gender, using the same eight age categories.IRs were estimated by dividing the number of incident cases by the total person-time at risk, with individuals entering the study cohort on their first visit after January 1, 2017, and being followed until the outcome, exit from the database or end of the study period.The study period covered 2017-2020.In the ACCESS protocol, the study population included all individuals who were observed in the databases for at least 1 day during the study period (January 1, 2017 to last date available) and who had at least 1 year of data availability before cohort entry, except for individuals <1 year of age with data available since birth.
In the ERASMUS protocol, the study population was defined slightly different people observed on January 1 2017, January 1, 2018, or January 1, 2019 had to be observed continuously for at least 365 days with no event before this observation date.Ninety-five percent of CIs were calculated using an exact method described by Ulm. 18se definitions for VTE were developed by the researchers independently.ACCESS utilized the CodeMapper tool 19 to find harmonized definitions across coding systems.The full list of included concepts and details on its generation process is publicly available. 20rough using the OMOP common data model for its analyses, clinical codes in databases to which ERASMUS has access to were mapped to the SNOMED system, ensuring harmonized case definitions.The list of clinical codes included by ERASMUS can be accessed in the ATLAS application. 21Table A1 in the appendix shows included ICD10 codes for both ACCESS and ERASMUS.Only ICD10 codes are shown since both research organisations harmonized their definitions across coding systems.For both PE and DVT, the definition by ACCESS includes a broader range of concepts.For PE, the additional concepts are related to septic PE.The additional concepts for DVT mostly correspond to phlebitis, thrombophlebitis and DVT related to pregnancy.
As described in a report by the FDA, 22 there are no clear guidance on whether these concepts should be included or not.
A short overview of the databases is provided in Table 1.Further details, including demographic characteristics and total population, are provided in the corresponding published reports. 15,16All databases are listed in the ENCePP research database, which also shows a list of relevant research publications they have been used in.For three databases (PHARMO, BIFAP, and SIDIAP), the IR had been estimated both on the total population and on the subset of subjects with linked primary care and hospital records (PC-H linkage).For the primary analysis, the total population estimates were used.
The data included the years 2017-2019.ACCESS reported IRs by year; hence, rates were pooled based on counts to match the data structure of ERASMUS, who reported only combined estimates for all years.Data from the Danish registries (DCE-AU) were reported only for the years 2010-2013 and thus were not included in the analysis.
For the PHARMO database, only data for 2017 and 2019 was reported, due to an error in the imputation of a subset of data for 2018; BIFAP data with hospital linkage were only reported for 2017-2018.

| Analysis
We used forest plots to visualize heterogeneity, displaying estimated IRs as squares with CIs for each database.The size of the squares is proportional to the precision of the estimate.
A random-effects meta-analysis, using the restricted maximum likelihood (REML) estimation method, was performed on the log scale of the IRs to calculate a summary estimate and to quantify the level of heterogeneity, thereby allowing for heterogeneity between databases, which is more realistic than assuming that the true value of the estimand is exactly the same for each database.
To quantify the absolute value of this heterogeneity, we reported estimates of τ 2 measuring the 'dispersion of true effect sizes between studies in terms of the scale of the effect size' 23 and I 2 measuring what proportion of variation in the observed effects is due to variation in true effects, that is, due to inherent differences between the investigated data sources rather than sampling error. 24Borenstein et al. 18 stress that I 2 represents a proportion rather than an absolute value.Therefore, we estimated the level of heterogeneity in comparison with statistical variability rather than heterogeneity itself.In addition, a prediction interval was calculated. 25Such an interval combines uncertainty due to sampling variation and due to heterogeneity to provide an approximate range of true values.
Finally, based on the available metadata characteristics of each of the included databases (see Table 1), several supplementary analyses were performed omitting or selecting a set of databases meeting selected criteria, aiming to reduce potential unwanted measurement and information heterogeneity, in order to assess more accurately true differences in subject-level data (i.e., true heterogeneity).These exploratory analyses allowed determining the contribution of each database characteristic to the heterogeneity in IR across databases.
The following supplementary analyses were performed: a. Restricting the analysis to a subpopulation; only those databases with linkage between primary care and hospital data can provide information on the influence of the health-care setting (i.e., type of data source).Including data sources with only primary care data might lead to underestimation in case of in-patient diagnosis.Data sources with only hospital data may underestimate the events in case of out-patient diagnosis.
b. Restricting the analysis to only those databases using the same clinical classification system for diseases.In this study we included only databases that used the International Clinical Classification of Disease (ICD10) 26 to diagnose VTE as it is the most widely used vocabulary among the available data sources.c.Restricting the analysis to databases with homogeneous case definition.Table A1 in the appendix specifies the ICD10 codes used to diagnose VTE in the two studies.We performed separate analyses by study, that is, ACCESS and ERASMUS, to explore differences in case definitions and population selection criteria.d.The analyses were performed using the R software 27 package meta. 28

| RESULTS
A total of 60 080 169 subjects contributed to the 13 databases.See Tables A2 and A3 for an overview of the reported IRs stratified by age category and gender, by database and by study.
Two databases (CPRD GOLD and SIDIAP) were used in both studies by both consortia.Differences in defining the study cohort resulted in the cohort entry criteria not being identical.After removing the duplicated databases, a total of 54 257 284 subjects derived from 11 databases were included in the main analysis, representing collectively all age and gender subgroups from six countries.
As the first step, we display the age-gender-database-specific IR estimates in a forest plot (Figure 1) using total population estimates.
The forest plot showed a relatively large amount of heterogeneity between databases and within strata of age groups and gender.While for the 0-19 age group the IRs appeared to be in the same order of T A B L E 1 Overview of main characteristics by data source.Next to the age-gender-database-specific IRs, age-gender IRs from meta-analyses were calculated.In our study, the meta-analysis estimated IR of VTE from 5 to 421 per 100 000 person-years depending on age-gender strata.The wide confidence interval of the summary (i.e., pooled) measure identified even within each stratum large patient-level differences.Table A4 in the appendix displays the agegender IR estimates and CIs of the pooled measure for VTE from meta-analyses.
Figure A1 in the appendix shows the calculated prediction interval for the primary analysis.The prediction intervals for each agegender group were notably high confirming the substantial population-level heterogeneity observed across data sources.
Table 2 shows the estimated I 2 and τ 2 values by age-gender stratum.The values for I 2 indicated that a majority of the observed variability is due to differences between databases rather than random sampling error.Supporting the impression from the forest plot, we observed an increasing estimate of I 2 with increasing age in both sexes.There did not seem to be an age-related trend in the estimates of τ 2 , but estimates for τ 2 appear to be lower for males than for females.

| Sensitivity analyses
In the first sensitivity analysis, we restricted the databases to those with PC-H linkage (Figure 2).The forest plot demonstrates a relatively large decrease in heterogeneity when restricting the analysis to data- with the same study also suggesting a difference in IR between genders: IRs increase markedly with age for men and women; the overall ageadjusted IR is higher for men (130 per 100 000) than women (110 per 100 000).The observed heterogeneity in the different age-gender strata is a source of information that leads to a better understanding of the burden of VTE in the general population.However, as demonstrated through the summary estimate and CIs, we still found substantial heterogeneity between data sources within each stratum, suggesting that there still might be unobserved patient-level heterogeneity and therefore a single estimate for each stratum might be inaccurate.
In an attempt to understand the contribution of database characteristics to the reported heterogeneity, we performed several exploratory analyses.Our databases included data derived from both hospital and primary care settings.In all data sources, when estimating background rates, it is important to consider how the population denominator was derived.When linking data between the two settings, depending on the mechanism of linkage, there is a risk of only capturing those subjects that had a hospital visit recorded, which could lead to biased estimates.Restricting the databases that included a link with hospital (PC-H linkage) resulted in a moderate decrease in the reported variability.Alongside, we did not see a decrease in heterogeneity using only databases that used the ICD-10 vocabulary demonstrating that the type of vocabulary used for clinical classification of VTE could not be identified as a major source of heterogeneity.
Differences in background rates remained between the two studies even if the time at risk in which the rates were collected and agegender subgroup definitions and analytical methods were similar.
Comparing more closely the methodology applied in the two studies, differences in case definitions were noted, with some clinical codes only included in one of the studies (Table A1 in the appendix).The inclusion and exclusion criteria for individuals also differed, leading to non-identical study populations even when within the same data source.Since we did not find any systematic differences in estimates between the two consortia, it is unlikely that differences in case definition or inclusion criteria had a large influence on observed heterogeneity.
When quantifying heterogeneity using the statistical measure I 2 , a considerable amount of heterogeneity (close to 100%) is reported.
The large values of I 2 are not surprising, as the large sample sizes in every database imply small variance estimates.In particular the I 2 estimates in the 0-19 age group seem to be influenced by this fact: due to a larger sample size, the variance is lower than for the other age groups, leading to I 2 estimates that appear too high in comparison with the other age groups when looking at the forest plot.
In this study, we have calculated pooled estimates for the primary analyses.However, when large heterogeneity is present, focusing on a pooled estimate is not advisable, given that the pooled estimate will derive largely from the particular choice of databases and the relative weights associated with each database. 30Following the classification in Deeks et al. 8 it is not advised to combine the estimates if the value of I 2 was estimated to be larger than 90%.In addition, when reporting the results after combining estimates, attention needs to be given to uncertainty quantification.In addition to the common risk of misinterpretation for CIs, 31 CIs for random-effects meta-analyses are easily misinterpreted to quantify dispersion of study effects.However, CIs only represent the uncertainty in estimating the mean effect size, not taking into account variability due to different database characteristics.This means, that CIs are always smaller than the range of observed estimates.
A major strength of this study is the use of data aggregated from a large number of data sources independently provided by two research consortia using the same calculation method to estimate the IRs.This enabled the exploration of database-specific aspects related to heterogeneity.From 11 databases, 54 257 284 subjects contributed to the main analysis; with the databases spanning a large part of Europe, this can be considered a representative sample of the total population.
The above exploratory analyses show that even when certain database characteristics are harmonized, significant heterogeneity is still present.A limitation of our study is that only aggregated data was available which prevented us from investigating potential sources of heterogeneity attributable to patient characteristics.For instance, comorbidities may have an influence on IRs. 32A final limitation is that clinical validation of the diagnosis codes was not performed in these studies which might have led to different frequencies of misclassification or underreporting of VTE cases, which may affect estimates of heterogeneity. 33is study highlights the challenges regarding the varying levels of available information about database characteristics and the difficulty to identify sufficiently detailed information about the data sources.For example, some differences can only be explored through subject matter expertise about the corresponding health-care systems.Health-care systems might differ between regions, implying possible differences in the probability of recording certain events even in the same health-care setting.The process of clinical coding could also influence the quality of recording, with different levels of quality control or incentives for correct coding.With the large level of observed heterogeneity, an important recommendation is to use the same databases when comparing estimates at different time-points, for example, for pre-or post-exposure IRs of AESI.
The above considerations highlight the necessity for careful assessment of the suitability of databases to include in multi-database studies.In the two studies, a variety of databases was included because many different AESIs were considered simultaneously.For a study on a specific outcome, more specific restrictions on the databases should be placed a priori.In our study we have observed that the type of data source is one of the most important considerations.
Based on subject matter knowledge or available validation studies, it should be evaluated in which type of setting the most accurate estimation is possible.
Our analysis also highlights the importance of unified case definitions and inclusion and exclusion cohort criteria to select adequate data sources in multi-database studies.This is evidenced in this study by the CPRD data source used by both study groups to address the same objective, but with a difference of 17% in the total number of individuals included in the study cohort, most likely related to differences in the operationalization of case definitions.
It is important to note that the methods used for detecting and addressing heterogeneity should be specified before starting any meta-analysis.When the forest plot show outliers among observed rates, it can be tempting to exclude the corresponding databases from the analysis without further investigating causes for outliers.This practice is, however, likely to introduce bias and should be avoided in most situations.Criteria for excluding certain databases should be specified prior to performing the analysis, but even then, it is advisable to also present results with the excluded databases, as a sensitivity analysis.In parallel, the choice of method for handling heterogeneity should also be prespecified, conditional of the outcome of the method for detecting heterogeneity.Also, it is preferable that the method for estimating the meta-analysis model and its statistical heterogeneity (e.g., REML), the methods for quantifying CIs (e.g., the Hartung-Knapp and Sidik-Jonkman modifications to the Wald method) and prediction intervals are prespecified. 34re specifically, exploring the level of heterogeneity using multiple databases must be considered if these rates are intended to be used to support safety signal detection activities and to avoid misleading recommendations.One of the current initiatives is the DIVERSE project with the aim to develop guidelines for the identification, collection and reporting of heterogeneity in multi-database studies. 35In addition, EMA's list of metadata for Real World Data catalogues, 36 which will be the basis of a catalogue of RWD sources, will provide researchers with standardized, relevant information about databases to use for RWE studies.Another approach would be to develop a set of metrics to measure database heterogeneity or to develop phenotype libraries to identify important variables in different databases.

F
U R E 3 Age-gender-databasespecific IR estimates and 95% CIs for VTE in databases using ICD10.*Due to the CIs being too small compared with the size of the square, some of the CIs are not noticeable in the figure.I G U R E 4 a.Age-gender-database specific IR estimates and 95% CIs for VTE in databases from ACCESS.*Due to the CIs being too small compared with the size of the square, some of the CIs are not noticeable in the figure.b.Agegender-database-specific IR estimates and 95% CIs for VTE in databases from ERASMUS.*Due to the CIs being too small compared with the size of the square, some of the CIs are not noticeable in the figure.
Age-gender-stratified IR estimates and 95% CIs* for VTE by database and pooled.*Due to the CIs being too small compared with the size of the square, some of the CIs are not noticeable in the figure.includingthe estimates from both the total population and the subpopulation with PC-H linkage, there was some dependence between the estimates of BIFAP, SIDIAP, and PHARMO.For ACCESS, the hospital database PHARMO showed far lower estimates than all other databases included.This could be linked to an oversampling of the denominator.Apart from this, visually there seemed to be some Age-gender-stratified I 2 and τ 2 estimates from metaanalyses.
29ses with PC-H linkage, across all age groups.It became apparent that this restriction of databases primarily leads to low estimates being excluded from the analysis.In TableA5in the appendix, which lists I 2 and τ 2 estimates for all sensitivity analyses, we noticed lowered I 2 estimates especially for younger age groups and considerably lowered τ 2 values for all age groups.Figure3did not imply any reduction in heterogeneity when restricting the analysis to databases using ICD 10 codes to diagnose VTE.Both range and distribution of estimates were similar to the primary analysis.The same was true for estimates of I 2 and τ 2 .Figure4a, b showed forest plots of the analysis considering data from ACCESS and ERASMUS separately.Note that due to This study explored heterogeneity in background IRs of VTE reported from 11 data sources spanning six EU countries and derived from two observational studies, by focusing on the database as a source of heterogeneity.Through investigating data source characteristics potentially introducing differences in estimated IRs, our aim was to investigate the amount of unwanted (i.e., methodological) T A B L E 2 F I G U R E 2 Age-gender-databasespecific IR estimates and 95% CIs for VTE in databases with PC-H linkage.*Due to the CIs being too small compared with the size of the square, some of the CIs are not noticeable in the figure.heterogeneityoruncertainty between data sources to provide more valid conclusions for safety surveillance activities.Data sources used in this study were mostly from primary care settings, partly with linkage to hospital data.The study used aggregated background IRs of VTE, considered a relevant AESI for a class of EU-approved COVID-19 vaccines.Substantial heterogeneity in the background IRs was observed between all included data sources, in addition to observed within-datasource differences across age groups and genders.Age was the main contributor to the heterogeneity as shown in our study.Overall, it was observed that background rates increased with increasing age with no clear pattern in IR between males and females.The observation of increased IRs with increasing age is in line with another study on VTE,29 systems, may reduce the variability among estimates derived from different data sources.Nonetheless, the mapping of original coding systems to SNOMED may not reduce the heterogeneity as such, but may merely conceal possible heterogeneity introduced by different classification systems to operationalize the case definition of VTE.This is evidenced in this study by the ERAS-Age-gender-stratified prediction interval* and 95% CIs for VTE by database and pooled.* Due to the prediction interval being too small compared with the size of the square, some of the prediction intervals are not noticeable in the figure.Age-gender-stratified IRs per 100 000 person-years (with 95% CIs) for VTE for the databases provided by ERASMUS.Age-gender-stratified IRs per 100 000 person-years (with 95% CIs) for VTE for the databases provided by ACCESS.Age-gender IR estimates and CIs for VTE from metaanalyses.Age-gender-stratified I 2 and τ 2 estimates from meta-analyses for the different sensitivity analyses., 2023, 9, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/pds.5631by Utrecht University, Wiley Online Library on [18/01/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 38antify factors influencing IRs through a set of sensitivity analysis using patient level data.Finally, the use of SNOMED CT (systematized nomenclature of medicineclinical terms),37a terminology that can cross-map to other classifications and code ences in the case definitions and population cohorts as defined in the protocols used by the two study groups.The use of HARPER (harmonized protocol template to enhance reproducibility)38to operationalize code definitions will improve the creation of unambiguous clinical codes in studies integrating data from multiple data sources.Our study can be utilized to better understand the complexity of RWE and to illustrate the importance of a cautious selection of databases, based on their characteristics, so that the observed heterogeneity represents true differences, to ultimately improve the reliability of RWE.Our findings should be considered in context of similar analyses with other databases and in other settings.TA B L E A 2 10991557