Cohort Profile: The National Institute for Health Research Health Informatics Collaborative: Hepatitis B Virus (NIHR HIC HBV) research dataset

Purpose: The National Institute for Health Research (NIHR) Health Informatics Collaborative (HIC) was established to enable re-use of routinely collected clinical data across National Health Service (NHS) Trusts in the United Kingdom to support translational research. Viral hepatitis is one of the first five exemplar themes and hepatitis B virus (HBV) is the current focus of the theme. The NIHR HIC HBV dataset, derived from the central data repository of NIHR HIC viral hepatitis theme, aims to describe and characterise HBV infection in secondary care in the United Kingdom, and provides a resource for translational research.

Participants: The dataset comprises >5000 individuals (99% adults aged ≥18, 1% children aged <18) with chronic HBV (CHB) infection from five NHS Trusts across England, representing clinical data collected between August 1994 and August 2021. 

Findings to date: Data on demographics, laboratory tests, antiviral treatment, elastography scores, imaging/biopsy reports, death information, and potential risk factors for liver disease have been collected. Data are captured by electronic patient record (EPR) systems, and records are updated prospectively as new results are added. This cohort profile describes the dataset in its current form. Among the adults, 55% are male, and the median age at index date (defined as the first recorded positive hepatitis B virus surface antigen (HBsAg) or HBV DNA in EPR systems) was 40 years (interquartile range [IQR]: 32-50). For those individuals with ethnicity reported, 30% were Asian, 24% were Black, 30% were White, and the remaining 16% were mixed or other ethnic groups. Currently, the median follow-up duration of the adult patients in this dataset was 5.0 (IQR: 2.7-7.5) years, with 9.3 (95% CI: 8.2-10.5) deaths per 1,000 person-years. We have already conducted several analyses using subsets of this dataset including an evaluation of distribution and trajectories of HBsAg and HBV viral load in CHB, reviewing the use of antiviral treatment, quantifying the burden of liver disease in the untreated population, and studying the use of laboratory biomarkers to improve stratification and surveillance. 

Future plans: Longitudinal data collection is continuing, with the sample growing in size, more parameters being collected, average follow-up increasing, and more NHS Trusts participating. This dataset offers important opportunities for epidemiological studies and biomedical informatics research, as well as characterising an HBV population for clinical trials through external collaborations with industry.

Chronic infection with hepatitis B virus (HBV) is a global health problem, resulting in an estimated 887 000 deaths worldwide in 2015. 1 Unlike deaths from other infections such as tuberculosis, human immunodeficiency virus (HIV) or malaria, the number of viral hepatitis deaths [the majority of which are attributable to HBV and hepatitis C virus (HCV) infection] has increased since 1990. 2 To advance towards international goals for eliminating viral hepatitis, 3 it is important to accurately estimate the baseline burden, to develop and deliver interventions based on real-world data and to monitor progress towards targets at regional and national levels. 4 As the prevalence of HBV infection is low across the UK overall, there are limited data describing population characteristics and disease burden. 5,6 Chronic HBV (CHB) nevertheless presents a concern in certain populations, either as a result of increased prevalence and/or risk factors for the development of long-term liver disease (e.g. chronic coinfection with HIV 7 or other hepatitis viruses, 8,9 diabetes mellitus or metabolic syndrome, 10,11 alcohol abuse, 11 migrants from countries/regions with a high prevalence of HBV 12,13 ). Chronic infection can lead to pathology which has a major impact on quality of life and life expectancy, including cirrhosis, end-stage-liver failure and hepatocellular carcinoma (HCC). Following the successes of directacting antiviral drugs for HCV treatment as well as potential cure strategies targeting the reservoir in HIV infection, the clinical and research communities have focused progressive attention on cure strategies for HBV. There is therefore a pressing need for national-level data collection to evaluate population characteristics, identify risk factors, assess treatment deployment, develop predictive models for outcomes and provide a foundation for clinical trials for HBV.
• The dataset is populated with routinely collected clinical data captured from electronic patient record (EPR) systems; follow-up frequency of each individual depends on clinical practice, with a median of 5.1 (IQR: 2.8-8.0) years.
• Data on demographics, laboratory tests, antiviral treatment, elastography scores, imaging/biopsy reports, death information and potential risk factors for liver disease have been collected.
• Over time, the cohort will continue to grow in size, average follow-up duration will increase and more NHS Trusts will participate.
• This dataset offers important opportunities for epidemiological studies and biomedical informatics research, as well as characterizing an HBV population for clinical trials, including external collaborations with industry. collected clinical data have been accumulated in electronic patient record (EPR) systems in the UK's unified secondary care services. The National Institute for Health Research (NIHR) Health Informatics Collaborative (HIC) collaboration was established in 2014 to enable re-use of these 'big' data for translational research. 14 The NIHR HIC viral hepatitis theme provides a framework for collection of data for HBV, HCV and hepatitis E virus (HEV) in secondary care across National Health Service (NHS) Trusts (distinct regional organizations, each a separate legal entity, responsible for provision and commissioning of health care) in England, UK. In this cohort profile, we specifically introduce the large prospective multicentre cohort for CHB established within this theme. The challenges that had to be overcome in order to share data included establishing a unified governance framework across separate organizations, variations in data entry practice, data definitions and clinical practice between sites, de-identification required for large amounts of important free-text data and different levels of expertise in clinical informatics in different sites. 14 With funding from the NIHR HIC and local support by NIHR Biomedical Research Centres (BRCs) at participating sites, the dataset continues to expand over time, with additional NHS Trusts joining the NIHR HIC viral hepatitis theme, and existing members refining the quality and quantity of data submitted.
Who is in the cohort?

Locations and setting
The NIHR HIC HBV cohort is a multisite dataset populated with anonymized routinely collected clinical data from individuals (including adults and children) with CHB attending secondary care services across the UK. Current data are from England, but the NIHR HIC provides opportunities to expand the dataset to represent other locations within the UK. The locations of the current 10 participating sites are shown in Figure 1, of which six have submitted data up to February 2022.
At each site, routinely collected clinical data are captured in local electronic systems. However, these systems were originally designed for local clinical services rather than for research purposes, so data entry practice and storage format are not unified. Different sites use different types of EPR system for clinical solutions (e.g. Cerner Millennium, Epic), and even when sites are using the same type of EPR system, the data record style and integration are locally customized. To overcome such challenges, we have developed an informatics infrastructure and established a comprehensive governance framework for collecting data between heterogeneous EPR environments, detailed previously. 14 All the laboratory assays used at each site were undertaken on validated platforms in UK laboratories with clinical accreditation.
Data since the date of EPR system implementation are retrospectively captured, though historical data pre-dating the implementation are also included at some sites. Further data are added prospectively, with updates submitted on request and transferred to the theme central data repository. Thus, the start date (earliest available data) can vary by years between Trusts, due to different time lines of EPR system introduction, but the end date (latest available data) is mostly within the same calendar year.
The central data repository of the NIHR HIC HBV dataset is hosted by the theme lead centre (Oxford University Hospitals NHS Foundation Trust) under a governance framework that includes a data sharing agreement and terms on contractual responsibilities, confidentiality, intellectual property and publication. 14 Data subjects (patients) are informed about the processing and data use via the Trusts' Privacy Notices and public facing materials, and can opt out from having their data shared with this dataset via the National Data Opt-out. A scientific steering committee, made up of at least one representative from each participating site, meets regularly to review data collection, feedback progress on active projects, consider updates to the database and review all data requests.

Data anonymization and data protection
Each participating Trust anonymizes their data by removing direct patient identifiers locally and assigning each unique patient a study identifier prior to transmitting data to the central data repository. 14 This allows researchers to conduct analysis without the possibility of patient identification. To avoid submitting duplication of records to the lead centre, each Trust is responsible for locally maintaining a link between the patient's local identifier and the anonymous study identifier used in the dataset. 14 The central data repository is based on a secure data access platform and only authorized personnel are permitted to access data. To access the data for research purposes, a formal request with research proposals must be submitted to the theme scientific steering committee for review. When reporting or publishing data, information for small numbers (e.g. less than five) study subjects in special groups are suppressed to avoid the risk of individuals being identified, as per national guidelines on managing data protection risk. 15,16 Eligibility criteria The inclusion criteria up to May 2021 were (i) individuals for whom data are recorded in the EPR systems; and (ii) individuals with CHB, defined by two positive hepatitis B surface antigen (HBsAg) tests and/or detectable HBV DNA at least 6 months apart (Supplementary Table S1, available as Supplementary data at IJE online). In June 2021, an update was agreed to relax the criteria for inclusion, such that a single positive HBsAg or HBV DNA test was considered sufficient. Although this potentially adds a small number of cases of acute infection, it renders many more cases of chronic infection eligible for data inclusion and thus provides a more complete picture of all HBV infections. The cases with acute infection can be subsequently excluded when analysis is performed on the dataset, if the study requires a stringent case definition of chronic infection. The exclusion criteria were: (i) patients without records of demographics or (ii) patients without mandated laboratory data (Supplementary Tables S2, S3, available

Index date, baseline period and numbers of subjects
For each individual, we defined the first episode of positive HBsAg or HBV DNA recorded in the EPR system as their 'index date' (Figure 2A). A baseline period was defined as 365 days within the index date. For some patients, the index date may be later than the time when they were clinically diagnosed with CHB due to geographical migration across regions/countries. We are unable to capture the retrospective data that might be stored in a different Trust for patients even when data from this Trust are added into the dataset, as it is not possible to map patients between Trusts within the dataset due to anonymization.
At the most recent update in February 2022, the NIHR HIC HBV dataset consists of 6080 CHB patients, with index dates between August 1994 and December 2020. Cumulative numbers of cases in the cohort over time are presented in Figure 2B. Individuals with age <18 years (n ¼ 89) are not described in the remaining text but they are included in the dataset when they reach 18 years of age.
How often have they been followed up? Individuals were followed from the index date until they died or were lost to follow-up (defined as no new records within 24 months of the most recent data update). Patients lost to follow-up, who are subsequently re-enrolled into care in their original Trust, would be included back into the dataset as the same participant; whereas if later they move into another Trust which is contributing to this dataset, they would appear as a new participant once they have hepatitis B serology done. The follow-up frequency of each individual is variable (influenced by clinical requirements and patient preference), and is subject to influence by other factors, including disruptions to clinical services caused by the COVID-19 pandemic since early 2020.

Follow-up duration and frequency, and availability of longitudinal data
Currently, median follow-up duration of the adult patients in this dataset has been 5.1 (IQR: 2.8-8.0) years; 5.46% to the mortality rate reported by an Asian study of CHB patients with similar age profile. 17 The demographics, follow-up duration and coinfection characteristics of adults who died (n ¼ 327) or were lost to follow-up (n ¼ 1301), compared with those who are active (n ¼ 4363) in the cohort, are presented in Table 1.
In line with clinical guidelines, ultrasound is routinely used for surveillance; CT and MRI scans are less frequently used (typically only if concerns are raised by other imaging, laboratory or clinical features). Three sites (contributing data for 1882 adults with CHB) have submitted imaging reports to date. During follow-up, 1084/1882 (57.6%) and 783/1882 (41.6%) patients had one or more and two or more ultrasound examination(s), respectively ( Figure 3C). For those with two or more ultrasound examinations, 18% were on high-intensity surveillance (6 months), 24% on moderate-intensity surveillance (>6-12 months), 26% on low-intensity surveillance (>12-24 months) and 32% on surveillance with intervals >24 months ( Figure 3D).

What has been measured?
Data model A standardized data model, used by all collaborating sites for data mapping, extraction and submission, has been designed and released in the Mauro Data Mapper (used as the NIHR HIC's metadata catalogue; see: https://modelca talogue.cs.ox.ac.uk/nihr-hic/#/home). An overview of data classes and elements defined in the data model is provided in Supplementary Table S2 and detailed definitions are provided in the Supplementary XML Schema Definition

Data inference
One principle of the designed data model is to collect source data as they appear in EPR systems and to allow researchers to infer information of interest using raw data collected. All the inferred fields included in the data model are presented in the Supplementary XML schema definition (XSD) file. Here, we used the inferred variables coinfection exposures and liver disease severity. Chronic viral coinfections (HIV, HCV, HDV), and acute infection or past exposure to HEV, were identified from laboratory tests (Supplementary Table S3). Liver fibrosis and cirrhosis were characterised based on Ishak or METAVIR scores from biopsy reports 18 or on liver stiffness measurements from transient elastography (FibroScan) if available; otherwise, we used AST to platelet ratio index (APRI) 19   Patients with differing intensity of ultrasound surveillance for those who had two or more ultrasound scans. In panel A, mean value of time intervals between every two consecutive tests for each patient was calculated, then the violin plot and boxplot were drawn based on these mean values with outliers (the observations below the 1st percentile and the observations above the 99th percentile) removed. Boxplots indicate the median and quartiles with whiskers reaching up to 1.5 times the interquartile range. The violin plot outlines illustrate kernel probability density, i.e. the width of the blue shaded area represents the proportion of the data located there. Data beyond 30 months were not shown in the plots. In panel B, the x-axis indicates the number of measurements for patients who had longitudinal data (i.e., two or more measurements) on a test. TI, time interval available as Supplementary data at IJE online). We used pre-defined thresholds for significant/advanced fibrosis and cirrhosis: 1.5 and 2.0 for APRI score, respectively; 19 3.25 and 3.6 for FIB-4 score, respectively. 20,21 Decompensation and HCC information was retrieved from clinical and imaging reports if available.

What has it found?
At baseline, for adults (n ¼ 5991), the median age was 39 years (IQR: 32-50) and 55% were male; 4796 had ethnicity recorded, among whom 32% were Asian, 23% were Black, 30% were White and the remaining 15% were mixed or other ethnic groups (Supplementary Table S4, available as Supplementary data at IJE online).
This present cohort (as per updates up to February 2022) comprises CHB patients of diverse ethnicities from six secondary care NHS Trusts across England, mostly representing adults in middle life. The proportion of patients receiving antiviral treatment varies by gender, age and ethnicity, which warrants further investigation. A large majority of patients in this cohort had longitudinal measurements of relevant laboratory parameters, providing promising opportunities for longitudinal analyses.
We have already undertaken studies using this framework, with more in process. During the COVID-19 pandemic, we have investigated service disruptions, revealing that reduction in rates of surveillance closely track COVID-19 incidence and periods of population lockdown. 22 Using this dataset, we have reported a bimodal viral load distribution 23 and found evidence of a virological set point in untreated patients. 23,24 In a comparison of tenofovir disoproxil fumarate (TDF)-treated vs untreated patients, we reported variable ethnicity distributions across the two groups and some evidence for liver fibrosis progression in the untreated group, highlighting a need for further evidence for expanded treatment. 24 A study of HBsAg and HBeAg clearance dynamics demonstrated that these markers may contribute to prognostication and patientstratified care and provide a foundation for advancing insights into mechanisms of disease control. 25 The list of publications is available in [https://hic.nihr.ac.uk] with ongoing/planned studies presented in Table 2.
What are the main strengths and weaknesses?
As the biggest dataset reflecting CHB in secondary care in England, and growing year on year with improving quality, the NIHR HIC HBV dataset is an invaluable resource for answering diverse questions, supporting collaborations, refining approaches for care stratification and treatment and influencing policy for health interventions. As the HBV field moves towards new therapeutics, with a quest for cure strategies, clear information about the characteristics of HBV infection in different settings will be essential to underpin the design and implementation of clinical trials and ultimately to inform equitable access to treatment.
The strengths of NIHR HIC HBV derive from the broad interdisciplinary and cross-site collaboration among clinicians (including hepatology, infectious disease and microbiology specialists), informaticians, project managers and data managers/analysts, representing the NHS Trusts, the NIHR BRCs and the affiliated universities actively participating in the research. Each NHS Trust publishes regularly updated Patient and Public Involvement strategies and engages with patients and the public about the research supported using Trust resources on a regular basis.
The multisite approach integrates CHB data for a broad cross-section of populations from secondary care, and produces comprehensive records on a large scale, with an automatic data validation process. Longitudinal clinical data are particularly important for informing treatment and stratification. Data collection is continuing, with the sample size growing, collection of more parameters being completed, average follow-up increasing and expansion to include more NHS Trusts across the UK. The diversity and statistical power of the dataset will therefore be enhanced for future analyses, providing robust and reliable results despite heterogeneous intrinsic characteristics that exist in patients from different sources.
We recognize limitations which can influence data quality and completeness. Although assays are performed on validated platforms, methods of laboratory tests vary by site or period, e.g. different approaches are used for HBsAg quantification and variable equations are used for eGFR calculation. Therefore, data may need calibration or transformation before analyses, and differences must be flagged before data comparison across sites. As different Trusts prioritize different tests, various levels of missingness exist in liver biochemistry like AST and in serology markers such as HBeAg and anti-HBe. However, these data can influence planning and improving the standard and consistency of clinical care, as well as improving access to new treatments as these become available. Data at some sites are not currently linked to national registries/sources such as the Office for National Statistics death registrations. Additionally, free-text imaging and liver biopsy reports are not systematically available, as anonymization processes that are novel to some sites must be performed before data can be shared. Meanwhile, some data points are difficult to capture from EPR systems, such as treatment records stored in local pharmacy systems, elastography scores recorded in inconsistent and inaccessible formats and self-reported alcohol data not consistently recorded. Although data noise is a common limitation accompanying use of routinely collected clinical data, findings will become more robust as larger study populations are assimilated, electronic systems become better at data capture and the data model is further refined.
Our original inclusion criteria required two episodes of positive HBsAg and/or HBV DNA tests 6 months apart, which might result in some cases with missing data being excluded. The relaxation of the inclusion criteria from June 2021 to one positive HBsAg and/or HBV DNA test will provide a wider population available for investigation, while still allowing researchers to apply their own criteria to narrow down the population to include only the more stringent diagnosis of CHB if required for a particular question. Additionally, many individuals with HBV infection are not diagnosed or not receiving clinical care, and thus not represented in secondary care datasets. These individuals may include a disproportionate number in vulnerable groups, including migrants 26 (and perhaps specifically non-English speakers), people who inject drugs 27 and those in prison or detention centres. 28 Although comparable HBV datasets are more available in other countries, such as China and the USA, [29][30][31] there are scarce comprehensive data of HBV in the UK except for data reported from certain populations [32][33][34] or the primary care population. 5,6 We believe this secondary care cohort can start to fill evidence gaps, especially by collating laboratory, imaging and treatment data which are not currently well captured in primary care.
As an exemplar case, this cohort profile not only highlights the potential utility of a CHB cohort, but also demonstrates that routine clinical data are a valuable resource for translational research. Our use of data during the COVID-19 pandemic 35 highlights how the resource can be quickly adapted to address new questions as they arise.
Can I get hold of the data? Where can I find out more?
Any potential collaborations are welcomed, and data may be made available to researchers on request following positive review by the steering committee. Further details are available at [https://hic.nihr.ac.uk]. Queries regarding data access and more information about the dataset can be sent to Prof. Eleanor Barnes [ellie.barnes@ndm.ox.ac.uk] or directed to [orh-tr.nihrhic@nhs.net].

Ethics approval
The research database for the NIHR HIC viral hepatitis theme was approved by South Central-Oxford C Research Ethics Committee (REF Number: 21/SC/0060). All methods for data collection, transmission and management for the NIHR HIC HBV cohort were carried out in accordance with relevant guidelines and regulations. The requirement for written informed consent was waived by South Central-Oxford C Research Ethics Committee, because data have been anonymized before transmission to the theme central data repository.

Data availability
See Can I get hold of the data? above.

Supplementary data
Supplementary data are available at IJE online.