ukbpheno v1.0: An R package for phenotyping health-related outcomes in the UK Biobank

Summary The complexity and volume of data associated with population-based cohorts means that generating health-related outcomes can be challenging. Using one such cohort, the UK Biobank—a major open access resource—we present a protocol to efficiently integrate the main dataset and record-level data files, to harmonize and process the data using an R package named “ukbpheno”. We describe how to use the package to generate binary phenotypes in a standardized and machine-actionable manner. For complete details on the use and execution of this protocol, please refer to Yeung et al. (2022).

workstations (high-performance computing cluster is not required). Figure 1 illustrates the main concept of data integration and transformation using the ukbpheno package.
In this protocol, we show how to download and process the data from the UK Biobank as well as to generate health related phenotypes with the use of ukbpheno. More specifically, we utilize the ukbpheno package to first harmonize various data into single episode format and then generate the phenotypes ( Figure 2). We then demonstrate the use of package for in-depth exploration of the harmonized data with example visualizations and analyses. As an example, we demonstrate the ascertainment of type 2 diabetes, considering the rationales put forward by Eastwood and colleagues (outlined in S1 Appendix) (Eastwood et al., 2016). We use data from self-report sources (both touchscreen and nurse interview), secondary care data and primary care data to identify people with type 2 diabetes while ruling out other types of diabetes (gestational diabetes and type 1 diabetes) as well as use of metformin due to other conditions. In the second part we show how to automate the phenotyping process, namely to generate multiple cardiometabolic traits used in our previous studies (van der Harst and Verweij, 2018;Verweij et al., 2020;Benjamins et al., 2022;Yeung et al., 2022). The diagnosis codes used for these traits are shown in Table 1. With these phenotypes, we perform some example analyses including the generation of a clinical characteristic table and a survival analysis. Health outcome information is collected in numerous ways. Individual-level data are stored in the main dataset which includes self-report outcomes during the nurse interview, data from cancer registry and death registry. Clinical variables and biomarkers which may be useful for defining a health outcome such as glycated hemoglobin are also stored in the main dataset. Record-level data from hospital inpatient data, primary care data, death registry as well as COVID-19 test results are available through the Data Portal. It is important to note the data from the Data Portal is updated more frequently than the main dataset. As a result, the data in the main dataset and the data from the Data Portal may not be updated at identical dates. Users should be aware of the source of the data and consult the Showcase for the corresponding censoring dates.
Note: The UK Biobank data are an open access resource to any bona-fide researchers. The data analyzed in this protocol were obtained through application number 74395. Please refer to the following link to request access to UK Biobank data: https://www.ukbiobank.ac.uk/ enable-your-research/apply-for-access. a. Follow the steps in section 2 of the data access guide (https://biobank.ctsu.ox.ac.uk/ bbdatan/Accessing_UKB_data_v2.3.pdf) to obtain and decrypt the main dataset from the UK Biobank Data Showcase (Showcase) (https://biobank.ctsu.ox.ac.uk/crystal/download. cgi). Files needed in this step include: i. An MD5 Checksum to verify download (sent to principal investigator and authorized delegates). ii. A key file for decryption (sent to principal investigator and authorized delegates). iii. Utilities software ''ukbmd5'' and ''ukbunpack'' available on Showcase. b. Generate a metadata file (.html) of the decrypted main dataset (.enc_ukb) using utility ''ukbconv'' (available on Showcase).
c. Generate a tab-separated file (.tab) of the decrypted main dataset (.enc_ukb) using utility ''ukbconv'' CRITICAL: Do not run the accompanied ukbxxxxx.R script generated at this step. Only the tab separated file (.tab) is required.
2. Obtain the record-level data from Data Portal on Showcase (https://biobank.ndph.ox.ac.uk/ showcase/). a. Click the ''Login'' on the top of the webpage which redirects users to the Access Management System. b. Once logged in, select ''Projects'' on the left panel of the Access Management System and view the relevant project (''View/Update''). c. Inside the project, select the ''Data'' tab to connect the Showcase in ''logged-in'' mode.
i. Once in the Showcase, move to ''Downloads'' tab on the top of the webpage. ii. In the ''Downloads'' page, researchers with access to record-level data will see a tab ''Data Portal'' next to the tab ''Datasets''. iii. Connect to the record repository in the ''Data Portal'' tab. d. Download the complete data tables (.txt) through the '' Table Download'' tab. i. Request access to record-level hospital inpatient data via Field 41259.
ii. Request access to record-level primary care data via Field 42038, 42039 and 42040.
iii. Request access to record-level death register via Field 40023. 3. Obtain the most recent participant withdrawal list for the project.
a. This list is sent by the UK Biobank to the principal investigator and delegated collaborators and it contains the identifiers of participants who should be excluded in the analyses.

CRITICAL:
The time required at this step varies depending on the size of dataset that has been approved for the project (please see expected outcomes for examples).
Optional: We included a shiny app to cross-reference codes between systems using the mapping file provided by UK Biobank. (https://github.com/niekverw/ukbpheno/blob/master/inst/ util/shiny.lookup_codes.R). Download the code map file (Excel workbook) provided by the UK Biobank (https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=592): (1) locate the shiny app script and run the shiny app, and (2) visit the address returned (usually in the form of http:// 127.0.0.1:xxxx) in a web browser and use the app. A screenshot of the shiny app can be found in Figure 4. b. Fill in fields with conditions in the ''TS'' (touchscreen) column.
Note: For example, a composite phenotype ''diabetes mellitus'' may include two phenotypes ''type 1 diabetes'' and ''type 2 diabetes''. Alternatively, for the phenotype ''type 2 diabetes'' we may want to exclude any cases with also a ''type 1 diabetes'' diagnosis.

Load input files in R
Timing: 15 min Input files required by the package include data files from UK Biobank including the main dataset, the metadata file and optionally data tables from Data Portal; the completed definition table and data setting file.
6. Specify data file paths in R.
8. Read data setting file. The pre-curated data setting file specifies the characteristics of each data source which are taken into account in the data harmonization process.
9. Run the ''read_definition_table()'' function to process the definition table. a. The function ''read_definition_table()'' expands parent codes using the code maps and sort out codes relevant for inclusion and exclusion accordingly. i. Code maps include all available codes. b. The function will also cross-check codes entered in the definition with the code maps and warn users of any non-matching codes e.g., i. A specific ICD10 code may not exist in the UK Biobank ICD10 code map as this code is not present in the data. ii. There may be typos.
Optional: Alternatively download the code maps from the UK Biobank Showcase or create them manually by extracting all unique codes from your data using ''get_all_exisiting_codes()'' which generates flat-form (non-hierarchical) code maps. Adjust the data setting file accordingly.
At the end of the harmonization, all clinical events will be returned in the same episode format. In addition, the original data from main dataset and a full list of participants are also returned.
a. The ''allow_missing_fields'' flag specifies whether field(s) required on the definition table but missing in the main dataset is allowed and ignored. If this flag is set to ''FALSE'', the harmonization step will halt in case of any missing field. b. If the participant withdrawal list is provided, records of these individuals will be removed.
Note: The function harmonized_ukb_data() harmonizes all available data (minimally works with only the main dataset and meta-data file). Additionally, the function will check if all fields required on the definition table are present in the main dataset and inform the user if any field is missing.
Note: Time required to harmonize the data is dependent on size of the files. Factors that should be taken into considerations include number of fields approved for the particular project, number of participants included as well as if record-level primary care data are present. a. ''lst.data'' contains data from all sources in same episode format documenting ''identifier'', ''code'',''eventdate'' and an ''event'' column. i. Diagnosis codes without associated actual event date will have date of visit to assessment center (such as self-report diabetes) in the ''eventdate'' column and ''0'' in the ''event'' indicating that the date does not reflect a true event ( Figure 6).
b. ''dfukb'' is a subset of the main dataset and contains only columns necessary for the definition of target phenotypes. c. ''vct.identifiers'' is a vector of identifiers of all participants in the main dataset.
Generate phenotype and explore the data Timing: 2 h 12. To define case/control status of the participants, we need the phenotype (diabetes) definition, the harmonized data tables, the data settings and the individuals to be included (either specified by a vector of participant identifiers or a data-frame containing identifier in the first column and reference dates in the second column).
13. Use ''get_cases_controls()'' function to obtain the case/control status. The function returns a list of three data.  (Table 4). Included case/control is marked with 2/1 respectively while excluded case/control will be marked with -2/-1 in this table. b. ''all_event_dt.Include_in_cases'' is data.table object including all event episodes supporting the diagnosis for the cases included (Table 5). c. ''all_event_dt.Include_in_cases.summary'' is a data.table object with the same format with ''df.casecontrol'' but includes only cases (both included and excluded case). 14. Generate timeline plot to check the relative contribution by various data sources over time (Figure 8). Events with known event date will be included.
15. Use ''make_upsetplot()'' to examine the overlaps between the data sources at baseline to gain insight on their relationships (Figure 9).
16. Generate summary descriptions on the events with ''get_stats_for_events''. For example, generation of a frequency plot of codes among all events from secondary care may help verify or refine the definition (Figure 10).
17. Explore secondary care code count by individual ( Figure 11). 18. Generate a timeline of the codes contributing to diagnosis for a particular individual (please replace the identifier if copied from the cell below) (Figure 12).
To make the definition of the type 2 diabetes more precise, we may screen and exclude individuals with evidence of other types of diabetes as well as the use of metformin not due to diabetes.
19. First identify participants with specific diabetes codes (gestational diabetes, type 1 and type 2 diabetes) as well as general diabetes code.
20. Identify use of different anti-diabetic medications. Find individuals on metformin likely due to diseases other than diabetes by cross checking with the list of individuals with diabetes diagnoses.

Generate phenotypes in batch
Timing: 15-30 min This session demonstrates how to generate multiple phenotypes and make a clinical characteristics table with these phenotypes, stratified by type 2 diabetes status. An example definition table with the selected cardiometabolic diseases, family history of these diseases and diabetes medication usage is provided in the package. We additionally extract demographic information namely age and sex as well as biomarkers BMI, blood glucose, glycated hemoglobin and self-report insulin use within one year of diabetes diagnosis from the main dataset.
22. Read and process the definition file.
23. Extract only the required fields from the main dataset using read_ukb_tabdata(). a. The metadata provides information such as data type of these fields. b. Extract age at assessment center visit (Field 21003), sex (Field 31), body mass index (Field 21001), glycated hemoglobin level (Field 30750), glucose level (Field 30740), self-report insulin use within the first year of diabetes diagnosis (Field 2986), UK Biobank assessment center visited (Field 54) and Date of attending assessment center (Field 53).
28. To match type 2 diabetes case to control by age, sex and body mass index. We extract those variables and remove individuals with missing values in either the target phenotype or any covariates.
29. Format the coding of the phenotype and name the rows by participant identifier in preparation for the matchit() function. Run the matchit() function to match 2 controls to each case. Examine the result.

EXPECTED OUTCOMES
Harmonization of UK Biobank data After the harmonization step, the user should have in the R workspace the data from both Data Portal (record level data) as well as a subset of the main dataset relevant for the target phenotypes (as specified in the definition table). Users can perform further analyses in R as they see fit (see reporting of clinical characteristics, survival analysis and case-control matching sections).
Using our machine with 64 GB RAM (DDR4 2667 MHz) and 10-core processor (Intel Skylake) under an Ubuntu 16 system, it took 12 min to load and process the main dataset and all record level files. Of which the most time-consuming step was loading the main dataset (six minutes with file size 35.5 GB) followed by the primary care data (three and half minutes with total file size 8.5 GB). The harmonization step was additionally tested on an Ubuntu 16 machine with 16 GB RAM (DDR4 2667 MHz) and 6-core processor (AMD Ryzen 5). It took approximately 45 minutes to complete on this machine.

Ascertainment of diabetes status
The ''get_cases_controls()'' function from ukbpheno determines the case/control status for a target phenotype following the inclusion/exclusion criteria outlined in the definition table. Users can extract prevalent as well as incident cases with a specific reference time point. Results presented in this protocol are produced using data downloaded in June 2021, with primary care data for around 45% of the UK Biobank cohort available and data censoring dates shown in Table 7 (UK Biobank, 2022b).
Following the example definition of ''DmRxT2'', users should expect a prevalence of 4.7% at the baseline visit, an estimate close to (Eastwood et al., 2016). Additionally, it can be observed in the disease timeline ( Figure 8) as well as the UpSet plot (Figure 9) that the majority of prevalent cases during baseline visit periods were attributed to self-report sources rather than secondary care (hesin) and hence use of secondary care alone for ascertainment would miss out a lot of cases, as reported by Eastwood and colleagues.
Examining the diabetes diagnoses captured in secondary care system, user should be able to obtain similar frequency plots as shown in Figure 10. Counts of diagnosis code per individual have a highly skewed distribution with median of 3 records from the secondary care ( Figure 11).
With the plot_individual_timeline() function, user should expect visualization of diagnosis codes of a selected participant over time. Figure 12 shows such a timeline of a hypothetical participant.
Lastly in the cross-examinations with other types of diabetes and medication, users should expect similar numbers as reported in the prevalence algorithms by Eastwood and colleagues (Eastwood et al., 2016). Users could verify these conditions by inspecting the records in the harmonized data or further investigate with other information (such as the biomarker / genetics). For instance, users should be able to observe that around 300 individuals with young onset diabetes received an E119 diagnosis after their baseline visits.

Phenotype generation in batch
In this part users should be able to generate the cardiometabolic phenotypes namely, atrial fibrillation or flutter, coronary artery disease, hypertrophic cardiomyopathy, heart failure, type 2 diabetes, hypertension and hyperlipidemia. Users should also obtain the related phenotypes including family history of diabetes, family history of heart disease, family history of hypertension and the use of diabetes medications. Using our machine with 64 GB RAM (DDR4 2667 MHz) and 10-core processor (Skylake), it took three minutes to process all 12 phenotypes.

Report clinical characteristics at baseline and survival analysis
In this part users should obtain a clinical characteristics table stratified by type 2 diabetes status ( Table 8). The diagnosis was ascertained with only type 2 diabetes specific diagnosis codes; potential cases with type 1 diabetes specific diagnosis codes (at any time point) were excluded; potential controls with any diabetes diagnosis codes or medications were also excluded. This resulted in four stratified groups in the table -''Case inclusion'', ''Case exclusion'', ''Control inclusion'' and ''Control exclusion''. It can be observed that the group of ''Case exclusion'' had the highest blood glucose and glycated hemoglobin. The proportion of ''Case exclusion'' (with specific type 1 diabetes diagnosis codes) on insulin was also the highest with the highest percentage of starting insulin within the first year of diagnosis. Higher blood glucose and glycated hemoglobin as well as proportion of diabetes medication can be observed in the group of ''Control exclusion'' (with non-specific diabetes diagnosis codes -evidence of diabetes but no type 1 / type 2 specific records at baseline) than the ''Control inclusion'' group. The non-zero percentage of use of diabetes medication (oral diabetes medication / insulin) at baseline observed in the ''Control inclusion'' group was likely individuals with pre-diabetes; while the negative values of ''Years since type 2 diabetes diagnosis'' observed were due to the individuals who received diabetes diagnosis sometime after their baseline visits (incident cases).
In the survival analysis on new-onset heart failure (outcome of interest) between participants with or without type 2 diabetes at baseline, user would obtain a Kaplan-Meier plot similar to Figure 13. Participants with type 2 diabetes at baseline were more likely to have heart failure. The between-group difference was assessed by log-rank test with p<0.0001 in the data analyzed.

Case-control matching with MatchIt
Following up with the type 2 diabetes phenotype we created, it can be seen that case-control ratio is unbalanced. It can also be seen that the controls were younger, more likely to be female and had Figure 6. Screenshot of the harmonized records  (Table 8). We used the MatchIt package to match two controls to each case by these three covariates. After matching users should see these variables are balanced in cases and controls ( Figure 14).

LIMITATIONS
Phenotyping of a heterogenous resource such as UK Biobank can be challenging due to multiple data origins compounded with varying coverage both in terms of time and individuals.
In the current protocol we demonstrate how to use the ukbpheno package to define health-related outcomes and we presented possible analyses with regard to the phenotyping process.
With the ukbpheno package, the case/control status is determined by presence of diagnosis records. In the case of contradictory records, an individual might be classified with theoretically mutually exclusive diagnoses. Ad-hoc analyses would be needed to refine the phenotypes of these individuals should such precision be required.  It is also of note that, while the users can assign different weights to different data sources by adjusting minimum instance filter in the setting, it is not possible to directly weight on individual codes which could be important for a certain phenotype. The limitation could be circumvented by an implementation on the definition level (separating the codes in various definitions). It is important to recognize that there is no gold-standard for many of the health-related outcomes. Users should decide on their definitions based on the study questions at hand.

TROUBLESHOOTING
Problem 1 Unable to download/decrypt the main dataset (step1).

Potential solution
Ensure the most updated project key which is valid for one year after initiation.

Problem 2
Unable to download data from Data Portal (step 2).

Potential solution
Request the relevant fields listed in step 2 via the UK Biobank Access Management System (UK Biobank, 2022a). Refers to the classification system of the code Figure 8. Disease timeline of type 2 diabetes by different data sources ll OPEN ACCESS

Potential solution
Installation of the dependent package ''xx'' was not successful. Find the error message on console, resolve the error and restart the installation. Possible reason of error includes ''dependency ''xx'' is not available (for R version x.x.x) '', which may be solved by specifying the version of package compatible for the corresponding R version.

Problem 4
Certain codes are missing after reading in the definition table (step 9).

Potential solution
The package drops codes that are not available in data. If a certain code is dropped not due to this reason, check for special characters (accented characters are not recognized) and make sure codes are comma separated. Protocol Problem 5 Data processing step fails (step 10).

Potential solution
Run ''get_all_varnames()'' with the processed definition table and meta-data file to check if but some required fields are missing in the main dataset. Either remove the missing fields from definition table or allow missing fields (set ''allow_missing_fields'' flag to TRUE) in the ''harmonized_ukb_data()''.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources should be directed to the lead contact, Ming Wai Yeung (m.w.yeung@umcg.nl).

Materials availability
This study did not generate new unique reagents.
Data and code availability Data used and generated are listed in the key resources table. The UK Biobank data are open to bonafide researchers and accessible with an approved research project. Application guideline for the UK Biobank data can be found at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.