UKB.COVID19: an R package for UK Biobank COVID-19 data processing and analysis

COVID-19 caused by SARS-CoV-2 has resulted in a global pandemic with a rapidly developing global health and economic crisis. Variations in the disease have been observed and have been associated with the genomic sequence of either the human host or the pathogen. Worldwide scientists scrambled initially to recruit patient cohorts to try and identify risk factors. A resource that presented itself early on was the UK Biobank (UKBB), which is investigating the respective contributions of genetic predisposition and environmental exposure to the development of disease. To enable COVID-19 studies, UKBB is now receiving COVID-19 test data for their participants every two weeks. In addition, UKBB is delivering more frequent updates of death and hospital inpatient data (including critical care admissions) on the UKBB Data Portal. This frequently changing dataset requires a tool that can rapidly process and analyse up-to-date data. We developed an R package specifically for the UKBB COVID-19 data, which summarises COVID-19 test results, performs association tests between COVID-19 susceptibility/severity and potential risk factors such as age, sex, blood type, comorbidities and generates input files for genome-wide association studies (GWAS). By applying the R package to data released in April 2021, we found that age, body mass index, socioeconomic status and smoking are positively associated with COVID-19 susceptibility, severity, and mortality. Males are at a higher risk of COVID-19 infection than females. People staying in aged care homes have a higher chance of being exposed to SARS-CoV-2. By performing GWAS, we replicated the 3p21.31 genetic finding for COVID-19 susceptibility and severity. The ability to iteratively perform such analyses is highly relevant since the UKBB data is updated frequently. As a caveat, users must arrange their own access to the UKBB data to use the R package.


Introduction
The ongoing global pandemic of coronavirus disease 2019 , caused by a novel coronavirus (severe acute respiratory syndrome coronavirus 2, SARS-CoV-2), has resulted in a rapidly developing global health and economic crisis.Most people with COVID-19 never develop symptoms or suffer mild symptoms.However, about 5% of cases are critical (defined as respiratory failure, septic shock, and/or multiorgan dysfunction or failure) (Wu and McGoogan 2020), possibly leading to lethal lung damage and even death.These and other clinical observations led to the hypothesis that genetic factors in either or both the host and the pathogen could be responsible, at least in part, for this variation.Worldwide scientists scrambled initially to recruit patient cohorts to try and identify genetic risk factors.
UK Biobank (UKBB) (RRID: SCR_012815) is a long-term biobank study that recruited 500,000 volunteers aged between 40-69 years in 2006-2010 from across the UK.UKBB's large-scale database is a global research resource accessible to approved researchers who are undertaking health-related research.All participants provided detailed information about their lifestyle, physical measures and had blood, urine and saliva samples collected.The samples of all participants have undergone SNP array typing and are now also undergoing whole-exome and whole-genome sequencing.UKBB has become a major contributor to the advancement of modern medicine and treatment, enabling a better understanding of a wide range of serious and life-threatening diseases.
Researchers can apply for access to the data and worldwide hundreds of researchers are using the UKBB data to carry out research on many different diseases.The UKBB has facilitated first-time analyses on traits such as brain imaging phenotypes (Elliott et al., 2018).
The UK has been badly affected by COVID-19.As of 20 May 2021, there have been over 127,000 reported deaths in the UK, with an estimated 4.5 million infections.Worldwide there have now been more than 3 million reported deaths due to COVID-19, with continually increasing rates of infections in India and South America.The UKBB was an early, available population genetic resource that could be harnessed to better understand COVID-19 risk factors, and with its continuing evolution continues to serve as a powerful cohort to permit such studies.
UKBB has taken swift strides to help tackle the global pandemic by undertaking four major initiatives: serology study, COVID-19 repeat imaging study, coronavirus self-test antibody study and health data linkage.UKBB has been receiving COVID-19 test data for previous UKBB participants in England and has linked the test result data with health data.The test results data are being updated every two weeks.In addition, UKBB is making more frequent updates of death and hospital inpatient data (including critical care admissions) on the Data Portal.This rapidly changing dataset requires a tool that can process the up-to-date data as frequently as the data updates, in a standardised, reproducible, and somewhat automated manner to permit rapid re-analysis of the data and to also enable other researchers to use such a tool as a basis for their analyses.
Therefore, we developed an R package (version 4.0.5)UKB.COVID-19 which summarises COVID-19 test results, combines test results data with hospitalisation data and death register data, performs association tests between COVID-19 susceptibility/severity and potential risk factors (age, sex, blood type, socioeconomic status, comorbidities etc.) and generates input files for genome-wide association studies (GWAS).Ethics approval was granted through WEHI project 17/09LR by the WEHI's Human Research Ethics Committee (HREC).

Implementation
UKB.COVID19 was built in R (version 4.0.5) and currently depends on the following R packages: questionr, data.table,tidyverse, magrittr, here, and dplyr.COVID-19 related data files from UKBB can be directly imported in the R package without any pre-processing.

REVISED Amendments from Version 2
The newly revised article contains additional information as suggested by the reviewer, which includes 1) a discussion of long COVID and relevant functions in the UKB.COVID19 R package; 2) a "Statistical analysis" section in the methods section; 3) a vignette in the UKB.COVID19 R package.
Any further responses from the reviewers can be found at the end of the article Operation UKB.COVID19 is distributed as part of the CRAN R package repository and is compatible with Mac OS X, Windows, and major Linux operating systems.UKB.COVID19 is maintained at GitHub (https://github.com/bahlolab/UKB.COVID19).The archived source code can be found in http://doi.org/10.5281/zenodo.5174381(Wang et al., 2021).All analyses are performed using R (version 4.0.5).All functions and descriptions are listed in Table 1.
COVID-19 test results data COVID-19 test results data are being provided to the UKBB by Public Health England (PHE), Public Health Scotland (PHS) and SAIL Databank for English, Scottish and Welsh data respectively.The data have been updated approximately once every two weeks since 16 March 2020.Most samples tested for the COVID-19 disease-causing virus, SARS-CoV-2, are from combined nose/throat swabs.In intensive care settings, lower respiratory tract samples may also have been taken and analysed.The data consists of the encoded participant ID, date the specimen was taken, specimen type (e.g.nasal, nose and throat, sputum), the laboratory that processed the sample, whether the sample was reported as positive or negative for SARS-CoV-2, the requesting organisation description, as well as other variables.The test result data used in the analyses of this report are up to 6 April 2021.

Death register data
The death register data includes the date of death, the primary and contributory causes of death, coded using the ICD-10 system.The death register data have been updated every one or two months.The death register data used in the analyses of this report are up to 23 March 2021.

Hospital inpatient data
The hospital inpatient data consist of seven tables: 1) HESIN: the overall master table, providing information on admissions and discharges, the type of admission and other information related to the inpatient record as a whole.2) HESIN_DIAG: diagnosis codes (ICD-9 or ICD-10) relating to inpatient records, including primary diagnoses and secondary diagnoses.The primary diagnosis is the main condition treated or investigated during the relevant episode.A secondary diagnosis is a clinically relevant contributory factor or issue that impacts the primary diagnosis (including chronic conditions).3) HESIN_OPER: operations and procedures codes (OPCS-3 or OPCS-4) relating to inpatient episodes.4) HESIN_CRITICAL: a child table of HESIN containing further information about those hospital episodes that required treatment in a critical care unit.5) HESIN_PSYCH: a sibling table to HESIN containing fields relating to administrative aspects of psychiatric admissions.6) HESIN_MATERNITY: a sibling table to HESIN containing fields relating specifically to maternity admissions.7) HESIN_DELIVERY: Information regarding a child born as a result of a HESIN_MATERNITY record, where applicable.In this study, we use the HESIN, the HESIN_DIAG, the HESIN_O-PER, and the HESIN_CRITICAL tables.The hospital inpatient data used in the analyses of this report are up to 5 February 2021.

Phenotype definition
The makePhenotypes function defines multiple COVID-19 traits, related to susceptibility, severity and mortality, which may be used for association testing and GWAS (Table 2).
For susceptibility analysis, we generated a proxy variable, which includes all participants who have been tested for COVID-19 and define those who received at least one positive result as cases.

Non-genetic risk factors
The risk_factor function generates formatted variables for several non-genetic risk factors from the linked health data provided by UKBB.These variables are all established risk factors for SARS-CoV-2 exposure, and/or COVID-19 severity (Pijls et al., 2021;Wolff et al., 2021;Booth et al., 2021).The currently selected risk factors are listed in Table 3.The multi-category variables are converted into multiple dummy variables.For the blood type group factor, three dummy variables encoding the blood types A, AB, and O, are added to the data to compare with blood type B (baseline).For the ethnic background factor, Black, Asian, Mixed, and other ethnic backgrounds (BAME) are added to the data to permit comparison to white Europeans (baseline).
Simple associations between COVID-19 phenotypes and these common risk factors may be examined using the log_cov function, which performs a logistic regression model and formats the results for quick interpretation.

Comorbidities
The comorbidity_summary function summarises disease history records of each individual from the hospital inpatient diagnosis data.To meet different research aims the function allows restriction to a period and filtering of annotations by only primary diagnoses or all diagnoses (using the "Date.start","Date.end"and "primary" arguments, respectively).
For illustration, if we are interested in the co-occurrences of COVID-19, we can set the episode start date as 16 March 2020 ("Date.start= 16/03/2020"), when the first COVID-19 test result was recorded and choose to use all diagnoses ("primary = FALSE").If we are interested in individuals with reported comorbidities that are at a higher risk to SARS-CoV-2, we can choose an episode start time before the COVID-19 outbreak in the UK, for example, "Date.end= 01/01/2020" and only focus on the primary diagnoses ("primary = TRUE").Comorbidity categories are generated using the block categories in the ICD10 code, which is shown in the second column in Table 4.We include ICD10 chapters 1-14 and 17 and exclude several chapters such as pregnancy, childbirth, and consequences of external causes etc.For instance, the first category is "A00-A09", representing intestinal infectious diseases.During a period restricted by the start and end dates, cases are defined as any participants who were diagnosed as any subclasses under the block A00-A09 in the hospital inpatient diagnosis data.In this way, 164 binary variables are generated and each of them represents a comorbidity category.The R function generates a text file including all comorbidity categories, which can be used in the comorbidity association tests.
The comorbidity_asso function performs association tests between each comorbidity category and the selected phenotype using logistic regression models and adjusts the tested phenotype with covariates, which can be set using the argument "cov.name".By default, the covariates include sex, age, and BMI.Different ethnic backgrounds can be chosen for the test by setting the argument "population".By default, all populations are included.It outputs a table comprised of odds ratios (ORs), confidence intervals (CIs) of ORs, and p-values for all the comorbidity categories.

Preparation of files for genetic analyses
The UKB.COVID19 package provides several functions, to facilitate GWAS, or other genetic analyses using the UKBB data.We provide two functions sampleQC and variantQC, to allow easy cleaning of the genetic data, using quality control (QC) metrics, supplied by UKBB (Bycroft et al., 2018).A third function, makeGWASFiles, outputs phenotype files, which may be used as input for the GWAS software packages PLINK (Purcell et al., 2007) and SAIGE (Zhou et al., 2018).
The sampleQC function outputs a csv file summarising sample-level QC metrics, as well as producing lists of IDs for inclusion and/or exclusion in downstream analyses.The function identifies individuals to be excluded from genetic analyses based on: 1) being excluded by UKBB, before imputation due to high heterozygosity or missingness (>5%), 2) sex mismatches between genetically predicted and recorded sex, 3) an apparent excess number of relatives in the UKBB cohort (≥ 10 relatives), 4) putative sex chromosome aneuploidy, 5) withdrawn consent.The user has the option of further restricting to individuals of "White British" ancestry (determined using genetic principal components), by using the ancestry argument.Finally, the user can specify whether they require inclusion/exclusion sample lists to be formatted for PLINK or SAIGE.
The variantQC function identifies variants to be included in downstream analyses, based on minor allele frequency (MAF) and imputation quality (INFO score), with thresholds specified by the user (defaults to MAF ≥0.001 and INFO ≥0.5).The function outputs list of variants passing these thresholds are in two formats, given the two types of SNP IDs available in the UKBB imputed genetic data release: 1) snpIncludeSNPIDs_minMaf0.001_minInfo0.5.txt contains the unique SNP identifiers; 2) snpIncludeRSIDs_minMaf0.001_minInfo0.5.txt contains the rsid or the reference panel marker ID (note these IDs are not guaranteed to be unique).The function also outputs a file containing IDs of the subset of SNPs, used by UKBB for calculating ancestry principal components (Bycroft et al., 2018).This subset of SNPs is suitable for analyses where a pruned set of independent SNPs are preferred, for example for calculation of a genetic relatedness matrix (GRM).
The makeGWASFiles function generates a phenotype file, suitable to be used in association analyses by either SAIGE or PLINK (Purcell et al., 2007) (File format specified by user).The function utilises the phenotypes data frame generated by the makePhenotypes function, with the user able to specify specific phenotypes.The output phenotype file also contains the first 20 ancestry principal components, and genotyping array, as these are likely to be required as covariates in any genetic analyses.The user can also specify additional covariates (e.g.those generated by the risk_factor function), to be outputted to the phenotype file.Finally, the user can choose to output phenotypes, only for the individuals passing all QC (using the output file from sampleQC function), or for all individuals.

GWAS
We performed QC for the genotype data from UKBB using the sampleQC function, with the ancestry = "WhiteBritish" option, and the variantQC function, with thresholds MAF = 0.01 and INFO = 0.8.Phenotype files for SAIGE were generated using the makeGWASFiles function, containing all variables generated by the risk_factor function.Using the output files from the sampleQC and variantQC functions, we filtered the directly genotyped data using PLINK (Purcell et al., 2007), and the imputed data using QCTool version 2. We then performed GWAS of all COVID-19 phenotypes using SAIGE (Zhou et al., 2018).Firstly, the null model was fitted for each phenotype with 20 ancestry procedure codes (PCs), genotypic array, and associated non-genetic risk factors as covariates, and we used the pruned subset SNPs to construct the GRM.Subsequently, genome-wide association testing was undertaken, using the filtered imputed data.

Statistical analysis
To assess the associations between non-genetic risk factors and COVID-19 phenotypes (including susceptibility, severity, and mortality), we employed multivariable logistic regression models using the 'glm' function from the R package stats.To identify genetic variants associated with COVID-19 phenotypes, we performed GWASs using the SAIGE software.Principal component analysis (PCA) was performed to account for population stratification, and the first 20 principal components (PCs) were included as covariates in the analysis.Additionally, we adjusted for age, sex, BMI, SES, smoking status, residence in aged care facilities and genotypic array in the regression models.The association between each SNP and the phenotypes was tested using a logistic regression model, as follows: logit (COVID-19 phenotype) ~SNP + age + sex + BMI + SES + smoking status + aged care status + genotypic array + PC1-20.To account for multiple testing, the Bonferroni correction was applied.Loci reaching the genome-wide significance threshold (p < 5Â10 À8 ) were considered significant.Manhattan plots and quantile-quantile (QQ) plots were generated to visualize the results using R package ggplot2.All analyses were carried out using R (version 4.0.5).

Results
We applied the R package UKB.COVID19 to the data released in April 2021.  .
We only apply GWAS to the white British participants in the UKBB.Therefore, we performed non-genetic risk factor association tests again for self-reported "white" participants only.It shows that age, sex, BMI, SES, smoking, and if in an aged care home are associated with COVID-19 susceptibility in white British.Incorporation of the two array effects and the first 20 PCs, these risk factors are used to adjust susceptibility in the GWAS.The genome-wide significant COVID-19 susceptibility locus identified in our GWAS is 3p21.31(Figure 1 and Table 6).The most statistically significant SNP is rs2771616 within the glycine transporter gene SLC6A20 (3p21.31,p-value = 3.36 Â 10 À9 ), followed by SNPs rs73062389 (3p21.31;SLC6A20; p-value =5.16 Â 10 À9 ) and rs73062394 (3p21.31;SLC6A20; p-value = 6.68 Â 10 À9 ) in strong linkage disequilibrium (LD) (r2 = 1 and r2 = 1) (Table 7).SLC6A20 encodes an amino acid transporter that interacts with ACE2, the main receptor that SARS-CoV-2 uses to gain entry into host cells (Elhabyan et al., 2020;Hoffmann et al., 2020).This locus has also been previously identified by other studies (The Severe Covid-19 GWAS Group "Genomewide Association Study of Severe Covid-19 with Respiratory Failure", 2020), several meta-analyses of which have also made use of the UKBB COVID-19 data (Host Genetics Initiative, 2021).All genome wide significant GWAS hits with gene annotations are available in Table 7.   6 and 7).Specifically, the most significant SNP for both COVID-19 hospitalisation and critical care GWASs is located in the gene LZTFL1 (rs35044562 in locus 3p21.31;p-value = 1.55 Â 10 À10 and p-value = 2.23 Â 10 À9 , respectively).According to the Genotype-Tissue Expression (GTEx) project, LZTFL1 is widely expressed throughout the body and encodes a protein involved in protein trafficking to primary cilia, which are microtubulebased subcellular organelles acting as antennas for extracellular signals.In T lymphocytes, LZTFL1 participates in the immunologic synapse with antigen-presenting cells, such as dendritic cells (these cells prime T-lymphocyte responses) (Kaser 2020;Seo et al., 2011;Jiang et al., 2016).

COVID-19 mortality
By 23 March 2021, 16,465 UKBB participants received positive COVID-19 test results.Among these, 1,042 individuals died from COVID-19.We performed the same association tests for COVID-19 mortality as for susceptibility and severity.The results (Table 10) show that males have a much higher chance of dying from COVID-19 than females (OR = 1.89, 95% CI = [1.63,2.20],p-value <10 À5 ), consistent with previously published results from independent cohorts (Peckham et al., 2020).The black ethnic group is at a much higher mortality risk from SARS-CoV-2 compared to white individuals (OR = 2.04, 95% CI = [1.38,2.94],p-value = 0.0002).Age, BMI, SES, and smoking are positively associated with COVID-19 mortality.People living in aged care homes are at a much higher risk of dying from COVID-19.For self-reported white individuals, age, sex, BMI, SES, smoking, and being in an aged care home are positively associated with COVID-19 mortality.Therefore, all these covariates were used to adjust the mortality phenotype for GWAS.However, no genome-wide significant signal was detected for this GWAS (Figure 5).

COVID-19 comorbidities
We were interested in the co-occurrence of COVID-19 and comorbidities in individuals who had suffered from severe COVID-19.Therefore, we divided the hospital inpatient diagnosis records into before and after the COVID-19 pandemic using the date 16 March 2020, when COVID-19 testing commenced in the UK.We performed association testing for each comorbidity using logistic regression models and adjusted COVID-19 severity (if the patient received critical care treatments) by sex, age, BMI, SES, smoking and aged care status.Tables 11 and 12 list the top ten associated diseases with severe COVID-19 before and after 16 March 2020.respectively.From Table 12, we found that the common co-occurrence associated with COVID-19 are pneumonia, respiratory diseases, renal failure, metabolic disorders, hypertensive diseases, heart disease and other bacterial diseases.People who have ever had mental disorders, influenza and pneumonia, renal failure, respiratory diseases, bacterial, viral, or other infections, malignant neoplasms of lymphoid, haematopoietic and related tissue, or other blood diseases, tend to have severe symptoms after being infected by SARS-CoV-2.

APOE e4
Several publications have reported that the APOE e4 genotype is associated with COVID-19 susceptibility and severity (Numbers and Brodaty 2021; Kuo et al., 2020aKuo et al., , 2020b)).APOE e4 is a known risk factor for dementia, which has been replicated many times (Liu et al., 2013;Safieh, Korczyn, and Michaelson 2019;Emrani et al., 2020).One explanation for people with APOE e4 being at higher risk of COVID-19 could be due to a higher risk of exposure, as these individuals are more likely to reside in care homes, which have suffered from high rates of infections.This is particularly likely to be the case in UKBB, where 47% of participants are older than 70 years old.To test this hypothesis, we performed GWAS tests with and without aged care status.The APOE e4 signal was genome-wide significant without aged care status but was gone after aged care status adjustment (Figure 6), suggesting that this finding is not robust and may be due to ascertainment bias.

Use cases
To demonstrate the functionality and utility of UKB.COVID19, we present a basic tutorial for using UKB.COVID19.Due to the restriction of using UKBB data, we illustrate the use cases using simulated data.The SAIGE GWAS script example can be found in Github: https://github.com/bahlolab/UKB.COVID19/tree/main/inst/GWAS.
ppl" are the susceptibility phenotypes, which denote 1) UKBB participants with COVID-19 positive versus negative results 2) and participants with positive results versus all the other participants.

Discussion
We developed an R package that can reproducibly analyse and produce input files for GWAS studies for COVID-19 traits, using the UKBB resource.
The R package can be easily applied to the frequently updated UKBB COVID-19 datasets, facilitating rapid analyses.By applying the R package to data released in April 2021, we found that age, BMI, SES and smoking are positively associated with COVID-19 susceptibility, severity and mortality.Males are at a higher risk of COVID-19 infection than females.People residing in aged care homes were also at higher risk, potentially because they have other pre-existing conditions, and may also have a higher chance of exposure to SARS-CoV-2.By performing GWAS, we replicated previous findings (Pairo-Castineira et al., 2021;Zeberg and Pääbo, 2020; "Genomewide Association Study of Severe Covid-19 with Respiratory Failure", 2020; Host Genetics Initiative, 2021) that the locus 3p21.31 is associated with COVID-19 susceptibility and severity.
The COVID-19 Host Genetics Initiative brings together the human genetics community to generate, share, and analyse data to learn the genetic determinants of COVID-19 susceptibility, severity, and related outcomes.They have been performing largescale meta-analyses using existing biobanks, including UKBB, and periodically provide updated releases of their results, making available genome-wide summary statistics, and providing an online browser for exploring the latest results (https:// app.covid19hg.org/).We primarily advocate the use of these resources for exploring genetic associations with COVID-19 susceptibility and severity.However, we anticipate our R package will enable researchers to undertake more bespoke genetic analyses, using the most up to date UKBB COVID-19 data, to meet the aim of their studies.Such analyses may include adjusting for non-genetic risk factors or comorbidities, to explore mediators, polygenic risk score analyses, or Mendelian Randomisation studies.
Long COVID, also known as post-acute sequelae of SARS-CoV-2 infection, refers to a range of symptoms that persist for weeks or months after the acute phase of COVID-19 has resolved.These symptoms can include fatigue, shortness of breath, cognitive dysfunction, and various other systemic issues, significantly impacting the quality of life of affected individuals.The UKB.COVID19 package provides multiple functions to facilitate long COVID analysis.
For instance, the 'comorbidity_summary' and 'comorbidity_asso' functions can be used to summarise potential long COVID symptoms and assess their associations with risk factors, such as age, sex and certain pre-existing conditions.Furthermore, researchers can focus on subsets of participants reporting persistent symptoms consistent with long COVID to investigate genetic risk factors using GWAS.These analyses hold promise for uncovering the biological underpinnings of long COVID and identifying potential therapeutic targets to alleviate its impact.
There are several limitations of UKBB COVID-19 data.First, UKBB is not a nationally or worldwide representative sample.The majority of participants are of white British ethnicity.UKBB participants were more likely to be older, to be female, and to live in less socioeconomically deprived areas than nonparticipants.Compared with the general population, participants were less likely to be obese, to smoke, and to drink alcohol daily and had fewer self-reported health conditions (Fry et al., 2017).Initiatives such as OpenSafely (Williamson et al., 2020), have aimed to examine risk factors for COVID-19 disease in an unascertained UK population, via electronic health records.These data, however, are not presently available for use by the wider research community, due to the possibility of re-identification of individuals.The recent OpenSafely flagship paper examined health records of over 17 million individuals in England, of whom 10,926 had a COVID-19 related death, and found that male sex, greater age and deprivation, and non-white ethnicities were major clinical risk factors for mortality.Despite the ascertainment of the UKBB, it is reassuring that these established risk factors are also associated with COVID-19 outcomes in this cohort.
Second, the UKBB COVID-19 dataset evolved as testing scaled up in line with the national testing strategy and thus COVID-19 data is also subject to ascertainment bias.UK testing was initially largely restricted to healthcare workers, and those individuals with symptoms in hospitals.A positive result in an individual not recorded as a healthcare worker was therefore a reasonable proxy for severe disease early on in the pandemic.Testing capacity subsequently increased to include more community testing under pillar 2 of the national strategy, and as of 27 April 2020, NHS England directed hospitals to test all non-elective patients admitted overnight, including asymptomatic patients.To maximise ascertainment of cases and to evaluate disease severity, SARS-CoV-2 testing data should be used in combination with linked medical records (i.e.hospital inpatient records and death records) as we have implemented in this package.More recently, UKBB has made primary care records available for COVID-19 research.These data not yet utilised by the UKB.COVID19 package, will further improve case identification.Nonetheless, there are likely to be many individuals in the UKBB who contracted COVID-19, in particular those with milder disease, who will not be captured by the available data.
The definition of COVID-19 susceptibility is supposed to be the status of people who get infected or not after exposure to SARS-CoV-2.However, exposure to SARS-CoV-2 is not easy to determine.Furthermore, not everyone has an equal chance of being exposed to SARS-CoV-2 (for example, exposure will vary by occupation), nor does everyone have the same likelihood of being tested, due to testing strategies, as noted above.Such data idiosyncrasies have the potential to distort associations, in observational studies, and also in genetic analyses through population stratification.This issue of ascertainment, or collider bias, in the context of COVID-19, is discussed at length by Griffith et al. (2020).Analyses using the UKBB data should therefore be undertaken and interpreted within the context of changing testing capacity, and other limitations regarding phenotype definitions.
We welcome further suggestions and improvements for this R package, which we hope will reduce the barrier to utilising the UKBB data for COVID-19 research.

Edgar Gonzalez-Kozlova
Icahn School of Medicine at Mount Sinai, New York, NY, USA Dear authors, Fantastic job preparing a package to facilitate data retrieval and analysis.
I would like to see a few additions that can only strengthen the article.
> Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Prepare make a vignette available showcasing exactly how you intended the package to be used.While the article describes well the study and package, packages without a vignette are disregarded.
> Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
There is no mention of long covid in the article.Long term effects of COVID19 cant be ignored.Please include a discussion and/or status of long covid patients in the article.
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
A statistics section is missing from the methods section.Every test used in the article should be clearly described and justified in methods.Reviewer Expertise: Computational Biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Author Response 23 Jul 2024
Longfei Wang ------------------Reviewer Comment: 1. Prepare make a vignette available showcasing exactly how you intended the package to be used.While the article describes well the study and package, packages without a vignette are disregarded.

------------------Reviewer Comment:
2. There is no mention of long covid in the article.Long term effects of COVID19 cant be ignored.Please include a discussion and/or status of long covid patients in the article.
Author Response: Thank you for your suggestion.We have added a discussion of long COVID and provided relevant functions in UKB.COVID19.
Long COVID, also known as post-acute sequelae of SARS-CoV-2 infection, refers to a range of symptoms that persist for weeks or months after the acute phase of COVID-19 has resolved.These symptoms can include fatigue, shortness of breath, cognitive dysfunction, and various other systemic issues, significantly impacting the quality of life of affected individuals.The UKB.COVID19 package provides multiple functions to facilitate long COVID analysis.For instance, the 'comorbidity_summary' and 'comorbidity_asso' functions can be used to summarise potential long COVID symptoms and assess their associations with risk factors, such as age, sex and certain pre-existing conditions.Furthermore, researchers can focus on subsets of participants reporting persistent symptoms consistent with long COVID to investigate genetic risk factors using GWAS.These analyses hold promise for uncovering the biological underpinnings of long COVID and identifying potential therapeutic targets to alleviate its impact.
------------------Reviewer Comment: 3. A statistics section is missing from the methods section.Every test used in the article should be clearly described and justified in methods.
Author Response: We added a statistics section in the methods section.

Statistical analysis
To assess the associations between non-genetic risk factors and COVID-19 phenotypes (including susceptibility, severity, and mortality), we employed multivariable logistic regression models using the 'glm' function from the R package stats.To identify genetic variants associated with COVID-19 phenotypes, we performed GWASs using the SAIGE software.Principal component analysis (PCA) was performed to account for population stratification, and the first 20 principal components (PCs) were included as covariates in the analysis.Additionally, we adjusted for age, sex, BMI, SES, smoking status, residence in aged care facilities and genotypic array in the regression models.The association between each SNP and the phenotypes was tested using a logistic regression model, as follows: logit(COVID-19 phenotype) ~ SNP + age + sex + BMI + SES + smoking status + aged care status + genotypic array + PC1-20.To account for multiple testing, the Bonferroni correction was applied.Loci reaching the genome-wide significance threshold (p < 5x10 -8 ) were considered significant.Manhattan plots and quantile-quantile (QQ) plots were generated to visualize the results using R package ggplot2.All analyses were carried out using R (version 4.0.5).

------------------
Competing Interests: No competing interests were disclosed.The rationale is well explained, sufficient details of the code, methods, and analysis are provided, outputs are well described and conclusions are sound and appropriate.
However, some minor points should be considered: It is not clear how comorbidities are retrieved, classified (at which level of ICD-10), and analysed ○ Authors should discuss how they choose to classify severity (the distinction between critical care and advanced critical care for example) and why they choose to include all Covid patients (for example severity 2-3 vs 0-1 instead of severity 2-3 vs 1).Why not consider it as an ordinal variable?Reviewer Expertise: biostatistics days" in the HESIN_CRIRICAL table or received advanced respiratory support, such as, E85.1 invasive ventilation, in the HESIN_OPER table.The commonly used GWAS tools, such as SAIGE and PLINK, do not support ordinal categorical phenotypes.Therefore, we converted this ordinal variable into four binary variables named "hospitalisation", "critical care", "advanced critical care" and "mortality" (Table 2).However, users can get the ordinal variable by simply summing the four binary variables.We assume that participants who were tested COVID-19 positive but did not admit to hospital had no or mild symptoms and hence classified them as controls in severity phenotypes.

Authors should specify if they consider mortality due to Covid or with Covid
Sorry for the unclearness.We defined the mortality case as mortality due to Covid.In the article, we wrote: For mortality, we include all individuals who received at least one positive test result and define those whose primary cause of death is recorded as being due to COVID-19 as cases.
To make it clearer, we corrected the definition of mortality in Table 2 from "1 = death with COVID-19" to "1 = death due to COVID-19".
Competing Interests: No competing interests were disclosed.

Thomas Michael Palmer
Population Health Sciences, University of Bristol Medical School, Bristol, UK Before I review this R package properly there are some basic fixes to the GitHub repository version which require attention.
The package has an unusual history.Two versions have been released on CRAN however as I can see from the website it was "Archived on 2021-10-06 as email to the maintainer was undeliverable".So I recommend that the authors contact CRAN to get the package unarchived. 1.
The CRAN archive shows versions 0.1.0and 0.1.1,however the GitHub repo shows version 0.1.0in its DESCRIPTION file.The repo should have the latest version in it.

2.
Whilst the versions listed on CRAN 0.1.0and 0.1.1 must have been CRAN compliant, otherwise they would not have been allowed on CRAN, unfortunately the code in the GitHub is no longer CRAN compliant and a simple running of R CMD check on the code in the repo 3.
CRAN team.They replied that a CRAN team member tried to contact me and the email has got a bounced message notification.However, my email address is correct and has not been changed since I submitted the package.I have resubmitted the package with an increased version number and with minor changes according to your suggestions.It has been unarchived (https://cran.r-project.org/web/packages/UKB.COVID19/index.html).My apologies that the package on GitHub was out-of-date.I have updated the latest version in GitHub.

The
3. Whilst the versions listed on CRAN 0.1.0and 0.1.1 must have been CRAN compliant, otherwise they would not have been allowed on CRAN, unfortunately the code in the GitHub is no longer CRAN compliant and a simple running of R CMD check on the code in the repo gives 2 R CMD check errors and 1 note.These R CMD check errors should be fixed and the R CMD check note should also be fixed by adding the relevant entries to the .Rbuildignore file.
I have updated the latest version in GitHub and double checked it with R CMD check.There's no errors, warnings, or notes from the R CMD check now.
4. The script in the tests/testthat folder does not use any of the testthat functions as it should.This should be improved or removed.
Thanks for your suggestion.I have improved the scripts in the tests/testthat folder with proper testthat functions.
5. Personally I find the name of the package unusual, I don't prefer full-stops/periods in package names.
Thanks for your suggestion.The package has been on CRAN for a while.People may have included the package in their scripts.These scripts will break if I change the name of the package.And it may be hard for everyone to find the renamed package.So I decided to keep the name and will definitely use proper names for the packages I build in the future.
6. Returned objects from the functions could be defined under one of the R's class systems, e.g., S3.
Thanks for your suggestion.I have defined the returned objects under the S3 class system.

Figure 3 .
Figure3.The Q-Q plot and Manhattan plot of COVID-19 critical care GWAS.Sample size is 11,974.In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis).The significance level is presented in the y-axis.The red line indicates the threshold for genome-wide significance 5 Â 10 À8 while the blue line indicates the threshold for suggestive genome-wide significance 1 Â 10 À5 .The light green dots are the genes of interest, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7.The critical care phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1-20.The result shows that the locus at 3p21.31 is genome-wide significantly associated with COVID-19 critical care.The most significant SNP for both COVID-19 critical care GWAS is located in the gene LZTFL1 (rs35044562 in locus 3p21.31;p-value = 2.23 Â 10 À9 ).

Figure 4 .
Figure 4.The Q-Q plot and Manhattan plot of COVID-19 advanced critical care GWAS.Sample size is 11,974.In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis).The significance level is presented in the y-axis.The red line indicates the threshold for genome-wide significance 5 Â 10 À8 while the blue line indicates the threshold for suggestive genome-wide significance 1 Â 10 À5 .The light green dots are the genes of interest, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7.The advanced critical care phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1-20.No genome-wide significant signals were found.

Figure 5 .
Figure 5.The Q-Q plot and Manhattan plot of COVID-19 mortality GWAS.Sample size is 12,790.In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis).The significance level is presented in the y-axis.The red line indicates the threshold for genome-wide significance 5 Â 10 À8 while the blue line indicates the threshold for suggestive genome-wide significance 1 Â 10 À5 .The light green dots are the genes of interest, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7.The mortality phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1-20.No genome-wide significant signals were found.

Figure 6 .
Figure 6.COVID-19 susceptibility GWAS tests with and without aged care status covariate adjustment.a. COVID-19 susceptibility GWAS without care home status covariate adjustment.The model we used is: susceptibility ~age + sex + BMI + PC1-20 + array + SNP.b.COVID-19 susceptibility GWAS with care home status covariate adjustment.The model we used is: susceptibility ~age + sex + BMI + PC1-20 + array + inAgedCare + SNP.The APOE e4 signal was genome-wide significant without aged care status but was gone after aged care status adjustment, suggesting that this finding is not robust and may be due to ascertainment bias.
the rationale for developing the new software tool clearly explained?YesIs the description of the software tool technically sound?YesAre sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?PartlyIs sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?PartlyAre the conclusions about the tool and its performance adequately supported by the findings presented in the article?PartlyCompeting Interests: No competing interests were disclosed.
Clinica e Biostatistica Direzione Scientifica, Fondazione IRCCS Policlinico san Matteo, Pavia, Italy Annalisa De Silvestri Scientific Direction, IRCCS Policlinico San Matteo Foundation, Pavia, Italy Authors developed a potentially useful R-package tool to analyze data from the UKBB COVID-19 database, which summarises COVID-19 test results, and performs association tests between COVID-19 susceptibility/severity and potential risk factors such as age, sex, blood type, comorbidities and generates input files for GWAS.

○
Authors should specify if they consider mortality due to Covid or with Covid ○ Is the rationale for developing the new software tool clearly explained?YesIs the description of the software tool technically sound?PartlyAre sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?YesIs sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?YesAre the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.

Reviewer Report 02
December 2021 https://doi.org/10.5256/f1000research.58938.r100445© 2021 Palmer T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
CRAN archive shows versions 0.1.0and 0.1.1,however the GitHub repo shows version 0.1.0in its DESCRIPTION file.The repo should have the latest version in it.

Table 1 .
Description of R functions in the UKB.COVID19 R package.Selects several potential non-genetic risk factors from the linked health data provided by UKBB and generates an output file including the selected risk factors for the downstream analyses.Automatically returns sex, age at birthday in 2020, socioeconomic status, self-reported ethnicity, most recently reported body mass index, most recently reported pack-years of smoking, whether they reside in aged care (based on hospital admissions data, and COVID-19 test data) and blood type.Function also allows users to specify fields of interest (field codes, provided by UK Biobank), and allows the user to specify more intuitive names for selected fields.
comorbidity_asso Performs association tests using logistic regression models, adjusts the tested phenotype with covariates and outputs a table comprised of odds ratios (ORs), 95% confidence intervals (CIs) of ORs, and p-values for all the comorbidity categories.sampleQC Collates genetic QC data, as provided by UKBB and outputs lists of samples for inclusion/ exclusion, for use with PLINK (Purcell et al., 2007) and/or SAIGE (Zhou et al., 2018).Also outputs a csv file summary sample-level QC metrics.variantQC Collates genetic QC data, as provided by UKBB and outputs lists of variants for inclusion in downstream analyses, for use with PLINK and/or SAIGE.makeGWASFiles Output phenotype files, formatted to be used as input for GWAS, or other genetic analyses, with PLINK and/or SAIGE.log_cov Performs association tests using logistic regression models.
By 6 April 2021, 77,222 individuals in the UKBB had received COVID-19 tests and 16,562 had tested positive for COVID-19 on at least one occasion.The pheno.type = "susceptibility" option summarises the COVID-19 test results data and generates a susceptibility phenotype for association tests and GWAS.
Based on the World Health Organization (WHO) ordinal scale for clinical improvement, we classify severity into four levels.These levels are defined as 1) hospitalisation: individuals admitted to hospital with their primary diagnosis recorded as COVID-19.2) critical care level 2: individuals required basic treatment in a critical care unit, such as non-invasive ventilation and continuous positive airway pressure, and with their primary diagnosis recorded as COVID-19.3) critical care level 3: individuals required advanced treatment in a critical care unit, such as invasive

Table 2 .
The COVID-19 related phenotypes output from the makePhenotypes function in the UKB.COVID19 R package.
1 = evidence of COVID-19, from one or more of: a) positive test result for SARS-CoV-2 infection; b) admitted to hospital with COVID-19; c) death with COVID-19.0 = no evidence of COVID-19, due to consistently testing negative for SARS-CoV-2 infection.NA = no evidence of COVID-19, and no record of test result for SARS-CoV-2 infection.pos.pplCOVID-19 case vs the rest of the UKBB participants -binary variable.1 = evidence of COVID-19, from one or more of: a) positive test result for SARS-CoV-2 infection; b) admitted to hospital with COVID-19; c) death with COVID-19.0 = any individual, not meeting the criteria for a COVID-19 case.severity hospitalisation COVID-19 cases with hospitalisation vs the rest of COVID-19 cases -binary variable.1 = evidence of COVID-19 severity level 1, from one or more of: a) admitted to hospital due to COVID-19; b) received basic critical care or advanced critical care due to COVID-19; c) death due to COVID-19.0 = no evidence of COVID-19 severity level 1, even though testing positive for SARS-CoV-2 infection.0 = no evidence of COVID-19 severity level 3, even though testing positive for SARS-CoV-2 infection.mortality mortality COVID-19 cases who have died due to COVID-19 vs the rest of COVID-19 cases -binary variable.1 = death due to COVID-19.0 = any other COVID-19 cases.
ventilation and temporary tracheostomy, and with their primary diagnosis recorded as COVID-19.4)mortality:individuals died due to COVID-19.The critical care information was summarised from the HESIN_CRITICAL table and the HESIN_OPER table.The critical care level 2 cases are the COVID-19 patients who required at least one "Critical care level 2 days" in the HESIN_CRIRICAL table or received basic respiratory support, such as, E85.2 non-invasive ventilation NEC, in the HESIN_OPER table.The critical care level 3 cases are defined as the COVID-19 patients who required at least one "Critical care level 3 days" in the HESIN_CRIRICAL table or received advanced respiratory support, such as, E85.1 invasive ventilation, in the HESIN_OPER table.The commonly used GWAS tools, such as SAIGE and PLINK, do not support ordinal categorical phenotypes.Therefore, we converted this ordinal variable into four binary variables named "hospitalisation", "critical care", "advanced critical care" and "mortality" (Table2).

Table 3 .
The current selected risk factors of COVID-19 in the UKB.COVID19 R package.

Table 4
. The comorbidity categories.Comorbidity categories are generated using the block categories in the ICD10 code, as shown in the second column.We only included the blocks in chapter 1-14 and 17 and excluded several chapters such as pregnancy, childbirth and consequences of external causes etc.

Table 4 .
Continued Each model adjusted for covariates such as age, sex, and BMI.The tested risk factors included socioeconomic status (SES), smoking status, blood type, ethnic background, and residence in aged care facilities.The logistic regression model for each risk factor was specified as follows: logit (COVID-19 phenotype) ~risk factor + age + sex + BMI.Comorbidity associations were analyzed using similar multivariable logistic regression models, with COVID-19 phenotypes modeled as: logit (COVID-19 phenotype) ~comorbidity category + age + sex + BMI + SES + smoking status + aged care status.Odds ratios (ORs) with 95% confidence intervals (CIs) were reported, and p-values were calculated to determine the significance of the associations.
Second, we tested each potential risk factor individually with adjustment of age, sex, and BMI.Several publications have already reported that blood type groups are associated with COVID-19 susceptibility (Zhao et al., 2020; Zietz, Zucker, and Tatonetti 2020), including genetic associations with the ABO blood group locus at 9q34.2 (The Severe Covid-19 GWAS Group "Genomewide Association Study of Severe Covid-19 with Respiratory Failure" 2020).

Table 5 .
COVID-19 susceptibility and non-genetic risk factor association test results for all populations and white British.Cases are defined as participants who received at least one COVID-19 positive test result.Controls are those who received only negative results.We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI.SES stands for socioeconomic status.Odds ratio (OR) and p-values (P) are provided.

Table 6 .
The most genome-wide significant hits of COVID-19 susceptibility, hospitalisation and critical care genome-wide association studies.

Table 7 .
The genome-wide significant hits of COVID-19 susceptibility, hospitalisation and critical care genome-wide association studies.February 2021, 15,666 UKBB participants received positive COVID-19 test results.2,104individualshad been admitted to the hospital due to COVID-19, 1,129 of these individuals received critical care treatments and 1,010 received advanced critical care treatments.The risk factor association test results are presented in Tables8 and 9for all populations and self-reported white individuals, respectively.Compared to white individuals, Black, Asian, and other minority ethnic groups are at a higher risk of severe COVID-19.Age, sex, BMI, SES, and smoking are also positively associated with COVID-19 severity.
The results from the GWAS are shown in the quantile-quantile (Q-Q) plots and Manhattan plots in Figures2-4.The tested phenotypes are adjusted by age, sex, BMI, SES, smoking, if in an aged care home, array, and PC1-20.The results show that the locus at 3p21.31 is genome-wide significantly associated with COVID-19 hospitalisation and critical care (Tables

Table 8 .
COVID-19 severity and non-genetic risk factor association test results for all populations.Cases of hospitalisation include participants who were admitted to hospital and whose primary diagnosis was COVID-19, received critical care treatments, or died from COVID-19.Controls are the rest of the participants who received positive test results.Cases of critical care phenotype include those who received critical care treatments due to COVID-19 or died from COVID-19.Cases of advanced critical care are defined as participants who received advanced critical care treatments or died from COVID-19.We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI.SES stands for socioeconomic status.Odds ratio (OR) and p-values (P) are provided. .

Table 9 .
COVID-19 severity and non-genetic risk factor association test results for white British.Cases of hospitalisation include participants who were admitted to hospital and whose primary diagnosis was COVID-19, received critical care treatments, or died from COVID-19.Controls are the rest of the participants who received positive test results.Cases of critical care phenotype include those who received critical care treatments due to COVID-19 or died from COVID-19.Cases of advanced critical care are defined as participants who received advanced critical care treatments or died from COVID-19.We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI.SES stands for socioeconomic status.Odds ratio (OR) and p-values (P) are provided.
The risk_factor function in UKB.COVID19 can be used to generate a covariate file with established risk factors and risk factors of interest by specifying the field code in UKBB main data.

Table 10 .
COVID-19 mortality and non-genetic risk factor association test results for all populations and white British.Cases of mortality include participants whose primary death cause is COVID-19.Controls are the rest of the participants who received positive test results.We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI.SES stands for socioeconomic status.Odds ratio (OR) and p-values (P) are provided.

Table 11 .
The top 10 comorbidities associated with COVID-19 severity before COVID-19 testing in the UK.We divided the hospital inpatient diagnosis records into before and after the COVID-19 pandemic using the date 16 March 2020, when COVID-19 testing commenced.We performed association testing for each comorbidity using logistic regression models and adjusted COVID-19 severity (if the patient received critical care treatments) by sex, age, body mass index, socioeconomic status, smoking and aged care status.To show the comorbidities in individuals who had suffered from severe COVID-19, we ranked the p-values before 16 March 2020 and listed the top 10 comorbidities.

Table 12 .
The top 10 comorbidities associated with COVID-19 severity after COVID-19 testing in the UK.We divided the hospital inpatient diagnosis records into before and after the COVID-19 pandemic using the date 16 March 2020, when COVID-19 testing commenced.We performed association testing for each comorbidity using logistic regression models and adjusted COVID-19 severity (if the patient received critical care treatments) by sex, age, body mass index, socioeconomic status, smoking and aged care status.To show the top 10 co-occurrence of COVID-19, we ranked the p-values after 16 March 2020 and listed the top 10 comorbidities.
10 #> [1] "Outputting file: ~/UKB.COVID19/extdata/results/phenotype.txt" head Generating a comorbidity summary file.The comorbidity_summary function scans all the hospitalisation records with a given time period and generates a text file.The following example is to generate a comorbidity summary file that includes all the primary and secondary diagnoses in the hospital inpatient data after 16 March 2020.
Each model adjusted for covariates such as age, sex, and BMI.The tested risk factors included socioeconomic status (SES), smoking status, blood type, ethnic background, and residence in aged care facilities.The logistic regression model for each risk factor was specified as follows: logit(COVID-19 phenotype) ~ risk factor + age + sex + BMI.Comorbidity associations were analyzed using similar multivariable logistic regression models, with COVID-19 phenotypes modeled as: logit(COVID-19 phenotype) ~ comorbidity category + age + sex + BMI + SES + smoking status + aged care status.Odds ratios (ORs) with 95% confidence intervals (CIs) were reported, and p-values were calculated to determine the significance of the associations.
This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.