The Data Use Ontology to streamline responsible access to human biomedical datasets

Summary Human biomedical datasets that are critical for research and clinical studies to benefit human health also often contain sensitive or potentially identifying information of individual participants. Thus, care must be taken when they are processed and made available to comply with ethical and regulatory frameworks and informed consent data conditions. To enable and streamline data access for these biomedical datasets, the Global Alliance for Genomics and Health (GA4GH) Data Use and Researcher Identities (DURI) work stream developed and approved the Data Use Ontology (DUO) standard. DUO is a hierarchical vocabulary of human and machine-readable data use terms that consistently and unambiguously represents a dataset’s allowable data uses. DUO has been implemented by major international stakeholders such as the Broad and Sanger Institutes and is currently used in annotation of over 200,000 datasets worldwide. Using DUO in data management and access facilitates researchers’ discovery and access of relevant datasets. DUO annotations increase the FAIRness of datasets and support data linkages using common data use profiles when integrating the data for secondary analyses. DUO is implemented in the Web Ontology Language (OWL) and, to increase community awareness and engagement, hosted in an open, centralized GitHub repository. DUO, together with the GA4GH Passport standard, offers a new, efficient, and streamlined data authorization and access framework that has enabled increased sharing of biomedical datasets worldwide.


In brief
The GA4GH Data Use Ontology (DUO) provides unambiguous, machinereadable standard language for consent forms and the data sharing policies they represent. Lawson et al. describe the DUO standard and implementations throughout the data access workflow to expedite data access while maintaining or improving compliant processes.

INTRODUCTION
To address global scientific challenges in health, human biomedical data must be shared and integrated worldwide. 1 To promote discovery and improve healthcare, researchers and clinicians need to be able to find, access, harmonize, and re-use data from diverse data sources. Data access for research is often facilitated by data repositories, and in a growing number of federated data environments 2 that aggregate datasets within or among themselves and make the results available to the research community. Challenges arise in the aggregation of datasets with varying ethical or regulatory conditions on data reuse. Different conditions may stem from different applicable data protection laws (e.g., limits on allowable purposes of processing, transfers to third countries), informed consents (e.g., specific vs. broad), policies (e.g., IRB data release authorizations), or data sharing agreements (e.g., within consortia). 3 Due to this heterogeneity of re-use conditions, it can be difficult for researchers to search and find appropriate datasets, methods of requesting and accessing those datasets vary, and there is no shared understanding of the allowable uses and/or downstream analyses of the data once access is approved.
Current processes to access sensitive human biomedical data can be cumbersome, time and cost intensive, and variable between repositories. In typical workflows, Data Access Committees (DACs) manually review data use terms; this process can be delayed by the need to interpret data use terms often described in inconsistent and ambiguous language. There can also be inconsistency in access determinations across DACs, particularly for broadly defined data use terms, such as ''permitted use for a disease and related conditions.'' Similarly, language in a consent form prohibiting ''commercial use'' has been interpreted differently by DACs, ranging from not allowing commercial organizations access to the data to not allowing the data to be used for commercial purposesindependently of the organization type. Finally, these interpretations can shift over time, increasing the risk that data are used in a way that does not reflect what the research participant originally agreed to and leading to inconsistent data sharing practices.
To address the needs for consistent terminology and reliable interpretations of allowable data uses, the GA4GH Data Use and Researcher Identities (DURI) work stream 4 developed a data authorization and access framework to streamline the process for granting researchers access to biomedical datasets based on their credentials and research purposes. A main component of this framework is the Data Use Ontology (DUO), a standard, machine-readable vocabulary of data use terms that enables direct matching between data use conditions and intended research use. DUO is complemented by the GA4GH Passport standard (see Voisin et al. in this issue), 5 which provides a machine-readable representation of a researcher's data access permissions. Together, the GA4GH DUO and Passport standards enable automating access by researchers to multiple datasets based on their authentication and authorization levels and has been deployed by various organizational members of the GA4GH DURI work stream. DUO is now the accepted GA4GH standard for data use terms, based on use cases from several GA4GH Driver Projects. 6 machine-readable DUO terms (Table 1). DUO has been successfully leveraged by software such as the Broad Data Use Oversight System (DUOS) to enable automated matching between access requests and DUO annotation on datasets (see Cabili et al. in this issue). 7 In this study, we report on the DUO standard, describe the curated structured vocabulary and hierarchies, and review use cases and considerations in implementing DUO for the management and access of biomedical datasets. DUO has been successfully used to annotate genomics datasets worldwide, and its usage is being expanded to direct mapping into consent forms and automated matching of requests to permissions by DACs. Future uses of DUO include annotation to different data types such as samples and integration within GA4GH Passport visas.

DESIGN
DUO is a structured vocabulary of standard human-and machine-readable data use terms. DUO's original list of terms was informed by review of common terminologies used by major international controlled-access genomic repositories (e.g., U.S. National Institutes of Health database for Genotypes and Phenotypes, NIH dbGaP, 8 and European Genome-Phenome Archive, EGA 9 ), as well as policy tools developed by the GA4GH Regulatory and Ethics Work Stream (REWS). 3,10 Contributors from those efforts joined to form the Data Use group, which met regularly both through videoconferences and face-to-face meetings. External efforts such as the Informed Consent Ontology (ICO) 11 were additionally reviewed for interoperability and synergistic evolution; DUO has been directly imported in ICO to describe data use conditions instead of duplicating its content. The DUO terms are intended to be a simple set of data use terms most often used or referenced in consent forms that include provisions for data sharing. DUO does not aim to represent all possible data use terms, consent phrases, or complex logical permutations of permissions, limitations, or requirements. Structurally, DUO contains 25 terms representing two types of data use terms, permissions and modifiers (Table S1) DUO is use-case driven, and requests for new data use terms in DUO must be supported by specific use cases that promote and facilitate data sharing. Each DUO term was developed based on contributions and reviews from community experts and implementers. Contributions to DUO are public and created by raising GitHub issues; 12 anyone may submit a request to add a new term or comment on an existing request. Requests are discussed by the DUO work stream leads and driver project implementers on the tracker, on the DUO mailing list, and during periodic teleconferences. Once approved, changes are open to the public for further discussion over a comment period of 2 weeks, as per the DUO governance policy. 13 DUO is implemented in the Web Ontology Language (OWL), 14 a World Wide Web Consortium standard. Development of DUO follows Open Biomedical Ontologies (OBO) development principles, 15 ensuring interoperability with other ontological resources, such as those describing disease entities. 16 As per OBO guidelines, DUO is built under the Basic Formal Ontology (BFO) 17 upper-level ontology. The DUO root terms ''data use permission'' and ''data use modifier'' are subclasses of ''data item'' (IAO:0000027), itself a type of ''information artifact entity'' (IAO:0000030) and ''generically dependent continuant'' (BFO:0000031). While BFO provides the framework for the DUO hierarchy, it proved confusing to use for most users. We consequently worked with the developers of the EMBL-EBI Ontology Lookup Service (OLS) 18 to design and implement a system allowing selection of suitable entry levels in the DUO hierarchy. The ''preferred root'' toggle shown in Figure 2 allows most users to browse only classes of interest, while expert ontologists can instead select the complete view. DUO terms are stable, with each DUO term having its unique Uniform Resource Identifier, which can be browsed using the OLS. Most importantly, the meaning associated with a specific DUO ID is permanent; this guarantees consistency through time of the data use terms. Different versions of DUO are available through the GitHub repository, 19 including an editors' version that captures ongoing development and stable, released versions. Released versions of DUO are associated with permanent URLs (PURLs) for sustainability: 20 the most recent release is always available from http://purl.obolibrary.org/obo/duo.owl, while previous versions can be accessed through their date-based PURL, providing choice for users who prefer to use a specific historical view of the ontology 21,22 for stability while transitioning to the latest version. Terms are positioned in the DUO hierarchy, such that subclasses are more specific sets of instances than their parents. This allows for inference of new knowledge through description logic underpinning OWL reasoners. 23 For example, when searching for datasets for a ''disease-specific'' research use ( Figure 2), a researcher would see query results of datasets matching this use term and its parents, ''health and biomedical research'' (direct superclass) and ''general research'' (indirect superclass). The initial structure of the repository was generated using the ontology development kit, 24 which provides a way of creating an ontology project ready for pushing to GitHub. Development of the ontology follows a modular approach for greater flexibility both by developers of DUO and its users. For example, the DUO Japanese translation is stored as a separate file from the main ontology. This file is merged in at release time via an automated script, allowing different files and features to remain independent until they are ready to be published and/or to be excluded at release time on demand-for example, for users who do not require translations from English. The same script also executes SPARQL 25 queries to render CSV versions, again for easy human browsing in the GitHub repository. Finally, the script merges relevant subsets of external ontologies imported through the MIREOT method 26 to promote ontology re-use and consistent identification of ontology terms across resources.
To increase community awareness and engagement, DUO is hosted under an open, centralized GitHub repository. This enables tagging of versions and continuous integration tests to be run at each iteration via the Travis CI software. After each modification of the source file, the ELK reasoner 27 is run to ensure ongoing consistency of the ontology.

RESULTS
To ensure trustworthiness and sustainability of its technical standards, the GA4GH applies an open and consistent development and product approval process. 1 In 2019, DUO was unanimously approved as a GA4GH standard by the GA4GH Steering Committee, joining other products in the GA4GH Genomic Toolkit suite. 1 Figure 3 displays the current implementers of DUO.
DUO has been incorporated in several central aspects of the data access request process (Box 1). First, DUO terms are applied as dataset metadata to be stored alongside the data they describe in a repository, making it easier for data custodians to manage their datasets compliantly and facilitate researchers' querying of the datasets by their data use terms. Repositories can add DUO annotations to their dataset files, either retrospectively through curation of existing data or interactively at submission time. Users can search for datasets according to data use terms to determine what datasets are available for their purposes before requesting data access. This improved accessibility and interoperability of datasets increases their FAIRness: 28 2.6% of data requesters who applied for access to Sanger's Cancer Genome Project (CGP) datasets between April and October 2020 had used the EGA DUO search tool to find re-usable datasets compatible with their research purposes.
In a second use case, DUO terms have been leveraged by DACs to facilitate and, for the first time, automate parts of the data access request process. The use of DUO in electronic data access systems enables automated matching by software algorithm, leveraging the DUO hierarchy and logical structure. An implementation in automating data access requests has been piloted for NIH and the Broad Institute through DUOS 7 and is now being extended to other databases. The DUOS software platform performs automated DUO-based data use oversight and provides interfaces to simplify the work of DACs. An empirical evaluation of the results demonstrates that the DUO is broadly useful, matching 96% of consent terms in examined datasets, and that using DUOS to automate the process streamlines the review process while maintaining efficacy and consistency.
As a third use case, DUO terms are incorporated into the data sharing language in consent forms written during the study inception. 30,31 Incorporating DUO terms at this early stage is important to enable more effective and consistent data use management. This addresses current challenges in the common use of informed consent language that does not fully capture the scope and issues related to data sharing and secondary research purposes, resulting in uncertainty for participants regarding research expectations as well as for data providers and data stewards or DACs in assessing how datasets can be distributed. The consent clauses in the Machine-Readable Consent Guidance are accompanied by explanations and guidance for consistency, and to ensure prospective capture as machine-readable data use terms. This is currently undergoing evaluation and validation by IRBs, and we anticipate this becoming a recommendation that could be more broadly followed. The DUO OWL file has been loaded in humanfriendly browsers such as the Ontology Lookup Service (OLS). This enables interactive navigation through the hierarchy and display of additional properties such as definition, comment, or relations to other terms. For example, the ''disease specific research'' DUO term, http://purl. obolibrary.org/obo/DUO_0000007, clarifies that it should be used in conjunction with a term from a disease ontology. The ''Preferred root terms'' button (middle, active green checkbox) guides display of the top classes to be displayed to the user instead of presenting the complex upper-level BFO hierarchy (accessible by selecting ''All terms'')

DISCUSSION
Since its approval as a GA4GH standard, 1 DUO has been widely implemented across diverse biomedical projects worldwide. Beyond requests for and comments on new data use terms, DUO standard implementers have contributed by proposing translations in other languages, such as Japanese, or in ''plain language,'' which has been shown to increase understanding and participation of research participants. 32 To this end, DUO was successfully extended for consent use as the Machine-Readable Consent Guidance described earlier, which was approved as a GA4GH standard in July 2020 33 and is being actively reviewed and implemented by IRBs and research studies. In addition, community members enthused by the success and simplicity of DUO aim to further extend its application beyond genomic datasets to resources such as biological specimens, imaging data, and public health data. The Finnish Institute for Health and Welfare biobank 34 has already implemented DUO in requiring sample depositors to describe sample/data use terms when depositing in their repositories. Indeed, nothing precludes developing applications or extensions of DUO for other scientific resources. Successful external extensions of the standard can be fed back to GA4GH, allowing for continual improvement in utility and function for the community.
DUO terms can also be used in healthcare settings and alongside complementary standards. Health Level Seven International (HL7)'s Fast Healthcare Interoperability Resources (FHIR) 35 Consent resource, 36 as well as other tools or standards, such as the Automatable Discovery and Access Matrix (ADA-M) or OASIS's LegalRuleML, 37 use logic for expressing more complex data use rules. The HL7 standard permits an implementer to adopt a default rule for a given use term (e.g., everything permitted by default, everything restricted by default) and then specify exceptions. LegalRuleML and ADA-M explicitly define if a rule for coded data use is a permission, prohibition, or condition. This approach requires users to ''translate'' their intuitive thinking into machine-based logic and can lead to complexity, confusion, and a greater risk of error.

OPEN ACCESS
Limitations of the study The GA4GH DUO standard represents the data use terms commonly used by data management professionals for sharing of biomedical datasets, while minimizing the complexity of logical permutations of data use terms, essential to global interoperability and data sharing. 38 For example, DUO adopts the term ''not-for-profit use only'' rather than decomposing ''profit'' and whether it is ''allowed,'' ''forbidden,'' or ''restricted'' in specific instances, thus not requiring users to mix and match terms with potentially opposing meanings; DUO is not built to capture the entire spectrum of possible data use combinations, as pursuing a vocabulary to describe all possible combinations of data use would likely lead to an infinitely complexifying model given the constant increase of possible terms and combination permutations. This intentional limitation of the DUO terminology space has been encouraged by researchers, in line with the DURI leadership's vision for DUO as a concise standard to facilitate compatibility of terms.
Arguments to the contrary espouse DUO and the aspiration for a limited vocabulary as counter to the needs of specific participant communities. A red herring example often used to justify this contrary position is that rare disease research participants often believe that DUO's limited scope would not be able to represent the unique, specific diseases they have, such as ataxia-telangiectasia or Diamond-Blackfan anemia. Yet this reflects an inversion of understanding, as permitting unique, edge-case-like types of research would be permissible via many of the existing DUO terms, particularly those such as General Research Use and Health/Medical/Biomedical Use. Annotating those datasets with more general DUO terms also increases the probability of researchers reaching those dis-ease-specific findings, possibly impacting scientific discoveries to prevent and treat such diseases. Ultimately, after engaging with the DUO team, representatives of the RARE-X rare disease community became strong proponents of DUO and advocate for its use among other rare disease participant groups. To help clarify this to future adopters of DUO, the DURI work stream is actively developing DUO implementation guidance and is also evaluating whether it would be feasible to provide a DUO-based software service to aid groups in choosing DUO terms that fit their needs.
Currently, the implementation and use of DUO may be limited by the need to retrospectively translate consent form language into DUO terms. This limits the number of dataset annotations possible and potentially generates variability in the mapping of legacy consent form conditions to DUO terms. To prospectively mitigate this issue, we have finalized the Machine-Readable Consent Guidance 29 to propose a consent form already mapped into DUO terms. DUO also supports DACs and data custodians with workshops and trainings on how to translate consent forms to DUO terms.

Conclusion
DUO has been adopted worldwide for use in annotation of over 200,000 datasets to describe data use conditions for human biomedical data ( Table 1). The GA4GH DUO and Passport standards, part of a joint strategy to streamline access to data, have not yet been connected to enable a singular process. As a next step, the DURI working group of GA4GH is planning to integrate DUO terms into Passport visas, combined with advocating for policy shift in approving access to groups of datasets by data use profile rather than individualized datasets. This will allow Technology ll OPEN ACCESS authenticated researchers to automatically access new and existing datasets matching their DAC-approved data use profile after sign-in. Further streamlining the access process will minimize the need for multiple consecutive requests as new data are released either for a specific project or in a new repository. Such an approach also sets a precedent for establishing trust between DACs and enhanced alignment in the approval process: we envision users' data use profiles could be shared across DACs. As biomedical datasets are produced in greater numbers, across diverse settings, reliance on DUO-based mechanisms is critical to streamline data access to enable scientific collaborations.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: Data donors-participants in trials and studies-agree to data use purposes described in consent forms. Consent forms are written by research teams in compliance with national, local, or institutional regulations and/or policies. To maintain stewardship and accessibility, these forms should adopt clear-language data use terms, and templates should be made publicly accessible. DUO standard data use terms can be embedded directly in the consent forms' clauses, following the GA4GH Machine-Readable Consent Guidance. 29 Organizations may add additional usage parame-ters beyond DUO, for example, to protect intellectual property.

STEP 2: DATASET ANNOTATION
Datasets hosted in controlled-access repositories are annotated with DUO terms denoting the data use terms that must be adhered to for approval for secondary data usage. The DUO terms can be added retrospectively by repository custodians for legacy datasets and/or prospectively by data depositors upon data submission.

STEP 3: DATASET DISCOVERY
A researcher can use DUO terms to search for datasets with relevant use conditions in a data repository. For example, they can search for all datasets consented for melanoma research. This returns only the list of datasets that would be permitted for use given this specific condition. Alternatively, the researcher can query a specific dataset for their use case, without needing to contact the DAC or other help resources. This process allows the researcher to streamline the process of identifying suitable datasets and avoid unnecessary data access request submissions.

STEP 4: DATA ACCESS REQUEST
A researcher requests access to relevant datasets and describes the research purpose using DUO terms. This enables efficient triaging by the DAC, either manually or using an automated matching algorithm. 7 The DAC reviews the access request to determine if the proposed research is consistent with the data use terms and if so, grants the researcher access to the datasets. The use of DUO terms facilitates a streamlined and standardized review by DACs.   Table S1. DUO terms as of February 23rd 2021. Each term has a stable identifier (column 1, "ID"), an optional shorthand (column 2) that can be used for visualisation purposes as shown on Figure 4, a label (column 3) and a textual definition (column 4).

Example implementation in the European Genome-Phenome Archive
1. Addition of DUO term(s) to EGA datasets DUO can be added to datasets at the EGA by two routes.
The first, and currently most commonly used method is for the submitter to choose appropriate DUO term(s) for their dataset(s) using the most up to date version of the ontology, always available from http://purl.obolibrary.org/obo/duo.owl. Once they have identified the most appropriate DUO term(s) for their dataset(s) they provide them to the EGA by emailing helpdesk on helpdesk@ega-archive.org. The helpdesk team then runs an in-house script that assigns the DUO term(s) to the dataset(s) which in turn propagates through to our website.
The second method is done by the submitter themselves if they are submitting programmatically to the EGA through the use of the XMLs by adding it to the policy XML. In this instance the data use attribute in the XML is used to store the desired DUO term(s) along with the version which references the ontology used from a given build. e.g, <DATA_USES> <DATA_USE ontology="DUO" code="0000042" version="23-02-2021"/> </DATA_USES> In some instances, DUO can be used in combination with a modifier to allow more detailed description of how the data may be used. For example, DUO:0000007 (disease specific research) should be used in combination with terms from MONDO to describe specifically which disease research the data can be used for e.g., DUO:0000007 MONDO:0000996 would mean that the data could only be used for research into prostate lymphoma. In order to submit this programmatically the user would use a modifier attribute within the XML e.g., <DATA_USE ontology="DUO" code="0000007" version="23-02-2021"> <MODIFIER> <DB>MONDO</DB> <ID>0000996</ID> </MODIFIER> 2.

Rendering of DUO term(s) on the EGA website
On the EGA website, users can search using DUO terms and retrieve datasets of interest. For example, https://ega-archive.org/datasets/EGAD00001007598 shows a dataset consented for disease specific research (DUO:0000007) where the disease is further specified by using the MONDO term MONDO:0004992 -cancer.