Microbiome Metadata Standards: Report of the National Microbiome Data Collaborative’s Workshop and Follow-On Activities

ABSTRACT Microbiome samples are inherently defined by the environment in which they are found. Therefore, data that provide context and enable interpretation of measurements produced from biological samples, often referred to as metadata, are critical. Important contributions have been made in the development of community-driven metadata standards; however, these standards have not been uniformly embraced by the microbiome research community. To understand how these standards are being adopted, or the barriers to adoption, across research domains, institutions, and funding agencies, the National Microbiome Data Collaborative (NMDC) hosted a workshop in October 2019. This report provides a summary of discussions that took place throughout the workshop, as well as outcomes of the working groups initiated at the workshop.

T he National Microbiome Data Collaborative (NMDC) is a pilot initiative that was launched in July 2019 and is funded by the Department of Energy (DOE) Office of Science, Biological and Environmental Research Program, to support microbiome data exploration and discovery through a collaborative, integrative data science ecosystem (1). The NMDC team is building an open-source, integrated data science ecosystem that leverages existing data standards, data resources, and infrastructure in the microbiome research space. The NMDC initiative embraces the FAIR (findable, accessible, interoperable, and reusable) data principles (2) by incorporating community-driven data standards and quality measures to enable data integration and access in its science gateway. Understanding the current landscape of data standards for the microbiome research community is an important first step toward achieving the aims of the NMDC pilot initiative.
Information that contextualizes samples, including sample collection, sample preparation, data processing methods, and data products (3) (Fig. 1), also known as "metadata," is essential for the interpretation of measurements produced from a biological sample. Standardized metadata using common terms, such as from an ontology (a controlled vocabulary with logic linking between its terms), are essential for data sharing, synthesis, and reuse, and can enable the discovery of new insights (4). The Genomic Standards Consortium (GSC) (5) and the Open Biological and Biomedical Ontologies (OBO) Foundry (6) have made important contributions to the development of community-driven sample metadata standards. Yet, it is unclear how much of the microbiome research community are applying metadata standards, or whether there remain barriers to adoption.
To understand how data standards support microbiome science across research domains, institutions, and funding agencies, the NMDC team hosted 50 experts in microbiome research, data standards, genome annotation, bioinformatics, and community engagement for a 4-day workshop in October 2019 at the Lawrence Berkeley National Laboratory (https://microbiomedata.org/nmdc-ontology-workshop/). The workshop goals were to review how standards are currently used, explore approaches for improving community adoption of and compliance with standards, build consensus around the importance of metadata, and establish a network of key stakeholders to advocate for standards across their organizations and communities.
The main sessions of the workshop included (i) perspectives from repositories, infrastructure projects, metadata resources, and standards organizations (https:// microbiomedata.org/nmdc-ontology-workshop/); (ii) group discussions on best practices, remaining challenges, and paths forward; and (iii) the initiation of working groups to evaluate current standards and their adoption, enhance existing standards, and identify training needs. Here, we summarize the workshop discussions on addressing barriers in microbiome data standards, and share outcomes from several working groups formed at the workshop.

ADDRESSING BARRIERS IN MICROBIOME DATA STANDARDS
Throughout the workshop discussions, two cross-cutting areas for improvement related to microbiome data and standards emerged: (i) encourage a culture that shares microbiome data, and (ii) understand and reduce barriers to (meta)data submission. We present a summary of the workshop discussions in the context of these two key themes.
Encourage a culture that shares microbiome data. Success in science is often measured by high-impact publications (7), creating pressure to be the first to make important discoveries and receive credit for the published contribution. Waiting until findings are published before making data available to others is not uncommon and remains a significant barrier to the provision of data to the broader community (8,9). Even post publication, data sharing continues to be challenging due to a noted lack of time to prepare data for sharing and reuse, legal or privacy constraints, and concerns about misinterpretation or misuse of data (8,10). As a result, researchers often cannot find data (11), or spend up to 50 to 80% of their time wrangling data into a more usable form (12). The current data revolution highlights the need to explore other measures of success (13)(14)(15), as researchers are producing massive quantities of data that could provide valuable context for questions far beyond their original intent. While funding agencies are discussing ways to mandate data sharing (16), the sharing of high-quality, well-curated data should also be driven by incentives. Other considerations include a mechanism to request permission to use data sets prior to publication by the data owner(s), as scientists would be more willing to share data with certain conditions on its use (8).
To encourage a culture that shares microbiome data, it is critical to develop incentives and promote ways to reward data stewardship. This workshop brainstormed several ways to encourage a culture that shares microbiome data, which the NMDC team is working to support.
(i) Establish digital object identifiers (DOIs) to enable data set citations. It has widely been reported that receiving credit through data set citations is important for data sharing (8,17). Providing a method for citing data sets in published articles opens the door for data set reuse to be quantified and, therefore, easily incorporated as a new metric in the research incentives structure. Journals that publish data set papers, such as Nature Scientific Data, Gigascience, and Microbiology Resource Announcements, are an important start, and other publishers have started these discussions (18). Several organizations are able to issue and register DOIs for data sets, but determining the granularity of DOI assignment at the individual data set or project level, as well as tracking mechanisms, remain challenging. Further coordination with funders and additional publishers will be critical for defining, establishing, and promoting data citations and accurate citation metrics.
(ii) Host data analysis competitions to support training on FAIR data for early career researchers. Early career researchers, including graduate students, are seen as critically important for catalyzing the cultural shift toward sharing well-curated microbiome data. While they may not get to decide when their data are shared, early career researchers are often responsible for the experiments, data collection, data management, data formatting, and efforts needed to make experimental data reusable and publicly accessible. Because of the inherent data access and transparency challenges (19), meta-analyses can serve as important training for early career researchers to (i) understand the challenges in finding, accessing, and preparing data sets for analysis; (ii) recognize and appreciate data sets that are well curated and accessible; and (iii) thus, be motivated to prepare and share their own data. Hosting data competitions (e.g., DREAM challenges, http://dreamchallenges.org/) to encourage meta-analyses can FIG 1 Examples of different types of metadata along the workflow from environmental samples to data and analysis tables. Submitting data to central repositories typically requires sample and preparation metadata. Sample metadata include information about when, where, and what sample was collected; preparation metadata describe how the sample was processed and turned into data products; data processing and feature metadata are generated by the repository or analysis software. Refer to Text S1 in the supplemental material for additional information.
Perspective showcase data sharing and reproducible science, while also providing benefits for participants (training, professional development, funding) and making important contributions to science (20)(21)(22)(23). Further, data competitions can showcase how aggregating multiple standardized, well-curated microbiome data sets can enable new discoveries (24) and, more importantly, forge new paths for optimizing data collection and applying data standards earlier in the research workflow.
(iii) Celebrate the value added by impactful meta-analyses. When exploring how to address the current grand challenges in microbiome science, novel approaches using large-scale data science applications are no longer a goal, but a necessity (25). For example, the increased application of machine learning to biological problems (26) has begun to expand how we think about data and data sharing (27). It used to be thought that researchers who published work using someone else's published data were considered "data parasites" (28,29). Now, the Pacific Symposium on Biocomputing celebrates the impactful meta-analyses through their annual Research Parasite Awards (https:// researchparasite.com/), which highlight important contributions of secondary analyses. Well-curated and FAIR microbiome data sets will be necessary for our field to explore applications of machine learning, automation, and secondary analyses (30,31).
While making data accessible is an important first step, data sets with missing information, erroneous values, or inconsistent formats hinder reuse. The workshop participants also discussed ways to incentivize efforts for sharing reusable data.
(iv) Establish comprehensive and coordinated data management plan(s) in collaboration with funders, publishers, and research service centers. While funders and publishers have moved toward encouraging open access to data (32), the details of their data sharing policies vary (33,34), and there are insufficient resources for enforcement (35). Data access remains a challenge for reproducible science (11,34,36,37). A comprehensive data management plan that includes community standards should be supported by both funders and publishers, which would provide structure and guidelines for data management best practices throughout the scientific research process (38). In addition, a partnership with research service centers, such as sequencing and other omics centers, can provide an effective strategy for revisiting data management plans earlier in the data life cycle, before experimental data is generated.
(v) Provide training for a variety of learning styles. Data management best practices and data standards and ontologies are powerful tools in support of the FAIR data principles. However, even seasoned scientists are often overwhelmed by guidelines and intimidated by ontologies. It isn't enough to create a comprehensive data management plan. Making this material accessible to the diversity of individuals who participate in the research process will be critical for effective adoption. A "quick start" guide is often a more approachable entry point for a data management novice. Extensive, searchable documentation is key for veterans who just need a refresher. To allow understanding and exploration of these data types, access can be provided through interfaces that allow programmatic access and visual representation to support researchers with and without computational expertise. Further, the use of various formats, such as tutorial videos, interactive webinars, and in-person events, support a diversity of learning styles and enable bidirectional communication, which is critical for improving and updating training materials.
(vi) Establish a certification of "compliance." Despite the significant efforts already invested in defining minimum standards for microbiome data, such as the Minimum Information about any (x) Sequence (MIxS) packages (39), important work remains to ensure that the various standards and ontologies are interoperable and easily accessible to the research community. This entails working with researchers to identify metadata attributes that are valuable for data reuse within their respective communities, and defining community-specific benchmarks. Establishing a "certification of compliance" based on these benchmarks would enable designation of data sets ready for reuse, which encourages inclusion in follow-up studies and enhances their citation metrics (see section i above).
Understand and reduce barriers to data submission. In addition to encouraging a culture that shares microbiome data, the workshop participants also discussed infrastructure challenges that impede sharing. Current data submission processes to primary data repositories or analytic platforms can be difficult to navigate, creating barriers even for good data stewards. The workshop participants suggested the following as a starting point to understand and reduce barriers to data/metadata submission.
(i) Understand how communities are currently using MIxS packages. MIxS packages are available for a variety of sample types and environments, but comparing their usage across data repositories is challenging. Are certain domains using them more or less often than others? For example, identifying research areas (e.g., domain, geographic location) that rarely use MIxS packages, submit data with the minimal required fields, or use null values to represent more than one meaning (e.g., missing versus not collected) enables a more targeted approach to training and outreach.
(ii) Explore ways to harmonize data submission processes across platforms. Data submission portals, such as those involved in the International Nucleotide Sequence Database Collaboration (INSDC) (40), each have unique requirements and interfaces, some having more robust manuals or training documents than others. Enabling coordination through community standards and appropriate training materials will greatly enhance the availability of FAIR microbiome data.
(iii) Validate sample metadata with immediate, informative feedback. Using ontologies or MIxS packages requires the use of specific formats for sample metadata attributes. Most communities manage data in spreadsheets without use of controlled vocabularies or data standards, and reformatting entries is error-prone. Reducing barriers to reformatting spreadsheets using sample metadata validators provides immediate, informative, and targeted feedback (41). Efficient and effective data submission has a significant impact on researchers' likelihood to share well-curated data.

OUTCOMES FROM THE WORKING GROUPS
During the workshop, working groups were formed and tasked with identifying ways that the microbiome research community could achieve tangible progress to advance FAIR data principles (2). Three areas were targeted as initial steps that the NMDC team, in collaboration with the working groups, could promote to improve sharing and adoption of standards: (i) expanding and enhancing existing communitydriven standards; (ii) understanding the current use of standards across research communities; and (iii) outlining a strategy for training and adoption of standards by the community.
Expanding and enhancing standards. In collaboration with the data standards community, the NMDC initiative is expanding and enhancing existing sample metadata standards for microbiome data. These efforts include closely collaborating with the GSC to convert the MIxS standard into machine readable formats (i.e., JSON-Schema, Web Ontology Language), reviewing and adding new terms for the next MIxS standard release (version 6), and engaging with new stakeholders to address domainspecific needs. While the NMDC pilot initiative does not currently support the migration of other packages or checklists to the MIxS standard, the team does encourage community-driven development of standards for emerging subfields through the GSC, such as an agricultural-focused metadata standard (42). The NMDC team is collaborating with the Environment Ontology (EnvO) (43) group to assist with the development of new terms, new relationships between terms, and training on EnvO, and is working with the Genomes Online Database (GOLD) (44), a manually curated metadata resource at the DOE Joint Genome Institute, team. As a result of these collaborative efforts, the NMDC initiative has established a schema (https://microbiomedata.github.io/nmdc -metadata/) for mapping core standards and ontologies to streamline the integration of diverse sample metadata spreadsheet formats. The NMDC metadata schema relies on Biosample information (https://microbiomedata.github.io/nmdc-metadata/) for linking complementary data originating from the same physical sample (e.g., 16S and metagenomes), consistent with the National Center for Biotechnology Information (NCBI) and the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI). While there are challenges in linking other data types beyond sequence data (e.g., geochemical analyses), the use of an International Geo Sample Number through the System for Earth Sample Registration (https://www.geosamples.org/overview) registry would support data linkages to unique biosamples and is being adopted by the NMDC.
Use of standards across research communities. In collaboration with representatives from NCBI and EMBL-EBI, this working group gathered MIxS environmental package usage data from the Sequence Read Archive (SRA) and European Nucleotide Archive (ENA), respectively. Examining the overall number of samples registered with MIxS environmental packages reveals similar rates of adoption across SRA and ENA (Fig. S1 in the supplemental material) (counts represent distinct samples submitted to each respective repository, and mirrored data are not double counted). Further evaluation of whether the MIxS packages are being applied as expected (Table S1) show noticeable differences between the two repositories (Fig. 2), which likely reflect distinct user communities. In ENA, usage of MIxS packages is higher across studies than across samples, suggesting that smaller studies are more regularly using MIxS. In SRA, human-associated packages are prominent, likely reflecting projects funded by the National Institutes of Health. While these statistics focus on baseline usage for MIxS packages, other checklist/packages, such as the "default ENA checklist" or the "NCBI metagenome package," are not necessarily incorrect, nor do they indicate poorly curated sample metadata. Some non-MIxS checklists/packages provide extensive metadata descriptors (e.g., the ENA sewage checklist), which may be unique to certain types of samples. The NMDC team will use these data as a baseline for assessing metadata standards adoption across communities, and to inform areas for targeted training or feedback collection. The NMDC team, in collaboration with the GSC, will report updates on MIxS standards usage in ENA and SRA, and incorporate this information into forthcoming training modules.
Training and adoption of standards by the community. In collaboration with international partners affiliated with the GO FAIR initiative (https://www.go-fair.org/), the NMDC team recently established the FAIR Microbiome Implementation Network, the first coordinated effort focused on FAIR data for the microbiome community (https:// www.go-fair.org/implementation-networks/overview/fair-microbiome/). The Microbiome Implementation Network aims to promote discovery and reuse of microbiome data by formalizing core and domain-specific microbiome ontologies and establishing training on the NMDC data models. In addition, the NMDC team is building out a modular training strategy, in collaboration with the GSC and OBO Foundry, that will cover basic sample metadata, such as domain-specific characteristics (e.g., MIxS packages) and FAIR data best practices. As a high-level summary, this working group drafted Introduction to Metadata and Ontologies: Everything You Always Wanted to Know About Metadata and Ontologies (But Were Afraid to Ask) (Text S1).
Conclusions. The foundation for reusable data has been created by the standards community and data sharing is increasing throughout the microbiome community, but there are still barriers to making microbiome data truly FAIR. Workshop participants highlighted the need to encourage data sharing through changes in the incentive structure and research culture. They also stated the importance of providing researchers with sufficient tools, training, and infrastructure to lower the barriers to sharing well-curated, reusable data. The working groups provided valuable contributions to the NMDC initiative, which has fed into the development of the NMDC metadata schema linked to existing standards, evaluation metrics on the usage of the GSC MIxS environmental packages for targeted activities, and the design of training packages to complement available data standards. The NMDC pilot initiative will continue to work across the standards and microbiome research communities to reduce barriers to data sharing, recognize data contributions, and make microbiome data FAIR.  Table S1 for details) in order to inform how the MIxS packages were used across communities. The standards were evaluated as follows: (i) "Expected MIxS checklist/package," the chosen checklist/package used for sample registration was the most appropriate MIxS option based on the metagenome organism name provided (Table S1); (ii) "Other checklist/package," the chosen checklist/package used for sample registration may not have been the most appropriate MIxS checklist/package or followed an alternative set of standards; or (iii) "ENA default checklist or NCBI metagenome package," the chosen checklist/package used for sample registration was the ENA/NCBI defined minimum for samples/metagenome samples and did not use a specific sample metadata standard. Only public samples and their associated studies for raw read submissions of metagenomic and amplicon data (MIMS and MIMARKs survey) to ENA or SRA were included in the respective counts (counts reflect only submitted data to each repository and exclude mirrored data). Associated studies were counted once for each unique metagenome organism name represented in the study, and hence may have been counted more than once (i.e., a study associated with samples assigned with x unique metagenome organism names may be counted x times). Queries were run in fall 2020. ENA queries used the ENA Portal API with the respective taxon criteria and checklist ID (Table S1) (e.g., ENA sample counts with expected use of the Air MIxS checklist (https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=(sample_accession= %22SAMEA*%22%20OR%20sample_accession=%22ERS*%22)%20AND%20(tax_eq(655179)%20OR%20tax_eq(1708701)%20OR%20tax_eq (1643811)