ecocomDP: A flexible data design pattern for ecological community survey data

The idea of harmonizing data is not new. Decades of amassing data in databases according to community standards both locally and globally have been more successful for some research domains than others. It is particularly difficult to harmonize data across studies where sampling protocols vary greatly and complex environmental conditions need to be understood to apply analytical methods correctly. However, a body of longterm ecological community observations is increasingly becoming publicly available and has been used in important studies. Here, we discuss an approach to preparing harmonized community survey data by an environmental data repository, in collaboration with a national observatory. The workflow framework and repository infrastructure are used to create a decentralized, asynchronous model to reformat data without altering original data through cleaning or aggregation, while retaining metadata about sampling methods and provenance, and enabling programmatic data access. This approach does not create another data ‘silo’ but will allow the repository to contribute subsets of available data to a variety of different analysis-ready data preparation efforts. With certain limitations (e.g., changes to the sampling protocol over time), data updates and downstream processing may be completely automated. In addition to supporting reuse of community observation data by synthesis science, a goal for this harmonization and workflow effort is to contribute these datasets to the Global Biodiversity Information Facility (GBIF) to increase the data’s discovery and use.


Introduction
Primary environmental research data are being made publicly available based on two main premises. First, the practice will make research more transparent and back up results, and second, it will enable reusing the data in more than one research project (Heffernan et al., 2014). Specifically, the combination of many local-scale research results may reveal broader patterns, drivers, trajectories, and predictions of ecological systems, particularly in response to the current rapid and unprecedented environmental changes (Levy et al., 2014). Many research communities have recognized this potential and data repositories like the Environmental Data Initiative (EDI, https://Environ mentalDataInitiative.org) hold thousands of diverse primary datasets from research studies in the ecological sciences. However, these data, although publicly available, still remain mostly locked away by their varied sampling methodologies, idiosyncratic formatting and nonstandardized terminology. Furthermore, these data can only be reused when the environmental context in which they were collected is fully understood and accounted for in the analytical approaches (Welti et al., 2021).
Given this situation, primary research datasets in ecology are often not easily combined or synthesized. Comprehending sampling and environmental conditions, resolving terminology, formatting, and aggregating data generally takes a large portion of research time (Lohr, 2014;Press, 2016;Wickham, 2014). A process of pre-harmonizing has been successful for some types of data in large community efforts. In some cases, the original investigators transform their data into a community-vetted, prescribed format using controlled terminology, such as Darwin Core-based contributions to the Global Biodiversity Information Facility (GBIF, 2021), or the observation model used by the Consortium of Universities for the Advancement of Hydrologic Science (CUAHSI, Tarboton et al., 2008). In other cases, data collection and formatting efforts are coordinated from the start (e.g., Baldocchi et al., 2001;Duffy et al., 2019;Fraser et al., 2013;Leray and Knowlton, 2015;Mulholland et al., 2001;Stokstad, 2011). Prescribed formats are more easily achieved for some types of regular monitoring (e.g., sensor data), and the concept of Analysis-Ready data (ARD) is becoming prominent in the earth-observing field to reduce the burden of pre-processing on users (Dwyer et al., 2018). However, the idiosyncratic methods for collecting organismal data preclude most efforts to apply any single standard to spatial or taxonomic concepts, and standard data formats rarely find community acceptance because most cannot accurately capture complex environmental sampling conditions or other constraints particular to each research program (Kissling et al., 2018;Reichman et al., 2011). Furthermore, in many cases, incentives for the original researchers to transform their data are lacking. Ultimately, these barriers to synthesis of datasets inhibit collaboration and slow down potential scientific insights (Evans, 2016;Poisot et al., 2019).
Today, complex ecological datasets are becoming available from single locations where observations were collected consistently over long time periods. If combined appropriately, with the diversity in their sampling approaches overcome, these datasets become indispensable to understanding trends, testing ecological theory, and predicting changes in the numerous ecosystem services beneficial to society (Orth et al., 2020;Pereira et al., 2013). Research networks like the National Science Foundation's (NSF) Long Term Ecological Research (LTER) Network have met the expectation that their data are available in public repositories and permanently archived (Mayer, 2020;Servilla et al., 2016). These primary datasets are especially valuable and are increasingly being synthesized and reanalyzed to generate new knowledge (Collins et al., 2018;Dornelas et al., 2014;Record et al., 2021). This increased third-party use shows that datasets are now meeting some of the FAIR principles (Wilkinson et al., 2016), in that they are "Findable" and "Accessible". However, many would benefit from improvements to their interoperability and reusability, the "IR" of FAIR.
Here, we focus specifically on ecological community observation data and the collaboration among the Environmental Data Initiative (EDI) repository managers, data scientists from the National Ecological Observatory Network (NEON), and community ecologists from the LTER Network to recombine such data for reanalysis and improve their reusability. The need for this effort was prompted by community ecology synthesis working groups who noted that because pertinent datasets are formatted and described in a manner most appropriate to their unique original research objectives, they are not easily used in synthesis studies without major harmonization efforts. Multiple working groups typically use subsets of the same data independently and develop their own investigation-specific data cleaning, aggregation, and formatting procedures that do not translate across projects. This re-wrangling of datasets effectively duplicates large amounts of effort and impedes synthesis science insights, pointing to a need for a harmonization system for data collected at particular levels of biological organization (e.g., population, community, ecosystem; Record et al., 2021).
The harmonized format we present here is agnostic to the research question, adds specific metadata for improved discovery and reusability, and accommodates different types of measurements (e.g., count, percent cover, biomass), taxonomic resolutions, and nesting of sampling designs over space and time. Given use case requirements, the repository framework, and the need to emphasize the importance of sampling context, this model and workflow framework appeared to be the best compromise, and we look forward to feedback from users (e.g., htt ps://github.com/EDIorg/ecocomDP/issues). Here, we report on the model itself, a library in the R language to assist with creation, access and exploration, metrics of the model's use to date, plus compatibility with a widely used biodiversity format, the Darwin Core Archive (DwC-A).

Methods
The project was carried out in three phases: Design, Implementation, and Maintenance. Design captures essential attributes of a science domain, considers past and present standardization efforts, and potential linkages to external authoritative systems to disambiguate meaning. The design phase leveraged the activities of science synthesis working groups and data management expertise to identify accurate and persistent data patterns. Implementation is accomplished through conversion of archived legacy data by data contributors or by EDI's data curation team, and is supported by data pattern documentation, best practices guides, and software libraries. Maintenance is achieved through programmatic workflows that automatically run when source data packages are updated.

Learning from existing approaches
We identified several ongoing or completed harmonization efforts using existing community observations and including datasets available from the EDI repository. All of these efforts used similar datasets from multiple sources, and all are one-time efforts with minimal plans for maintenance or updating harmonized data. In many cases, the resulting harmonized datasets were used to answer specific research questions and were then further changed or extended for additional uses. The abstract view of these datasets were potential models for general harmonization, and three in particular exemplify the need for a more broadly useable data model for observationsone which is also capable of structuring spatial information and taxonomy: 1) Popler, a database and R-libraries designed to analyze LTER population time series (Compagnoni et al., 2020); 2) CESTES, a global database for metacommunity ecology (Jeliazkov et al., 2020); and 3) BioTime, a global database of species abundances through time (Dornelas et al., 2014). In addition to the three research-focused models, we also considered the Darwin Core Archive (DwC-A) format used by the GBIF (Wieczorek et al., 2012). The GBIF system is arguably the largest aggregator of organismal occurrence and related data, holding over 1.5 billion records of species occurrences, taxonomic checklists, and sampling event or sample data from over 1500 institutions.
All three research-focused models implemented table structures and measurement types which do not accommodate the wide variety of raw data that capture complex environmental conditions during sampling and which are available in the original dataset. Only one (Popler) allows spatial nesting and taxon authority referencing. None of these databases accommodates references to external measurement dictionaries or ontologies. For all, access is somewhat limited by the choices of storage (i. e., Excel, or relational databases which require a custom interface or code). Temporal sampling is generally limited to observation dates, and CESTES includes text fields to describe nuances of temporal or other sampling. Compiled harmonization efforts such as these are highly valuable, as they represent considerable scientific knowledge and hours (possibly days) of thorough, manual checking and reformatting. Computing cannot supplant that scientific knowledge, but a comprehensive intermediate format can streamline some of the reformatting tasks.
GBIF's DwC-A came closest to meeting the requirements for broad reuse; these are self-contained datasets composed of text tables plus a file describing table organization. Table columns are labeled using the Darwin Core vocabulary (DwC) for indexing. A large fraction of GBIF records are simple organism occurrences, however DwC-A extensions allow for inclusion of other aspects such as contributor-defined measurements (e.g., abundance or cover), which are common for ecosystem studies of the type housed by EDI and data products published by NEON. The DwC also includes fields for external taxon references. Missing from the DwC-A were explicit site nesting and external measurement references (see Discussion). Interestingly, some of the structures created by scientists for their own synthesis can be strikingly similar to DwC-A tables (Walter et al., 2021) with features added (e.g., the aforementioned nested sampling sites).

Identifying requirements
Consistent with the goals to support a synthesis workflow that will reduce data preparation efforts for answering new research questions and minimize impact on data producers, we developed requirements based on three main considerations (see also discussion in Sholler et al., 2019): 1) the expectations of data contributors and the original data; 2) the repository framework; and 3) the needs of the data reusers. The scope is defined as ecological community data, in which observations are abundances of co-occurring groups of organisms in an area, as opposed to population or demographic data (where observations are made at the level of individuals within a species). We recognize that some original data will contain both types of information, and ideally, while the harmonized intermediate may not contain the original population-level information, the framework should make that original readily available. Our short name for a model for the flexible intermediate for ecological community data is "ecocomDP", for "ecological community data design pattern".

Data contributors and the original data.
Original data are available in the EDI repository as text tables (usually ASCII) formatted to best suit the original research questions, with collection methods that are adapted to the environment and community of interest (e.g., aquatic, forest, grassland). In many cases the datasets are updated regularly. The data contributors (data managers or scientists) are intimately familiar with local conditions, which is vital to creating high-quality data packages. As mentioned above, there is no incentive for the data contributor to format their data in any other way, and so it was essential that the harmonization process did not interfere with a data contributor's formatting for their original research questions. The challenges presented by the data themselves included the large number of different parameters measured (e.g., number of individuals, cover, biomass, catch per unit effort), taxonomic resolution and consistency (e.g., family, genus, species), environmental or experimental conditions essential to interpretation (e.g., fertilization, harvest, simulated disturbance), the nesting of sampling units over space (e.g., site, transect, plot, subplot, depth) and time (e.g., date, season, year), plus changes to the sampling protocol over time (e.g., the addition of new sampling locations or changes in the taxonomic resolution of sampling).
Additionally, NEON publishes a variety of data products on its portal that provide biodiversity data on sentinel taxonomic groups from 81 field sites located across the United States (https://data.neonscience. org/). Many of these data products were designed with input from and for use by population and community ecologists (Thorpe et al., 2016;Utz et al., 2013). These products offer organismal data that can be mapped to the ecocomDP model, used in research, and derived data packages can then be archived in the EDI repository (e.g., Li et al., 2021).

The repository framework.
In the EDI repository the granule is a "data package", composed primarily of a metadata record (Ecological Metadata Language, EML) and, one or more data entities (i.e., ASCII tables). The repository supports metadata and data immutability, revision control, DOI assignment and event subscriptions to track updates to data. Repository staff, although experienced data specialists, lack specific local knowledge for every dataset.
2.1.2.3. The data users. Aside from a standard data format and nomenclature, scientists attempting to use these existing data were mostly concerned with data discovery, i.e., the ability to identify data that best suited their needs in a repository. A few types of searches were common to all reuse (e.g., number of taxonomic units in study, duration of study and frequency of sampling, and the size and arrangement of sampling areas), and so needed to be supported. Those who are reformatting data to this model must understand the original data well, and so its associated code should include checks for certain features, like uniqueness and typing.
Our solution to these requirements is the development of a flexible domain-specific intermediate model in a lightweight, distributed workflow framework, in which data repositories handle some of the preparation work typically done by end users. The original data are not aggregated or otherwise changed, only normalized to a standard format that can be more readily accessed and used. This reformatting is accomplished by automated workflows which allow data products to be repeatedly synchronized when original data are updated. This process increases the value of the data by implementing standard quality checks and can provide feedback to contributors to inform them of aspects of data and metadata that are the most important during reuse, and of arrangement or presentation choices that function well.

Implementation
During the implementation phase, pertinent datasets in the EDI and NEON repositories were identified. For each EDI dataset, an R script was developed to convert the data into the ecocomDP model. This effort incrementally led to tuning of the model itself and associated documentation. It also served to outline necessary functions for building data packages and accessing NEON data. Lastly, to test both the data format and the entire workflow, we used ecocomDP formatted data to generate DwC-A for submission to GBIF. This last step has the added benefit of making EDI holdings available for GBIF users. Fig. 1 depicts the general workflow which was implemented and will be followed for updates. The "level" designations and terminology are adapted from NASA's Earth Observing System Data and Information System (EOSDIS) (Price et al., 1994) with L0 being the original data; L1 is the same data transformed to the ecocomDP model, and made available as a data package in the EDI repository. L2 has been further transformed or aggregated as needed for a particular synthesis research question or other use (such as a DwC-A).

Maintenance
The maintenance phase focuses on developing robust R scripts for continued conversion when source data (L0) are updated and converting new datasets as they are submitted to the EDI repository. Maintenance of the R package includes adaptations for the NEON endpoints as these evolve. The EDI infrastructure supports the execution of external workflows through its API and event notification service to automate routine data management tasks. Upload of an L0 revision triggers execution of its conversion script. The system is ideal for a series of data packages, as it simplifies and accelerates creation of continuously updated synthetic data packages (Servilla et al., 2016).

The ecocomDP data model
The model (Fig. 2) is composed of eight related data tables in an extended star schema (Seyed-Abbassi and Madesi, 2015) and implements database-style principles of foreign keys and normalization, along with attribute/value style tables to accommodate a wide range of measurements. Three data tables are required: the central "observation" table and two supporting dimensional tables, "sampling_location", and "taxon". The "dataset_summary" is automatically created and populated based on the observations. The three primary tables are each extended with an optional table for ancillary information to accommodate additional measurements important to understand and use specific sampling conditions for analysis. The optional eighth table maps variables to external dictionaries.

Observations
The central "fact" table holds the actual ecological community observations (Fig. 2, e.g., abundances or densities of a taxon).

Locations
The nesting of sampling locations (e.g., plots within transects within areas, or depths or heights of a profile) is accomplished using a selfreferencing table, in which a location may have a 'parent' which is itself a sampling location in the same table. This mechanism allows observations to be associated with a location at any level, and observations can be aggregated under groups of locations.

Taxonomy
The taxonomy table does not attempt to describe all aspects of a taxon, but rather holds basic information such as name and rank (e.g., family, genus, species), with the option to refer to a taxonomic name authority system. Although a taxonomic name may be reused in different kingdoms and a hierarchy required for full understanding, the model deliberately does not encode taxonomic hierarchies, as these are somewhat fluid and no single system applies to all organisms. Instead, that information can be held by the authority system, and accessed with readily available software tools, or it can be recorded in the tax-on_ancillary table.

Summary table
A one-row table summarizes information in the Observation, Location, and Taxonomy tables. It represents the information most frequently needed by scientists as they evaluate a dataset for use, mainly to understand the taxonomic, temporal, and spatial coverage.

Ancillary tables
Each primary table has an optional table for additional information. Also designed as attribute/value, these ancillary tables provide a place for environmental conditions (e.g., air temperature, observation uncertainties), organism characteristics, (e.g., biomass, traits, morphotype, phylogenetic information), or experimental conditions (e.g., fertilization). Date fields are included for taxon_ancillary and location_ancillary as these may have been recorded a different times than the primary observation. The observation_ancillary table might contain specific sampling-event-data, such as volume cleared by a plankton tow or single depth (when not part of a profile). These are data typically included with the community observation data to ensure that data users are aware of conditions and can judiciously subset and aggregate original observations.

Accommodating measurement term disambiguation
An optional "variable_mapping" table allows unambiguous term definition using external vocabularies and ontologies by documenting the system used and a unique identifier for the term (i.e., a URI or URL). It is intended for the content of fields titled 'variable_name' in the observation and optional ancillary tables.

Supporting code
We developed an open-source code library in the R statistical language to support common tasks for creating, checking and using eco-comDP data packages . To assist with conversion to ecocomDP from EML-described data packages, R functions are available to harvest EML metadata from the L0 dataset preserving essential high-level elements (e.g., abstract, methods and personnel), with additional text and EML elements to clarify that this (L1) is a derived data product: a provenance link to the L0 dataset, additional abstract and title text, and keywords (e.g., "ecocomDP"). L0 variable names and descriptions are transferred to coded value lists in L1 EML. To promote discovery, some ecocomDP table content is elevated to metadata, such as full taxonomic hierarchies including common names and external identifiers, and EML annotations created from the Level 1 (L1) data packages (also in the repository) are formatted according to a predefined model, in this case, ecocomDP. Researchers are able to use L1 as inputs with its code to speed their analyses and generate Level 2 (L2) data. An archive of the L2 data package in the same repository is recommended. Data sources and sinks may be a repository (e.g., EDI) another data provider (e.g., NEON) or aggregator (e.g., GBIF).
variable_mappings table. The R library also supports quality control to ensure that tables are model-compliant, confirming presence of required fields, referential integrity between tables, and uniqueness of identifiers. Taxon IDs are added with the taxize R library (Chamberlain et al., 2020).
The ecocomDP R library provides functions to search data and metadata on free text, taxonomic names, geographic area, and summary features (from the dataset_summary table, Fig. 2), which is improved over typical repository searches on metadata alone. Analysis workflows are supported through functionality for programmatically accessing and reading the data and metadata; merging datasets; transposing eco-comDP tables into the "wide" format (e.g., each column representing a taxon or variable) preferred by many scientists; and for creating plots of basic features to evaluate fitness for use (see below). As we have already stated, preparing data for analysis can still be complex, and these tools will not replace ecological understanding of fitness for use of data in a particular analysis. However, they will help streamline the process considerably.

Using the ecocomDP format
The R library described above was developed and tested as we processed original, incoming data through the Fig. 1 workflow, first converting them to the ecocomDP model (L1; Fig. 1, Step 1), followed by a) plotting general characteristics as might be required by synthesis and b)

Fig. 2.
The ecocomDP model shown with relational database notation for foreign keys and relationships (e.g, lines ending in crows-foot indicate 1:many relationships). Semi-transparent tables are optional. Medium green fields in each table are the primary key. Yellow/hashed fields are a combined unique constraint. IDs (suffixed, "_id"), must be unique within a table, as in an relational database. Full documentation (e.g, optional fields and definitions) can be found in the Git repository (EDI, n.d.). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) conversion to publication ready DwC-A (an example of L2). Those processes and summary metrics from conversions are detailed here. As incoming datasets are nearly always unique, the conversion to the ecocomDP format (L1) requires an understanding of the study design, measurement methods and data types, with the R library helping to ensure full understanding, and appropriate use and accelerating the technical steps. Because all L1 are a standard format, further processing can be streamlined, and often automated.

Converting original data to ecocomDP (L0 to L1)
To date, we have created 70 ecocomDP data packages from EDI holdings of LTER, Long Term Research in Environmental Biology (LTREB), and other projects. Our approach to conversion of these original (L0 datasets) is to assemble each package's data into a single wide table, which helps maintain referential integrity in the derived tables. Issues arising at this step are best resolved in collaboration with the original data creators and may provide valuable feedback to them. The next step is to extract data from the L0-wide table for the core ecocomDP tables (i.e., observation, taxon, location; Fig. 2) followed by the optional ancillary tables. The ecocomDP R library supports common steps for scripting the entire process, including programmatic reading of the L0 package. We recommend scripting this entire step for two reasons: the script serves as documentation of the process, and if the L0 data package is updated (e.g., new data added), subsequent conversions can be automated.
When the original data format is well controlled, reformatting to the ecocomDP model is more straightforward. NEON exposes its corpus of datasets of organism data for integration with EDI's holdings, using code created by NEON with scientists from the NEON Science Summit Meeting (Boulder, CO, 2019) . R functions pull data from the NEON share point using the neonUtilities R library and convert it from a NEON data product to the ecocomDP data pattern. As of this writing, functions are available in the ecocomDP R library to deliver data for NEON terrestrial organisms ( As NEON data products are continent-wide, these were divided into individual field sites for analysis to make them spatially compatible with EDI holdings. For both NEON and EDI data, summary information, identifiers and DOIs if applicable can be found in the dataset, O'Brien et al. (2021). Spatial, temporal, and taxonomic coverage for a total of 530 NEON and EDI datasets are shown in Fig. 3, comprising over nine million observations. The NEON data are broken out by sites (83 total sites) as that unit was more similar in structure to the data packages available from EDI, which come from site-based research groups such as the LTER Network. Data in harmonized format clearly illustrate the differences between the data collection strategies of NEON and the EDI holdings from individual place-based sampling programs. NEON's targeted biological collections focus on nine groups of species (by taxonomic or other attributes) over relatively narrow spatial extents within sites (but a large spatial extent among sites), and over shorter, evenly- Fig. 3. Temporal, spatial and taxonomic coverage of datasets available in the ecocomDP model. Data source: Black, EDI; Gray, NEON. A) Temporal coverage (years), B) Temporal evenness (years), C) Spatial extent, D) group. An asterisk indicates that two groups (Tick, Mosquito) are specifically targeted by NEON. When these taxa occur in EDI datasets, they are plotted here with Arthropods. spaced time periods (collections began in 2013 with full operations in 2019). Coverage plotted from EDI data holdings, on the other hand, shows a wide diversity for all three coverage elements and reflects the diversity of research programs. Durations range from a few years to over six decades, with somewhat less even sampling, a broader spatial extent (up to 10 5 km 2 ), and many general taxonomic groups represented.

Working with ecocomDP formatted (L1) datasets
The principles of a central observation table linked to additional information and the attribute/value pattern that underlies the eco-comDP model are common approaches for managing heterogeneous data due to their flexibility and storage efficiency (Wieczorek et al., 2012). We used the formatted data to demonstrate two outcomes: first, the ease of creating common plots for scientific evaluation, and second, a mechanism to create DwC-A for GBIF.
As with the coverage plots (Fig. 3), a common format enables other common plots to be created. The ecocomDP R library supports plotting of features commonly requested by scientists to evaluate a dataset's suitability for use. Fig. 4 shows four aspects: number of taxa over time, spatio-temporal sampling effort, species accumulation, and species shared among sites. These examples, plotted from L1 data represent features of interest to synthesis working groups and are based on their input Record et al., 2021;Walter et al., 2021). Community ecologists often use data on taxon presence or abundance to generate evidence that quantifies the strength of species interactions such as competition, predation, or mutualism, or responses to shared environmental conditions. For example, Record et al. (2021) used the L1 output to explore spatial and temporal representativeness of several LTER datasets to assess the suitability of LTER community datasets for addressing questions of how spatiotemporal scales influence insights from metacommunity analyses. Likewise, Jarzyna et al. (2021) used the L1 output of NEON data to explore temporal dynamics in animal communities at a continental scale. Walter et al. (2021) synthesized the spatial synchrony of biodiversity across 20 marine and terrestrial communities. The ability to quickly create the common plots shown in Fig. 4 for many datasets were instrumental in streamlining the data-discovery phase of each of these syntheses.
In addition to supporting reuse of community observation data by synthesis science, a goal for this harmonization effort is to contribute these datasets to the holdings of the Global Biodiversity Information Facility (GBIF) to increase the data's discovery and use. Although the ecocomDP model is more extensive than the DwC-A, their similarities make a scripted process straightforward. Both the DwC-A and ecocomDP models are star schemas with attribute/value tables and both use EML for metadata. Information loss is minimized by mapping to DwC-A's Event Core layout (GBIF, 2021). Our approach makes use of ecocomDP R functions for manipulating datasets, followed by mapping to the DwC-A terms and adding required metadata elements. Several types of external identifiers are included in the DwC-A tables. For taxa, we include ids (DC: taxonID) with named authority (DC: nameAccordingTo) and Life Science Identifiers (LSIDS) in the DC scientificNameID field. We also make use of the recently added EML annotation field (Jones et al., 2019) to include measurement URIs in the DwC-A extension field measurementTypeID.
With the conversion from original (L0) data to ecocomDP (L1) formatted data to DwC-A (L2) data fully automated, updating long-term observational datasets is simplified. As of this writing, we are working with GBIF on the technical aspects of the contribution mechanism. In the interim, all DwC-A packages are in the EDI data portal via the keyword "Darwin Core Archive". Researchers will soon have several options for accessing these data in addition to the original dataset: the ecocomDPformatted and the archived DwC-A packages both archived at EDI, and by querying values through GBIF systems.

Discussion
Decades of harmonizing data from diverse studies and developing community data standards at multiple scales indicate that a substantial upfront cost is incurred. These laborious efforts must be justified by benefits such as importance to meta-analyses, reduced expense of obtaining and preparing them for analysis, or even commercial value.
Further, it appears that harmonization efforts generally lead to a certain loss of information, which can be acceptable during analysis if balanced by sufficient volume (e.g., Pollet et al., 2015). As a result, highly complex, multidimensional data have largely eluded harmonization. Ecological community observations, although irreplaceable and highly valued for understanding environmental change (LTERnet.edu n.d), have highly-variable sampling methods and high dimensionality that continue to make synthesis across studies difficult (Welti et al., 2021). A level of pre-harmonization is essential if the community is to avoid each synthesis group expending significant effort repeatedly wrangling data into similar formats, and to promote more rapid and reproducible synthesis efforts .
Given these experiences, requirements, and use cases, our new data model minimizes information loss while meeting most of the needs of meta-analysis, and uses a workflow system that also accounts for regular updates to the datasets. The reformatted data (ecocomDP format) are maintained as independent packages in the EDI repository to take advantage of its general functionality of search and access, hence avoiding another database 'silo'. Further, specific discoverability is improved by the addition of standardized metadata to aid the process of selecting relevant datasets. Any synthesis effort will still have the significant step of determining if a dataset is fit for a particular analysis, which is typically performed by examining the sampling methods, constraints, and other facets of data collection. That task can be further assisted by disambiguating semantics through linkages to external dictionaries, which is accommodated in the ecocomDP data model as well as the EML metadata standard. Li et al. (2021) details the decisions made while converting NEON data to ecocomDP. Some of the checking available in our R-package is a result of that, however additional dependencies or checks may become evident which help ensure that scientists fully understand the data as they convert it into the ecocomDP format.
Although extensive reusable R programming functionality was developed, the conversion from original data formats (L0) to ecocomDP format (L1) still requires a moderate investment in time and some ecological understanding for every new dataset-a significant task taken on primarily by the repository, EDI. Future reuse of these data will determine the value of such a reformatting service and the likelihood of its continuation. An advantage of the workflow system is that after the initial effort, the scripts generating ecocomDP data packages from the original data can be fully automated and repeated when the original data are updated. The generation of downstream data products can also be automated, and our creation of DwC-A for submission to GBIF serves as a model for generating submissions to other systems, such as Popler, CESTES, BioTIME or VegBank (Peet et al., 2012). In addition to supporting short-term synthesis research, we envision these important data supporting the needs of ecological forecasting studies (e.g., Dietze et al., 2018) and being used to calculate indices for Essential Biodiversity Variables (EBV, (Pereira et al., 2013;GEO-BON, 2013), the communitymanaged state variables that stand between primary observations, or even for higher-level indicators such as the Ocean Health Index, (Halpern et al., 2012(Halpern et al., , 2015Schmeller et al., 2015).
The flexible attribute/value data format used for ecocomDP has been widely used in other data harmonization approaches (e.g., Tarboton et al., 2008;Wieczorek et al., 2012). It saves space and allows an unlimited number of attributes, hence accommodating any type of measurement. However, description and control of aspects such as data typing, precision, or text definitions are not built in, and as compared to the detailed data table descriptions common in the original data packages, may result in some metadata loss. The ecocomDP project mitigates such losses by retaining as much metadata as possible, quality checking, and by implementing a workflow system that includes a provenance trace in derived data (L1, L2; Fig. 1) so that original data can be accessed if necessary.
The semantic parity between ecocomDP and the DwC-A model is strong, especially for concepts like Observation and Taxon. The GBIF and Darwin Core systems work quite well for observations of individuals but less well for measures of abundance; the ecocomDP model helps fill that gap. The functionality of ecocomDP's ancillary tables is aligned with ExtendedMeasurementOrFact, and together these features helped to streamline our conversion to DwC-A. Although that conversion was relatively straightforward, there are significant differences between the two formats. First, the DwC vocabulary and GBIF model does not explicitly support the kind of site nesting needed to understand a sampling design. The Event class (which includes locations) can be leveraged for this use (De Pooter et al., 2017), although examples and recommendations are not well-established in the community. Therefore, ecocomDP explicitly includes a site-nesting feature, similar to other models used by scientists (i.e., Popler, Compagnoni et al., 2020). Our conversion scripts can be adapted in the future as the use of the DwCbased models evolves. Secondly, inclusion of external dictionary references for measurements is not currently an established part of the DwC vocabulary (which determine column headings for DwC-A). Our L2 DwC-A already includes the proposed extension for measurementID (to hold URIs in external measurement dictionaries) and will serve as an example as adoption of this extension increases. Those differences, and the ease with which our ecocomDP datasets can be converted to DwC-A makes the ecocomDP intermediate valuable both for detailed scientific syntheses and large-scale querying by aggregators like GBIF.
The use of ecocomDP to promote discovery, reusability, and integration of data is an exciting step towards harmonization of data across coordinated research networks, which advances collating in-situ ecological community observation data at global extents to support broad concepts such as EBVs. This EDI-NEON collaboration also reveals the value of synergies between networks by integrating the deep longterm and place-based knowledge of the LTER Network with the broad spatial coverage of the NEON Observatory. Just as harmonization of data helps synthesis scientists avoid "reinventing the wheel" for each research project, collaboration among groups such as NEON, LTER, and EDI promotes communication between repository staff and scientists to share insights and pitfalls about data. Furthermore, although NEON data are extremely well documented and encapsulate standardized collection protocols, the level of detail surrounding slight nuances in data collection over time (e.g., reductions in sampling events) or abbreviations used (e.g., "sp." and "spp.") may elude users. The oversight of data wrangling in collaboration with NEON staff for the ecocomDP model assures users that these idiosyncrasies have been considered. End users will still need to recognize that the ecocomDP data are intended to be used for community ecology analyses rather than for demographic analyses, although the original data may contain that information. For instance, to access NEON's small mammal mark-recapture information (e.g., to estimate occupancy for population models) users would need to return to the the original data product.

Conclusion
Many important primary data are ongoing research-grade time series, and access to these trusted, up-to-date data sources is highly desired by synthesis scientists, managers, and policy and decision makers, yet easy access is seldom realized. Data harmonization is not a new idea. But typically, harmonization projects for organismal data are designed for specific research questions or types of queries, which tend to drive data preparation decisions. Unfortunately, those formatting or aggregation choices often reduce the potential for other types of use.
Our workflow-based model makes both the original data and harmonized version easy to discover and access, and takes advantage of existing repository functionality. Furthermore, heterogeneous data become available in a manner consistent and interoperable with current and emerging trends in other biological fields. The harmonized intermediate has basic formatting applied, and accommodates standardized measurement semantics and taxonomy. The use of event subscriptions to track their updates and rerun processing code is a transformative activity, and provides a template for a process that can be reused in other scientific domains.

Declaration of Competing Interest
None.