Approaches and tools for user-driven provenance and data quality information in spatial data infrastructures

ABSTRACT Geospatial data are fundamental in most global-change and sustainability-related domains. However, readily accessible information on data quality and provenance is often missing or hardly accessible for users due to technical or perceptual barriers, for example, due to unstructured metadata information or missing references. Within an interdisciplinary process encompassing perspectives of data users, data producers, and software developers, we identified major needs to facilitate effective fitness-for-use assessments by data users and developed approaches to address these. We provided a stylized analysis of large-scale land use data to showcase selected approaches. To support data users, interoperable quality and provenance information need to be meaningfully represented. Data producers need efficient workflows and tools supporting them in creating high-quality, structured and detailed quality and provenance information. Our newly developed approaches to increase the availability of structured metadata synthesize new and existing tools to extract metadata or to generate provenance data during processing. Within our approaches to improve interoperability and accessibility we present novel tools to support (i) the creation of curated and linked registers of data quality indicators and thematic terms, and (ii) linked visualization of data quality and provenance information. Following our approaches increases transparency, facilitates fitness-for-use assessments, and ultimately improves research quality.


Introduction
During the past decades, the availability and heterogeneity of geospatial data have rapidly increased (Balbi et al. 2022).This opens vast opportunities for the development of well-informed management strategies in earth system modelling, particularly regarding the interplay between land use and conservation (Dornelles et al. 2022;Rounsevell et al. 2014).Land use and land cover data play a central role in understanding socio-environmental system feedback and climate-change assessments (Prestele et al. 2016;Verburg, Neumann, and Nol 2011).However, the usefulness of any data-driven analysis ultimately depends on 'fitness for use' or 'fitness for purpose' of the input data for a given assessment (Whitfield 2012).While assessing the fitness for use of input data is essential to ensure good scientific practice and to provide reliable information for decision-making, it ultimately depends on the availability and accessibility of relevant and curated metadata on quality and provenance (Tilmes, Yesha, and Halem 2010;Wüest et al. 2020).Here, we broadly define accessibility as absence of technical or perceptual barriers.This includes openness, findability, understandability, interoperability, and machine-readability, encompassing the first three FAIR guiding principles (Wilkinson et al. 2016).Information on data quality is indispensable for data users to decide if certain data fit to the analysis in the context of their research question (Peng et al. 2021).This is also stressed by several initiatives, like the Global Community Guidelines for Documenting, Sharing, and Reusing Quality Information of Individual Digital Datasets (Peng et al. 2022).Provenance information provides relevant details on the genesis of a dataset including input, processes, parameters, and involved actors (Jiang et al. 2018;Magagna et al. 2020).
The availability of quality and provenance information on geospatial data is rudimentary at best (Bernard et al. 2014;Bielecka 2015;Lush, Lumsden, and Bastin 2018).Quality information is often missing entirely, incomplete, or not corresponding to any standard (Anderson et al. 2020;Devillers, Bédard, and Jeansoulin 2005).For biodiversity data, metadata about uncertainty of primary biodiversity records with regard to both taxonomic identification and geo-reference are rarely provided, although inaccuracies are well known (Anderson et al. 2020;Meyer, Weigelt, and Kreft 2016;Moudrý and Devillers 2020).Moreover, the availability of particularly relevant quality information is perceived as low to medium (Box 1).A systematic review on the publication of metadata shows that only about 20% of metadata records contain quality indicators, rarely including information on thematic accuracy, temporal accuracy, and provenance (or lineage) (Yang et al. 2013).
Box 1. Survey on relevance, availability, and accessibility of data quality information in earth system science data We developed a survey to better understand the needs of geodata users regarding data quality and provenance information (Fischer, Egli, and Henzen 2022).The survey is mainly structured around the data quality elements of ISO 19157 (ISO 2013) addressing availability and accessibility on several levels of detail.Provenance information are included as they are key to understanding data quality of a product with respect to its development.The survey was launched from November 2021 to January 2022 and was distributed within international earth system science experts and data users.In total 33 respondents completed the survey.Twenty-six participants responded that provenance information is relevant for their fitness-for-use assessments.Further, thematic accuracy and completeness have been named most often as being relevant (a = 15).The availability and accessibility of relevant quality information (a ≥ 5) was rated medium except for provenance information which was rated to be less available or accessible (Figure B1).Many participants currently obtain data quality and provenance information from associated publications or reports (a = 24), yet this is the preferred source of information for less respondents (a = 16) (Figure B2).Most participants prefer the presentation of quality information in tables instead of retrieving this information from text sections (a = 23).To determine fitness for use, user tend to evaluate multiple data quality elements for a certain dataset (a = 17) or multiple datasets for a certain quality element (a = 14).Moreover, most participants download the data and perform some tests (a = 20).Survey results indicate that for certain quality indicators spatially, temporally, or thematically specific information is needed, e.g. for positional accuracy or thematic accuracy.Respondents further stressed the usefulness of concepts that link data quality and provenance information and visualize them accordingly (Fischer, Egli, and Henzen 2022).
Limited accessibility can also confine fitness for use.Metadata are often hard to find or difficult to understand without discipline-specific knowledge, because they are not (explicitly) referenced and documented (Bielecka 2015;Devillers, Bédard, and Jeansoulin 2005;Ivánová et al. 2013).The accessibility of particularly relevant data quality and provenance information is often medium or low, respectively (Box 1).Many data users obtain data quality and provenance information from associated publications or reports, yet the majority would prefer to extract this information (automatically) from metadata to reduce effort (Box 1).
Uncertainties or missing information regarding data quality and provenance pose various challenges and can even lead to data misuse and misinterpretation (Goodchild 2007;Wentz and Shimizu 2018;Whitfield 2012), particularly in interdisciplinary projects frequently including data and concepts across different disciplines.The term forest, for example, has been defined depending on region, institution, and perspective (Table 1), which can lead to inconsistencies among datasets (Chazdon et al. 2016;Sexton et al. 2016).Combining datasets with different concepts may then bias the results in certain contexts (Verburg, Neumann, and Nol 2011).Lack of information on omission and commission errors of land-cover classes can result in under-or overestimation of areas (Nol et al. 2008;Wentz and Shimizu 2018), whereas lack of information on temporal inconsistencies between data times series (e.g.regarding used methods or landcover class definitions) can result in under-or overestimation of areal changes (Nedd et al. 2021).Moreover, a lack of provenance information could suggest a false level of detail if data has been disaggregated (Verburg, Neumann, and Nol 2011), or lead to circular reasoning in downstream analyses if users are unfamiliar with underlying concepts and data (Leyk et al. 2019).Especially for data users with insufficient expertise or time constraints, the mentioned shortcomings can severely affect the quality of their assessment (Goodchild 2007;Wentz and Shimizu 2018).
While there is a good overview of the major challenges related to the availability and accessibility of data quality and provenance information in fitness-for-use assessments, a comprehensive synthesis of needs and most importantly specific ways to overcome them integrating the perspectives of data users, data producers, and software developers are missing.Since earth system sciences are a strongly data-driven and data-producing research field, researchers need clear approaches and guidelines to foster the curation of research data.Although such approaches partly exist an overview is needed to facilitate their selection and use.Data curation provides a methodological and technological basis for improving data management, data quality, and the (re-)usability of datasets, including transparency and reproducibility (Freitas and Curry 2016), throughout the full life cycle of a data product.
We address this gap by (i) summarizing major needs regarding the availability and accessibility of data quality and provenance information from data user as well as from data producer perspective (see also Box 1), (ii) presenting a comprehensive overview of newly developed approaches including new as well as already existing tools for data curation to address these needs and to avoid common pitfalls and challenges in downstream analyses, in particular regarding the documentation, visualization, and complementation of data quality and provenance information, and (iii) showcasing selected approaches on an exemplary use case analysis of agricultural yield determinants.While we mainly refer to the land use and land cover domain, identified aspects are relevant for all domains of geospatial data.

Materials and methods
In the following, we describe (i) the conceptual framework based on which we organized our work, (ii) how we identified the needs of data users and producers, (iii) how we derived approaches to address those needs, and (iv) the use case we devised to iteratively refine our understanding of the protocol.

Conceptual frame to co-develop tools, produce, and use data
We followed an interdisciplinary and collaborative approach (i.e.scientists with different disciplinary background iteratively refined needs and approaches) with respect to all phases of the research data lifecycle and major roles.Thereby we aimed at providing approaches, including workflows, processes, and tools, with a specific focus on availability and accessibility of geospatial data quality and provenance information.We hereby understand tool as a specific piece of software built for a certain purpose and with clearly delimited use, process as a course of data operation (e.g. using one or more tools), and workflow as a well-defined sequence of processes that leads to a specified outcome.
Our interdisciplinary group of scientists included (i) users of existing land use and land cover data for analyses, (ii) data producers that combine a wide range of data to derive harmonized global land use data, and (iii) software developers that develop concepts and tools meeting user's and producer's needs (henceforth referred to as 'users', 'producers' and 'developers', respectively) (Figure 1).Here, we used a simplified model envisioning that both, data users and data producers act as domain expert.Using these three perspectives, we aimed at (i) raising producers' awareness of user needs, (ii) supporting producers to improve information and metadata provision, and (iii) providing suitable tools to improve the representation of geodata for users.

Identification of needs and development of approaches
First, we combined different approaches to identify relevant needs.With respect to existing standards we reviewed the ISO 19157 quality elements, including their subclasses (Table 2), regarding their relevance in the land use and land cover domain by rating each subclass from 1 (low relevance) to 5 (high relevance) based on our experience as user.We then systematically assessed availability and accessibility of the most relevant quality element subclasses in 15 exemplary datasets depicting different land use and land cover aspects, based on information provided in the dataset's metadata as well as in associated publications, reports, and supplements.By considering the results of our survey, we identified metadata gaps and needs among a wider community of users (Box 1).Since the number of survey participants was limited, we combined the results with findings from expert interviews conducted in the project and experiences from our activities in international working groups.Moreover, we specified user needs in the context of a use case (see below), particularly related to the availability and accessibility of metadata of several datasets that are supposed to be combined in downstream analyses.
Second, we evaluated the ability to address the user needs to be identified in the previous steps from a data producer perspective and specified respective needs regarding software and curation tools.
Table 1.Selected definitions of the term forest.

Term Definition Source
Forest "A vegetation community dominated by trees and other woody shrubs, growing close enough together that the tree tops touch or overlap, creating various degrees of shade on the forest floor.It may produce benefits such as timber, recreation, wildlife habitat, etc.'

GEMET (2021)
Forests "Land spanning more than 0.5 hectares with trees higher than 5 meters and a canopy cover of more than 10 percent, or trees able to reach these thresholds in situ.It does not include land that is predominantly under agricultural or urban land use.'

AGROVOC (2022)
Forest "dense collection of trees covering a relatively large area' WIKIDATA (2022) Forests "Generally, an ecosystem characterized by a more or less dense and extensive tree cover.
More particularly, a plant community predominantly of trees and other woody vegetation, growing more or less closely together.'

NAL Agricultural
Thesaurus ( 2019) Third, based on the identified needs of both users and producers, we developed and implemented approaches and tools to facilitate the creation, provision, and visualization of data quality and provenance information from a developer perspective.We then tested and reviewed these approaches in the context of the use case (see below) and refined them iteratively.To derive requirements and recommendations, we use knowledge from the survey results, project-specific interviews, recent literature and our programming experiences.That allows us to take different perspectives and scales into account, e.g.concrete use case-specific aspects synthesized with the results from the internationally distributed survey.However, we acknowlegde that, within next iterations, we can integrate more roles, e.g.data stewards to support interaction and mediation.

Use case: global determinants of agricultural yield
We performed a fitness-for-use evaluation, data selection, and stylized analysis regarding global determinants of agricultural yields.We described yield as a function of agricultural management, climate, soil, and pollination: Yield = f(agricultural management, climate, soil, pollination) For simplicity reasons, we focus on showcasing the fitness-for-use evaluation of data on yield (Monfreda, Ramankutty, and Foley 2008;Yu et al. 2020), agricultural management, and pollination (Schulp, Lautenbach, and Verburg 2014).Thereby we focused on yield data of rapeseed as a crop modestly dependent on pollination (Klein et al. 2007) and selected irrigation (Portmann, Siebert, and Döll 2010) as one aspect of agricultural management.
We selected these geodatasets and extracted relevant metadata from repositories, specific websites, associated publications, and supplements and integrated them into one project-specific data management software CKAN.We used the statistical software package R 4.1.3(https:// Table 2. Summarized overview of openness, FAIRness and data maturity, and detailed overview of quality of yield data evaluated in the use case (Monfreda, Ramankutty, and Foley 2008;Yu et al. 2020).Openness is high for both datasets.FAIRness and data maturity are higher for MapSPAM, but for both datasets data quality and provenance information is missing in the metadata and evaluated data quality information was derived from the dataset and associated publications.www.R-project.org) in RStudio (http://www.rstudio.com)to assess, process, and analyse these data and to upload new metadata from our analyses to the CKAN instance.The software project including relevant R scripts is available as GitHub repository (https://zenodo.org/account/settings/github/repository/legli/GeoKur_UseCase1).

Needs
The availability of data quality and provenance information is still limited for many geospatial data products.If available, information is often published in scientific publications or associated reports, which makes it difficult to quickly extract the relevant information and use them for human-readable or machine-readable purposes.Given that this information is indispensable to facilitate users to select the most fitting data product, increasing the availability and accessibility of structured quality and provenance metadata of spatial data remains the major need from the user perspective.Thereby metadata needs to include structured and interoperable data quality and provenance information explicitly linked to the data.This also includes information on underlying thematic terms and concepts, which should ideally be harmonized for the specific use case to enable proper data use across different domains and to prevent semantic misinterpretation.Users need an improved and spatially explicit representation of metadata that enables them to rapidly understand data quality and provenance, which is particularly lacking for gridded data products.Further, tools are needed that allow for a systematic comparison of different datasets with respect to a certain data quality aspect and to define specific characteristics (e.g.threshold values).
Producers need workflows and tools that support them in creating high-quality, structured, and detailed quality and provenance information for spatial data while reducing the effort to do so.Thereby, producers should be able to integrate respective tools into their existing analysis and scripts while the amount of required installation or code adaption needs to be minimized.Here, (partially) automated processes increase efficiency and allow to update quality information once new information is available.
Finally, developers need to learn about producers' workflows and software environment as well as users' specific requirements on metadata and how it should be presented to fit their evaluation needs.

Approach 1: Support the creation of provenance metadata
Producers should provide a curated selection of provenance information, which is needed to understand the genesis of a data product.The ISO 19115-2 lineage extensions (ISO 2019) and PROV-O (https://www.w3.org/TR/prov-o/) can serve as starting point to capture detailed provenance information either on dataset or object level.Closa et al. (2017) describe how to capture provenance on dataset, feature, and attribute level and propose a mapping between the ISO 19115 lineage elements and PROV-O.Intermediate datasets should be made available to foster reuse and trust in the final data product.Further, source code of the applied processes should be provided to self-assess scripts, detect errors, foster reproducibility and to create provenance graphs.
Developers should provide tools to support producers in the creation of provenance information within their respective working environment to integrate provenance tracking in the analysis scripts.Thereby, provenance metadata (i) is always up to date with the current version of the analysis script, (ii) can be generated automatically (i.e. with less effort), and (iii) at different levels of detail as needed for specific cases.We developed the package provr (https://github.com/GeoinformationSystems/provr), which allows producers to create PROV-O-conform provenance graphs effortlessly during script execution.
Larger or data-intensive projects should use a research data infrastructure to efficiently manage data and related processing and analysis workflows.Developers should support researchers by integrating provenance generation support in the geodata infrastructure.He et al. (2015) describe how to integrate provenance management at different levels of granularity in geoprocessing environments.We aimed for the support of provenance capturing in rather loosely coupled environments (e.g.researchers using different scripting languages to create data on local machines which is subsequently published on a common data management platform).
The data management software (DMS) CKAN (https://ckan.org/) for example, can be used for managing, publishing, and searching geodata, allows to implement and use several metadata schemas and comes with several geospatial extensions.However, other DMS like Dataverse (https:// dataverse.org/),DSPACE (https://dspace.lyrasis.org/),or INVENIO (https://inveniosoftware.org/) provide similar systems with options for several schemas and APIs, but do lack specific libraries to be used from the analysis scripts or such geospatial extensions.We developed metadata schemas (GeoKur metadata profile based on GeoDCAT: https://zenodo.org/record/4916698)for processes and datasets, which both contain metadata fields to describe their provenance step-wise.The captured provenance information is serialized into PROV-O-conform provenance graphs and made available at a SPARQL-endpoint (https://www.w3.org/TR/sparql11-query/), which facilitates semantic querying for fitness-for-use evaluation or for inclusion in data analysing scripts.The CKAN metadata can be managed via the package ckanr (https://cran.r-project.org/package=ckanr), which serves as a wrapper for the CKAN-API.Thus, provenance information can directly be updated alongside other metadata from within researchers' working environment (Rümmler, Figgemeier, and Henzen 2022).

Approach 2: Support the generation of data quality metadata
When building a spatial dataset, producers should generate spatial data quality information throughout the whole data lifecycle.To support this, developers should provide concepts and the technical base for structured and transparent quality assurance (QA) that consider the requirements of both producers and users and allow for their participation in the QA design process.These concepts include for instance suggestions on how to collect and update which quality information with respect to existing metadata standards, the given software environment, and data creation process.
We propose the implementation of a QA workflow (Wagner and Henzen 2022) along the whole data lifecycle which adapts existing concepts for openness (https://5stardata.info/en/)and FAIRness of data (Wilkinson et al. 2016), data maturity (Höck, Toussaint, and Thiemann 2020), and data quality (ISO 2013).We derived a data quality matrix for spatial data, similar to a data maturity matrix (Wagner and Henzen 2022).For each lifecycle phase, a set of mandatory maturity and quality measures are defined and complemented with use case-specific openness requirements.Our QA workflow is implemented as a web-based interactive questionnaire resulting in a custom report (see GitHub project: https://github.com/GeoinformationSystems/RDMOCatalogBuilder).By publishing the results users get a brief overview on the quality of a dataset and its openness, FAIRness and maturity, which ultimately increases the accessibility of information (Percivall 2010) (Table 2).
To complement missing metadata for existing datasets, developers should provide tools to derive certain quality indicators automatically, e.g. by analysing the files based on their geodata file type/ format.We developed the tool MetadataFromGeodata (Wagner, Henzen, and Müller-Pfefferkorn 2021; published on GitHub: https://github.com/GeoinformationSystems/MetadataFromGeodata)that allows to extract several geodata quality indicators compliant to ISO 19115-1 (ISO 2014) and ISO 19157-1 (ISO 2013), including number or rate of missing items per attribute and various parameters on representativeness.
Given the high variety of available geodata, producers should suggest measures to support users working at different spatial and temporal resolutions and scopes in selecting the most fit-for-purpose products for their specific applications.One way to support users, besides publishing spatially and temporally disaggregated validation results and following standardized guidelines for validation assessments (Olofsson et al. 2012;Stehman and Foody 2019), is to consider several measures that actually maximize the products comparability.Firstly, producers can use a unique set of validation samples, of known quality and quantity to allow comparability of several geodata at the exact same locations.Here we developed a validation database of forest samples to compare the correctness in forest mapping of three well-known global land use/cover time-series in sub-Saharan Africa (Figure 2).Moreover, producers should assure full transparency and reproducibility of their data quality assessments by publishing information on validation samples collected from the literature or providing their own sample data as open/FAIR datasets, linking them to digital object identifiers (DOIs) of public repositories.
As the confidence of reported thematic accuracies directly depends on the representativeness of the validation samples (Foody 2009), it will increase as more data holders publish and integrate validation samples into the domain-specific, open-access repositories.Especially for domains where amounts of open-access validation and topical samples are rapidly growing (e.g. for biodiversity occurrences via GBIF (www.gbif.org),for soil profiles via WoSIS (www.isric.org/explore/wosis)),regular re-validations of data products will thus be needed, not only to improve the reliability of data quality metadata but also to keep them comparable across products.Producers with the technical capacities to do so, should implement routines to regularly revalidate their own products against latest standardized validation records, e.g. on cloud-based geocomputation platforms and update associated metadata, while all other producers should support such re-assessments by third parties.

Approach 3: Provide flexible and interoperable metadata profiles
For producers the provision of rich metadata is time-consuming and generic metadata schemes (e.g.ISO 19115) may be overwhelming due to their complexity.However, the use of generic schemes is crucial to provide interoperable metadata.This conflict can be resolved by developing metadata profiles, i.e. less complex subsets of a metadata scheme that can be tailored towards user needs.
Our approach to develop metadata profiles is to reduce, restrict, adapt, and comply the original scheme (Henzen, Rümmler, and Wagner 2021).Reduce includes a critical review of the optional fields in the original schemes, i.e.only keep fields that are meaningful for the description of the respective datasets.In the context of earth system science data, we kept 24 of originally 49 metadata fields included in the GeoDCAT dataset profile.Restrict includes the change of obligations from optional to mandatory where the information is necessarily needed (e.g. an identifier for internal processing) or can be provided with guarantee (e.g. a contact point), the reduction of cardinality and format restrictions (e.g.machine-readable contents only).Adapt accounts for community needs regarding terminologies by changing the labels of the metadata fields.By using well-known and accepted terminologies, that are driven by, e.g.best practices or quasi standards, we facilitate the understanding of which information should be provided for a certain metadata field.Comply refers to extensions that can be integrated into the structure of the original scheme, for example to facilitate the description of different data quality indicators.

Approach 4: Increase interoperability for metadata and thematic terms
A common understanding of terms is crucial for collaboration between project partners.Furthermore, knowledge dissemination to potential users or providers with different background relies on a clear and accessible description of focal terms.Any project-specific system to manage and represent domain-specific knowledge that is used to organize and describe the concepts of a project as for instance vocabularies, taxonomies, or ontologies, should be published in human-and machine-readable form, and its contents should be linked to existing concepts.Finally, if datasets refer to terms or definitions in a certain field of the metadata scheme (e.g.tags or CRS), these fields should always link to proper terms that are available in an openly accessible register.We (i) developed the R-package ontologics (https://cran.r-project.org/web/packages/ontologics) to support scientists in the development and setup of use-case-specific ontologies, whose terms are well-defined, harmonized, and linked to terms in other knowledge organization systems, (ii) developed and published an ontology of land-use/landcover concepts (Ehrmann, Rümmler, and Meyer 2022) (Figure 3), and (iii) published an extendable register (https://geokur-dmp.geo.tudresden.de/quality-register) of geodata quality indicators that fosters managing descriptions of quality indicators and providing them for reference in data quality descriptions and assessments.
The package ontologics enables the guided development of ontologies from within R, by providing a range of assisting utility functions that prevent inconsistencies within the ontology.The package makes use of the Simple Knowledge Organization System (SKOS, https://www.w3.org/TR/skosreference/) to define the linkages between registered terms.Consequently, the ontology can be exported as an RDF document.A common understanding is crucial for thematic terms as well as for quality indicators.Therefore, we built a register of geodata quality indicators.We initially expressed a subset of the geodata quality indicators that are described in ISO 19157 (ISO 2013) with the Data Quality Vocabulary (DQV) and made them available in a triplestore database.A dynamic web page requests the triplestore and presents the register in human-readable form.The page is hosted in our CKAN instance including visualizations and a form to provide feedback or to propose new indicators.

Approach 5: Provide user-friendly tools to visualize data provenance and quality information
To support the efficient evaluation of fitness for use for geospatial datasets, developers should provide easy-to-use tools to visualize both data quality and provenance.Accordingly, we developed a Geodashboard user interface that links provenance information, data quality information, and general metadata on several levels of detail (Figgemeier, Henzen, and Rümmler 2021).The Geodashboard (https://github.com/GeoinformationSystems/Geodashboard)uses standardized semantic geospatial metadata requested via SPARQL from a triplestore (Figgemeier, Rümmler, and Henzen 2022).It supports queries on well-defined quality indicators managed in the above-mentioned quality register (see approach 4).Several linked dashboard widgets provide an overview of selected datasets, allowing user-specific selection of quality information on different levels of detail supported by charts and a map view.For users, the Geodashboard facilitates assessing whether potential datasets fulfil user-defined requirements, for instance regarding their spatial resolution, geodata quality, and provenance, e.g. to avoid circular reasoning (Figure 4).Heterogeneous data quality information can be mapped spatially explicit, allowing users to quickly determine whether the quality of a dataset is adequate in the region of interest.The embedded visualization of provenance information allows the selection of specific datasets and shows which data has been processed.

Value of presented approaches and remaining challenges
Our paper synthesizes approaches across single workflows and tools to support producers and users in the provision, management, and usage of relevant and accessible metadata for geospatial data at various stages (Table 3).For example, the quality-assurance workflow guides producers through the entire research data lifecycle, the suggested approaches and related tools facilitate the generation of provenance information, the automated extraction of data quality information during processing and analyses and the provision of standardized, transparent and reproducible quality information.Metadata profiles, ontologies, and data quality registers facilitate the structured publication and archiving of metadata and the link between different communities.Once this information is available, visualization will support users in the reuse of data.From a producer perspective, following these approaches will increase transparency, accessibility, and usage of data products.From a user perspective, fitness-for-use assessments will be facilitated, which will improve the adequate usage of data for downstream analyses and ultimately research quality.Further, usage of data from other domains will be eased which will help to answer new research questions and foster research in general.
Although the developed approaches and tools are designed for easy re-use and implementation, a certain effort is needed to integrate them into work routines.The creation of provenance metadata during processing, for example, still requires manual input from producers.Adopting the questionnaire within the quality assurance workflow might be challenging for producers that rely on scriptbased processes to handle large amounts of data.An even more improved awareness of the crucial importance of metadata for meaningful and successful geodata usage is thus further needed, together with a genuine willingness of the community of producers as well as developers to improve metadata provision.Likewise, awareness and knowledge among users regarding the importance of metadata on data quality and provenance needs to be fostered via corresponding training on responsible data use in higher-education programmes.

Outlook
Currently, metadata on data quality of geodata are mainly captured, structured, and described from a producer-centric view and might be prone to fail providing suitable information for users (Zabala et al. 2021).Flexible user-centric approaches are needed which enable users providing feedback given their own data usage and to annotate data with respect to e.g.data quality or fitness for use for a certain application, and to make this knowledge accessible for other users (Anderson et al. 2020;Yang et al. 2013;Zabala et al. 2021).Storing user feedback in an authorized and reviewed way remains an open task, however, first attempts to do so have already been made (Vahidnia and Vahidi 2021).
We demonstrated the necessity to interlink ontologies of the same domain and suggested approaches to support this process (approach 4), e.g. by mapping novel to existing terms as a prominent feature in an ontology development tool ('domain knowledge descriptions') or by managing an integrative register of data quality indicators that allows to define different descriptions of quality indicators for different research domains, and defining relations between them ('data quality descriptions').While both types have intersections, e.g.different domains use different descriptions for the same data quality concept, there is a lack in concepts that leverage these intersections to generate knowledge and support both producers and users.Thus, the development of concepts interfacing ontologies of different types have a high potential.Assuming that information on a dataset's thematic scope is given by linking to terms in an ontology, such interfaces could infer the overarching domain of the thematic terms and provide the user with recommendations of data quality indicators that are commonly used in this domain.
The use of a unique set of validation samples can foster thematic accuracy comparison across several geodata by checking their accuracy at the exact same locations.Still, further developments are needed.Comparison of products with different spatial resolutions via a unique set of validation samples requires methodologies that account for the scale-mismatch in the sample-to-product assessment.However, data collection guidelines developed by the community will help for future data integration, but not in making use of validation samples already developed.Therefore, the integration of various data sources to generate a comprehensive validation database requires methodologies that can remove or quantify the bias introduced by the different data collection protocols used (cf Ehrmann, Seppelt, and Meyer 2020).
Given the needs to continuously improve the reliability of data quality metadata and to keep metadata comparable across data products, an important frontier lies in developing generic tools and workflows for dynamic quality-(re)assurance.A showcase illustrating the potential of dynamic quality-(re)assurance for enhancing the reliability of downstream scientific and policy applications is implemented within GlobES (www.globesdata.org).Here, an automated workflow periodically re-validates the thematic accuracies of time series on ∼70 natural and artificial ecosystems (incl.different land-use classes) against global GBIF-facilitated occurrence records of plant and animal species known to rely on the respective ecosystem class as habitat.The rapidly increasing GBIF data volumes will thus regularly change the spatiotemporal patterns of classification uncertainties in the gridded products.Propagating these further into the uncertainty bars of aggregated ecosystem-change indicators, also improves the precision and reliability of progress-tracking towards global policy targets.
Realistically, few producers may currently have the capacity to implement and finance such dynamic quality-(re)assurance on their institutional hardware.Easy-to-use tools (e.g.R packages) to support setting up similar automated routines on cloud-based geocomputation platforms are as needed as solutions for financing curation of, and continued use of computational resources by, such routines beyond the funding periods of the projects that originally developed the data products.To ensure reproducibility and provide permanent links to data published alongside a paper, standards, and regulation of data publishing increased and new journals with respect to data publishing emerged.However, so far, increased requirements regarding documentation or transparency are mainly borne by producers while publishers provide few services regarding e.g.metadata generation or updates.Thus, beyond the perspectives of users, producers, and developers integrated here, additional actors need to be considered.Efforts from all actors involved are essential to ensure the implementation of the approaches presented here and widely improve the availability and accessibility of data quality and provenance information.

Figure B1 .
Figure B1.Mean availability (blue) and mean accessibility (green) of relevant data quality and provenance information (0 = low, 5 = high).Only quality information mentioned at least five times (a) was included.

Figure B2 .
Figure B2.Current (blue) and preferred (green) source of data quality and provenance information.Twenty-six participants responded that provenance information is relevant for their fitness-for-use assessments.Further, thematic accuracy and completeness have been named most often as being relevant (a = 15).The availability and accessibility of relevant quality information (a ≥ 5) was rated medium except for provenance information which was rated to be less available or accessible (FigureB1).Many participants currently obtain data quality and provenance information from associated publications or reports (a = 24), yet this is the preferred source of information for less respondents (a = 16) (FigureB2).Most participants prefer the presentation of quality information in tables instead of retrieving this information from text sections (a = 23).To determine fitness for use, user tend to evaluate multiple data quality elements for a certain dataset (a = 17) or multiple datasets for a certain quality element (a = 14).Moreover, most participants download the data and perform some tests (a = 20).Survey results indicate that for certain quality indicators spatially, temporally, or thematically specific information is needed, e.g. for positional accuracy or thematic accuracy.Respondents further stressed the usefulness of concepts that link data quality and provenance information and visualize them accordingly(Fischer, Egli, and Henzen 2022).

Figure 1 .
Figure1.Roles of data user, data producer, and software developer and their interactions within the process of improving the availability and accessibility of provenance and data quality information.Numbers inside parentheses represent the sequence of steps of this process.

Figure 3 .
Figure 3. Visual representation of the ontology developed in this project with the ontologics R-package.Circles that are nested into bigger circles show hierarchically narrower concepts.Commodities are only partially visualized for clarity, but can be found in the online version of the ontology (Ehrmann, Rümmler, and Meyer 2022).

Figure 4 .
Figure 4. Dashboard showing provenance information, general metadata, and data quality for the MapSPAM (Yu et al. 2020) dataset.The provenance graph shows that input data included irrigation.Since irrigation is also used as a predictor in the downstream analysis, this could bias statistical parameter estimates and cause circular reasoning.Information was generated based on the methodological descriptions in associated publications and supplements.

Table 3 .
Overview of approaches to increase the availability and accessibility of metadata for geospatial datasets and related tools.