Publishing NextGEOSS data on the GEOSS Platform

ABSTRACT This paper is the second of a series that describes some of the main dataset resources presently shared through the GEOSS Platform. The GEOSS Platform was created as the technological tool to implement interoperability among the Global Earth Observation System of Systems (GEOSS); it is a brokering infrastructure that presently brokers more than 190 autonomous data catalogs and information systems. This paper is focused on the analysis of the NextGEOSS datasets describing the data publishing process from NextGEOSS to the GEOSS platform. In particular, both the administrative registration and the technical registration were taken into consideration. One of the most important data shared by the GEOSS Platform are the NextGEOSS datasets: the present study provides some insights in terms of GEOSS user searches for NextGEOSS data.


Introduction
This is the second publication of a series of manuscripts presenting some significant datasets, which are currently published and accessible through the GEOSS Platform. GEOSS (the Global Earth Observation System of Systems) is a social and software ecosystem sharing independent and open Earth observation (EO) data, information, and processing services. The GEOSS Platform (formerly called GCI: GEOSS Common Infrastructure) was created as the technological tool to implement interoperability among the ecosystem of enterprise systems and be the cornerstone around which GEOSS is implemented (Boldrini, Hradec, Craglia, & Nativi, 2021;Craglia, Hradec, Nativi, & Santoro, 2017;Nativi et al., 2015). Recently, the GEOSS concept is evolving towards the Digital Twin pattern enabled by a flexible and scalable digital ecosystem (Guo et al., 2020;Nativi, Mazzetti, & Craglia, 2021;Santoro, Mazzetti, & Nativi, 2020). Furthermore, the new GEOSS platform will enable model sharing such as demonstrated in the GEO Plenary held in Canberra in 2019 (Ollier, 2019).
GEOSS was started and is operated by the Group on Earth Observation (GEO). 1 GEO is an intergovernmental partnership working to improve the availability, access, and use of open Earth observations, including satellite imagery, remote sensing, and in situ data, to impact policy and decision-making in a wide range of sectors (GEO, 2022). Established in 2005, today, GEO is a partnership of more than 100 national governments and more than 100 Participating Organizations. GEO, GEOSS, and the GEOSS Platform were fully introduced in the first manuscript of the series (Roncella et al., 2022).

Becoming a provider of GEOSS ecosystem and publishing datasets
The enterprise organizations contributing to the GEOSS ecosystem share their EO data and information on the GEOSS Platform, where users and software clients can discover and access them. The organizations must follow a rather simple procedure to join the ecosystem and become a data/service Provider. The provision process mainly consists of a couple of steps implementing the administrative and technological registration to the ecosystem platform -as depicted in Figure 1 (GEOSS Infrastructure Development Task Team, 2017).
The two registration steps provide the necessary (administrative, political, and technological) information to implement the required interoperability arrangements among the many different providers and the ecosystem itself. GEOSS Platform ecosystem accomplishes the following interoperability functions: (a) internet protocol interfaces mediation for data discovery and access services; (b) data models/encodings mediation and harmonization; (c) metadata models/encodings mediation and harmonization; (d) policy models harmonization.
New (as well as already registered) Provider must interface with the GEO Secretariat (GEOSec) to execute the two registrations and get necessary support and guidelines. In keeping with the GEOSS data sharing principles (GEO, 2015), the two-steps registration process (GEOSS Infrastructure Development Task Team, 2017) was introduced in details in the first article of the series (Roncella et al., 2022).
Following the address to publishing and accessibility challenges given by the GEO Data Providers, the authors discuss the issue of sharing datasets, including the utilization of the services and the APIs offered by a complex System of Systems Platform like GEOSS. The manuscript also tackles issues that involve policy, social, and technological aspects.
The next section will introduce the NextGEOSS Datasets. Then, section three discusses the implementation of the two registration steps -previously introduced. Finally, a conclusion section argues the significant contribution made by the NextGEOSS Data Provider through an analysis of relevant statistics information and what are the possible enhancements to improve data discoverability and accessibility for NextGEOSS datasets.

NextGEOSS datasets
The NextGEOSS Catalogue is a European data hub for Earth Observation data, based on Comprehensive Knowledge Archive Network (CKAN), 2 an open-source data management system which leverages existing data infrastructures in Europe by providing a single access point to access the available data. For each data infrastructure, datasets are harvested, providing a link to the original data at the source, while harvesting relevant general and topic-specific metadata that allow querying and easy discovery of those datasets from the catalogue.
The access to data is, most of the time, the main need for the scientific communities and application developers. NextGEOSS Catalogue main goals are to facilitate data access and make it easily discoverable and reachable to the communities. By cataloguing data in a platform like NextGEOSS, Data Providers and the scientific communities aim to increase the visibility of their data.
The NextGEOSS Catalogue follows a modular design, containing multiple CKAN docker instances running in different nodes of a cloud-based cluster. The distributed solution of the NextGEOSS Catalogue contributes to: • Ensure scalability: to add new data coming from new Data Providers, it is a simple process to add new cluster nodes and integrate them in the catalogue system. • Reduce the downtimes: to perform maintenance procedures on a specific node, it is only required to stop that node. Other nodes and the catalogue will continuously work and deliver results to the end users. • In the case of massive harvesting, it is possible to scale and spread harvesters by the different nodes and collect data for specific areas of interest or specific time intervals.
At the date of this paper, the NextGEOSS Catalogue contains metadata (and provides access to the data in the original data sources) from 42 Data Providers and 181 data collections. It exposes mainly Earth Observation data acquired by satellites, but it also exposes in-situ data, data generated by scientific models and statistics. Multiple data connectors were developed to allow harvesting metadata from different Data Providers which expose their data via standard interfaces such as OpenSearch, OGC CSW, OGC WFS, OGC WCS, DHuS, REST API. A powerful backend is responsible to handle all the queries performed by the end users, that can discover data via a user-friendly and intuitive geoportal (https://catalogue.nextgeoss.eu) or via the OpenSearch interface. The OpenSearch interface is implemented to follow a two-step search approach, searching first by a specific data collection and then, adding other filters to search for the datasets inside the collection (area of interest, time interval, free text search, and others).
The OpenSearch interface made available by the NextGEOSS Catalogue (applying the OGC standards) contributes to increase the interoperability with other data catalogues. The NextGEOSS OpenSearch interface was used to publish the NextGEOSS datasets and (mainly) for the brokering of the NextGEOSS system in the GEOSS Platform -which is the main subject of this paper.

Administrative (yellow page) registration
The administrative registration aims to provide Data Providers with a clear, simple, and transparent online form to be filled (see supplemental Annex A) and it has been completed in November 2020 by the NextGEOSS Data Provider. Table 1 shows the required information and the contents filled by the NextGEOSS Data Provider. NextGEOSS builds upon European vision of GEOSS data exploitation for innovation and business and the concept of system of systems. This catalogue is operational, regularly enriched with fresh data and monitored. The objectives of NextGEOSS catalogue are: a) provide tools and support to data or applications providers to catalogue their valuable data and applications assets to NextGEOSS for a wider reach.
b) provide a user feedback loop from the data consumers to the data providers. This user feedback mechanism can, with certain conditions, collect and redistribute the feedback made from several portals. Protected personal data (Contact points names and emails) are not available in agreement with the General Data Protection Regulation (GDPR). The service endpoint is omitted due to policy reasons.

Interoperability (GEO DAB) registration
The GEO DAB component is a middleware software that allows an interconnection of the heterogeneous and distributed capacities contributing to GEOSS. The GEO DAB component features a brokering process by providing an interoperability registration and it produces an abstract and harmonized view of the diverse data/metadata. This is possible by mapping the different data and metadata models into a flexible and extensible general model based on ISO 19115 standard. ISO 19115 is a rich and extensible metadata model containing more than 400 metadata elements to describe geographical datasets in depth, allowing the addition of new concepts and related attributes through an extension mechanism. The GEO DAB periodically harvests Data Provider services to fetch original metadata datasets, harmonize them into the internal model and store the information to a central database. This mechanism allows harmonized and efficient discovery and access of records across the many heterogeneous GEOSS Data Providers. The GEO DAB users are typically software agents such as web-based (e.g. GEOSS Portal) or desktop client applications. In order to simplify the development of applications and clients making use of the GEO DAB, a set of high-level client-side Open APIs have been designed and developed to discover and access GEOSS resources via the GEOSS Platform and they are described in the first paper of this series (Roncella et al., 2022). The brokering process can be used to set up a connection with the remote NextGEOSS system, both in terms of discoverability and accessibility.
During the first phase of the interoperability registration, the GEO DAB team and the NextGEOSS technical team discussed the best options in order to connect the GEOSS Platform to the NextGEOSS Catalogue: in fact, NextGEOSS features CKAN and OpenSearch, which are two different standard interfaces supported by the GEO DAB. Following the suggestion of the Data Provider, the GEO DAB team opted for the OpenSearch interface to communicate with the remote NextGEOSS system. The NextGEOSS OpenSearch service has two different endpoints: one for searching the dataset Collections and one for retrieving products/granules.
The GEO DAB follows a two-step search approach: (1) During the first step the dataset collections (i.e. series) matching the user query are retrieved. Each collection is an aggregation of many individual datasets (i.e. granules) that can be further searched.
(2) During the second step user selects one or more collection from the first step.
A second query is issued and granules matching the user query are retrieved.
Technically, collections are harvested periodically by the GEO-DAB (once a month), storing the relative metadata contents in a central database, to optimize first step queries; granules metadata are instead not stored in the central database and only recovered at query time by executing distributed queries to the NextGEOSS system. At the date of this paper, the GEO DAB publishes 154 dataset Collections from the different EO satellite data and in-situ data including more than 10 million datasets representing the products/granules. All the metadata records present in the NextGEOSS data system use the Atom (Web standard) format based on XML language and HTTPS protocol. The main metadata fields coming from NextGEOSS included in each entry in the results feed are: • atom:title: identifies the title of the resource; • atom:id: represents the identifier of the Collection or product/granule; • georss:polygon: indicates the spatial coordinates; • atom:content: indicates the abstract description of the dataset Collection; • atom:summary: indicates abstract description of the product/granule; • atom:published : represents published date of the dataset; • atom:updated: identifies the last revision date of dataset; • dc:date: represents the time range; • atom:link: contains online information to access or download the resources including links to the thumbnails preview.
In terms of discoverability, the GEOSS Platform allows users to retrieve datasets through queries based on different constraints typical of geoportals: geographical coverage, temporal extent, keywords, and others. Figure 2 shows the NextGEOSS dataset collections available in the GEOSS Portal.
In terms of accessibility, the GEOSS Platform allows users to download the NextGEOSS data through the HTTP/HTTPS protocol. The NextGEOSS Catalogue provides the necessary information to download data for both Collection entries and products. Initially, some Collections contained placeholders for some metadata elements (i.e. http://example.com for the link metadata element) and GEO DAB implemented a specific solution to ignore them.
In some cases, due to policy agreements, direct data download from the GEOSS Platform may not be possible. In this case, users are redirected to the Provider data system or to third-party systems. Sometimes the NextGEOSS data system requires the use of credentials (username and password) to download products or access the thumbnails. Several products can be directly downloaded as Network Common Data Form (NetCDF) files or Hierarchical Data Format (HDF) file depending on the different remote resources. For some resources, the NextGEOSS data system also provides an OPeNDAP URL, widely used in Earth science, to optimize the retrieval of gridded data.

Discussion and considerations
The analysis of the end-users query requests to the GEOSS Platform makes possible to assess some discovery trends of NextGEOSS datasets and provide few hints about possible enhancements to improve their findability and accessibility. The discussed analysis focuses on user requests (i.e. human interaction statistics expressed as query requests to the GEOSS Portal). In addition, it is acknowledged that automatic tools (i.e. software clients), managed by approved organizations (notably, WMO WIS GISC), make regular harvesting requests to periodically gather the entire GEOSS Platform metadata content, including the NextGEOSS datasets.
The analyzed indicators cover a time period from January 2021 (i.e. when the NextGEOSS catalogue was added to the GEOSS Platform) to May 2022. During this period, 326 queries were made by the GEOSS Portal users. These requests returned one or more NextGEOSS records as results, generating an average number of about 22 queries per month. Figure 3 represents the most utilized searched keywords expressed by the users to retrieve NextGEOSS datasets.
Noteworthy, many matching requests (about 23%) have no keyword indicated by the user. An in-depth analysis shows that all these requests (with missing keywords) have actually indicated NextGEOSS as the target catalog to search-thus, this search clause acts  (in fact) as a keyword. The most popular keywords, explicitly expressed by the users, are Sentinel 2 and Landsat 8; together they represent about 28% of the matching queries. Most of the users, who obtained as a result some datasets from NextGEOSS, seem to be interested in satellite data. There exists also an important interest in studying the current pandemic situation (covid keyword). These searches were especially performed in the first semester of 2021. NextGEOSS indeed includes two dataset Collections related to COVID-19 thematic: ECDPC COVID-19 and Air Quality Megacities Pollution and Covid19 Timeseries. Figure 4 illustrates the most popular dataset collections from NextGEOSS shared resources. Remote sensing resources appear amongst the most returned ones (mostly Sentinel, but also EUMETSAT and Landsat). Collections of in-situ observations are well popular as well (e.g. data from multiparametric buoys).
The IP addresses characterizing the request originators (i.e. the users) were further analyzed, to understand who are the NextGEOSS resources consumers. The "whois" program was employed to retrieve information on each IP address from the public registries, in particular the associated organization and country. It must be premised  that such identification mechanism is not always possible and sometimes does not give meaningful results. Indeed, most IP addresses are associated with Internet Service Providers (ISPs), which provide dynamic IP addresses to their users -a couple of factors may contribute to that: employees smart working (due also to the COVID lockdowns) and the interest of citizens. For the cases where the identification was possible, among the main users emerge: the European Commission, academia, and some research centers. Table 2 shows the top countries from where the user requests were originated. The first five top countries are: Italy, France, Portugal, Germany, and Poland.
From the GEOSS Platform point of view, the NextGEOSS catalog is in the first third of the most popular Data Providers. Leveraging on the experience of NextGEOSS and similar Data Providers, some recommendations are here drafted to try to increase their popularity: • To implement the FAIR (Findability, Accessibility Interoperability and Reuse) principles.
• To align with the major international directives and guidelines on metadata and data (i.e. INSPIRE, ISO 19115, NetCDF-CF, etc.). • To include (wherever possible) the use of vocabularies and linked data instead of free text. • To test the implementation (for example, trying to search their data on the GEOSS Platform) to verify that the query system is giving back the expected results. • To periodically follow-up between the GEOSS Platform Operation Team (GPOT) and Data Provider after the brokering of the source has been completed. Feedback back and from the Data Providers is fundamental to publish metadata and data to GEOSS Platform in the most effective way. • To be active at the Community level. Hopefully, GEO will revamp soon the GEOSS Platform Community workshops (GEO Data Provider Workshop, GEO Data Technology Workshop, etc. His research interests deal with multidisciplinary interoperability, designing and developing infrastructures and services for geo-spatial resources, with particular focus on semantic discovery and environmental models interoperability. He is responsible for the Geospatial Artificial Intelligence and Information Sharing (GAINS) Working Group of the CNR-IIA. He participated in several research projects and initiatives funded by EC (FP7, H2020), US NSF, and National R&I frameworks. He coordinates the design and development of the VLab framework. He is responsible for the GEO Discovery and Access Broker (GEO DAB) operational environment, member of GEOSS Common Infrastructure (GCI) Operations Foundational Task, GEOSS Platform Operations Team and of the GEOSS Infrastructure Development Task Team.
Paolo Mazzetti is Head of the Division of Florence of CNR-IIA. He holds a degree in Electronic Engineering and he taught "Telematics" at the University of Florence in Prato for the degree in Information Engineering for seven years. He has more than fifteen years of experience in the design and development of infrastructures and services for geo-spatial data sharing in the context of national, European (FP7, CIP, H2020) and global initiatives. He is Principal Representative of Italy in the GEO Programme Board. He is the Coordinator of the GEO DAB activities in the GEOSS Common Infrastructure (GCI) Operations Foundational Task, and member of the GEOSS Infrastructure Development Task Team. He has been a member of the GEO Secretariat Expert Advisory Group (EAG) and of the GEO Institutions Development Implementation Board (IDIB). He is a member of the EuroGEO Coordination Group representing Italy. He is a member of the National Working Group on Land Degradation Neutrality. He is the project coordinator of the NewLife4Drylands LIFE Preparatory Project.
João Andrade is a senior software engineer of the EO Ground Segment Systems business unit at DEIMOS, experienced in projects involving processing platforms, data archives, orchestration of processors and Big Datanamely SenSyF, Co-ReSyF, Hydrology-TEP, SimOcean, Marine-EO and NextGEOSS. He was involved in the development phase of the SenSyF project, more specifically, in the development of a Service Development Kit (SDK) which was integrated in the project's framework. He was involved in the integration of WOIS (Water Observation and Information System) on the ESA's Hydrology-TEP Platform and the setup of the CKAN catalogue and harvesters (data connectors) to get and store data within the SIMOcean project. He has been the Systems Engineer of NextGEOSS and responsible for the development and maintenance of the NextGEOSS distributed Data Catalogue and all the Data Connectors to collect data from many European providers through a wide range of interfaces such as OpenSearch, OGC CSW, OGC WFS and many others. He is involved in other projects containing catalogue and archive solutions for Big Data, such as CHEOPS Archive, Marine-EO, MELOA and NAOS-DC. He was involved in the architecture design of a Data infrastructure for the Instituto Hidrográfico in Portugal. He is also responsible for the deployment of the PDGS in the FSSCAT project (archive and orchestration of processors) and recently product owner for the Archive4EO and Chain4EO (software products developed by Deimos to catalogue and archive data and orchestrate data processors). He is currently involved in the design, development and deployment of the Deimos Services4EO. He holds an M.Sc. degree in Electrical and Computer Engineering from Instituto Superior Técnico in Lisbon.
Nuno Catarino is the Head of PDGS Division in the Ground Segment Systems Business Unit at DEIMOS Engenharia. As Head of Division, Nuno is responsible for coordination of operational projects team, project control and Business Development. Under his leadership, the PDGS division has significantly increased activity in the past years, currently accounting for about a fourth of the total company revenue. Within PDGS and the broader Space Systems and Earth Observation fields, Nuno's activities range from the development of remote sensing systems, data processing, cloud-based big data handling and service creation systems for Earth Observation, and the development of intelligent process management tools, among others. Within these new activities, much of the focus has been devoted to exploring the development of new services and products, in particular for Remote Sensing of the Oceans and Coastal regions. Nuno has lead several large (i.e >2M€) international consortia in software and hardware development for the European Space Agency and Earth Observation markets, involving industrial, institutional and academic partners. He has a first degree in Physics from the major Engineering school in Portugal (IST), and a PhD in Applied Mathematics from one of the UK top schools (Univ. Warwick) and has supervised several MSc students in recent years in activities related to Remote Sensing.