Virtual earth cloud: a multi-cloud framework for enabling geosciences digital ecosystems

ABSTRACT Humankind is facing unprecedented global environmental and social challenges in terms of food, water and energy security, resilience to natural hazards, etc. To address these challenges, international organizations have defined a list of policy actions to be achieved in a relatively short and medium-term timespan. The development and use of knowledge platforms is key in helping the decision-making process to take significant decisions (providing the best available knowledge) and avoid potentially negative impacts on society and the environment. Such knowledge platforms must build on the recent and next coming digital technologies that have transformed society – including the science and engineering sectors. Big Earth Data (BED) science aims to provide the methodologies and instruments to generate knowledge from numerous, complex, and diverse data sources. BED science requires the development of Geoscience Digital Ecosystems (GEDs), which bank on the combined use of fundamental technology units (i.e. big data, learning-driven artificial intelligence, and network-based computing platform) to enable the development of more detailed knowledge to observe and test planet Earth as a whole. This manuscript contributes to the BED science research domain, by presenting the Virtual Earth Cloud: a multi-cloud framework to support GDE implementation and generate knowledge on environmental and social sustainability.


Introduction
Humankind is facing unprecedented global environmental and social challenges in terms of food, water and energy security, resilience to natural hazards, population growth and migrations, pandemics of infectious diseases, sustainability of natural ecosystem services, poverty, and the development of a sustainable economy (Nativi, Mazzetti, and Craglia 2021).Addressing these challenges is crucial for our planet preservation and the future development of human society.To this aim, international organizations have defined a list of policy actions to be achieved in a relatively short and medium-term time framework.Notably, the United Nations (UN) defined 17 Sustainable Development Goals (SDGs) 1 along with an implementation agenda.Such an effort is supported by other relevant international initiatives and programmes, including the UNFCCC Process-and-meetings 2 (see for example the Conference of Parties 2015 on Climate: COP21) 3 and the Sendai Framework for Disasters Risk Reduction 4 overseen by the United Nations Office for Disaster Risk Reduction (UNDRR).Most of these frameworks require the development and use of a knowledge platform, which must build on the recent and next coming digital technologies that have transformed societyincluding the science and engineering sectors.Having this in mind, for example, the EU developed a European growth model underpinned by the 'twin transitions', 5 i.e. the green transition must go hand in hand with the digital one (Muench, et al. 2022).
A knowledge platform supports policy makers to take significant decisions (providing the best available knowledge) and avoid potentially negative impacts on society and the environmentalso considering the connections between the local and global processes (Mazzetti et al. 2022;Guo et al. 2020;Nativi, Mazzetti, and Craglia 2021).A sustainable development must be guided by a science that transitions from focusing on one-problem-at-one-scale at a time, to truly exploring the complexity and systemic nature of the simultaneous and mutually interacting cause-and-effect chains in the interdependent humannature and peopleplanet reality of the twenty-first century (Rockström, Bai, and deVries 2018).
Understanding the impacts and interrelationships between humans as a society and natural Earth system processes requires a significant effort in collecting, analyzing, and sharing relevant information at different spatial and temporal scales.With the advent of the Digital Transformation, the interconnection between the physical and the digital world has become almost complete: economic, industrial, and social relationships have been moved to the 'cyber-physical' world (i.e. the realm where digital and physical hybrid systems operate, thanks to transformative technologies for managing the interconnection between their physical assets and computational capabilities operate (Lee and Kao 2015)), where all the relevant stakeholders are included more easily and can intensively cooperate in generating the knowledge required for addressing a given purpose (Nativi, Mazzetti, and Craglia 2021).The Digital Earth concept (Gore 1998), represents an overarching effort in that direction to address the scientific and technological challenges which must be faced to enhance our understanding of the Earth system.As recognized by the International Society on Digital Earth (ISDE), 'Digital Earth is a multidisciplinary collaborative effort that engages different societies to monitor, address and forecast natural and human phenomena' (ISDE 2019).
In this context, Big Earth Data (BED) science (Guo et al. 2020) aims to provide the methodologies and instruments to generate knowledge from numerous, complex, and diverse data sources, which are essential to develop a sustainable society and preserve the planet Earth.BED science requires the development of Geoscience Digital Ecosystems (GDEs) (Nativi and Mazzetti 2020), which bank on the combined use of: Big Data, AI data-driven instruments, and online highly scalable computing platforms to observe and test planet Earth as a whole (Guo et al. 2020;Nativi, Mazzetti, and Craglia 2021).For GDEs, the fundamental technology units (i.e.big data, learningdriven artificial intelligence, and network-based computing platform) enables the development of more detailed knowledge.
A key tool for transforming the huge amount of data currently available into knowledge is represented by scientific models, either theory-driven or data-driven.This manuscript contributes to the BED science research domain, by presenting a multi-cloud framework to support GDE implementation and execute scientific models for the generation of knowledge on environmental and social sustainability.Next section (2) introduces the concept of GDE as defined by (Nativi and Mazzetti 2020).Section 3 presents the resources that must be handled by a GDE.In section 4, the concept of Virtual Earth Cloud is defined and a possible architecture is presented.Section 5 describes a proof-of-concept, implemented in an international framework.Finally, section 6 draws some conclusions and discusses the future work.

Geosciences digital ecosystems
A Geoscience Digital Ecosystem (GDE) is defined in (Nativi and Mazzetti 2020) as a 'tsystem of systems that applies the digital ecosystem paradigm to model the complex collaborative and competitive social domain dealing with the generation of knowledge on the Earth plane'.
The Digital Ecosystem (DE) paradigm stems from the concept of natural ecosystems (Blew 1996).DEs focus on a holistic view of diverse and autonomous entities (i.e. the many heterogeneous and autonomous online systems, infrastructures, and platforms that constitute the bedrock of a digitally transformed society) which share a common environment.In search of their own benefit, such entities interact and evolve, developing new competitive or collaborative strategies, and, in the meantime, modifying the environment (Nativi, Mazzetti, and Craglia 2021).
In the geosciences domain, DEs are called to enable the coevolution (i.e. the complex interplay between competitive and cooperative business strategies) of geosciences public and private organizations around the new opportunities and capacities offered by the digital transformation of society -Internet, big data, and computing virtualization processes represent some of the main engines of innovation, rising an entirely new type of geosciences ecosystems (Nativi and Mazzetti 2020).
It is worth to note that the Digital Ecosystem (and therefore the Geosciences Digital Ecosystem) approach differs from the approach adopted by currently available cloud platforms for geospatial data processing (e.g.Google Earth Engine (Google 2022), Microsoft Planetary Computers (Microsoft 2022), etc.).Such platforms are highly optimized tools for developing and executing scientific models on top of geospatial data made available on the platform itself, and utilizing the computational resources of the underlying cloud platform.Google Earth Engine is built on top of a collection of enabling technologies that are available within the Google data center environment (Gorelick et al. 2017)a similar approach is implemented by Microsoft Planetary Computers.The advantages of this approach are well-known and essentially stem from the control of the entire (end-to-end) technological stack which makes up the platform.A Digital Ecosystem approach, instead, focuses on how to build added-value services on top of existing and autonomous systems (i.e.systems which are operated and governed autonomously from the others) without being able to control the end-to-end technological stack.To this aim, as far as Geosciences Digital Ecosystems are concerned, a set of principles, development patterns, and governance styles were recognized for an effective GDE implementation framework (Nativi and Mazzetti 2020): . Evolvability and Resilience: a GDE operates in a highly dynamic environment in which technology, policies, and use needs are in constant evolution. .Emergent Behavior of GDE as a whole: for the enterprise systems belonging to an ecosystem, the aim is to create a value that is greater than (or different from) the value that they would have without being part of the ecosystem. .Enterprise Systems Dispersion: The enterprise systems, constituting a GDE, are generally disperse (i.e.geographically distributed and heterogeneous) and cope with big data.Therefore, the resulting ecosystem must deal with the challenges characterizing the 'big data' cyber-physical realm and implement appropriate strategies to manage volume, velocity, variety, and value of data from multiple sources in a scalable way. .Governance: The enterprise systems constituting a GDE are existing and autonomous structures, managed by different organizations.The governance of a GDE must define and apply the set of rules and principles that will help steer the ecosystem evolution and effectiveness through the many changes occurring at the political, social, cultural, scientific, and technological environment where it operates.
A high-level architecture for a GDE according to the above principles is designed in (Nativi and Craglia 2021).This high-level architecture recognizes the need to leverage the existing heterogeneous and autonomous systems, which provide the necessary functionalities/resources needed by the digital ecosystem to fulfill its objective.That is, the GDE implementation should apply a System of Systems (SoS) approach to connect and orchestrate the contributing systems in order to provide added-value services and functionalities.
The diagram shows a set of enterprise systemslarge complex computing systems which handle large volumes of data and enable organizations to integrate and coordinate their business processes (IGI Global 2022); these are the actual digital ecosystem components.Each enterprise system can share resources (data resources, analytics resources, and computational resourcessee section Geosciences Digital Ecosystem Resources) utilizing Web APIs (see next subsection) to interact with the other components of the distributed environment, which is implemented by the ecosystem.The emerging (virtual) platform of the digital ecosystem connects to the enterprise systems to exploit their resources; this is where new components and functionalities are implemented to provide digital ecosystem added-value services.Finally, Figure 1 shows the metasystemi.e. the governance and cybernetic framework; this is included here for completeness but is out of the scope of this work.
This work focuses on the analysis of the emerging (virtual) platform of the digital ecosystem.In particular, we capture and analyze the requirements and the solutions (conceptual and technological) to enable execution of heterogeneous analytical software (models) in a GDE framework.

Ecosystem elements: enterprise systems and their interoperability mechanisms
Digital ecosystems must leverage existing heterogeneous and autonomous systems.These provide their functionalities/resources utilizing the Web technologies which are more suitable for their own mandates (they are autonomous).For the objective of this work, the key feature characterizing the enterprise systems contributing to the digital ecosystem is that they expose their resources and functionalities utilizing Web technologies, rather than the specific utilized technology (e.g.Web APIs, Web Services, or Microservices).
Web APIs stem from the notion of APIs, which (Shnier 1996) defines 'the calls, subroutines, or software interrupts that comprise a documented interface an application program can use the services and functions of another application, operating systems, network operating system, driver, or another low-level software program'.From a software engineering point of view, APIs constitute the interfaces of the various building blocks that a developer can assemble to create an application (Santoro et al. 2019;Vaccari et al. 2021).The expression Web APIs indicates the APIs operating over the Web.In the reminder of this paper, the terms Web APIs and APIs are used generically to refer to the mechanisms an enterprise system implements to expose its resources and functionalities on the Web.
Several definitions exist for a Web service, essentially describing it as a service that is offered over the web, irrespective of the usage of specific protocols and message formats (Santoro et al. 2019;OASIS 2006;IBM 2020).The main difference between a Web Service and a Web API stems from their offering.A Web Service provides a service interface (i.e. an interface aimed at offering access to 'high-level' functionalities for end-users).A Web API offers a programming interface (i.e. a set of low-level functionalities that can be used and combined by software developers to deliver a higher-level service).Thus, Web Services and Web APIs differ at the design level but not necessarily at the technological level (Santoro et al. 2019).
Microservices deal specifically with how an application is structured internally; the National Institute of Standards and Technology (NIST) (Karmel, Chandramouli, and Iorga 2016) defines a microservice as 'a basic element that results from the architectural decomposition of an application's components into loosely coupled patterns consisting of self-contained services that communicate with each other using a standard communications protocol and a set of well-defined APIs, independent of any vendor, product or technology'.Microservices can be described as a way of structuring a Web application into loosely coupled, independently deployable components that communicate over the web utilizing lightweight interfaces (Santoro et al. 2019;Karmel, Chandramouli, and Iorga 2016;Newman 2015).

Geosciences digital ecosystem resources
As outlined in section 2, the main objective of a GDE is the generation of knowledge about the Earth planet.A key tool for transforming the huge amount of data currently available into knowledge is represented by scientific models, either theory-driven or data-driven.The former, also referred to as physical models, encode the mathematical model of a scientific theory, e.g.numerically solving the set of equations that represent alleged physical laws (Prada et al. 2018).The latter, also termed empirical models, aim at building a model of data by raw data reduction and fitting, with the only objective of empirical adequacy.With the increased availability of data and the development of advanced modeling techniques like Machine and Deep Learning, data-driven modeling is gaining importance (Hofmann et al. 2019;Nisbet, Elder, and Miner 2009).
In order to execute a scientific model (implemented as an analytical software), there is the need to discover and utilize different types of resources, which can be classified into three broad categories: data resources, infrastructural resources, and analytical resources.Such resources are provided by the different enterprise systems belonging and contributing to a digital ecosystem and are shared utilizing different APIs, according to the specific implementations of each enterprise system.It is worth to note that an enterprise system can share resources belonging to different categories (e.g. it can share both data and infrastructural resources).Providing such APIs enables an enterprise system to contribute to the overall digital ecosystem, allowing the implementation of new components which utilize the APIs to offer added-value services exploiting the shared resources.
The following sections describe the different categories of resources and their main characteristics.

Data resources
Data to be processed or generated by computational models (e.g.model input and output) belong to this category.Geospatial datasets are characterized by a high level of variety in terms of spatial and temporal characteristics, coordinate reference systems, encoding formats, etc.The resulting landscape is therefore highly heterogeneous.Traditionally, several (standard) data schemas and data access protocols exist for data sharing (Santoro et al. 2018;Craglia et al. 2011)including metadata and data typologies, access service interfaces, and APIs.This heterogeneity addresses the need of handling a great variety of data resources, and generates a high complexity for working with multi-disciplinary data, requiring specific expertise.Often, domain scientists (including modelers) but also application developers, do not have such expertise.Typically, a domain expert (or application developer) works with a limited set of data types and protocols, belonging to her/ his realm.
An enterprise system sharing data resources must provide one or more Web APIs, possibly complying with FAIR principles (Wilkinson et al. 2016), allowing the discovery and access of shared data, at least.In the reminder of this manuscript, such APIs are referred to as Data APIs.

Infrastructural resources
This category of resources includes networking, storage, computing, and other fundamental infrastructural resources, which are commonly used to execute a scientific model/workflow.Often, a scientific model run requires infrastructure scalabilityi.e.heavy computing capabilities and substantial data storage.Even when a single execution does not require a significant amount of resources, the opportunity to invoke multiple parallel executions might need that.
Today, empowered by the digital transformation technologies, different solutions exist to provide scalable infrastructure resources.Although such solutions differ in terms of technical capabilities and philosophical approaches (e.g.resources availability, costs, privacy, and property rights conditions), they can all be characterized as IaaS (Infrastructure as a Service) solutions (Mell and Grance 2011).
An enterprise system sharing infrastructural resources must provide one or more Web APIs, which allow the discovery and instantiation of the resources, at least.In the reminder of this manuscript, such APIs are referred to as IaaS APIs.

Analytical resources
These resources represent the implementation(s) and encoding of scientific models to process one or more datasets.In general, three different approaches for scientific model sharing can be recognized, according to three implementation traits: (a) openness, (b) digital portability, and (c) clientinteraction style (Nativi, Mazzetti, and Craglia 2021): .Model-as-a-Tool (MaaT): users interact with a software tool internally exploiting scientific models, but they cannot interact directly with the models themselves. .Model-as-a-Service (MaaS): a given implementation of the scientific model runs on a specific server, but this time, APIs are exposed to interacting with the model. .Model-as-a-Resource (MaaR): the source code (or the executable binary) of a scientific model is shared and can be accessed through a resource-oriented interface, i.e.API.
In the case of MaaT, the user utilizes a dedicated Graphical User Interface to configure and launch the model execution, which then is run on a specific computational resource.A high level of control is ensured in this case as far as how the model is used and executed; however, this methodology is essentially not interoperable (no machine-to-machine interaction is possible).Besides, due to the limitations of the computational resource where the model is executed, scalability of the model is strongly limited as well.Finally, with MaaT it is not allowed to automatically move the model to different computational resources.
With the MaaS approach the level of interoperability is increased due to the availability of APIs to configure and launch the model execution, i.e.MaaS enables a machine-to-machine interaction.However, also in this case, scalability limitations described for MaaT still apply and it is not possible to automatically moving the model to different computational resources.
Finally, in the case of MaaR, the interoperability level can be considered the same as for any other shared digital resource, e.g.data.With MaaR it is possible to automatically move the model and launch its execution on different computational resources, e.g.allowing to select the most appropriate according to the specific needs of the single run.While the scalability and interoperability benefits are clear in this case, this approach requires addressing some challenges in particular as far as automatically executing all necessary steps for the configuration of the execution environment and the triggering of the model.This manuscript will consider the MaaS and MaaR approaches, which allow a machine-tomachine (M2M) interaction for the execution of a given scientific model.In the reminder of this manuscript, the expression Model APIs refers to the APIs utilized to share a model resource, according to either the MaaR or MaaS approach (in those contexts where the difference is relevant, MaaR or MaaS qualification is explicitly stated).

Virtual earth cloud architecture
Virtual Cloud can be defined as a 'customized cloud by aggregating resources and services of different clouds and aims to provide end users with a specific cloud working environment' (An et al. 2017).The use of different clouds is beneficial for several reasons, including cost efficiency, avoidance of vendor lock-in, performance optimization, service outages resilience, diversity of geographical locations, etc.It is worth noting that the provided definition of Virtual Cloud does not imply any specific approach for the use of different clouds, focusing instead on the fact that the end user is provided with a specific cloud working environment, i.e. the underlying use of different clouds is transparent for the user.
Since short time after the emergence of Cloud Computing and different Cloud Service Providers, the issue of how to enable the use of different clouds emerged as a central topic in the field of Cloud Computing research, also considering that despite tremendous development of Cloud Computing, it still suffers from the lack of standardization (Chauhan et al. 2019).Many papers address this topic, analyzing, discussing and classifying the different architectural approaches which can be applied to obtain an effective use of multiple clouds (often referred to as Inter-Cloud), according to different use cases, needs and constraints.In (Grozev and Buyya 2014) the first broad level of architectural classification differentiates between Independent and Volunteer Inter-Cloud environments, defining the concept of Multi-Cloud as 'the usage of multiple, independent clouds by a client or a service', whereas a (cloud) Federation 'is achieved when a set of cloud providers voluntarily interconnect their infrastructures to allow sharing of resources among each other'.The Multi-Cloud strategy fits with the Digital Ecosystem approach, where autonomy of the enterprise systems (including cloud providers) is key.
In the Multi-Cloud approach, cloud brokering plays an important role.Already in 2011, the NIST Cloud Computing Reference Architecture lists the Cloud Broker as one of the five major actors of the architecture (Liu et al. 2011) and defines it as 'An entity that manages the use, performance and delivery of cloud services, and negotiates relationships between Cloud Providers and Cloud Consumers'.Several organizations active in the cloud technology area have identified cloud service brokerage as an important architectural challenge and a key concern for future cloud technology development and research (Fowley, Pahl, and Zhang 2014).A classification of cloud brokering solutions is proposed in (Fowley et al. 2018) and a list of cloud broker capabilities is provided.Particularly relevant for this manuscript is the Broker Integration Capability, defined as 'building independent services and data into a combined offeringoften as an integration of a vertical cloud stack or data/process integration within a layer through transformation, mediation and orchestration' (Fowley et al. 2018).
We define Virtual Earth Cloud as a Multi-Cloud integration brokering framework for Big Earth Data analytics.It is possible to identify different types of users interacting with Virtual Earth Cloud, directly or indirectly, according to the following roles: . Resource Provider: the person/organization providing resources (data, infrastructural, analytical) to Virtual Earth Cloud; .Application Developer: the intermediate user developing applications for end-users, exploiting the functionalities made available by Virtual Earth Cloud. .End-users: an end-user interacts indirectly with Virtual Earth Cloud, through an application created by an Application Developer; examples of end-users include: policy-makers, decisionmakers, and citizens.
Figure 2 depicts the high-level system architecture of the Virtual Earth Cloud framework.Based on the well-known layered architecture style (Richards 2015), the architecture includes three functional layers (which can be implemented as a three-tiers architecture): (1) Presentation layerthat is responsible for handling all user interface components (i.e.Client Applications).The Virtual Earth Cloud does not provide any user interface for end-users, instead it provides APIs which can be invoked by user interfaces for exploiting its functionalities (see section Use-cases).( 2) Business Logic layerthat is responsible for implementing all necessary functionalities to satisfy requests from Client Applications, exploiting the available digital resources shared by the enterprise systems, which belong to the ecosystem.This layer provides the actual Multi-Cloud Framework functionalities and is comprised of several components (see section Main Components of the Virtual Cloud), which form a Virtual Earth Cloud.This, based on the virtual cloud paradigm (An et al. 2017), creates another level of abstraction providing users (Client Applications) with a unified perspective on services from a range of heterogeneous providers (the enterprise systems constituting the ecosystem).
(3) Digital Resources:that provides all the required resources to define and execute scientific models (see section Geosciences Digital Ecosystem Resources).Different and autonomous enterprise systems (represented by different colors) contribute their resources.The set of contributing enterprise systems is dynamic; i.e. it is possible for new enterprise systems to join the Virtual Earth Cloud (providing their resources) as well as for contributing enterprise systems to leave the Virtual Earth Cloud.

Virtual Earth Cloud requirements
The main goal of the Virtual Earth Cloud infrastructure is to allow the execution of analytical software (e.g.scientific models) on the most appropriate of the different underlying enterprise systems, in a seamless way for the requester (i.e.users via Client Applications).To this aim, the Virtual Earth Cloud must provide a common entry-point (e.g. a set of Web APIs), which can be used to request the execution of a scientific model, along with a set of input data.Model is then executed on one or more of the available infrastructures; the infrastructures selection is the result of an optimization metrics, which consider parameters such as computational resources availability, latency time, data availability, legal obligations, and execution cost.Finally, results are returned to requesters via the Virtual Earth Cloud Web APIs.
To achieve its goal, the Virtual Earth Cloud infrastructure must provide the following high-level functionalities: (1) Implementation of the workflow required for model execution: configuring the environment (programming languages, software libraries, etc.), ingesting input data, etc.; (2) Provisioning of computational resources from the available underlying enterprise systems; (3) Discovery of and access to the necessary data and model resources, from the available underlying enterprise systems; (4) Optimization of the model execution, e.g. based on availability of computational resources, latency time, and required data.
Considering the multi-disciplinary domain characterizing the geosciences domain, the Virtual Earth Cloud must address the high heterogeneity which characterizes each of the resource types to be utilized (e.g.dataset schemas and metadata, scientific models description and code, and computing and storage infrastructures) by supporting the diverse resource and service protocols and APIs, exposed by the different enterprise systems, which constitute the digital ecosystem.
To support the MaaR approach, for scientific models sharing, the Virtual Earth Cloud must be able, not only to discover and access model implementations, but also to execute the implementing code.For a given model, where several model implementations exist (developed in different software environments or simulation frameworks).Avoiding to impose constraints on model providers is key to keep interoperability requirements as minimum as possible (Santoro, Nativi, and Mazzetti 2016;Bigagli et al. 2015).Therefore, the Virtual Earth Cloud must be able to supply the appropriate execution environment for the model, rather than pushing a change in the utilized software environment or simulation framework.

Conceptual approach for Virtual Earth Cloud design
This section introduces the three conceptual approaches the Virtual Earth Cloud design is based on, namely: containerization, orchestration, and brokering.

Containerization
Scientific models are developed in many different programming environments (e.g.Python, Java, R, MATLAB) or simulation frameworks (e.g.NetLogo, Simulink).Therefore, to be able to execute a model, it is not sufficient accessing its source code.The retrieved code must be executed in an environment that supports the required programming language and the software libraries utilized by the model source code.
Today, containerization (or container-based virtualization) (Soltesz et al. 2007) is commonly used to address this requirement.Essentially, containerization is the packaging together of software code with all its necessary components (e.g.libraries, frameworks, and other dependencies).This creates a container, i.e. a single fully packaged and portable executable, which can be run on any infrastructure compatible with the specific containerization technology (i.e.container engine).With respect to traditional virtualization approaches based on the creation of full Virtual Machines (VMs), containerization offers several benefits.In particular, containers are typically lighter than VMs and require less start-up time.The result is that containerization allows to use the same type of computational resources for any software (i.e.model implementation).The only requirement is the availability of the specific container engine, since all other model-specific dependencies are packaged in the container itself.

Orchestration
Orchestration is the automated configuration, management, and coordination of computer systems, applications, and services (Red Hat 2021c).The Virtual Earth Cloud infrastructure must provide orchestration functionalities at three different levels: scientific model, container, and computational resources.
Model orchestration coordinates the execution of the necessary steps which compose the execution workflow of a given scientific model.The high-level steps for this orchestration include: . Management of input data access/ingestion. .Configuration of model execution.
. Triggering of model execution.
. Storage of the generated output.
Container orchestration automates the deployment, management, scaling, and networking of containers (Red Hat 2021b).In the case of Virtual Earth Cloud, this orchestration is needed when a container is submitted for execution to an enterprise system belonging to the ecosystem.The orchestration: . takes care of selecting which computing node to use, according to the capacities (memory, CPU, etc.) required by the container; .executes all container-level configurations (e.g.links to persistent storage if requested); .triggers the container; . monitors the resource allocations and the state of the containers.
Computational resources orchestration deals with all aspects related to the instantiation/removal of computational resources in an enterprise system.When new computational resources are requested on a specific enterprise system, it is necessary to coordinate and invoke the instantiation of the processing, storage, networks, and other fundamental computing resources.Finally, the newly instantiated computational resources must be properly configured to support the containerized execution of models.

Brokering approach for System of Systems
The notion of 'System of Systems' (SoS) and 'System of Systems Engineering' (SoSE) emerged in many fields of applications.'Systems of systems are large-scale integrated systems that are heterogeneous and consist of sub-systems that are independently operable on their own, but are networked together for a common goal' (Jamshidi 2008).
In the brokered approach (Nativi, Craglia, and Pearlman 2013) no common model is defined to be part of a SoS.Participating systems can adopt or maintain their preferred interfaces, metadata and data models.Interoperability is implemented by dedicated components (the brokers) that oversee connecting to the participant systems, by implementing all the required mediation and harmonization artifacts.The only interoperability agreement is the availability of documentation describing the published interfaces, metadata, and data modelsi.e.openness.The brokered approach was successfully applied to develop SoS in contexts characterized by a high level of heterogeneity, e.g. the Global Earth Observation System of Systems (GEOSS) (Nativi et al. 2015) (Craglia et al. 2017), CODATA system (International Science Council 2021), and the WMO Hydrology Observing System (WHOS) (WMO 2021).

Main components of the Virtual Cloud
The main software components of the Virtual Earth Cloud infrastructure are discussed in this section.For each component, the main requirements and functionalities are described.
Figure 3 depicts the internal components of the Virtual Earth Cloud and their interaction with the enterprise systems participating to the ecosystem (for simplicity only one enterprise system is depicted in the figure).Figure 3 also depicts a set of Ancillary Services.These are generic services that modern cloud infrastructures offer to developers, generally.Such services provide general-purpose functionalities, which can be used by application developers according to the SaaS (Software as a Service) paradigm (Mell and Grance 2011).The Web Storage service allows providers the possibility to store and retrieve a large amount of data in a web-accessible storage system.The Message Queue provides a cloud-based hosting of message queues, which lets to create and interact with a message queue, from different distributed components.

Data and analytical SW broker
This component enables the discoverability and access of data and analytical resources (models), which are provided by the enterprise systems participating the digital ecosystem.It provides a unique entry point to discover and access such resources by the other Virtual Earth Cloud components; i.e. it exposes a set of APIs which can be used to discover and access the (data and models) resources from the different enterprise systems.
Based on the brokering approach (Nativi, Craglia, and Pearlman 2013), the Data and Analytical SW Broker implements all the interoperability arrangements that are necessary to interoperate with the heterogeneous Data and Model APIs, utilized by the ecosystem enterprise systems to share their resources.To provide a harmonized view of the shared resources, this broker component must implement three main functionalities: mediation, distribution and harmonization.Mediation allows to interconnect different components by adapting their technological (protocol), logical (data model), and semantic (concepts and behavior) models (Nativi, Craglia, and Pearlman 2013).Distribution permits to 'view' all contributing enterprise systems as if they were a single provider.Harmonization allows consumer applications (generally clients, in this case the other Virtual Earth Cloud internal components) to discover and access available resources according to the same protocol and data model.
Based on the above functionalities, four main operations are provided by Data and Analytical SW Broker APIs: data discovery, data access, model discovery, and model access.While the first three operations are quite straightforward in terms of what they mean, the model access operation is worth to be further detailed, in fact two separate use cases must be considered, depending on the model sharing approach (MaaS or MaaR).Although both approaches enable a machine-to-machine interaction, accessing a model using the MaaS approach requires different functionality than using the MaaR approachthe distinction stems from the different paradigms they apply: SOA (Service Oriented Architecture) versus ROA (Resource Oriented Architecture).In the MaaS approach, the exposed service is the model execution.With ROA, the provider exposes the model as a digital resource, i.e. a logical entity that is exposed for direct interactions (Overdick 2007).Thus, with MaaR, the exposed resource is a given model implementation (e.g. the source code or a binary executable).In both MaaS and MaaR use cases, the broker component must implement typical mediation and distribution functionalities to interoperate with the different MaaS/MaaR APIs and distribute the access request, properly.However, in the MaaS case the broker access request triggers an actual model execution.While, in the MaaR case, the access request simply retrieves the model implementation; then, the model execution must be handled separately.

Infrastructure Orchestrator and Broker
This component is in charge of connecting to the enterprise systems which share computational resources in order to discover and allocate the required computational resources needed for the execution of the model.
To this aim, this component acts as a computational resource broker interacting with the specific and heterogeneous IaaS (Infrastructure as a Service) APIs exposed by the enterprise systems.Therefore, it must implement all required interoperability arrangements to interoperate with them.Besides, this component implements orchestration functionalities at two different levels: computational resources and container executions (see section Orchestration).
Through the Infrastructure Orchestrator and Broker component, the discovery of computational resources is provided to the other Virtual Earth Cloud components.The discovery functionality must provide different levels of granularity: enterprise systems, and computational resources.The first one, enterprise systems, enables the discovery of enterprise systems which are currently (i.e. at the moment of the request being received) connected to the Virtual Earth Cloud and share computational resources.The second level of granularity (computational resources) provides more details about computational resources availability from each enterprise system.At this level, the component provides information about the current availability of computational resources of an enterprise system.This includes, at least, the following items: . Total resources: this represents the maximum resources instantiable for the system. .Instantiable units: although modern cloud solutions enable the instantiation of computational resources according to combinations of CPU and RAM, not all enterprise systems will support all combinations; this information must provide the list (or the range) of possible combinations supported by the enterprise system. .Available resources: these are the currently instantiated resources (and their units) which are not in use. .Resources in use: these are the currently instantiated resources (and their units) which are inuse.
The orchestration functionalities implemented by this component enable the provision (and removal) of the required computational resources and the execution of a containerized model.When new computational resources are requested on a specific enterprise system, different types of fundamental computing resources must be instantiated (e.g.processing, storage, networks, etc.), in the correct order, and associated with the final computational resourceusually a virtual machine ready to be used.Additionally, a newly instantiated computational resource must be properly configured for its use in the Virtual Earth Cloud framework, i.e. it must support containerization and container orchestration.Depending on the utilized technologies, this support might require additional actions to be performed.An example of such additional actions is the registration of the computational resource to the container orchestration framework.The same applies for the removal process, i.e. the fundamental computing resources must be properly eliminated, in the correct order, to become available for other instantiations.Furthermore, this component must allow the definition of a set of (configurable) rules which are used to automatically trigger the removal of unused computational resources from the underlying enterprise systems.
As far as the execution of a containerized model, this component must expose the APIs which can be used to submit the execution of a container and implement all necessary actions to configure and run the containerized model on a computational resource of the selected enterprise system.Three main phases can be identified: . Scheduling of container deployment to a particular host: based on available resources on the different hosts (e.g.CPU, memory, etc.) and/or other constraints defined in the execution request. .Provisioning of required resources: according to the definition of the container in this phase the component must provision (link) the container environment to the host environment (network connections, persistent storage, etc.). .Lifecycle management: start the container, monitor its state, etc.

Task Optimizer
The Task Optimizer component is responsible for selecting the optimal enterprise system for the execution of the specified model.In order to perform the selection, the Task Optimizer requires the following information to be provided by the requester: . The computational resources required for the model execution (i.e.CPU and/or RAM).
. The identifiers of the input data.
It is worth noting that the required computational resources are provided by the requester and are not retrieved automatically through Data and Analytical SW Broker model discovery operation.This is because the same model might require different computational resources depending on the specific execution (e.g. the size of the input data) or specific context of the execution (e.g. more CPU might be required for high-priority executions).Provided with this information, the Task Optimizer retrieves availability of data and computational resources from the Data and Analytical SW Broker and Infrastructure Orchestrator and Broker components respectively.After obtaining all the necessary information, first the Task Optimizer excludes the infrastructures where it is not possible to execute the task (e.g. the model requires an amount of RAM which can't be allocated on a single node in one or more infrastructures).Finally, the remaining execution infrastructures are sorted, by applying a set of configurable rules, and returned to the requester.Figure 4 shows a sequence diagram of the steps which the Task Optimizer performs to work out its task.
Different optimization strategies can be applied, depending on what must be optimized (e.g.data transfer, computational resource usage, etc.).The same execution request might generate different selections based on different strategies.

Model Execution Orchestrator
The Model Execution Orchestrator component is the main entry point of the Virtual Earth Cloud component, i.e. it publishes the APIs which can be invoked to request the execution of a model.The request must specify the identifier of the model to be executed and the list of input data identifiers.
Upon a model execution requests, this module implements the business logic needed to execute a model; that is, (i) it orchestrates the data access/ingestion, (ii) configures the model execution, (iii) invokes the execution and (iv) saves the outputs.
While the high-level steps are the same for both MaaS and MaaR approaches, their implementations and the interactions with other Virtual Earth Cloud internal components are different.
In the case of MaaS, the first step executed by the orchestrator (management of input data access/ ingestion) is implemented by generating proper data access requests for each of the input data.To do this, the Data and Analytical SW Broker is queried to retrieve the metadata of each input data.
The metadata provides the access information which is used to create the actual data access request for each input data.The second step (configuration of model execution) consists in generating a request for the model execution, specifying the model to be executed and the data access request for each input data.The model execution request is then submitted to the Data and Analytical SW Broker which in turn creates a proper request (according to the specific MaaS APIs utilized for the model sharing) and distributes the request to the MaaS APIs provider.This implements the third step (triggering of model execution).Finally, the fourth step (storage of the generated output) is implemented by retrieving the execution result (again via the Data and Analytical SW Broker), retrieving the generated output data and store it to the Virtual Earth Cloud Web Storage.
In the case of MaaR, the orchestrator retrieves, via the Data and Analytical SW Broker, the model description.This is used to create a proper request to the Task Optimizer, which returns the necessary information about the enterprise system selected for the execution.Then, the Model Execution Orchestration executes a data discovery request for all input data.This operation is necessary to discover if the requested input data is already available (hosted) on the enterprise system which was selected for the execution.This information is necessary for the data ingestion phase.In this phase, the Model Execution Orchestration submits a series of containerized jobs (one for each input data) via the Infrastructure Orchestrator and Broker.These jobs ingest the required data to the node where the model execution will be executed.This ingestion is achieved either via the Data and Analytical SW Broker (in case the data is not hosted on the execution enterprise system) or directly using the data hosted on the execution enterprise system (if available).After completing the data ingestion, the actual model is retrieved from the MaaR APIs via the Data and Analytical SW Broker and the model container is defined.At this time, it is possible to submit the model container execution via to the Infrastructure Orchestrator and Broker.Finally, after waiting for the completion of the execution, the generated output data is stored in the Virtual Cloud Web Storage.

Proof-of-concept implementation
In October 2020, the DG JRC of the European Commission, in collaboration with ESA, ECMWF, and EUMETSAT, and with the support of CNR, implemented a Proof-of-Concept (PoC) of the Virtual Earth Cloud (Nativi and Craglia 2021;Santoro and Rovera 2021).
This implementation utilizes the Docker (Docker Inc. 2021) technology to realize the containerization approach.The Discovery and Access Broker (Nativi et al. 2015) technology developed by CNR provides data brokering functionalities while the Virtual Earth Laboratory (VLab) framework (Santoro, Mazzetti, and Nativi 2020), also developed by CNR, provides model brokering and orchestration functionalities for MaaR.The container orchestration is provided by the well-known and widely-used Kubernetes technology (Linux Foundation 2021), and, finally for the computational resources orchestration the Cluster API (Kubernetes 2021b) and Cluster Autoscaler (Kubernetes 2021a) technologies are utilized.
Figure 5 shows the software packages which compose the Virtual Cloud PoC technological implementation.The Discovery and Access Broker and the Virtual Earth Laboratory are instances of existing technological frameworks, namely the GEO DAB (Nativi et al. 2015) and the VLab framework (Santoro, Mazzetti, and Nativi 2020).The remaining two packages, Simple Optimizer and VECloud Infrastructure Orchestrator and Broker, implement the Task Optimizer and the Infrastructure Orchestrator and Broker respectively.The Simple Optimizer applies the following set of rules to select the enterprise system: (1) Select enterprise systems where both required data and computational resources are available; (2) Select enterprise systems where the required computational resources are available; (3) Select enterprise systems where additional computational resources can be instantiated.
The VECloud Infrastructure Orchestrator and Broker (Figure 6) is composed of two main software components: the VECloud Core and the Multi-Cloud Agent.
The Multi-Cloud Agent package is composed of a set of technologies which implement the orchestration functionalities at the container and computational resources levels.Specifically, container orchestration is provided by Kubernetes (Linux Foundation 2021).When deployed on a set of 'nodes' (cluster), Kubernetes provides all necessary functionalities and APIs to orchestrate the execution of containerized applications (models, in this case) and to discover available computational resources on its cluster (i.e. the nodes it is deployed on).By its own, Kubernetes does not provide any native way to automatically instantiate new nodes and add them to an existing Kubernetes cluster.To this aim, the Multi-Cloud Agent package includes the Cluster API (Kubernetes 2021b) framework.When utilized with Kubernetes, Cluster API is able to leverage a set of IaaS APIs to automate the task of instantiating new nodes and add them to the Kubernetes cluster.Cluster API currently supports different IaaS APIs, including: OpenStack, AWS EC2, Google Compute Engine, Azure VM, etc.If the enterprise system exposes IaaS APIs which are not supported by Cluster API, it is still possible to utilize a version of the Multi-Cloud Agent where the Cluster API framework is disabled.This results in a less flexible environment, where the number of computational resources is static and can't be adjusted according to the needs.For the PoC implementation, the Multi-Cloud Agent was deployed by creating and utilizing a set of Ansible (Red Hat 2021a) scripts which automate the deployment and configuration of the technology stack composing the Multi-Cloud Agent.
The Multi-Cloud Agent package exposes the Kubernetes APIs which are utilized by the VECloud Core component.This was implemented for this PoC and is charge of collecting information about computational resources availability, current model executions and Kubernetes cluster configuration.Besides, the VECloud Core package exposes the APIs which are utilized by the other modules of the Virtual Earth Cloud component (see section Infrastructure Orchestrator and Broker).
The deployment of the VECloud Infrastructure Orchestrator and Broker is depicted in Figure 7.The Multi-Cloud Agent package is deployed on each enterprise system.Instead, the VCloud Core package is deployed on a Central Cloud Infrastructure, collecting all required information from the distributed instances of Multi-Cloud Agent.

Use-cases
This section describes a couple of significant use case to exemplify how the described PoC works, i.e. to demonstrate the presented architecture is able to satisfy the identified requirements.From a user perspective, the following steps are executed in both use cases: (1) User discovers an analytical software (model).
(2) User selects input data for the model execution.
(3) User launches the model execution.(4) User retrieves/visualizes the output of the execution.The first use case utilizes the Earth Observation Data for Ecosystem Monitoring (EODESM) model (Lucas and Mitchell 2017).It classifies land covers and changes according to the Land Cover Classification System (LCCS) of the Food and Agricultural Organization (FAO).As input data, EODESM requires two Copernicus Sentinel 2 Level 2A products covering the same area of interest at two different points in time.First, the model processes the two products for generating land cover maps.Then, it calculates the difference in the two land cover maps, generating a third output which visually captures the identified changes.
The user interacts with the Virtual Earth Cloud through a dedicated Web GUI (Graphical User Interface) which utilizes the APIs exposed by the Virtual Earth Cloud.Through the Web GUI the User discovers the model (through the VLab which in the PoC provides model discovery and access functionalities, in addition to the model orchestration ones) and the input data (through the DAB).
Once defined the input data, the User requests to launch the model.In turn, the GUI sends the request to VLab which essentially implements the Orchestration steps for MaaR (see section Model Execution Orchestrator), saving the output to a Web Storage.Finally, the User retrieves the output data associated with the execution via VLab (where execution information is stored, including the location of the generated outputs).Figure 8 shows a screenshot of the GUI, displaying the output of the computation over the Gran Paradiso protected area in Italy.
The second use case calculates the UN SDG 15. functionalities in the PoC) (Giuliani et al. 2020).For this use case a different GUI is utilized to interact with Virtual Earth Cloud APIs, the experimental version of the GEOSS Portal developed by European Space Agency (ESA).Figure 9 displays the calculated Land Degradation map over Europe.

Conclusions
This manuscript presented the Virtual Earth Cloud concept, a multi-cloud framework for the generation of knowledge from Big Earth Data analytics.The Virtual Earth Cloud allows the execution of analytical software to process and extract knowledge from Big Earth Data, in a multi-cloud environment, to enable a Geosciences Digital Ecosystem (GDE): a system of systems that applies the digital ecosystem paradigm.
A key tool for transforming the huge amount of data currently available into knowledge is represented by scientific models.To execute a scientific model (implemented as an analytical software), there is the need for the GDE to discover and utilize different types of resources, which can be classified into three broad categories: data resources, infrastructural resources, and analytical resources.Such resources are provided by the different enterprise systems belonging and contributing to a digital ecosystem and are shared utilizing Web technologies (e.g.Web APIs, Web Services, etc.).
To enable the GDE, the Virtual Earth Cloud provides the following high-level functionalities: (i) implementation of the workflow required for model execution, (ii) provisioning of computational resources from contributing enterprise systems, (iii) discovery of and access to the necessary data and analytical software resources, and (iv) optimization of the model execution.
The design of the Virtual Earth Cloud is based on three conceptual approaches: containerization, orchestration, and brokering.Containerization enables the execution of analytical software developed in different programming languages and environments (e.g.libraries, frameworks, and other dependencies).Orchestration (i.e. the automated configuration, management, and coordination of computer systems, applications, and services) is necessary in the Virtual Earth Cloud at three different levels: scientific model (input ingestion, model launch, etc.), container (computing node selection, container-level configuration, etc.), and computational resources (coordinate the instantiation of the processing, storage, networks, and other fundamental computing resources).Brokering allows the realization of a System of Systems by minimizing interoperability requirements for participating systems by implementing interoperability through dedicated components (the brokers) that oversee connecting to the participant systems; considering the multi-disciplinary domain charactering the geosciences domain, this approach is key to address the high heterogeneity which characterizes each of the resource types to be utilized (e.g.dataset schemas and metadata, scientific description and code, and computing and storage infrastructures) by supporting the diverse resource and service protocols and APIs, exposed by the different participating systems, which constitute the digital ecosystem.
The main Virtual Earth Cloud architectural components are defined in terms of their specific requirements and functionalities.The Data and Analytical SW Broker enables the discoverability and access of data and analytical resources; the Infrastructure Orchestrator and Broker connects to the enterprise systems which share computational resources in order to discover and allocate the required computational resources needed for the execution of the model; the Task Optimizer is responsible of selecting the optimal enterprise system for the execution of the specified model; finally, the Model Execution Orchestrator is the main entry point of the Virtual Earth Cloud component (i.e. it publishes the Web APIs which can be invoked to request the execution of a model) and implements the business logic needed to execute a model; that is, (i) it orchestrates the data access/ingestion, (ii) configures the model execution, (iii) invokes the execution and (iv) saves the outputs.
The described architecture is demonstrated with a Proof-of-Concept (PoC) implementation of the Virtual Earth Cloud, based on the prototype demonstrated in October 2020 by JRC (Nativi and Craglia 2021), in collaboration with ESA, ECMWF, and EUMETSAT, and with the support of CNR.Finally, a use case is described to show how the presented architecture and PoC work and enable a GDE.
In the present era, characterized by the digital transformation of society and the affirmation of the cyber-physical domain, there is a clear need to abandon the traditional data exchange paradigm and embrace the most effective and sustainable information and knowledge-centric approach.This key innovation is enabled by the recent technological leap linked to the ubiquitous connectivity process, the new AI spring, and the availability of public and highly scalable online computing services.On the other hand, this same innovation has brought with it a new set of issues, not only technological, but also political and social.Most relevant challenges deal with the requirement of extending interoperability from the mere domain of data (e.g.data encoding, schema, and semantic issues) to the more demanding areas of scientific modeling and cloud computing infrastructures (e.g.multi-cloud and virtual cloud services).The already complex subject of governing data systemof-systems has become much more complicated once new stakeholders must be added to provide online data analytics and computing capacities.We demonstrated that technological interoperability is possible by using open solutions.However, according to our experience, policy and governance interoperability still requires innovative styles and practices.
Future work will mainly deal with: (i) consolidation/enhancement of the presented PoC (e.g.extending the number and typology of participating enterprise systems), and (ii) extension of the presented architecture to support more advanced use cases, including: . Support of real-time capabilities: this is part of planned future development, starting with example implementations in the Hydrology modeling domain; .Knowledge Base integration: capturing scientific experts' knowledge about a sound process for knowledge generation (e.g. the choice of appropriate datasets to be used as inputs for existing models, which model to use for a specific use-case, etc.) is a key element to build a GDE in line with Open Science principles of reproducibility, replicability and re-usability; this knowledge should be formalized and consolidated in a Knowledge Base in order to be shared and utilized in different contexts (Nativi et al., 2020).The presented Virtual Earth Cloud does not make use of any Knowledge Base; future work will investigate how to integrate such a component, e.g.building on existing Knowledge Base experimentations (Mazzetti et al. 2022). .Data quality/uncertainty: while not considered in the scope of this manuscript, data quality/ uncertainty is of paramount importance; future enhancements will investigate how to include this aspect, also considering its implications in possible chaining of models. .Digital Twins of the Earth (Nativi, Mazzetti, and Craglia 2021): extension of the presented architecture focusing on specific support for the implementation of Digital Twins of natural environments.

Figure 2 .
Figure 2. Virtual Earth Cloud High-level System Architecture.
Figure 8.Output of an EODESM Computation over Gran Paradiso National Park, visualized on a dedicated GUI developed using the Virtual Earth Cloud APIs.

Figure 9 .
Figure 9. Output of UN SDG 15.3.1 (Land Degradation) calculation over Europe, visualized on the GEOSS Test Portal using the Virtual Earth Cloud APIs.