Replicating File Segments between Multi-Cloud Nodes in a Smart City: A Machine Learning Approach

The design and management of smart cities and the IoT is a multidimensional problem. One of those dimensions is cloud and edge computing management. Due to the complexity of the problem, resource sharing is one of the vital and major components that when enhanced, the performance of the whole system is enhanced. Research in data access and storage in multi-clouds and edge servers can broadly be classified to data centers and computational centers. The main aim of data centers is to provide services for accessing, sharing and modifying large databases. On the other hand, the aim of computational centers is to provide services for sharing resources. Present and future distributed applications need to deal with very large multi-petabyte datasets and increasing numbers of associated users and resources. The emergence of IoT-based, multi-cloud systems as a potential solution for large computational and data management problems has initiated significant research activity in the area. Due to the considerable increase in data production and data sharing within scientific communities, the need for improvements in data access and data availability cannot be overlooked. It can be argued that the current approaches of large dataset management do not solve all problems associated with big data and large datasets. The heterogeneity and veracity of big data require careful management. One of the issues for managing big data in a multi-cloud system is the scalability and expendability of the system under consideration. Data replication ensures server load balancing, data availability and improved data access time. The proposed model minimises the cost of data services through minimising a cost function that takes storage cost, host access cost and communication cost into consideration. The relative weights between different components is learned through history and it is different from a cloud to another. The model ensures that data are replicated in a way that increases availability while at the same time decreasing the overall cost of data storage and access time. Using the proposed model avoids the overheads of the traditional full replication techniques. The proposed model is mathematically proven to be sound and valid.


Introduction
There is no a unique agreed-upon definition for a smart city [1]. As for any big concept, definitions are set based on the objective of using the concept of smart cities. Due to the complexity of smart city concepts, many definitions are being used among the research community. While researchers adopt different definitions for smart city for different objectives, they all agree on the components that form a smart city [2]. Some of these components are cloud computing, edge computing, communication infrastructure, and data access [3]. Cloud computing is an approach that ensures resource sharing among users with minimal management. Resources could be networks, servers, storage, applications, Sensors 2023, 23, 4639 2 of 30 services or data [4]. Cloud resource management is a research field that has been studied for the last two decades and most of its problems are being solved. Two problems are still hot topics: edge computing [5] and multi-cloud computing [6]. Edge computing is the field that is concerned with policies and techniques to migrate some of the computations and decision making from cloud to edge servers in order to minimise communication cost and to reduce some of the load that is handled by cloud servers so that cloud servers are left with tasks that cannot be performed on edge servers. Since the majority of edge servers are limited in resources, edge computing is still striving for contributions to solve many of its problems [5]. Multi-cloud computing is the approach of having multi-clouds cooperating together as a cooperative distributed system to solve problems that cannot be solved in a single cloud which are defined in [6,7].
Scientific applications typically involve high-throughput experiments, such as satellite surveys [8], supercomputer simulations [9,10] and sensor networks [11], which generate petabytes of scientific data, in addition to the massive data generated by internet every second. For example, the production of data within a radiology department in a hospital in an industrialised country such as the United States or Western Europe is on the order of 10 terabytes a year [12]. In European countries, the total data produced are in the order of petabytes per year and the total medical data of Europe or the United States can be estimated at thousands of petabytes. Furthermore, at present the largest astronomy database is around 20 terabytes per night. Nowadays, data generation is estimated 44 times more than that in 2009 [13]. At present the healthcare data worldwide are in terabytes (10 12 bytes), and it is expected in the future to be increased to zetaabyte (10 21 bytes) or yottabyte (10 24 bytes) [14]. Storing, accessing and analysing such huge data sets requires a means of efficiently organising, handling and manipulating high-volume data. Systems to address these fundamental issues are a focus of current research. As mentioned earlier, cloud, edge servers and IoT systems must have the capability to deal with the huge number of resources and users at the same time. Increased size sometimes presents the issue of performance degradation [15]. Therefore, such systems should be able to support adaptability, scalability and extensibility to avoid such degradation. The proposed partial replication algorithm allows users to replicate parts of files instead of replicating the full file. Hence, when a user submits a task that requires one or more files or segment of a file to be executed, the replica service uses the new portioning algorithm to divide a file into segments and transfer and save the segment(s) required by the task to the user's resources. The new replication system is treated as an optimisation solution that minimises the sum of the data access costs and achieves good system utilisation. This paper defines a multi-cloud edge hybrid system where resources are being shared and nodes cooperate together to ensure the availability of data with a minimum overall cost as will be seen later in this paper. The goal of a data cloud is to provide services for accessing, sharing and modifying large databases [16]. However, as the number of resources contributing to the cloud, edge servers and IoT grows, the complexity of managing these resources increases [17]. This complexity leads to larger databases and longer delays in task execution due to the need to locate multiple files stored in different sites [18]. To address this issue, intelligent management of terabyte data transfer over wide area networks is necessary to cope with current and future data [19,20].
There are a number of aims of the work presented here. One is to build the base model which supports task submission (i.e., requests of data files from single and multiple users). The second goal is to consider existing techniques and retain the "best" of these approaches. The third goal is to develop a partial replicas technique and investigate its effect on the system. The proposed system can access the relevant segment of a replica in a minimum response time, for given task under execution. The new replication system is treated as an optimisation problem that minimises the sum of the data access costs and achieves good system utilisation. The performance of the new system is better than a similar system using full dynamic replication. The partial replication algorithm has a significant impact on system performance in particular the operation of accessing distributed files enabling overall tasks turnaround times and resource consumption to be decreased. In this paper, an algorithm is proposed to find the best replication candidate with the minimum routing cost. The best candidate is chosen so that the cost of the whole system is minimised. A closed-form theorem that sets the constraints of system soundness is proposed and proved.
This paper is organised as follows: Section 1 introduces the paper and clarifies the different topics that are discussed throughout the paper. Section 2 defines the problem that is solved in this paper.
Section 3 presents the literature and previous work that contributed in solving similar problems. Section 4 presents the reference model of full replication scheme. Section 5 demonstrates the proposed model and its components. Section 6 demonstrates the file replication cost and its components. Section 7 explains the simulation and shows the results of different experiments. Finally, Section 8 concludes the work that is performed and summarises the results obtained from simulation.

Problem Statement
Increasing the performance of cloud responses to user requests is a major requirement for a better and smarter city [21]. The problem that this paper is solving can be summarised as follows: having a multi-cloud system with clouds that are interconnected and with multiple edge servers connected to clouds and/or other edge servers directly or indirectly, and with some clouds or edge servers that have file resources, it is required to replicate parts of files as needed so that they are accessible and reachable with minimum cost to edge servers. As will be seen in Section 5, the cost has different components with different weights depending on the node whether this node is a cloud or an edge server. Mathematically, the problem is defined as follows: The system S is defined as: where: where C is the set of clouds in a multi-cloud system, and n = C and c i is cloud i where 1 ≤ i ≤ n and: where E is the set of edge servers connected to this multi-cloud system and m = E and e j is edge server j where 1 ≤ j ≤ m and: where F is the set of files that are being shared in the multi-cloud system, l = F and f k is distinct file k where 1 ≤ k ≤ l and Ξ is the topology that describes the inter-connectivity of the system. It is required to replicate segments of the shared files so that the overall cost of file sharing is minimum. The cost of replicating a file segment s so that certain node n d accesses it is composed of: 1. The transmission cost from the source node n i to replica node n j 2.
The hosting cost of the replica n j 3.
The transmission cost from replica n j to n d Total cost η is defined to be: where π is the data hosting cost, ρ is the load on the hosting node, and τ is the cost of data transfer delay. The difference between π and ρ is that π represents the cost of replicating certain file or file segment in certain node whereas ρ is the cost of raising load on certain node. Every node has its own weighting for π, ρ and τ, and therefore every node learns β 1 , β 2 and β 3 from historic data. The final objective is to obtain cost where cost is defined to be: Note that the unit costs for every independent variable in Equation (5) is determined by the host and the weights are learned through historical data. When replicating a file f ∈ F, the cost of data transfer and hosting depends on how big a file is. This means that replicating parts of the file could minimise both hosting and data transfer costs.

Related Work
The authors in [22] proposed a replica creation and selection model using the replica creation based on access tendency (DRC-AT), and the replica selection based on response time (DRS-RT). The proposed model calculates the size of the file access tendency, which is used for the replica creation decision, and the user's request time is analysed and evaluated to select the best node that contains the requested data by the tasks submitted by the user. The authors in [23] proposed a replica replacement algorithm which predicts the future usage of the replica using the weight value and normalisation by calculating the three factors, cost of replica, frequency and number of requests. The proposed model uses a prediction function that computes the relative worth of replica using the history of the file access, then it determines the priority of the replica placement. The authors in [24] proposed an algorithm for accessing a subset of a spatial replica using a greedy approach which chooses replica subsets which allow fast data access to maximise performance. Storing the replica subset in descending order by calculating the goodness value of each subset. The proposed model aiming to provide load balancing by taking into consideration the hardware performance, filesystem and file storage prefetching. The authors in [25] investigate the effect of replication role in Distributed Transactional Memory (DMT) and comparing a full replication scheme and partial replication approaches. The study shows that the partial replication approaches have an obvious improvement in terms of scalability by reducing the amount of transferred and stored data at each node. To ensure data consistency, the study mentioned several existing techniques such as single copy model [26,27], Distributed Multiversioning (DMV) system model [28], History-based multiversioning [29] and Clock Validation [30,31]. This study also investigates the latency issues in partial replication in the context of DMT.
The authors in [32] proposed a nonlinear integer programming model for data replication in cloud storage aiming to achieve a low cost and high availability. The proposed low-cost failure resilient replication scheme is handling both non-correlated and correlated machine failure. The proposed model shows an improvement in data availability and replica consistency cost compared to Random Replication (RR) [33], Copyset Replication [33] and Replication Degree Customisation (RDC) [34]. The proposed scheme assigns a portion replica for each data object and the popularity of the replica is considered by handling the aforementioned problems. The popularity of any two data objects are analysed and compared to reduce the replication cost. The authors in [35] investigate the use of object storage system in the cloud storage environment to store and retrieve unstructured data in cloud computing environment. The object storage system process data in a different way compared to modern storage system which process data as objects instead of blocks and files. All data types can be stored in an object, such as records, file, database, medical record, video, audio and images, or it can be used to store only an entire type of data. This paper analysed the different types of storage systems used in cloud computing environment, the authors claiming that the object storage is the most suitable storage system for unstructured static data to be used in cloud storage environment due to its scalability, flexibility and security features.
The authors in [36] proposed a replica placement model based on evaluating the comprehensive performance value of the node. The proposed model used the Hadoop Distributed File System (HDFS) as the base model to be improved which divide the data into small blocks as the basic storage unit to be stored on different distributed nodes. The proposed model calculates the weight of the evaluation value of set of indicators such as the evaluation values of memory size, disk space, CPU and read-write speed of the disk, then the best set of replica nodes will be selected using the comprehensive evaluation value.
The authors in [37] proposed a hybrid data replication system for edge servers and cloud infrastructure which reduce the latency perceived by read and update operations by locating the replica near the end user. This paper proposed a replica convergence algorithm which keeps replicas in both the cloud and the edge server by combining Conflict-free Replicated Data Types (CRDTs) and Operational Transformation (OT) to achieve a consistency model. The proposed model is suitable for the applications based on microservices as it follows the hierarchical architecture of a master replica which broadcasts updates received from a particular replica. The authors in [38] proposed a partial storage strategy for cloud data centers by partitioning the dual direction download of files from different cloud storage. The proposed model introduced an improvement to the dual directional algorithm (DDFTP) which is used as a file retrieval from the cloud server. This algorithm divides the data into blocks. Then each block will be assigned to two cloud servers based on the download history of these blocks which in turn will be used to download the data from opposite directions. The download process will be parallel by handling the assignment of the forward and backward in the block, and the proposed portioning technique removes some blocks from each replica based on the history download experience. The proposed model did not take into consideration many factors such as memory speed, size, server failure and fault tolerance issues which affect the effectiveness and efficiency of the system.
In [39] the authors proposed an improved cache utilisation model which identifies user hot-spots, where fit clients are selected for partial caching. The proposed model uses a location tracking and prediction method to identify hot-spot locations to be used later by service's subscribers. The client nodes in the proposed model works as service providers and coupled with subscribers' latency. The authors in [40] proposed a collaborative fog-tofog communication algorithm which allow fogs to communicate with each other to process incoming tasks. In this proposed model, a threshold is set for maximum delay allowed. If the delay of the assigned fog reaches the threshold, then the fog will check the list of available candidate fogs which can service it and delegate the task to this candidate. In [41] the authors proposed a fog resource selection algorithm (FResS) which enables automated fog selection and allocation, the performance data of each fog are stored in standard format as execution logs. For an incoming job to be executed, these logs will be used for predicting its run-time to have real time estimate for best fog selection.In [42] the authors proposed a neural network prediction model to predict replica locations using the files' access profile. The run-time prediction model is meant to generate file location predictions for incoming tasks using historical executions. It utilises data clustering techniques to separate related tasks from the history and generates a prediction of a file's location using a mean predictor. On the other hand, the proposed model takes some factors into consideration to minimise a cost function using storage cost, host access cost and communication cost to achieve the minimum cost of the data service. In addition, the proposed model calculates the cost difference between replicating a whole file and a segment of a file on a certain node.

Reference Models-Full Replication
Replica management service is responsible for initiating data replication when needed, in addition to creating or deleting copies of files, or replicas, stored in a specific storage system. The design of the replica management service is modular, with several independent services, i.e., task scheduling, resource discovery, etc. and interacting via the Replica Manager (RM). The RM coordinates the interaction between all components of the replica management system i.e., if a new storage location offers better performance and availability for access to or from a particular location, then the replica manager will create a replica at the new location [43]. In addition, cloud, edge server and IoT environments are highly dynamic whereby the resource's availability and performance change constantly. Therefore the replica management service is responsible for discovering new replicas which may be added to or deleted from different locations.
Replica creation and/or selection is the second representative of the high level services provided by a cloud storage. Cloud storage technology was developed to share data between different organisations across distributed geographical locations in an efficient way. Cloud storage uses a data replication technique to move data closer to users thereby improving data access performance [44]. The replica selection service is responsible for finding the best replica that will minimise the transfer time, i.e., finding the nearest copy to the user, so the selection process uses the absolute performance technique (e.g., speed, cost or security). Cloud Information Services (CIS) is a cloud service which provides information about network performance. The replica selection service uses this information and the information provided by metadata repository (i.e., file size, location, etc.) to determine which storage will yield the fastest data access [45].
As previously mentioned, at present, the size of the data that need to be accessed daily on the cloud, edge servers and IoT is in the order of thousands of petabytes and by 2025 the amount of data generated globally will reach 175 ZB (zettabytes), the IoT applications and devices will be the main source of this amount [46,47]. Ensuring efficient access to such vast and widely distributed data is a serious challenge to network and cloud designers. Replication is one such widely accepted technique in distributed environment by storing the data at more than one site. If a data site fails, the system can still continue to operate using replicated data, increasing availability and fault tolerance. At the same time, as the data are stored at multiple sites, the request can find the data close to the site where the request originated, thus increasing the efficiency of the system, lowering bandwidth consumption and improving scalability of the overall system [48,49].
In the following Figure 1 replicas environment is presented in a simple manner. Storage resource 1, storage resource 2, storage resource 3, storage resource n, are distributed resource locations and connected through a middleware infrastructure. A file, i.e., File X used to hold the data, is stored in storage resource 2 and all other resources replicates File X. In this example the benefits of replication are clear (as storage resource 1 and storage resource 3 are close to the user compared to storage resource 2, where the file was originally stored). The access cost of files can be decreased, thus improving the performance and availability even if three out of four storage resources are down.

Using Machine Learning for Partial Replication and Selecting the Candidate Node
Multivariable regression is a statistical method used in machine learning to model the relationships between multiple independent variables and a dependent variable. It is a Sensors 2023, 23, 4639 7 of 30 powerful tool for analysing complex datasets and it is used in a wide range of applications, including finance, healthcare, marketing and social sciences. In simple linear regression, a single independent variable is used to predict the value of a dependent variable. However, in many real-world scenarios, a single independent variable may not be enough to fully explain the variability in the dependent variable. For example, in predicting the price of a house, factors such as location, size, number of rooms and age of the house may all be important, and a multivariable regression model can capture the influence of each of these factors on the house price.
Multivariable regression models can take different forms, including linear regression, logistic regression and polynomial regression. Linear regression is the most commonly used form and involves finding a line that best fits the data points by minimising the sum of the squared errors between the predicted values and the actual values. Logistic regression is used when the dependent variable is binary, such as predicting whether a customer will buy a product or not. Polynomial regression is used when the relationship between the independent and dependent variables is nonlinear.
One of the advantages of multivariable regression is its ability to control confounding variables. Confounding variables are variables that are correlated with both the independent and dependent variables and can lead to spurious associations. Multivariable regression allows the inclusion of these variables in the model, thereby providing more accurate estimates for the effects of the independent variables on the dependent variable.
Multivariable regression is a valuable tool for data analysis and can provide insights into the relationships between variables and help in making predictions about future outcomes. However, it is important to ensure that the assumptions of the model are met and the model is not overfitting the data, as this can lead to inaccurate predictions. Careful data cleaning and feature selection are essential to ensure that the model is robust and generalisable to new data.
A multivariable regression can be represented by a general equation, given by: Here, Y is the dependent variable and X i represents the ith independent variable. The coefficients β i are the weights assigned to each independent variable, indicating the strength of its relationship with the dependent variable.
The coefficient β 0 is the intercept, which represents the expected value of the dependent variable when all independent variables are zero. It is also known as the constant term or the error term, as it captures the overall effect of any unmeasured factors or errors in the model.
In this paper, the aim is to select the most efficient server for replication based on several parameters. Since the task's execution time depends on multiple variables, such as the server's processing power, memory size and network bandwidth, we employ multivariable linear regression. This technique allows us to create a model that can predict the duration required to execute a task on each server, given the various parameters.
To train our model, we use historical data that include information about previous tasks and the servers used to complete them, as well as the corresponding execution times. Once the model is trained, we can use it to predict the duration required to execute a new task on each available server.
The model we use for prediction is represented by Equation (54). This equation takes into account the various parameters that influence the task's execution time, such as the file size, disk size and the server's processing power, memory size and network bandwidth. By substituting in the relevant values for each server, we can determine which server offers the shortest duration to execute the task that accesses the replicated file segment. Utilising multivariable linear regression and Equation (54) enables us to select the most efficient server for a given task based on multiple parameters, leading to a significant reduction in task execution time and increased system performance.
The topology Ξ is presented in Equation (1) where it represents the interconnections between nodes in a system. These nodes are either clouds or edge servers and the possible alternative for node communication is as follows: 1.
Communication between two edge servers, 3.
Communication between a cloud and an edge server.
Note that in this paper it is assumed that graphs are bi-directional. In other words, the communication cost from node n 1 to node n 2 is the same code from node n 2 to node n 1 . As mentioned above, topology Ξ represents connection among all nodes regardless whether they are clouds or edge servers. Topology Ξ is represented as follows: Sub-Topology Φ represents the interconnections among different clouds. Φ is represented as follows: which is the cross product between the set of clouds and itself, this given an n × n square matrix and n = C . Sub-Topology matrix Φ is filled as follows: Equation (10) describes how matrix Φ is filled, the first case is when it is a diagonal element of the matrix and in this case, the element is always 1 since every cloud is reachable from itself. The second case is when two clouds c i and c j are directly connected together where the element is set to 1. The third and final case is when the element is not a diagonal and two clouds intersecting at this element where there is no direct connection that links the two of them, that is when the element is set to 0. The second sub-topology is Ω which represents the connections between edges. This is an m × m square matrix and m = E . Sub-Topology Ω can be formulated as follows: Sub-Topology matrix Φ is filled as follows: Equation (12) describes how matrix Ω is filled. The first case is when it is a diagonal element of the matrix and, in this case, that element is always 1 since every edge is reachable from itself. The second case is when two edges e i and e j are directly connected together and that is when the element is set to 1. The third and final case is when the element is not a diagonal and two edge servers intersecting at this element where there is no direct connection that links the two of them; that is when the element is set to 0. The third sub-topology is Ψ which represents the connections between edge servers and clouds, this is an n × m matrix where n = C and m = E . Sub-Topology Ψ is formulated as follows: Note here that, since the topology is represented as a bipartite graph, the union of the equation is necessary since the order here matters as it defines the flow direction. Sub-Topology matrix Ψ is filled as follows: Equation (14) describes how matrix Ψ is filled. The first case is when a cloud c i and an edge server e j are directly connected together and that is when the element is set to 1. The second and final case is when c i and an edge server e j are not connected and that is when the element is set to 0. The overall topology Ξ is formulated as follows: where Ψ T is the transpose of matrix Ψ, and Ξ is an (n + m) × (n + m) square matrix where n × m matrix where n = C = Φ and m = E = Ω .

Example of Topology Formulation
As shown in Figure 2, edge servers are used to provide content and services closer to the user, while cloud servers are used to store and manage data and applications in a remote location. Edge servers are typically located at the edge of the network, while cloud servers are located in a centralised data center. Edge servers are used to reduce latency and improve performance, while cloud servers are used to provide scalability and cost savings. Edge nodes can serve as intermediate nodes between two or more clouds in communication. In a distributed cloud architecture, there may be multiple cloud nodes that are geographically distributed across different regions or even countries. In such a scenario, it may be inefficient to transmit data directly between the clouds, especially if the data need to travel long distances or cross international borders. In this case, edge nodes can be used as intermediate nodes between the clouds. An edge node located near the source cloud can receive the data and process them locally, before transmitting them to another edge node located near the destination cloud. This approach can help to reduce the overall latency and bandwidth requirements of the communication. In addition, edge nodes can also perform other functions such as caching frequently accessed data, filtering or preprocessing data before they are transmitted to the cloud and providing additional security measures such as encryption and access control [50,51]. To demonstrate the idea, consider the topology shown in Figure 2 which shows a system S with three clouds and eight edges. Figure 2 demonstrates the problem to be solved, where users are trying to access some file segments on the cloud and because of the constraints that are discussed and defined in Section 2, the cost of accessing those files is high. The system shown in Figure 2 is described as follows: From Equations (9) and (10), we can see that Φ for the system shown in Figure 2 is: Equation (16) shows that c 1 and c 2 are directly connected, and c 2 and c 3 are also directly connected. Topology matrices can be seen as reachability matrices, which means that it tells which node is reachable from another. The matrix represents the reachability of clouds from other clouds and since any cloud is reachable from itself, the diagonal is always 1. Since the graph is actually bi-directional one can see that the matrix is a symmetric matrix. From Equations (11) and (12) we can see that Ω for the system shown in Figure 2 is: Equation (17) shows that e 1 is directly connected with e 7 . Furthermore, e 2 is directly connected with both e 3 and e 4 . e 3 is directly connected with both e 2 and e 4 . e 4 is directly connected with e 2 , e 3 and e 5 . e 5 is directly connected with e 4 . e 6 is directly connected with e 7 . Finally, e 7 is directly connected with both e 1 and e 6 .  The matrix represents the reachability of edge servers from edge servers and, since any edge server is reachable by itself, the diagonal is always 1. Since the graph is actually bidirectional one can see that the matrix is a symmetric matrix. From Equations (13) and (14) we can see that Ψ for the system shown in Figure 2 Equation (20) shows that c 1 is connected with e 1 , e 2 , e 3 , e 4 and e 5 . c 2 is connected with e 1 , e 4 , e 5 , e 6 and e 7 . Finally, c 3 is connected with e 4 , e 5 and e 6 . The matrix represents the reachability of clouds from edge servers or edge servers from clouds. Moreover the matrix does not have to be a square matrix since the number of clouds and edge servers do not have to be the same. From Equations (15)- (17) and (20) we can see that the overall topology Ξ is represented as follows: The top left corner of the matrix in Equation (21) is the matrix in Equation (16). The top right corner of the matrix in Equation (21) is the matrix in Equation (20). The bottom left corner of matrix in Equation (21) is the transpose of the matrix in Equation (20). Finally, the bottom right corner of the matrix in Equation (21) is the matrix in Equation (17). Removing separation lines is given in Equation (22).
Topology Ξ in Equation (22) is represented as a square matrix with size of (n + m) × (n + m) where n = C = 3 and m = E = 7.

Topology Operator
Operator ←→ is a proposed mathematical binary operator that, when applied, finds the path from two nodes. x ←→ y gives all possible paths between x and y. The convention that is followed when using this operator is as follows: 1.
x ←→ y means the path when x and y are directly connected. We call this 0degree path.
2. x 1 ←→ y means the set of all paths when x and y are indirectly connected and there is only one node in the middle. We call paths that belong to this set 1st degree paths.
3. in the middle and z is in those nodes. We call paths that belong to this set n th degree paths.

Proof of Soundness
This section proves the formal soundness of operator s n N d, where N is the set of nodes between source node s and node n. The soundness of the operator is claimed through the closed-form Theorem.
sound if and only if: The first condition of the theory ensures a valid connectivity and the second one ensures the nonexistence of cycles in the graph. We will start proving that if the operator is sound then the conditions apply.
Now, we will prove that if all conditions apply then the operator is sound.

Example of Using the Topology Operator
In this section, an example is given about how operator is used. It is applied on the topology shown in Figure 3 We start with phase 1: Φ paths with 1st degree The result of calculating c 1 1 Φ c 2 in Equation (23) shows that c 1 and c 2 are directly connected but there is no third node that immediately connects them together.
The result of calculating c 1 1 Φ c 5 in Equation (26) shows that c 1 and c 5 are not directly connected and there is no 1st degree connection that binds them together. We assume that connectivity is bi-directional, in other words, if c i is reachable from c j then c j is reachable from c i . Based on this assumption, there is no need to test a node c i with any node that has a lower index and that is why for c 2 we start testing from c 3 .
The result of calculating c 2 1 Φ c 4 in Equation (28) shows that c 2 and c 4 are directly connected and there are no intermediate nodes that bind them together.
The result of calculating c 2 1 Φ c 5 in Equation (29) shows that c 2 and c 5 are not directly connected but they are connected through intermediate node c 4 . The The result of calculating c 3 Now we continue with phase 2: Φ paths with 2nd degree. In this phase, only those connected nodes in first phase are used. In other words, the path segment between c 1 and c 5 in Equation (26) will be ignored since there is no 1st degree connectivity between both nodes.
Applying 2nd degree of the above operation works as follows: connected and there is no other connection path. Case 3 is the only case that will propagate to further phases. The generation of 2nd degree paths is as follows: The result tells that path (c 1 c 2 ) will not be considered for further phases since it does not add any information. The result shows that there is a direct path between c 1 and c 3 .
Path (c 1 c 2 ) will be used in the 3rd phase since it adds a path to c 2 through c 4 .
Note that (c 1 1 Φ c 5 ) is ignored since it is a 0. The 2nd degree path to c 3 from c 1 is found as follows: Results indicate that path (c 1 The 2nd degree path to c 4 from c 1 is presented as follows: Results indicate that c 4 is reachable which means that path (c 1 will be considered in the 3rd phase. The 2nd degree path to c 5 from c 1 is found as follows: This means that path (c 1 will not be used in the 3rd phase.
This means that path (c 1 This means that path (c 1 will not be used in the 3rd phase. This will be repeated for paths that start with c 2 , c 3 , c 4 and c 5 . Only successful paths will continue for phase 3. Note that the maximum number of phases is n, where n is the number of clouds connected to the topology Φ. The stopping criterion of the algorithm is the failure to produce any vector that can continue to the next phase.

Sub-Topologies
The previous demonstration for Φ applies perfectly to Ω since both Φ and Ω are both square matrices. Ψ represents the connectivity between clouds and edge servers which means that it is not necessarily a square matrix. If we want to find the connectivity between two clouds through one or more intermediate edge servers the algorithm still applies the same. However, if we need to find the connectivity between two edge servers through two or more clouds we need to use Ψ T .

Over All Algorithm
The algorithm finds paths that include diversity of different types of nodes. In other words, the path is a mix between links that bind clouds, edge servers, or clouds and edge servers together. The formula for the algorithm is defined to be: The equation means that there are paths such that every path has a minimum of one connection that could be:

1.
Connection from a cloud to a cloud 2.
Connection from an edge server to an edge 3.
Connection from a cloud to edge server 4.
Connection from edge server to a cloud Mathematically, let us assume that we have a path from cloud c 1 to cloud c 2 , then to edge e 1 , then to edge server e 2 and finally to cloud c 4 . This will be represented as follows: While matrices here have different dimensionality, what matters here is the dimensional equality of every two consecutive matrices since every operation verifies whether there is a path or not.
Note that the simplest way is to directly use Ξ in the format given in Equation (15) shown in Equations (21) and (22) because it is a square matrix and the application of the algorithm and the operator becomes very straightforward without the need to use Equation (47) which deal with heterogeneous non-uniform matrix cases.
Algorithm 1 takes incident matrix Ξ, the source node s and the destination node d as inputs and returns all possible paths s to d. Note that every path has the information of transmission cost in seconds between every two successive nodes per bit. This means that the longer the data to be transmitted are, the higher the cost is. The transmission cost differs from node to node and the total cost when transmitting b bits from source s to destination d through path p is presented as follows: where c i • is the successor of c i . After acquiring the different paths, we pick the path with the shortest cost, and note that the cost here is not just transmission cost, it is also the replication cost of replication cost. The minimum cost now is presented as follows: where t is the number of times node c d will communicate with node c j . The proposed algorithm is an algorithm that finds the cost of different paths from one node to another. The algorithm is divided into two phases: The topology recognition phase and the dynamic behaviour phase. The topology phase literally builds the graph while the dynamic behaviour phase keeps watching costs and finds the minimum path accordingly. When comparing this algorithm to the traditional Dijkstra graph minimisation [52]. We can see the following: 1.
The objective of Dijkstra is learning the path with the minimum cost and the cost is considered static and if cost changes, the algorithm is being applied again from the starting point until reaching the destination. The objective of the proposed algorithm is to learn the topology and then assign dynamic costs that varies from one point of time to another and according to changes that occurs, file segments begin to be replicated or deleted from certain nodes.

2.
While the Dijkstra algorithm works on undirected graphs, it is not guaranteed to be sound when weights are negative. In the proposed algorithm, this cannot happen since nodes that have been visited before cannot be visited again. This is guaranteed by the algorithm itself and verified by theory Section 5.3.

3.
The proposed algorithm learns the topology and then when costs change, only those nodes that are effected with be notified and therefore the service requester will move to another path. This also means that replicating a file segment in a node or removing a segment from a node is also handled by the proposed algorithm. for every ξ ∈ Ξ do 4: let index ← index + 1 5: let path t ← CurrentPath 6: if ξ ∈ path then 7: if index(ξ) == index(d) then 8: let path t ← path t ∪ d 9: if path t ∈ AllPaths then 10: let AllPaths ← AllPaths ∪ path t

File Replication
Back to Equation (1), the third component of the system is F, the set of files described in Equation (4) which are accessed in the System Ξ. The cost of accessing file f ∈ F depends on many factors including how far the file from the destination node α ∈ Ω and whether that α is a cloud or an edge server. In this paper, we assume that the total cost η of file replication is to be the data transfer delay that will be saved in τ, in addition to, the data processing cost π and the load on the hosting node ρ. These factors are weighed and weights differ from a server to another.
The unit costs for every independent variable in Equation (51) is determined by the host and the weights are learned through historical data. When replicating a file f ∈ F, the cost of data transfer and the hosting depend on how big the file is, this means that replicating segments of a file could minimise both hosting and data transfer costs. File is being segmented to a number of segments according to accessibility. For a region s of a large file f , if this region is frequently accessed, this S will be a separate segment which means file will be divided into three segments, segment s b which is the portion before the target segment, segment s which is the required one and segment s a which is the part of the file after the required segment. This means that file f after partition is: A segment can be partitioned into several sub segments the same way. The details of how this is carried out is being handled in a future study and in this paper we assume that files are already partitioned.

Simulation and Results
In our simulation, we apply the proposed framework on the system shown in Figure 3 with the setup in Tables 1 and 2. We assume that we have three different files f 1 , f 2 , f 3 , where these files are divided into segments as follows:   Table 1 shows the allocation of different file segments in different clouds and edge servers. Segment s 11 ∈ f 1 is located in three different nodes, namely, c 1 , c 2 and e 7 . File segment s 12 ∈ f 1 is located only in c 1 . File segments s 21 , s 22 ∈ f 2 are located only in c 2 while file segment s 23 ∈ f 2 is located in both c 2 and c 4 . File segment s 31 ∈ f 3 is located only in c 3 . File segment s 32 ∈ f 3 is located in both clouds c 3 and c 5 . Last but not least, file segments s 33 , s 34 ∈ f 3 are both located in cloud c 3 . From the data given in Table 1, it is easy to conclude that f 1 exists originally in c 1 since replication does not delete original segments and since s 12 exists only in c 1 . This means that s 11 was replicated to both nodes e 7 and c 2 . The same applies for f 2 . It is obvious that f 2 was originally located in c 2 and s 23 was replicated in c 4 since the whole file f 2 = {s 21 , s 22 , s 23 } exists in c 2 and s 23 is the only segment of file f 2 that exists in c 4 as seen in Table 1. Investigating f 3 , it is obvious that it originally existed in cloud c 3 and then s 32 was replicated to c 5 since the whole file f 3 = {s 31 , s 32 , s 33 , s 34 } exists in c 3 and s 32 is the only segment of file f 3 that exists in c 5 as seen in Table 1. Table 2 shows the assumed file segment lengths, where file length is calculated as follows: This means that κ( f 1 ) = κ(s 11 ) + κ(s 12 ) = 100 + 150 = 250, κ( f 2 ) = κ(s 21 ) + κ(s 22 ) + κ(s 23 ) = 60 + 110 + 210 = 360 and κ( f 3 ) = κ(s 31 ) + κ(s 32 ) + κ(s 33 ) + κ(s 34 ) = 50 + 20 + 200 + 120 = 390 where κ is the length in bits. Matrix τ in Equation (53) shows the assumed communication delays in milliseconds.
τ is a square matrix that defines the communication delays between every two nodes. If the two nodes are not directly connected, the cost is ∞ since it is a topology matrix and it recognises only the connectivity between different nodes whereas if it is directly connected, the delay is considered. Note that the diagonal of the matrix is 0 since there is no delay for any node to itself.
For e 9 to access s 11 , we have three different paths: The cost for e 9 to access s 11 n times through path p 1 when hosting in an intermediate node c 5 is as follows: The following experiment shows the results of node e 9 requesting file segment s 11 . This file segment exists in three different locations according to Table 1. Those locations are c 1 , c 2 and e 7 and paths for each source are studied below. The costs of different paths from c 1 to e 9 are shown in Figure 4. Among paths that start with node c 1 , the path with the minimum cost is:  Figure 4 shows the lowest cost paths from c 1 to e 9 . It is obvious that total costs differ from a path to another. Note that costs in the same path could differ depending on which node will be selected for replication.
Costs of different paths from c 2 to e 9 is shown in Figure 5. Among paths that start with node c 2 , the path with the minimum cost is:  Figure 5 shows the lowest cost paths from c 2 to e 9 . Again, it is obvious that total costs differ from a path to another. Costs in the same path could differ depending on which node will be selected for replication. Costs of different paths from e 7 to e 9 is shown in Figure 6. Among paths that start with node e 7 , the path with the minimum cost is:  Figure 6 shows the lowest cost paths from e 7 to e 9 . Again, it is obvious that total costs differ from a path to another. Costs in the same path could differ depending on which Sensors 2023, 23, 4639 22 of 30 node will be selected for replication. Data in Figures 4-6 illustrate different path costs from source to destination. Every bar represents total cost of a path. Those costs are calculated using Equation (51). Lengths of segments are shown in Table 2, communication delays are shown in Equation (53) and the hosting cost is being calculated based on the segment hosting shown in Table 1. A comparison with the three minimums resulting from different starting points shows different costs for the three shortest paths are shown in Figure 7, the comparison shows that replicating s 11 from node c 1 will give the minimum cost. Now that we know the chosen path is : The replication could occur to c 4 or c 5 . Figure 8 shows the cost of replicating s 11 to c 5 versus replicating the whole file. The figure shows the base cost, which is the cost of moving data from source c 1 to c 5 . It is low because it happens only once. The segment replication increases with a low slope since we only replicate what is requested by e 9 . The third line has a high slope since the whole file is being accessed every time e 9 sends a request. Figure 9 shows the cost of replicating s 11 to c 4 versus replicating the whole file. The figure shows the base cost which is the cost of moving data from source c 1 to c 4 . It is low because it happens only once, and the segment replication increases with a low slope since we only replicate what is needed by e 9 . The third line has a high slope since the whole file is being accessed every time e 9 sends a request. Figure 10 shows the difference of replication cost between c 4 and c 5 . The results show that the more the iterations are, the lower the cost is when replicating to c 5 .
This study employs a multivariate regression model as a machine learning technique to identify the optimal replica. The approach involves calculating the cost associated with each node if it were to run the process and factoring in the total cost of replication, taking into account the size of the file's segments to be replicated. By utilising this approach, we aim to determine the optimal choice of a replica in a cost-effective manner. As a sample of the data, we provide a subset of the training data: The replication could occur to c 4 or c 5 . Figure 8 shows the cost of replicating s 11 to c 5 versus replicating the whole file. 453 The figure shows the base cost which is the cost of moving data from source c 1 to c 5 , it is low because it happens only 454 once. The segment replication increases with a low slope since we only replicate what is requested by e 9 . The third line 455 has a high slope since the whole file is being accessed every time e 9 sends a request.  Figure 9 shows the cost of replicating s 11 to c 4 verses replicating the whole file. The figure shows the base cost which is 458 the cost of moving data from source c 1 to c 4 , it is low because it happens only once, also the segment replication increases 459 with a low slope since we only replicate what is needed by e 9 . The third line has a high slope since the whole file is being 460 accessed every time e 9 sends a request.   Figure 9 shows the cost of replicating s 11 to c 4 verses replicating the whole file. The figure shows the base cost which is 458 the cost of moving data from source c 1 to c 4 , it is low because it happens only once, also the segment replication increases 459 with a low slope since we only replicate what is needed by e 9 . The third line has a high slope since the whole file is being 460 accessed every time e 9 sends a request.  Table 3 showcases a small subset of the training data employed in this study. The total number of rows in the dataset was 10,000, which is considered sufficient to achieve a mean square error below our predefined minimum threshold. The first column of the table represents the processor used for a given task. The discrete values in this column represent three different types of processors, with 1 indicating the slowest and 3 indicating the fastest. The second column displays the log 2 of the memory size in gigabytes for the task. The third column shows the percentage of memory used for the given task. The value of the used memory is displayed in log 2 of the memory used in gigabytes.
The fourth column represents the size of the task required to execute, where the numbers represent the log 2 of the task length in gigabytes. Note that in this context, the term "task" refers to the entire activity required to execute and it is not related to the definition used in operating systems engineering. Furthermore, the fifth column shows the log 2 of the disk size in gigabytes, while the sixth column displays the percentage of disk space used for the given task. The value of the used disk space is also displayed in log 2 of the disk space used in gigabytes. Version April 10, 2023 submitted to Sensors 21 of 26 This study employs a multivariate regression model as a machine learning technique to identify the optimal replica. The 466 approach involves calculating the cost associated with each node if it were to run the process and factoring in the total 467 cost of replication, taking into account the size of the file's segments to be replicated. By utilizing this approach, we aim 468 to determine the optimal choice of a replica in a cost-effective manner. As a sample of the data, we provide a subset of the 469 training data:   Table 3 showcases a small subset of the training data employed in this study. The total number of rows in the dataset was 471 10,000, which is considered sufficient to achieve a mean square error below our predefined minimum threshold.

472
The first column of the table represents the processor used for a given task. The discrete values in this column represent 473 three different types of processors, with 1 indicating the slowest and 3 indicating the fastest. The second column displays 474 the log 2 of the memory size in gigabytes for the task. The third column shows the percentage of memory used for the 475 given task. The value of the used memory is displayed in log 2 of the memory used in gigabytes.

476
The fourth column represents the size of the task required to execute, where the numbers represent the log 2 of the task 477 length in gigabytes. Note that in this context, the term "task" refers to the entire activity required to execute and it is not 478 related to the definition used in operating systems engineering. Furthermore, the fifth column shows the log 2 of the disk 479 size in gigabytes, while the sixth column displays the percentage of disk space used for the given task. The value of the 480 used disk space is also displayed in log 2 of the disk space used in gigabytes.  The seventh column of the table represents the log 2 of the segment size in gigabytes. Finally, the last column represents the duration in microseconds required to process the given task that accesses the given file segment. It is important to note that this is only a small sample of the complete training dataset, which was utilised to train a multivariate regression model. This model was used to determine the optimal choice of a replica, based on the cost associated with each node if it were to run the process and the total cost of replication. The outcome of the training process is summarised as follows: Table 4 shows the outcome of the training process. The model with the learned coefficients can be presented as follows: It is noteworthy that the first independent variable in Table 3 is denoted as X 1 , which corresponds to the Processor column. The remaining independent variables continue in the same sequence until column X 7 , which represents the "segment size" variable. Furthermore, the dependent variable "duration" is denoted by Y and is located in the last column of Table 3. Once the time is predicted, the process will need to be executed in certain node, the total cost is predicted to be communication cost and processing cost. Communication cost is already known as mentioned before and processing cost is calculated using Equation (54). The node with the lowest cost is the node that will be chosen for replication and task execution. The model built by machine learning helps choosing the optimal candidate for file segment replication based on history data. The cost that is calculated by regression is the presented by π in Equation (51).

Performance Comparison between Using ML and Picking a Random Eerver
In this subsection, a comparison is conducted between the performance of the system when utilising machine learning and when selecting a server randomly. Table 5 presents various configurations to choose from, with values that are interpreted similarly to the explanation provided above. The experiment focuses on a process that has a burst of 18 and uses a segment with a size of 5. To predict the duration required to execute a task with these specifications, Equation (54) is utilised. The resulting predicted costs of each server are demonstrated in Figure 11.
As depicted in Figure 11, server 2 offers the shortest duration to execute a task with a burst of 18 and access a file segment of 5. Without utilising machine learning, a random selection of servers would have been made, possibly resulting in a server such as server 12 being chosen. However, executing the same task on server 12 would take almost 4.5 times the duration required on server 2.
The error in server selection is demonstrated in Figure 12. In this context, the error of server s is defined as the difference between the duration required to execute a task on server s and the minimum possible duration achievable with the available servers. As server 2 provides the shortest duration to execute the given task, its error is zero.

Conclusions
This paper proposes a machine learning approach for optimising file replication in a multi-cloud system in a smart city. Edge servers often communicate with clouds to request data in IoT and smart cities in general. A large amount of data transfer occurs by smart sensors and devices as they request significant amounts of data from different clouds in order to use them in tasks such as interpolation, prediction and decision making. This paper defines a framework for multi-cloud and edge server-cloud collaboration in which they build a virtual workflow to share files. In this paper, the file is divided into segments and segments can be replicated. In other words, there is no need to replicate the whole file if only certain segments are requested in some geographical areas while others are not. In this paper, a mathematical operator is proposed. This operator is applied on a predefined topology. Based on this topology and by applying the operator the needed number of times, the minimum route between some source and destination is found. A closed-form theorem is proposed and proved. The outcome of applying this operator is a cyclic sound path which leads from source to destination. The proposed algorithm finds all the nodes (sources) whether they are clouds or edge servers that host the requested segment. For every source, paths are found and the cost of those paths are calculated. The path is a series of nodes connected together in an acyclic series as mentioned earlier. A node in a path is internal if it belongs to the path and if it is between the source and destination. Every internal node is evaluated for being a hosting candidate for the requested segment based on experience (node history) and the node replies with either accept, neutral or reject. The node that has the minimum total cost from the accepting nodes is selected. If there are no accepting nodes then the one that will guarantee the lowest cost among the neutrals is selected. If there are no neutral nodes then the system will not replicate and the requesting node will stream from one of the sources. This will still guarantee a better performance than full replication if the performance of full replication is measured the same way the algorithm measures the costs. Simulation shows the improvement of system performance when applying the proposed model as compared to full replication. The worst case scenario cost of optimising any selected path by replicating to a node that belong to this path is equivalent to the best case scenario cost of no replication and definitely better than the worst case scenario of replicating the whole file.