A new fuzzy optimal data replication method for data grid

Article history: Received October 1, 2012 Received in revised format 22 January 2013 Accepted 23 January 2013 Available online January 24 2013 These days, There are several applications where we face with large data set and it has become an important part of common resources in different scientific areas. In fact, there are many applications where there are literally huge amount of information handled either in terabyte or in petabyte. Many scientists apply huge amount of data distributed geographically around the world through advanced computing systems. The huge volume data and calculations have created new problems in accessing, processing and distribution of data. The challenges of data management infrastructure have become very difficult under a large amount of data, different geographical spaces, and complicated involved calculations. Data Grid is a remedy to all mentioned problems. In this paper, a new method of dynamic optimal data replication in data grid is introduced where it reduces the total job execution time and increases the locality in accessibilities by detecting and impacting the factors influencing the data replication. Proposed method is composed of two main phases. During the first phase is the phase of file application and replication operation. In this phase, we evaluate three factors influencing the data replication and determine whether the requested file can be replicated or it can be used from distance. In the second phase or the replacement phase, the proposed method investigates whether there is enough space in the destination to store the requested file or not. In this phase, the proposed method also chooses a replica with the lowest value for deletion by considering three replica factors to increase the performance of system. The results of simulation also indicate the improved performance of our proposed method compared with other replication methods represented in the simulator Optorsim. © 2013 Growing Science Ltd. All rights reserved.


Introduction
These days, there are huge amount data and calculations geographically distributed around the world and these have created some challenges of accessing, processing and distributing data.In fact, it is a tedious task to handle large amount of data to do complicated calculations.Data Grid is an alternative solution for these kinds of problems, which is an architecture for management analyses of large scientific data sets (Chervenak et al., 2000).Data Grid is a development concept based on "Grid", which is an infrastructure for integrating distributed computational elements.In an advanced distributed environment of a grade, there are some data access matrices, response time, access cost, bandwidth utilization, reliability, and scalability.Data replication is an alternative technique to use to handle different challenges of high accessibility.In this method, we increase the fault-tolerance and improve the scalability and response time using data high accessibility and low bandwidth consumption (Amjad et al., 2012;Foster, 2002;Ranganathan et al., 2001;Lamehamedi & Szymanski, 2002;Ranganathan et al., 2002;Rahman et al., 2005;Vazhkudai et al., 2001;Stockinger et al., 2002).
When data is replicated, the replicas are built from the data files from various sites in a data grade.Replication can save the storage resources while occupying the provided data storage on each site.It also saves a large amount of bandwidth when the data is presented only on a particular site.Therefore, for providing a fast data access at all the times, data replication is an excellent exchange between the storage accessibility and bandwidth accessibility (Yuan et al., 2007).The main idea is to maintain the data near the user for effective and quick access.The data replication algorithm should respond to the critical questions such as the data type needed for duplication and the place they must be kept.Grid users' dynamic behavior makes it different for making difficult decisions about data replication to achieve the maximum accessibility (Schintke, & Reinefeld, 2003).
This paper introduces a new method of dynamic data replication in the data grid.The introduced method reduces the total job execution time by grid sites by involving the factors influencing the data replication.Therefore, the primary objective is to reduce the data access time, to reduce the data access latency, to reduce the job execution time, to reduce the occupied bandwidth, to increase the locality in accessibilities and to replicate data on the grid optimally.Zhao et al. (2010) provided a Dynamic Optimal Replication Strategy (DORS), where they extracted a replacement algorithm of new data replica based on the accessibility record, size of files and the network conditions.Therefore, the replicas with minimum amount are removed and the local node has sufficient capacity to load the new file.This algorithm has been used for dynamic data grid environment and can increase the accessibility output of data replica and there are some strategies used in two stages.In the first step, the replication strategy has been suggested; it makes decision based on the number of file replicas to decide whether the requested file should be kept in local storage system or not.In the second step, a new replacement method for new replication, in which the files are computed in the local storage system, is embedded.The file with high capacity will be used to improve the efficiency of data access, thus they should be protected.Chaprada et al. (2010) suggested a strategy for dynamic data grids, which helps to increase file accessibility to improve the response time and reduce the bandwidth consumption.This strategy is evolutionary; in other words, it helps to achieve the load sustainability of the whole grid at desired amount of time.The strategy consists of the following steps: Selecting the best candidate files for replication, determining the best sites to locate the files selected in the previous step, choosing the best replica.The first step of the strategy to detect the candidate files for replication is based on the following parameters: The number of times where the file F is requested, and the number of available replica of a file F on all grid system.The strategy investigates that a file F should be replicated if it has been essential and no enough replicas of F are available on the grid.After selecting the best candidate files for the replication, the best sites should be determined for their location.In order to be successful in this selection, the strategy takes the following parameters into account: Number of requests for each file F with each site S , utilization of each site S based on the grid when a file F is requested to replicate or to satisfy a request of the best replication.Selecting the best replication not only depends on the bandwidth, but also is dependent on the utilization function of grid sites.In fact, if we want to transfer a file F from a site S to a site S , it is possible to reduce the transferring time, S is nearest to S .A site, which is always connected to the grid, has no equal utilization with the other sites, which is frequently connected and disconnected.For these reasons, the strategy will utilize jointly these two parameters (bandwidth and utilization).Abdurrab and Xie (2010) are believed to be the first who introduced FIRE Strategy.Consider a data grid, which consists of n sites demonstrated with set S= {s , ..., s , ... s }.Moreover, let F={f ,... ,….,f ,…,f } be the data grid and let j be required files with D={d , ... , d , ..., d } in which w<= m.For each file d in D, FIRE checks whether it is locally available or not and, if not, FIRE tests whether local SE has a space to adapt d .If yes, FIRE replicates d from a site with minimum cost based on auction protocol (AP).If not, FIRE compares the number of remote access on d sent from the s against the number of local access of the least accessed file locally from the site s .If the earlier were not lower than the latter, the least local accessed file would be deleted and d could be replicated on s by using AP.Otherwise, the work j should have remote access to d .In this case, the replication manager of a remote site with d file should update his own access table to save this remote access event.Finally, the replication manager of site s updates his/her own file access table to record the local and remote access created by the work j.

A new fuzzy optimal data replication method
The primary objective of the proposed method of this paper is to detect important factors, playing effective role in increasing the efficiency of grid, decreasing the response time and reducing the bandwidth occupation, and to implement these factors in the replication algorithm, properly.The important point in the proposed method of this paper is that the effective factors in replication should be found and we should determine the effectiveness of each of these factors.Another important point in this study is the way of using these factors, simultaneously in the replication algorithm or in other words the way of valuating them.

Architecture of proposed method
Architecture of Fuzzy Optimal Replication Method (FORM) are shown in Fig. 1 (Saadat & Rahmani, 2012).

Connected via Internet
Grid sites are located at the lowest level of this architecture and these sites have been composed of storage as well as computational elements.A virtual organization is built from the total of multiple sites.There is a local server (LS) for each virtual organization where LS is responsible for maintaining the sequence of access to data inside each virtual organization.Moreover, the list of replicas is also placed in this component.Note that the speed of data access within each virtual organization is higher than the speed of data access on other virtual organizations.Regional Server (RS) is at the higher level of virtual organization, where each RS monitors one or more virtual organizations and they are connected through the Internet.Thus, the data transfer rate among them is less than the data transfer rate within the virtual organizations.

Description of FORM Strategy
The proposed method of FORM is composed of two phases:

First Phase: The phase for requesting a file and doing the replication
In this phase, first, we consider effective factors on whether the requested file can be replicated or it can be implemented from far distance.These factors include the availability of storage space remained on the local site, the replication expenditure of target file and the number available replicas from the requested file on the total grid.In fact, given that the replication is an expensive action, we need to know whether the replication is cost effective at every stage or not.This investigation has been performed properly in the proposed method of this paper and has been led to some appropriate results in saving the storage time and space and repeating the replication operation.
First factor: The remaining available storage space on the local site.In proposed method of FORM, it is investigated before any replication action whether the local site has sufficient capacity to store the requested file or not.If the answer is yes and there is sufficient capacity on the site, the file replication is executed on the site and replica of files are recorded on the local site.Otherwise, if the local site does not have sufficient space for file storage, we need to decide whether the requested file can be either available from distance for the local site or it needs to be replicated on the local site.This decision-making in our way is executed through evaluating two other factors investigated as follows.
The important point is that if the replication is performed blindly without considering the conditions, the files available from distance and with minimum cost are replicated and replaced with important and popular files due to the lack of storage space.
Second factor: The cost of accessing the target file, which plays the key role in each replication algorithm.In fact, the objective of replication is to reduce the job execution time.We the cost of transferring necessary files from the source site to the destination site is reduced, obviously, the time of performing the jobs will be reduced, significantly.In the proposed method of FORM, we investigate whether a file requested by a local site can be implemented from distance in the event there is limited storage space or not.If the local site does not have sufficient space to store the requested file, we should pay the associated replication cost.If the cost of accessing to the requested file is larger than the predetermined threshold amount, it means that the access to the file from remote distance will become time consuming.Moreover, given this point that the sites in a virtual organization have similar interests and also the bandwidth among the sites in a virtual organization is much lower than the bandwidth among other organizations, the proposed method of this paper has adopted this policy, which keeps, at least, a replica of requested file inside it for each virtual organization.As a result, if any other sites in a virtual organization needs to have the access to a file in near future and the file is not available locally, they can be replicated very easily.Given the mentioned cases, if the access expenditure of a file is more than the threshold value, the file should be replicated for local site because it is more likely that that site and other neighboring sites would request the file.Thus, we need to keep it in place nearer than its previous space to the sites of that virtual organization.However, if the access cost is less than the threshold value, then our algorithm will investigate the third factor for final decision-making of replication.
Third factor: The number of available replicas from the requested file in the whole grid.At this stage, the proposed method investigates how many replicas of requested file exist in the whole grid space and looks for the conditions, where appropriate numbers should be established.This factor is applied in replication decision making because if the number of replicas available on grid sites becomes more than the pre-specified limit, it will lead to the ineffectual occupation of valuable storage space and thus those valuable and popular files will have no sufficient space to be replicated and inevitably the replication will be used, frequently.Besides, if the number of replica in a file is less than the required limit, it will lead to an increase on the cost of replication.For example, if there is no replica of a file available in a virtual organization or even in the neighboring and near virtual organizations, the access to the closest replica will be very costly and time-consuming for the sites of that virtual organization.
If the number of available replicas of requested file is more than the threshold, there is no need to replicate the file and the local site can use the file from remote distance.Otherwise, if the number of available replicas of requested file is less than the threshold value, the local site should replicate the requested file from the closest available location.Therefore, the replication phase ends for a local site and its required file.Replication algorithm of FORM is shown in Fig. 2. local server decide based on fuzzy algorithm to delete some replicas on local site 'g' for make space for requested file 'f ' ; 18. } //end of second else 19.} //end of algorithm

Second Phase, Replacement Phase
The second phase focuses on subject of replicas replacement.The file, which has less possibility to be requested in the future and if it is requested, the replication cost for that site will not be significant and its replicas are sufficiently available on the site should be removed.Three factors are considered for investigating the removal or not-removal of a replica in the method FORM.These three factors determine the value of replica.If a replica has lower value, it will be the first candidate for replacement.These three factors are as follows: The number of access to the replica in the past: If the number of access to a replica has been greater than in the past, it is more likely that it will be used in the future.Thus, the value of that replica is higher and it is not a good candidate for deletion.
The cost of replication: An expensive replica is not an appropriate option for deletion because if it is needed in the future, the high cost should be paid for its replication.Thus, the more the cost of replication is, the more the value of replication will be.

•
Replication Cost(r): Cost of performing the replication for replica r; • bandwidthag: Bandwidth between the sites g and site a.Site a is a site containing the replica r and site g is the destination site (requester), • Replica-Numbers (r): Number of available replicas of file r on other sites, • Propagation delay timeag: Delay time for sending the replica from the source site (a) to the destination site (g), • RV: Replica Value.

Simulation and its results
The simulator package "OptorSim" has been used for simulation and evaluating the efficiency of proposed method in this paper ( OptorSim -A Replica Optimiser Simulation).OptorSim (Bell et al, 2002) is a simulator package written in Java.The package has been developed for investigating the impact of replica optimizing algorithms in the data grid environment (EDG -The European Data Grid Project) and for demonstrating the real structure of European Data Grid.In this simulator, it is assumed that the grid is composed of multiple sites, each consists of zero or the large number of computational elements and zero or the large number of storage elements.Computational elements execute jobs by processing the files and these files are stored in the storage elements.Server controls the resource of job scheduling operations and allocating them to the computational resources according to the scheduling algorithm.Each site manages the content of its own files by using the Replica Manager (RM).At the heart of the replica administrator, there is the Replica Optimizer (RO), which includes the replication algorithm that is responsible for automatic creation and deletion of replicas.The Algorithm FORM has been compared with five algorithms of No Replication, LRU, LFU, EcoModel, and EcoModel Zipf-Like distribution.The CMS Data Challenge 2002 is the architecture and topology implemented in the simulation.This architecture is shown in the Fig. 4 ( Cameron et al., 2004).General parameters of simulation are presented in Table 1 and it should be noted that in this simulation the data are read-only. .
Fig.1.Architecture of FORM