Core – An Optimal Data Placement Strategy in Hadoop for Data Intentitive Applications Based on Cohesion Relation

The tremendous growth of data being generated today is making storage and computing a mammoth task. With its distributed processing capability Hadoop gives an efficient solution for such large data. Hadoop’s default data placement strategy places the data blocks randomly across the nodes without considering the execution parameters resulting in several lacunas such as increased execution time, query latency etc., Also, most of the data required for a task execution may not be locally available which creates data-locality problem. Hence we propose an innovative data placement strategy based on dependency of data blocks across the nodes. Our strategy dynamically analyses the history log and establishes relationship between various tasks and blocks required for each task through Block Dependency Graph (BDG). Then Our CORE-Algorithm re-organizes the HDFS layout by redistributing the data blocks to give an optimal data placement, resulting in improved performance for Big Data sets in distributed environment. This strategy is tested in 20-node cluster with different real-world MR applications. The results conclude that proposed strategy reduces the query execution time by 23%, improves the data locality by 50.7%, compared to default.


INTRODUCTION
In this data era, massive volumes of data are being generated every second in a variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Big Data is the term applied to such large volume of data sets whose size is beyond the ability of the commonly used software tools to capture, manage, and process within a tolerable elapsed time [1]. Further Big data management possess many challenges due to big data diversity, big data reduction, big data integration and cleaning, big data indexing and query, and big data analysis and mining [2]. This situation has led to a rapidly increasing use of parallel and distributed environment framework like Hadoop, to analyze and gain insights from the data. The distributed processing of * Corresponding Author. E-mail: meetvengadesh@gmail.com large data sets across clusters of computers using simple programming models has been facilitated through Apache Hadoop software library [3] [4]. The inherent parallelization, synchronization and fault-tolerance offered by the model, makes it ideal for highly-parallel data-intensive applications [5]. Local computation and storage is achieved through the two major components of Hadoop namely Map Reduce (MR) and Hadoop Distributed File System (HDFS). The fundamental concept of HDFS and MR is to distribute data among many nodes and process in parallel.
HDFS [6] [7] is a distributed file system designed to run on commodity hardware capable of storing large files across multiple machines. HDFS follows master slave architecture, consisting of one Name-Node and multiple Data-Nodes. Name-Node is the coordinator of HDFS which maintains the metadata i.e., size, location and replicas of the data blocks. Data-Nodes hold the actual storage of data. When you dump a file into the HDFS, the files are broken into fixed size blocks, and blocks are stored on the various Data-Nodes in the Hadoop cluster. HDFS creates several replications of the data blocks and distributes them accordingly in the cluster in a way that will be reliable and can be retrieved faster. Data-Node reports the blocks stored in it, to Name-Node periodically thereby updating metadata. When you execute a query from a client, it will reach out to the Name-Node to get the file metadata information, and then it will reach out to the Data-Nodes to get the real data blocks. The most important aspect of Hadoop is that both HDFS and MR are designed with each other in mind and each are co-deployed such that there is a single cluster and thus it provides the ability to move computation to the data and not the other way around [8]. Thus, the storage system is not physically separate from the processing system, due to which the placement of data in Data-Nodes is crucial for efficient processing. Hence there is a huge need for optimal data placement in a HDFS as shown as in Fig. 1.
Data placement strategies mainly focus on two areas. 1.Placing data across the data centers for increasing the parallel execution [9][10] [11] resulting in improved performance viz. reduced query latency, reduced query execution time, improved data locality and improved read write performance,throughput. 2.Placing Co-related data together for reducing the resource utilization [12][13] [14] viz. reduced energy requirements for powering the computing equipment, minimizing the average query span, reduced network bandwidth, reduced carbon foot print, reduced operational cost. The type of strategy to be adopted depends on the nature of requirements as to whether the solution to the query is time dependent or cost dependent. We focus on data placement strategy for finding solution to queries which are required to be solved at the earliest possible time to enable quick decision making as well as deriving maximum utilization of resources. The real value of analyzing the Big Data is accelerating the time-toanswer, especially in case of streaming data where immediate response for taking better decision is very much desired.
Hadoop's default data placement strategy randomly places the data across the Data-Nodes without considering the storage capacity [15], but this can be overcome by executing the Load balancer utility. Load balancer [16] redistributes the data based on the storage capacity, but there is no guarantee that the data required for execution of any task is evenly distributed across the nodes to ensure Local map task execution. During parallel processing in a distributed environment, if the data required for a node to process is not available locally, then the Map task of that node will be idle or it will access the needed data remotely from some other node where the data is available, this will severely reduce the MR performance [17]. Hence the focus is on achieving maximum parallel execution through innovative data placement strategy. Several works have been done in the field of data placement in HDFS considering various metrics. In this paper, we focus on the inter and intra-dependency of data blocks in a node for a task. Existence of non-dependent data blocks in a node does not contribute towards the efficiency of map task, since they are not involved in any execution. So, we aim at evenly spreading out the dependent data blocks across the nodes which will result in maximum parallel execution. We term the concentration of such dependent data blocks in a node for a task as the Relation cohesion of the node for that task.
In our approach, Historical query workload logs are traced over a period of time and represented as a Block Dependency Graph, where nodes are data blocks and tasks are represented as edges incident on nodes. Cohesive Matrix is constructed from the graph for estimation of Relation and weighted Relation cohesion, which is used as input for the CORE algorithm as proposed. Our algorithm reorganizes the HDFS layout by redistributing the data blocks to give an optimal data placement, which has higher parallel execution resulting in improved performance. The result shows that CORE algorithm has efficiently reduced the query execution time and has improved local map task execution. Also a significant improvement over the read write performance is achieved. The rest of this paper is organized as follows: Section 2 describes the need and necessity of a new data placement algorithm, along with related works and problem definition. A motivating example is also explained in detail in this section. Our proposed CORE-Algorithm is explained in detail in section 3. Section 4 presents the experimental results and analysis; Finally, Section 5 concludes the paper with possible future research directions.

Cohesion and Coupling measurement
In our context, the term cohesion refers the tightness with which "related" data blocks are "grouped together" in a Data-Node. Coupling represents the amount of relationship, between the elements, belonging to different Data-Nodes illustrated in Fig.2. Let us consider a modular system S consisting of different modules M. Let R c (S) be the number of internal relations of system, R i (S) be the number of input relations, R 0 (S) be the number of output relations and R(S) be the total number of relations in the system. Then, Cohesion of the System [C H (S)] is expressed as a ratio between the number of internal relations R c (S) and the total relations R(S) by the formula.
Coupling of the System [C P(S)] is the ratio between the number of external relations R i (S) + R 0 (S) and the total relations of the system R(S) It is normally assumed that the better the designer is able to encapsulate related program features together, the more reliable and maintainable is the system [20]. For a good software design "High Cohesion and Low Coupling" is required. But our work focuses on reducing the cohesion in order to achieve maximum parallel execution for improving the system performance. Cohesion as defined in this paper can be termed as the density of related data blocks, located in a node that are required for execution of a particular task. Since we are focusing mainly on reducing the concentration of dependent data blocks from the highly dense Data-Nodes, it is suffice that the cohesion alone is taken into account since Intra-dependency. This can also be  proved to satisfy the above four important properties as defined in above Table 1. The estimation of cohesion measurements is explained in detail in section 3.

Related Works
Several works have been done in the field of data placement for larger data in a distributed environment. Yuan et al. [21] give a matrix based k-means clustering strategy to address the issues in data placement in scientific cloud workflows. Accordingly the related datasets are placed in appropriate data centers based on dependencies during the runtime. Dependency matrix is constructed based on the access pattern on the datasets for the set of tasks. The datasets are partitioned by transferring the Dependency Matrix into Clustered Dependency Matrix using Bond Energy Algorithm (BEA). This strategy guarantees balanced distribution of data and reduced data movements. The drawback of this approach is the use of Bond Energy Algorithm to cluster the Dependency Matrix, since the time complexity of finding permutations of all rows every time for BEA is high. Jun Wang et al. (2014) [9] have proposed an optimal data placement strategy based on grouping semantics. This proposed strategy reduces the query execution time and improves the data locality compared to default strategy. It improves parallel execution of data sets having interest locality. But most of the real world applications are without interest locality and in such cases this strategy proves to be ineffective. Chia-Wei Lee et al. (2014) [10] proposed a strategy, that distributes the data blocks based on computing capacity of a Data-Node instead of storage capacity so that faster nodes are provided with more data blocks and can solve HPC application with reduced execution time. However there is no mechanism to ensure that the required data for execution are present in the faster (higher computing) nodes, thereby defeating the purpose of data placement.

Cohension of a [Module | Modular System]
The cohension of a %[module m = <Em, Rm> of a modular system MS|modular system MS] is a function [Cohension(m)|Cohension(MS)] characterized by the following properties.

S. No. Property Definition 1 Non-negativity & Normalization
Cohension of a module and modular system belongs to a specified interval Cohension to be normalized so that the measure is independent of the size. 2 Null Value Cohension of a modular system is null if its set of intramodule edges is empty.
Monotonicity Adding intra-module relationships does not decrease [module|modular system] cohension. 4 Cohensive Modules The cohension of a [module|modular system] obtained by putting together two unrelated modules is not greater than the [maximum cohension of the two original modules|the cohension of the original modular system].
An optimal data placement, by co-locating co-related data items together in order to reduce resource consumption, query span (minimum number of machines required to process query) is suggested by Ashwin Kumar et al.(2013) [14]. Their work mainly focuses on reducing cost, but keeping the available resources without utilization is not a viable solution, since the real value of analyzing the Big Data is accelerating the time-toanswer, for taking better decision. Lili Sun et al. (2013) [22] suggested a strategy, by taking into account the disk space utilization and computing capacity of each node, to give an efficient load balancing. Though efficient load balancing is achieved with minimum movement of data across the nodes, it does not ensure data locality, which in turn may reduce local map execution.
Also several research papers are available in the literature to identify the metrics to measure inter and intra dependency through software measurements such as cohesion and coupling. Lionel Briand et al. [18] have suggested a mathematical model which defines several measurement concepts (size, length, cohesion, and coupling) which can be used for software design abstractions. However, specific measurement frameworks for particular product abstractions e.g., Control Flow Graphs, Data Dependency Graphs are not defined. Edward B. Allen et al. [19] give a new approach for measuring the Inter and Intra modular relations which exhibits finer discrimination than counting based measurement. But the usefulness of cohesion is not validated. Mirjana et al. (2014) [23] present a method to establish a set of relationship between particular software metrics and corresponding measures from complex networks theory. Accordingly a complex network measure and its related software metrics measure are considered for defining a relation. The defined relation is tested for establishing formalized metrics.

Motivating example
In order to prove the proposed work mathematically, the principle was experimented with several miniature examples. One of the motivating examples satisfying the requirements is explained in Fig. 3. Accordingly it comprises of a cluster with four nodes (DN1, DN2, DN3, and DN4) wherein three different tasks (T1, T2, and T3) over 24 data blocks (B1 -B24) are executed.

Problem definition
For a given set of data blocks B to be processed in a cluster having N number of Data-Nodes, as per default, the blocks will be randomly distributed among the N nodes. Let the blocks required for the execution of a particular task T' be B where B ⊆ B. Even if B is evenly distributed across the cluster, it does not guarantee that blocks required for any particular task is evenly distributed, resulting in reduced system performance. Further ensuring even distribution of B alone will not be an optimal solution since the task being executed in a system will not be unique. Hence there is a need for an Optimal Data Placement Strategy.

Proposed Strategy
In this paper, CORE, an optimal data placement strategy is procomputer systems science & engineering VENGADESWARAN AND BALASUNDARAM

Figure 3
Example showcasing the effectiveness of CORE algorithm.
posed by ensuring even distribution of related data blocks across the nodes to improve the efficiency of the system in a distributed environment. Accordingly, data blocks are placed in such a way that, the concentration of dependent data blocks in a node is balanced so as to improve the degree of parallel execution. The proposed strategy is elaborated in detail. The flow diagram of the entire work is shown in Fig. 4. This section consists of five parts, First part, User History Log exploits the system log files and Name-Node meta information to construct Task frequency table and network topology; Second part, Block Dependency graph (BDG) depicts the relationship among the dependent blocks; In third part, Cohesive Matrix is constructed to learn about the internal cohesion in the Data-Nodes; The estimation of cohesion strength for a Data-Node and modular system is done in the fourth part; The CORE which proposes an optimal data placement algorithm is done in the final part.

User History Log
Log files are typically large in size and contain lot of resources about data, which have to be processed to get useful information.
Analyzing the characteristics of cluster for various workloads is the key for making optimal placement decisions. Usually Log files are semi structured. All Map reduce applications executed in cluster, save the task execution details as a log file, which consists of two files (i) Job Configuration XML file: it contains the job configuration as specified when the job is launched, (ii) Job Status file which contains task ID, status, start and end up time etc. for each job executed in the machine. Using this as input, the log files are processed to construct Task frequency table containing the list of different task executed Ti, frequency of occurrence Tf and the required blocks Br for each of the task is shown in Fig. 3(c). Name-Node contains meta data from which the network topology is constructed to identify the different Data-Nodes present in the cluster and the data blocks present in each of the Data-Node (Fig.3(a)). Meta data can be traced in dfs.namenode.name.dir configuration property located in hdfs-site.xml. From the above information, the Block Dependency Graph is constructed.

Block Dependency Graph
The computations of parallel processing can be solved efficiently when the dependency of task on the blocks are mapped through a Block Dependency Graph (BDG). It is an undirected graph which depicts inter and intra dependency of data blocks for each task executed. BDG of a system S can be constructed as a 3-tuple BDG =<B, R, N> where B represents the set of Blocks, R is a binary relation on B (R ⊆ B × B) representing the relationships between Set of blocks and N is a collection of Nodes of S such that ∀ b ∈ B ( ∃ n ∈ N (n = <Bn, Rn> and e ∈ Bn)) and ∀ n1, n2 ∈ N (n1 = <Bn1, Rn1> and n2 = <Bn2, Rn2> and Bn1 ∩ Bn2 = ∅)  From the graph, the edges within a Data-Node connecting the blocks exhibit the degree of intra-dependency of blocks for a task. The edge (task) connecting blocks located in two different nodes is a measure of inter-dependency of blocks for a task. Since this paper focusses on improving the parallel execution by reducing the internal cohesion, the intra-dependency among the blocks of a node alone is taken into consideration.

Cohesive Matrix
From the Block Dependency Graph (BDG) and the information available from Task Frequency (Tf) It is a binary matrix or a (0, 1)-matrix as it contains only two types of elements 0 or 1. From the Matrix following properties are inferred.
1. The number of ones in each column (block) equals the degree (i.e. the number of varying edges incident on the node) of the corresponding node.

A column with all zeros represents an isolated node (unutilized block).
3. The sum of entries in a row Tm indicates the number of blocks required for that task.
4. The sum of entries in a column Bn indicates the number of tasks for which the block is required.
The incidence structure will be space efficient if there are many more nodes than edges. Fig. 3(d) depicts the presence of a data block in the node relating to a task in binary form. If a task requires a particular data present in the Data-Node for its execution, then its corresponding value will be one else zero. This will be an indicative of the total concentration of the blocks required for a particular task in every Data-Node.

Estimating cohesive strength for Data-Node and Modular system
The concentration of data blocks in a node Ni that is required for particular task T i execution with reference to the total blocks required for that task is termed as Relation cohesion of that node Rc < Ti,Ni> Rc < Ti,Ni >=

No. of blocks executed in Node Ni for Task Ti
Total no. of blocks required Br for that task Ti This Relation cohesion for each task in a particular node is standardized with reference to the frequency of each task and the weighted Relation cohesion Rcwa is calculated for every Data-Node.
For ideal parallel execution, the data blocks required for the task have to be evenly distributed across the cluster of Data-Nodes. Hence, the optimal Relation cohesion Rc op desired for a cluster for every node can be estimated from the Cluster configuration.
Rc op = 1 n Where n is the number of nodes in the cluster.
An Ideal situation may not be practically possible while redistributing the data blocks and hence in order to have an inbuilt flexible redistribution, a configurable threshold limit (default 10%) is fixed and allowable Relation cohesion Rc al for each node is designed as Rc al = Rc op ± Threshold Rc al will be the parameter used for categorizing the nodes as over-cohesive or under-cohesive which requires to be balanced. For balancing, the blocks have to be migrated from over cohesive to under cohesive nodes. Every addition or deletion of a related data block in a node (Block involved in any of the task execution) will result in a change in the Rc wa of that node. This change in Rc wa will be the Block cohesive value B cv which is used to estimate the number of blocks that has to be moved out from an over cohesive node and the number of blocks that can be received by an under cohesive node in order to be balanced (Fig. 3(e)).

CORE-Proposed Data Placement Algorithm:
Our work focuses on balancing the cohesive strength of the nodes by moving the dependent data blocks from higher cohesive strength nodes to lower cohesive strength nodes. It ensures even distribution of required data blocks across available Data-Nodes.

Move the Vb from SN to RN // Repeat the iteration, cluster cohesively balanced End Until // Repeat the iteration, until UC-set [ ] and OC-set [ ] empty End For
According to our optimal data placement algorithm,the higher cohesive strength of a Data-Node exhibits the existence of more number of Inter dependent blocks in the Data-Node for a particular task. The algorithm starts with the initial estimation of Relation cohesion R c for each task in a Data-Node and weighted Relation cohesion Rc wa of each Data-Node based on the frequency of task. Each Data-Node is categorized into over cohesive, under cohesive, and balanced based on Rc wa and Rc al . For our example Rc op is 0.25 and hence Rc al will be 0.225 to 0.275. From Fig. 3(d), according to default data block placement DN1 is over cohesive, DN2 is balanced and DN3, DN4 are under cohesive. The default block value B CV (change in R c for addition or deletion of any data block in the cluster) is estimated as 0.062 as shown in Fig. 3(e).
For the first iteration DN1 will be the Source Node [SN], DN4 will be the Receiving Node [RN]. Since DN1 has the highest Rc wa in the list of over cohesive nodes and DN4 has the least Rc wa in the list of under cohesive nodes. The number of blocks that can be moved out from Data-Node1 to balance it will be 2; the number of blocks that DN4 can receive for balancing is 1 (refer Fig. 3). Hence the number of blocks that can be moved in this iteration be 1 (Min (2, 1)). The victim block to be moved out is to be identified from the Source Node DN1. From Fig.  3(d) relating to SN i.e. DN1, The row having maximum sum (indicates the task having maximum cohesion) is identified. In our example row 1 relating to task 1 has maximum sum of 4. Then the victim block list [ ] that contains all blocks associated with the task corresponding to max sum of the row i.e. B1, B5, B9, B13 is created. Then the difference in Rc between the SN and RN for each of the task is found. Tabulate the values as <T i ,Rc diff >. The task which has the lowest difference will be identified (T3) and the blocks associated with that task (T3) in DN1 are deleted from the victim block list (B1,B5, B9). Remove the task and difference from the table. The iterations have to be continued until the required number of blocks to be moved NB i is only available in the victim block list. In our example B13 is victim block for the first iteration. Victim block [B13] is moved from SN [DN1] to RN [DN4]. After the movement of data blocks from SN to RN, there will be a change in R c and Rc wa in the Cohesive Matrix. The algorithm continues the process of categorizing the nodes and follows all the above steps. The process is continued until over-cohesive set (OC-Set[ ]) and under-cohesive set (UC-Set[ ]) become empty. This will end up with all nodes becoming cohesively balanced with reference to Relation cohesion (R c ) as in Fig. 3(f).
The maximum number of simultaneous map tasks on each node is limited by the hardware capacity; currently it is ≤2 in most of the clusters. Considering simultaneous map task each Data-Node is 2, the utilization % of local map task at each Data-Node for every task is calculated. The average utilization % for every task is calculated as UT i .
(utilization % at each node for the task) Total no. of nodes Then average utilization for the cluster is also calculated by giving weightage to frequency of each task.

Average utilization of map task %
((75 * 8) + (87.5 * 7) + (87.5 * 1)) 16 = 81.25 CORE algorithm iteratively changes the default block placement in each Data-Node to an optimized location which in turn is proved to be more effective in local task execution. Fig. 6 and Fig. 7 are local map task for initial and final data layout. The results tabulated in Fig. 7 which shows that local task execution percentage is increased by 18.75%. CORE algorithm does not guarantee 100% local map task execution every time since there is possibility of variation depending on the size of dataset, data block size, satisfying the load balancing and rack awareness but CORE will always produce an improved result over the default data placement strategy which is tested with several examples.

EXPERIMENTAL RESULTS AND ANALYSIS
Our proposed CORE algorithm is tested in 20 node cluster placed in a single rack with Hadoop-1.2.1 installed in every node. One node is configured as the Name-Node and the remaining 19 nodes are Data-Nodes. The Data-Nodes are provided with different configuration so as to have a heterogeneous environment. The detailed cluster configuration is shown in Table 2. The implementation of CORE algorithm will dynamically reorganize the HDFS data layout to present an optimal placement for execution; the program is launched as a utility to be executed manually as and when required.
The dataset used in our experiments is a collection of daily weather measurements collected by National Climatic Data Centre (NCDC) and it is a public data set available for download from Amazon s3 [25]. The data are collected every hour for about 18 metrological elements (e.g. Temperature, Wind speed, Humidity etc.) from over 9000 weather stations located globally for the period from 1929 to 2009. The data is strictly ASCII, with a mixture of character data, real values, and integer values. Data files are organized according to date and location of weather station. The records for every year is present in a directory in which there will separate gzipped file (.gz) for the data relating to each weather station. The size of the dataset is about 20 GB, distributed as blocks across the 19 Data-Nodes for our experiment.
The dataset required for analysis will be uploaded in a bulk at an instance. The default strategy will randomly distribute the dataset as even sized blocks across the available Data-Nodes. This strategy does not consider the nature of queries likely to be executed in the system. Even though the dataset that is available is unique, the nature of queries executed will exhibit some interest localities which cover only a part of big data. The interest locality may be different for different domain analyst based on the Geographical location, Metrological element, etc. For example, Metrological scientists belonging to a country will be interested in the data relating to their country alone and in such case the queries executed will have a common dependency among the data related to that country. Similarly specific domain analyst working on forecasting of rainfall will have an interest domain with reference to rainfall particulars which sweeps only a small part in this huge dataset. Since such interest locality are not taken into consideration, there is a likelihood of required dependent blocks for execution of any interest based query to be concentrated within a few nodes alone thereby initiating non local map task, resulting in poor performance. CORE measures density of such dependent blocks within a node and aim at evenly distributing related data blocks for queries across nodes.
The experiment was conducted by executing various tasks on weather dataset. The tasks T1 to T6 were chosen in such a way that it has specific dependent blocks over the entire 320 blocks (20GB). For example, finding the minimum temperature during a particular period of years, finding the maximum rainfall in a certain region, finding the day of maximum temperature for a particular station etc. The results are listed in Table 3, 4 and 5 which shows percentage improvement of local map execution for data placement based on CORE algorithm against the default random placement strategy.
The improvement in reducing the total execution time is also listed. From Table 3 out of the 328 maps required for execution, default random placement strategy has 130 maps executed locally (i.e. 39.6%) whereas as per CORE the local maps executed is 196 (i.e. 59.8%). Hence the overall improvement in local map task execution will be 50.7% ((196-130)/130). The CORE algorithm was tested with various tasks and it has always    shown an improvement over the default data placement. The results for only 6 sample cases have been listed in Table 3. When tested in worst case where any interest locality does not exist or all data blocks are required to be accessed for execution of the task, CORE has shown the same efficiency as default.
To strengthen the effectiveness of CORE algorithm, and to make a study of its behavior at various cluster size, the experiment was conducted for a sample task by varying the number of nodes. The task was executed on the same dataset and results are tabulated in Table 4. With the increase in Data-Nodes, the availability of maximum number of simultaneous map task will increase, which will naturally increase the degree of parallel execution. Our experiment gives an interesting result, that CORE algorithm is more efficient for bigger clusters. i.e., the incremental improvement of efficiency is high when the number of nodes in cluster is more. The results shown in graph (Fig.  8.) prove that the efficiency of the CORE algorithm increases with the increase in size of cluster. Our proposed algorithm also significantly reduces the Map and Reduce completion time as shown as in Fig. 11.

CONCLUSIONS AND FUTURE WORK
Hadoop's default data placement strategy places the data blocks evenly across the Data-Nodes. Even though, blocks are evenly distributed across the cluster, it does not guarantee that blocks required for execution is evenly distributed, which severely drag down the system performance during Map reduce task. So, we have proposed an innovative data placement strategy which will distribute the related data blocks i.e. blocks required for execution, evenly across the Data-Nodes to ensure maximum parallel execution. This has been experimentally tested in a 20 node cluster using different map reduce applications. The result has strengthened our proposed algorithm and has proved to be more efficient for massive datasets by reducing query execution time by 23% and significantly improves the data locality by 50.7% compared to Hadoop's default data placement strategy. Even though the results are most optimistic, our experiment has been conducted in a single rack topology without any replicas. We know that further data copies (replicas) will certainly improve the performance and reduce the overheads, but its behavior in cross rack environment has to be further studied for improved performance. Also due to migration of data blocks, ultimately cluster may become imbalance. Hence a new and efficient load balancing with optimal data placement is being focused considering the dependency of data bocks.