Hadoop Based Data Intensive Computation on IAAS Cloud Platforms

............................................................................................................................. xi Chapter 1: Introduction ....................................................................................................... 1 1.1 Cloud Platforms ........................................................................................................ 2 1.1.1 Amazon Elastic Compute Cloud (Amazon EC2) ............................................... 2 1.1.2 Amazon Elastic Map Reduce (Amazon EMR) .................................................. 4 1.2 Data Intensive Computation ..................................................................................... 6 1.2.1 Hadoop ............................................................................................................... 7 1.2.2 MapReduce ......................................................................................................... 8 1.3 Benchmarks .............................................................................................................. 9 1.3.1 HiBench Benchmarks ......................................................................................... 9 1.4 Research Objectives ............................................................................................... 12 Chapter 2: Literature Review ............................................................................................ 13 2.1 Studies Using HiBench Benchmarks ..................................................................... 13 2.2 Studies on Amazon Cloud Services vs. Other Cloud Platforms ............................ 14 Chapter 3: Research Methodology.................................................................................... 16 Chapter 4: Testbed Setup .................................................................................................. 18


Introduction
According to the National Institute of Standards and Technology (NIST), Cloud Computing can be defined as "A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction" (Mell et al, 2011).There are three service models provided on the cloud.
1. Infrastructure-as-a-Service (IAAS) where the consumer is provided with the capability of provisioning storage, processing and networks and run arbitrary services.In this model, the consumer does not control the cloud infrastructure, storage and processing.
2. Platform-as-a-Service (PAAS) where the consumer is provided with the capability of deploying applications on the cloud using the provider's tools and, libraries and languages.In this model, the provider controls the infrastructure and the consumer only has access to deploy applications and change configuration settings related to deployment.
3. Software-as-a-Service (SAAS) where the consumer is provided with the capability of using provider's applications that are running on the cloud.In this case, the applications are either accessible from a web interface or a program interface.In this case also, the provider controls the cloud infrastructure.

Cloud Platforms
1.1.1Amazon Elastic Cloud Compute (EC2) Amazon EC2 also known as Amazon Elastic Compute Cloud is an IaaS cloud platform that provides a web service based API for provisioning, managing, and de-provisioning virtual servers inside the Amazon cloud.Applications residing anywhere on the Internet can launch a virtual server in the Amazon cloud and users can launch as many virtual servers as they want in the Amazon cloud.Amazon EC2 also allows users to configure security, and provide networking and scaling based on business requirements.Amazon EC2 instances can store data in Amazon S3 buckets or Amazon EBS (Elastic Block Storage).Amazon S3 provides an online file storage web service provided by Amazon Web Service (Amazon, 2014).
Amazon EC2 instance types include: On-Demand Instances where the user pays for computing capacity by the hour; Reserved Instance (Light, Medium, and Heavy Utilization Reserved Instances) where the user pays one-time payment for the instance that they want to reserve and receive hourly discount on that instance; Spot Instances where the users bid on unused EC2 instances and run the instances, as long the users bid does not exceed the spot price.
Each instance type varies in terms of memory capacity, available virtual cores, storage capacity and I/O performance.Users can chose the instance types based on their application needs.

Amazon Elastic Map Reduce (EMR)
Amazon EMR consists of multiple EC2 instances grouped in a cluster and can process huge amount of data by splitting the computational work across multiple EC2 instances and each EC2 instance is a virtual server.Amazon EMR cluster is managed by an open source Hadoop distribution (Noll, 2011).Amazon EMR cluster performance can be measured using Amazon CloudWatch.In order to run a job on Amazon EMR, users have to create an Amazon EMR job flow and execute it on the number of cluster nodes they need.Amazon EMR is suitable for large cloud computing as new instances can be easily configured (added and removed) on running custom code.
Amazon EC2 is a stand-alone instance whereas Amazon EMR is a cluster of EC2 instances.Cluster management is performed by the user on each Amazon EC2 instance whereas automated Cluster management occurs in Amazon EMR.They also differ with respect to the cost variance factor.Amazon EC2 is more cost effective than EMR since Amazon charges for cluster management .Amazon EMR pricing is the cost of running of Amazon EC2 instance plus the cost charged by Amazon for cluster management.Based on these varying factors, it is critical to establish benchmarks on both the clouds so that the user can determine whether to choose Amazon EC2 over Amazon EMR or vice-versa when it comes to Data Intensive Cloud Computing.

Data Intensive Computation
Data intensive applications are applications that involve high CPU usage, processing large volumes of data typically in size of hundreds of gigabytes, terabytes or petabytes.It has become critical that data intensive cloud providers provide on-demand computing instances and on-demand computing capacity.Clouds that provide on-demand computing instances and clouds that provide on-demand computing capacity like Amazon EC2 and Amazon EMR can support any computing model compatible with loosely coupled cluster.MapReduce along with Hadoop has become the dominant programming model used in data intensive cloud computing that provide on-demand computing capacity.

Hadoop
Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers (Hadoop, 2013).It utilizes master-slave system architecture (Hedger, 2011).Apache Hadoop is driven by two main components: 1. Map Reduce -The framework that understands and assigns work to the nodes in a cluster.
2. Hadoop Distributed File System (HDFS) -This file system spans all the nodes in a Hadoop cluster for data storage.It links together the file systems on many local nodes to make them into one big file system.HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.

MapReduce Programming Model
MapReduce is a programming model and software framework first developed by Google (MapReduce, 2014).This programming model helps in the processing of huge amount of data in parallel on large clusters in a reliable and a fault-tolerant manner.There are two fundamental steps associated with a MapReduce programming model.First step is the Map () function where a master node converts a set of data input into smaller set of data where individual elements are broken down into tuples (key-value pairs).Each of these tuples will be distributed to a slave node and these input list processed by the Map () function under slave nodes produces a different output list.The next step is the Reduce () function where the master node takes output provided by each of the worker node and then combine them in a predefined way to provide the final output.MapReduce requires a "driver" method to initialize a job, which defines the locations of the input and output files and controls the MapReduce process.Each node in a MapReduce cluster is unaware of the other nodes in the cluster, and nodes do not communicate with each other except during the shuffling process.

Motivation and Related Work
Currently, there are no set of existing benchmarks and experiments for evaluating cloud performance of Amazon EC2 and Amazon EMR from the perspective of data intensive computing though there have been benchmarks that have been run on local machines and clusters using Hadoop.There also exist certain studies and benchmarking of Amazon cloud service particularly Amazon EC2 with other cloud platforms such as Rackspace as discussed below.
Huang The HiBench suite is essential for the community to properly evaluate and characterize Hadoop, because it's workload not only represent a wide range of large-scale data analysis using Hadoop, but also exhibit very diverse behaviors in terms of data access patterns and resource utilizations.
According to the recent benchmark study on clouds by Sarda et al in 'Cloud Performance Benchmark -Amazon EC2 vs. RackSpace' (cloud based VPS), Rackspace's 512MB instance was more than twice as fast as Amazon's micro instance (Sarda et al, 2011).The study benchmarked metrics Relative CPU Performance, IO Read and IO Write, Number of Requests Apache Can Handle and Processing Power.The authors concluded that Rackspace is 3 times faster than Amazon EC2 in terms of Processing Power, Rackspace can handle 5.5 times more requests than Amazon when using Apache HTTP server, and Rackspace can write 7.6 times more data than Amazon per second and is 2.3 times faster than Amazon EC2 in terms of CPU performance.
As discussed above, there are various benchmarks comparing the performance of Amazon EC2 to other clouds and vice versa but there do not exist any benchmarks studies that focus on comparing the performance of Amazon EC2 versus Amazon EMR for data intensive computing using Hadoop which is the present experimentation.

Hardware Configuration
Hardware configuration used is very critical for Hadoop based data intensive benchmark.For this purpose, an M3 General Purpose Double Extra Large instance type was chosen for both Amazon EC2 and Amazon EMR.Amazon M3 instance types provide a balance of memory, compute and network resources with its most prominent features being SSD based storage for very fast I/O performance and High Frequency Intel Xeon E5-2670 v2 (Ivy Bridge) Processors.

Software Configuration
Install Amazon Linux AMI on the workstations.Install version 1.7 of the Java JDK.Install Hadoop version 1.0.3 on both Amazon EC2 and Amazon EMR.Install Hi-Bench 2.2 on Amazon EC2 and Amazon EMR Hadoop.Configure SSH on all the nodes on Amazon EC2 and Amazon EMR for communication between name node and all the data nodes.Install Python on Amazon EC2, which is a pre-requisite for Star Cluster Installation.Use StarCluster open source toolkit to create cluster on Amazon EC2.Create cluster on Amazon EMR using Amazon UI (StarCluster, 2011).

Results and Discussion
The study evaluates and compares the performance of the Amazon EC2 and Amazon EMR cloud services using HiBench benchmark suite, which includes Micro Benchmarks (Sort, WordCount, Terasort), Web Search benchmark (Page Rank) and Analytical Query performance Benchmarks (Hive Join and Hive Aggregation).
Microsoft Excel 2010 built in function T-TEST was used for statistical analysis of the results obtained from the benchmarks on Amazon EC2 and Amazon EMR.The T-TEST function used two datasets as input, first dataset being Amazon EC2 and second dataset being Amazon EMR.The p-value was computed; a p-value of exceeding 0.05 is considered statistically insignificant difference between the two datasets, while a p-value not exceeding 0.05 an indication of statistically significant difference between the two datasets (Tables 4 -15).
For each benchmark, the response time (in seconds) and throughput (in megabytes per sec) is measured with increasing number of nodes from 1 to 8. Graphs were then plotted for Amazon EC2 and Amazon EMR cloud services for comparing their performance.The graphs compare the performance of Amazon EC2 and Amazon EMR cloud services using each of the HiBench benchmark suite, which includes Micro Benchmarks (Sort, WordCount, Terasort), Web Search benchmark (Page Rank) and Analytical Query performance Benchmarks (Hive Join and Hive Aggregation) by varying the dataset size (1GB, 10GB, 100GB) to represent data intensive computation using Hadoop.For each graph, the y-axis represents the response time/throughput values achieved during the tests, and the x-axis represents the number of nodes tested (Figures 1 -18).
Table 3 provides a basic insight into the pricing of Amazon EMR and Amazon EC2 for an m3.2xlarge instance.Amazon EC2 has a base price of $0.560/hr per instance whereas Amazon EMR pricing is cost of an Amazon EC2 instance which is $0.560/hr per instance plus the cost that Amazon charges for cluster management for Amazon EMR which is $0.140 /hr totaling $0.700 /hr.As the number of nodes are increased, the variation becomes more significant as shown in Figure 24 below.The variation becomes drastically significant when the number of nodes are multiplied by number of hours times price per instance.

Conclusions
The Amazon EC2 and Amazon EMR cloud services were tested using the HiBench benchmark suite while the number of nodes (1 to 8) and the size of the dataset (1GB, 10GB, and 100GB) were varied.Overall, it appeared that Amazon EC2 was well suited for less data intensive applications for data size less than 100 GB.The results of datasets of 1GB and 10GB run on m3.2xlarge instance showed this behavior.When we move over to higher benchmark workloads of 100 GB, Amazon EMR preformed better than Amazon EC2.This can be attributed to the fact that Amazon EMR installation of Hadoop containing patches and improvements added to Apache Hadoop to make it work effectively on AWS.This also includes using better compression codec's and fixes to better combine and split input files and better performance tuning of running clusters on Amazon EMR.The configuration settings of Hadoop used for Amazon EMR cluster are optimized for scalability and more data intensive applications which explain why Amazon EMR performed better than Amazon EC2 on larger data sets.
For Sort, TeraSort, Page Rank and Hive Aggregate benchmarks, the difference in response time between Amazon EC2 and Amazon EMR and the difference in throughput between Amazon EC2 and Amazon EMR was more significant than in WordCount and Hive Join benchmarks as the former contains more data intensive and I/O operations compared to the latter.
Certain advantages that Amazon EMR has over Amazon EC2 is that the Amazon EMR can be used for large scale data processing that includes a lot of setting and configuration work as Amazon steps forward to remove that extra work out for the customers.Also Amazon takes care of cluster monitoring, resource management, cluster start-up and shutdown and even security groups management in case of Amazon EMR.Most of the cases it is hard to tune the performance of running clusters but in case of Amazon EMR, it takes care of performance tuning of the clusters while running a job or a workload.Even Hadoop is made simple and easy by Amazon EMR.Certain benefits of EMR are: 1. Elastic: Amazon EMR uses many in few EC2 instances as needed.Also spins large or small job flows in minutes.
2. Easy to use: Easy to run jobs quickly using the web console.No detailed configuration is required.
3. Reliable: Fault tolerant service build on top of the Amazon Web Service (AWS) infrastructure.
4. Cost Effective: Amazon monitors the progress of each job flow and turn off the resources when job flow is done.
From a scaling and cost perspective, for higher workloads and large number of nodes to be managed, it is better to opt for Amazon EMR than Amazon EC2 even though the cost of Amazon EMR is higher than that of EC2.Amazon EMR automatically takes care of performance tuning of running clusters, cluster monitoring, resource management and security groups management.It is also fault tolerant and automatically retires failed tasks where as in Amazon EC2, all these will have to be done manually.There is less overhead in Amazon EMR compared to Amazon EC2 where as in case of small datasets and applications that don't need much scalability and need to operate on low cost , Amazon EC2 is an better option.

Future Work
This study is limited to benchmarking the two cloud services provided by Amazon, Amazon EC2 and Amazon EMR cloud services to evaluate the performance of Hadoop on data intensive applications while varying workloads and the number of nodes in the cluster.Extensions to this study on cloud performance include evaluating the performance of these cloud service on Bigdata level that is, varying the sizes upto terabytes of data.This may help the research to evaluate the performance pattern of Hadoop on each node for both the cloud services thus helping in further analysis.
In this study, we utilized m3.2xlarge instance provided by Amazon, which provides a balance of compute, memory and network resources.So further studies could be conducted on various instance types provided by Amazon, such as compute optimized instances (C3 instances), storage optimized instances (I2 instances) and Graphic optimized instances (G2 instances), to further explore the benchmarking on Amazon cloud platform.Another scope of further research is in terms of new benchmarks to that could be used for evaluating the performance.The research utilizes HiBench benchmark suite, which is a set of Hadoop benchmarks.New benchmarks can be introduced to experiment with Hadoop performance on data intensive applications.

Table 2 .
Benchmarks and Metrics This is a benchmark where input data is generated by Hadoop TeraGen program that creates by default 1 billion of 100 bytes lines.The data here are then sorted by Terasort which provides its own input and output format and also its own Partitioner which ensures that the keys are equally distributed among all nodes.This is an improved Sort program, which provides equal loads between all the nodes during the test.As a result, this is a CPU bound function for the Map stage and I/O bound function for the Reduce stage.The input workload for the Terasort benchmark is datasize to be generated.The workload contains an implementation of the PageRank algorithm on Hadoop, which is a link analysis algorithm, used widely in web search engines.This is a CPU bound function.The input workload to PageRank algorithm is number of Wikipedia pages.The input workload for the Analytical Query Benchmark is number of records to be inserted into User Rankings table and User Visits table.The overview of benchmarks, their categories and metrics captured are shown in Table2.
HiBench is a representative and comprehensive benchmark suite for Hadoop.This benchmark suite consists of a set of Hadoop programs including both synthetic micro-benchmarks and real-world applications.These benchmarks are used intensively for Hadoop benchmarking, tuning and optimizations.The categories of benchmarks used for this research are Micro benchmarks (Sort, WordCount, and TeraSort) which include more of unstructured data; Web Search benchmarks (PageRank) which includes more of semi-structured data and Analytical Query benchmarks (Hive Join, Hive Aggregation) which includes structured data (Wang, 2014).3.3.1 Micro Benchmarks1.Sort: This workload sorts its text input data, which is generated using the Hadoop RandomTextWriter program.Here the sorting is done automatically during the Shuffle and Merge stage of MapReduce programming model.This is an I/O bound function.The input workload for the Sort benchmark is datasize to be generated.2.WordCount:This workload counts the occurrence of each word in the input data, which are generated using the Hadoop RandomTextWriter program.This job extracts a small amount of information from a large data source hence this is a CPU bound function.The input workload for the WordCount benchmark is the datasize to be generated.3. TeraSort: 3.3.3Analytical Query Benchmark Hive Join and Hive Aggregation: The workload contains queries that correspond to the usage profile of business analysts and other database users.The two tables created are User Rankings table and UserVisits table.Once the data source has been generated, two of the Hive requests would be performed, a Join and an Aggregation.These tests are I/O bound functions.