An adaptive framework for utility-based optimization of scientific applications in the cloud

Cloud computing plays an increasingly important role in realizing scientific applications by offering virtualized compute and storage infrastructures that can scale on demand. This paper presents a self-configuring adaptive framework optimizing resource utilization for scientific applications on top of Cloud technologies. The proposed approach relies on the concept of utility, i.e., measuring the usefulness, and leverages the well-established principle from autonomic computing, namely the MAPE-K loop, in order to adaptively configure scientific applications. Therein, the process of maximizing the utility of specific configurations takes into account the Cloud stack: the application layer, the execution environment layer, and the resource layer, which is supported by the defined Cloud stack configuration model. The proposed framework self-configures the layers by evaluating monitored resources, analyzing their state, and generating an execution plan on a per job basis. Evaluating configurations is based on historical data and a utility function that ranks them according to the costs incurred. The proposed adaptive framework has been integrated into the Vienna Cloud Environment (VCE) and the evaluation by means of a data-intensive application is presented herein.


I. INTRODUCTION
Executing scientific applications in a Cloud-based environment requires allocation of computing resources, provisioning of the underlying programming environment, and Cloud-enabling the applications. Cloud computing [1], [2] offers researchers the illusion of virtually infinite resources which can be allocated on demand. Nevertheless, researchers usually have to pay for the utilized resources, or, in a private Cloud, resources are not disposable for other experiments. As a consequence, a common goal for both service providers and clients is to optimize resource utilization while keeping costs or the runtime of applications low. Researcher usually want to gain results in a given period of time or they want to spend as little money as possible. Cloud provider's aim at serving as many researchers as possible to increase earnings and therefore strive to optimize utilization of resources.
Within this work we present an approach for optimizing resource utilization of scientific Cloud applications based on the concept of utility [3] known from economics. In our approach utility takes into account utilization of Cloud computing resources as well as the runtime of an application. For maximizing the utility we consider the configuration of all three layers of the Cloud stack: (1) the application layer where applications are realized based on the Software-as-a-Service (SaaS) concept, (2) the programming and execution environment layer, also referred to as Platform-as-a-Service (PaaS) layer, and (3) the resource layer, also known as Infrastructure-as-a-Service (IaaS) layer.
The runtime of an application depends on the hardware characteristics and the amount of computing resources available (e.g., number of processors used). The programming environment layer (e.g., MapReduce programming model [4]) provides different configuration parameter sets which may effect the runtime. Finally, the configuration of an application may impact the runtime as well.
We present the design of an adaptive framework for improving the utility of Cloud-based scientific applications on the basis of autonomic computing concepts [5], [6], in particular the MAPE-K loop, as well as on utility functions [3]. By relying on knowledge about historic jobs, the adaptive framework aims at optimizing the configuration of the different Cloud layers for individual application jobs.
The remainder of this paper is organized as follows: Section II presents our model for describing the different layers of the Cloud stack. Section III delineates the design of the adaptive framework based on autonomic computing concepts. First results with a MapReduce application are presented in Section IV. Finally, we briefly discuss related work in Section V followed by a conclusion in Section VI.

II. CLOUD STACK CONFIGURATION MODEL
Our adaptive Cloud stack configuration framework relies on a generic model for the specification of layer-specific parameters to be taken into account during the optimization process. Our model of the Cloud stack (see Figure 1) comprises three layers: the application layer, the execution environment layer, and the resource layer. For each layer an XML descriptor captures a set of configuration parameters that might impact the resource requirements and runtime of an application. The definition of the concrete set of parameters at each layer, which should be configured adaptively, has Figure 1. Layered Model for Adaptive Framework: Each layer (application, execution environment, resources) is represented by an XML descriptor comprising a set of configuration parameters. [7] to be done by domain experts, specifically by the application or service provider.

A. Application Descriptors
The purpose of the application descriptor is to explicitly describe a set of application-specific parameters that should be tuned by the adaptive framework. Many applications provide a large set of configuration parameters enabling the tuning of an application regarding specific hardware resources. Moreover, experiments often rely on the execution of parameter studies which can be configured in different ways. Depending on the input parameters and the available resources, it may be appropriate to change the configuration of an application. The application descriptor has to be specified by the application provider by defining a set of parameters.

B. Environment Descriptors
The execution of an application may require different execution environments depending on the application characteristics and underlying resources. On HPC systems and clusters, usually batch submission systems are utilized for allocating the computing resources. In case of a virtual cluster in the Cloud, a batch submission system can be configured on demand for scheduling the jobs dependent on the virtual execution appliances available. Thus, scheduling systemspecific parameters that have to be set at job submission time can be defined via the environment descriptor.
For scientific applications usually a parallel execution environment such as MPI, OpenMP, or MapReduce is utilized. Most of these environments provide a huge set of configuration parameters that may impact the runtime of an application. For example, the Apache Hadoop Framework 1 , amongst many other parameters, allows to specify the number of map and reduce tasks and the configuration of the Java Virtual Machines. With our framework, those parameters that should be taken into account by the adaptive framework are defined in the environment descriptor and set upon job submission time. Cloud Stack Descriptors: Each layer can include multiple elements, each describable by an XML descriptor. Each XML descriptor includes a list of parameters with a name and a value. Application, environment, and resource descriptors can be generalized as descriptors.

C. Resource Descriptors
Within Cloud environments, often virtual clusters consisting of a variable set of computing resources with different CPUs and memory sizes are utilized during job execution. HPC systems provide different hardware architectures to consumers (e.g. multi-core CPUs, GPUs, ...) suitable for the execution of different applications. Resource descriptors enable an explicit description of the compute resources to be considered for the execution of a specific job.
Additionally, many applications require processing of large data sets. Storage environments, such as for example, the Hadoop Distributed File System (HDFS), provide a huge set of configuration parameters effecting the systems behavior. Adjusting these parameters is often not possible for single jobs because their configuration needs a lot of time (e.g. adjusting the block size or the replication factor of files in HDFS). Nevertheless, these parameters effect the runtime of the application and have to be considered during the job configuration. The resource descriptor enables the specification of these parameters.

D. Job Descriptors
The purpose of the adaptive framework is to adaptively configure a job upon job submission time on the basis of the application, the environment, and the resource descriptor(s). Therefore, a job descriptor comprises application, environment, and resource descriptors.

E. Representation of Descriptors
All descriptors are defined on the basis of XSD schemes which include generic key/value pairs enabling the definition of parameters. Additionally, the XSD schema enables the definition of the scope of each parameter to be considered during job configuration. This has to be done by an domain expert. By following this approach, different applications, execution environments, and resources can be easily supported. The model of the descriptors is depicted in Figure  2.

III. DESIGN OF THE ADAPTIVE FRAMEWORK
On top of these descriptors representing the Cloud stack, an adaptive framework for configuring all three layers has been designed. The adaptive framework supports the configuration of the layers upon job submission time depending on the job characteristics as well as the available resources. The main objective of the adaptive framework is the optimization of the utility for a specific job which is achieved by adaptive configuration of these layers. Therefore, the design of the framework follows the MAPE-K loop [6], which is a wellestablished concept in autonomic computing. The MAPE-K loop comprises a monitoring, an analyzing, a planning, an execution, and a knowledge element. In our framework, the planning element relies on an utility function. The adaptive framework itself acts as autonomic manager of the different layers of the Cloud stack. The design of the framework is shown in Figure 3.

A. Managed Resources
Our adaptive framework has been designed to manage resources at all three layers involved. The framework provides sensor and effector interfaces for gaining actual information about resources and their utilization and for changing the state of the resources.
Multiple execution environments may be involved during the job execution including the scheduling system and the programming environment (e.g. MPI, MapReduce).
The management of the resource layer provides an interface to the computing and storage resources. Computing resources may be HPC resources and clusters managed by a batch scheduling system. In case of the OGE 2 , information about the allocatable resources can be retrieved via the DRMAA API [8]. In case of private or public Cloud environments, the management can be done over the Cloud environment's API. Storage resources include distributed file systems, such as for example, the Hadoop distributed 2 OGE -Oracle Grid Engine, http://www.oracle.com/ us/products/tools/oracle-grid-engine-075549.html file systems, and Cloud storage solutions as provided by Amazon (Amazon S3).

B. Knowledge
To realize a framework capable of adaptively configuring the application, the execution environment, and the resources, there is a need to integrate knowledge gained from previous configurations of the system. Following the concept of the MAPE-K loop, this knowledge is made available to the framework via the knowledge component and used in the process of decision-making.
Various concepts for managing knowledge are established and could be utilized (e.g., Concept of Utility, Reinforcement Learning, Bayesian Techniques) [6]. In our approach, the concept of utility, describing the usefulness of a specific configuration, is utilized for representing knowledge. This enables the representation of varying goals from different stakeholders by defining different utility functions. For example, a specific configuration may have a different utility for the researcher or for the service provider depending on their goals. The knowledge itself is captured within a database system which stores application, environment, and resource descriptors of previous application executions.
In our framework we utilize a HSQL database system for the knowledge base. For each job, the runtime of the job, the utility of the job, and estimated values for runtime and utility (during planning phase) are stored and made available to the framework. A parameter table is utilized to store parameters specific to the managed resources. After a job has been finished, the utility of this job is calculated based on the runtime of the job, and both values are added to the knowledge base.

C. Monitor
The monitoring component is responsible for sensing the involved managed resources and for providing this information to the autonomic manager. Sensing the resources results in the generation of actual application, environment, and resource descriptors. These descriptors refer to the actual configuration of the managed resource (e.g. OGE). The adaptive framework has to monitor multiple resources at the different layers (IaaS, PaaS, SaaS). Therefore, the monitor relies on a component-oriented architecture, which enables simple integration of new monitor components for monitoring specific resources (e.g., different set of resources).

D. Analyzer
The analyzer is responsible for evaluating the actual configuration of all layers involved. The analyzer adopts a component-based architecture, as depicted in Fig. 4, and can be composed of multiple specific analyzers, for analyzing the configuration of specific resources. The analyzer executes all specific analyzer components sequentially. The basic execution chain of analyzer components is bottom-up, starting from the resource layer, next the execution environment, and finally the application layer. The execution chain can be changed by the service provider if required.
Each analyzer component examines the layer-specific parameters and generates a set of corresponding job descriptors. For example, if the resource descriptor includes a virtual cluster consisting of ten virtual appliances, the resource analyzer component creates ten different job configurations describing the utilization of one up to ten virtual appliances. In order to limit the number of different configurations, range of the different parameters has to be restricted by the service provider. The analyzing phase results in a set of feasible job configurations, each specifying a different configuration of the resources on each layer. One aspect with this approach is to balance the amount of parameters with the accuracy of the approach. On the one hand, we try to minimize the amount of generated job configurations by utilizing as less parameters as possible at each layer. On the other hand, the utilized parameters have to be chosen by domain experts to retrieve appropriate and accurate results.

E. Planner
Within the MAPE-K loop, the planning component is responsible for choosing the job configuration with the highest utility. This is done by means of knowledge and a planning strategy on the basis of the concept of utility, which is used in economics to describe the measurement of 'useful-ness' that a consumer obtains from any good [9].
The planner utilizes a utility function for estimating the utility of a specific job configuration. The utility function is based on the parameter set included in the job descriptor. The parameter set changes due to different types of involved resources, execution environments, or applications. Therefore, the utility function has to be adapted according to the specific application scenario. Additionally, the utility function takes into account the runtime of the application. Thus, an application-specific performance model is needed to estimate the runtime.
The planner ranks all job descriptions on the basis of their utility and chooses the job configuration with the highest utility. The basic design of the planner including a utility Figure 5.
Design of Planner: The design of the planner includes an application specific performance model and a utility calculator. The utility calculation relies on an utility function. The planner calculates the utility of each provided job description and ranks them afterwards by means of the utility.
calculator and an application-specific performance model is depicted in Fig. 5.
1) Utility Calculator: The utility calculator computes the utility of a job descriptor taking into account parameters within the application, environment, and resource descriptors as well as the estimated runtime as obtained with the performance model. In more detail the utility is defined as follows [10]: The utility U for a given configuration is defined as U (A, E, R), where A = {a 1 , ..., a n }, E = {e 1 , ..., e m }, and R = {r 1 , ..., r k } represent the parameter set at the application layer, the execution environment layer, and the resource layer, respectively. Different configurations are ranked on the basis of their utility.
, then configuration C is ranked higher than configuration C . The configuration with the highest utility is chosen for job execution. The utility of a configuration is normalized in the range [0, 1]. The utility function itself is defined as Sigmoid function depending on a function f (A, E, R). This function f is scenario specific and has to be defined by domain experts.
A Sigmoid function has been chosen because it highlights a small set of "good" configurations, slopes promptly, and settles down on a low value. Hence, the function (1) fits the problem of accentuating good job configurations.
2) Performance Model: Predicting the accurate runtime of an application usually is a complex, often not realistic task [11]. Similarly, a complete multidimensional regression including all parameters involved requires a large amount of test cases to deliver appropriate results. For these reasons, we propose the utilization of a history-based performance models. Within a case study, we retrieved accurate results on different computing resources by utilizing historic information and parameter-specific regression functions.

F. Executor
The task of the executor is to configure the three layers according to the chosen parameter configuration and execute the application job. First, the executor reserves and configures the computing and storage resources (or initializes them in case of a virtual cluster within a Cloud environment).
After job completion, the executor evaluates the execution (runtime of job and recalculated utility) and stores the gained information in the knowledge base so that it can be utilized for planning of forthcoming job submissions.

IV. CASE STUDY: CLOUD-ENABLING A MAPREDUCE APPLICATION
A prototype of the adaptive framework including all involved components has been implemented within the Vienna Cloud Environment (VCE) [12]. In the following we report on experiments with a MapReduce application from the domain of molecular systems biology [10], [13]. The framework has been successfully applied for optimizing the utility of MapReduce jobs by adaptively configuring the involved resources.
At the application layer, support for the execution of the MoSys application [13], matching tryptic peptide fragmentation mass spectra data against the ProMEX database [14], has been implemented. The application supports the execution of parameter studies, in particular the comparison of hundreds of files against the database. The framework adaptively splits or combines these input files into multiple jobs and schedules their execution on different sets of computing resources.
At the platform layer, the adaptive framework supports the configuration of the Apache Hadoop framework. HDFS (block size) and Hadoop parameters are monitored on demand and the Hadoop job configuration, e.g. the number of Map or Reduce tasks to be executed, are adaptively configured on the basis of these parameters.
At the infrastructure layer, the prototype supports scheduling jobs on virtual clusters within a private Cloud environment as well as on a local cluster system utilizing OGE.
Preliminary results of the prototype show how the case study application scales on private Cloud and cluster resources [10]. Additionally, the estimated runtime, calculated on the basis of a history-based performance model, has been compared with the job's execution time and accurate results have been shown.
The planning process relies on knowledge gained from historic jobs. At least one job has to be executed before. In the following, consider a test job comparing 1000 files against a 500 GB database. In a first run, we executed a job comprising four sub-jobs with 250 files each and each sub-job has been executed on two nodes on the cluster. The total runtime of this job was 10852 seconds. Utilizing the results of this first run, our framework generates and evaluates different configurations of jobs, taking into account the following limited set of parameters: the number of compute nodes, the estimated runtime of the job, the size of the database (HDFS), and the job configuration (i.e. number of sub-jobs and input files per sub-job). On the cluster up to eight nodes and on the private Cloud twelve nodes, all with eight cores, can be utilized. In total 340 different job  [7] descriptors have been evaluated. The impact of the HDFS block size and the number of map/reduce tasks has been evaluated separately and is not modified in this experiment.
For each created job descriptor the planner estimates the performance and calculates its utility on the basis of the defined utility function. Within the test case we choose a utility function weighting the four job specific parameters with the main objective to optimize the runtime. As a result of applying our framework, the job configuration utilizing eight cluster nodes for executing all 1000 input files within a single job has the highest utility and therefore is chosen for execution (estimated runtime: 8030 seconds). Table I shows sample job descriptors from our test scenario together with the measured runtime, the estimated runtime of the job, and the estimated utility. As can be seen the performance model provides quite accurate results.

V. RELATED WORK
In economics, utility functions are used to measure the relative satisfaction of customers with goods. This principle can be applied to Cloud computing models where users access infrastructure on a rental basis. In [15], utility functions in autonomic systems are used to continually optimize the utilization of managed computational resources in a dynamic, heterogeneous environment. The authors describe a system that is able to self-optimize the allocation of resources to different application environments. In contrast to their approach, we try to reduce the costs (resource usage, runtime) for one application. This is achieved by automatically configuring the environment.
In [16], an adaptive resource control system is presented. It adjusts the resources to individual tiers in order to meet application-level Quality of Service (QoS) goals. That is, to increase the resource utilization in a data center by taking into account the application level QoS. The premier objective of our framework is to reduce the amount of utilized resources for individual jobs.
The Automat toolkit [17] is a community testbed architecture that targets research into mechanisms and policies for autonomic computing that are under closed-loop control.
The toolkit enables researches to instantiate self-managed virtual data centers and to define the controllers that govern resource allocation, selection, and dynamic migration.

VI. CONCLUSION
In this paper we presented the design of an adaptive framework for optimizing scientific applications in the Cloud based on autonomic computing concepts. Our approach takes into account the configuration of all three Cloud layers: the infrastructure layer (IaaS), the execution layer (PaaS), and the application layer (SaaS). For describing the configuration parameters on these three layers generic XML descriptors are utilized. On top of these descriptors, an adaptive framework has been designed that realizes a MAPE-K loop, known from autonomic computing, and relies on the concept of utility for optimizing the configuration of the cloud stack on a per-job basis. Within a case study, a prototype of the adaptive framework has been developed for a MapReduce application to be executed on private Cloud and cluster resources. First results show that the method of adaptive configuration of the Cloud stack can lead to good results for MapReduce applications.
Future work will include the evaluation of the framework with other applications, execution environments, and resource types regarding scale and heterogeneity. Utilizing the framework in public Clouds will lead to challenges regarding the uncertainties of the execution context. We also plan to extend our work by taking into account energy consumption [18]. Another aspect is to develop and evaluate heuristics for the analysis of the parameter ranges to minimize the number of generated job descriptors. Moreover, we will investigate the use of machine learning algorithms within the planning phase.