On the Effect of Applying the Task Clustering for Identical Processor Utilization to Heterogeneous Systems

Actual task execution models over the networked processors, e.g., cluster, grid and utility computing have been studied and developed for maximizing the system throughput by utilizing computational resources. One of major trends in task execution types is to divide the required data into several pieces and then distribute them to workers like "master-worker model". In contrast to such a data intensive job, how to divide a computational intensive job into several execution units for parallel execution is under discussion from theoretical points of view. If we take task parallelization into account in a grid environment such as a computational grid environment, an effective task scheduling strategy should be established. In the light of combining task scheduling concepts and grid computing methodologies, heterogeneity with respect to processing power, communication bandwidth and so on should be incorporated into a task scheduling strategy. If we assume the situationwheremultiple jobs are being submitted in the unknown number of computational resources over the Internet, objective functions can be considered as follows: (i) Minimization of the schedule length (the time duration from per each job, (ii) Minimization of the completion time of the last job, (iii) Maxmization of the degree of contribution to the total speed up ratio for each computational resources. As one solution for those three objective functions, in the literature(Kanemitsu, 2010) we proposed a method for minimizing the schedule length per one job with a small number of computational resources (processors) for a set of identical processors. The objective of the method is “utilization of computational resources”. The method is based on “task clustering” (A. Gerasoulis, 1992), in which tasks are merged into one “cluster” as an execution unit for one processor. As a result, several clusters are generated and then each of which becomes one assignment unit. The method proposes to impose the lower bound for every cluster size to limit the number of processors. Then the literature theoretically showed the near-optimal lower bound to minimize the schedule length.


Introduction
Actual task execution models over the networked processors, e.g., cluster, grid and utility computing have been studied and developed for maximizing the system throughput by utilizing computational resources.One of major trends in task execution types is to divide the required data into several pieces and then distribute them to workers like "master-worker model".In contrast to such a data intensive job, how to divide a computational intensive job into several execution units for parallel execution is under discussion from theoretical points of view.If we take task parallelization into account in a grid environment such as a computational grid environment, an effective task scheduling strategy should be established.In the light of combining task scheduling concepts and grid computing methodologies, heterogeneity with respect to processing power, communication bandwidth and so on should be incorporated into a task scheduling strategy.If we assume the situation where multiple jobs are being submitted in the unknown number of computational resources over the Internet, objective functions can be considered as follows: (i) Minimization of the schedule length (the time duration from per each job, (ii) Minimization of the completion time of the last job, (iii) Maxmization of the degree of contribution to the total speed up ratio for each computational resources.As one solution for those three objective functions, in the literature (Kanemitsu, 2010) we proposed a method for minimizing the schedule length per one job with a small number of computational resources (processors) for a set of identical processors.The objective of the method is "utilization of computational resources".The method is based on "task clustering" (A.Gerasoulis, 1992), in which tasks are merged into one "cluster" as an execution unit for one processor.As a result, several clusters are generated and then each of which becomes one assignment unit.The method proposes to impose the lower bound for every cluster size to limit the number of processors.Then the literature theoretically showed the near-optimal lower bound to minimize the schedule length.
However, which processor should be assigned to a cluster is not discussed because the proposal assumes identical processors.If we use one of conventional cluster assignment methods such as CHP(C.Boeres, 2004), triplet(B.Cirou, 2001), and FCS(S.Chingchit, 1999), almost all processors may be assigned to clusters because they try to achieve the maximum task parallelism to obtain the minimized schedule length.Thus, the third objective function may not be achieved by those cluster assignment strategies.
In this chapter, we propose a method for deriving the lower bound of the cluster size in heterogeneous distributed systems and a task clustering algorithm.From results of experimental simulations, we discuss the applicability of the proposal to obtain better processor utilization.
The remainder of this chapter is organized as follows.Sec. 2 presents other conventional approaches related to task clustering for heterogeneous distributed systems, and sec.3 presents our assumed model, then the lower bound of the cluster size is derived in sec.4. Sec. 5 presents a task clustering algorithm which adopts the lower bound shown in sec.4. Experimental results are shown in sec.6, and finally we present conclusion and future works in sec.7.

Related works
In a distributed environment, where each processor is completely connected, task clustering(A.Gerasoulis, 1992;T. Yang, 1994;J.C. Liou, 1996) has been known as one of task scheduling methods.In a task clustering, two or more tasks are merged into one cluster by which communication among them is localized, so that each cluster becomes one assignment unit to a processor.As a result, the number of clusters becomes that of required processors.On the other hand, if we try to perform a task clustering in a heterogeneous distributed system, the objective is to find an optimal processor assignment, i.e., which processor should be assigned to the cluster generated by a task clustering.Furthermore, since the processing time and the data communication time depend on each assigned processor's performance, each cluster should be generated with taking that issue into account.As related works for task clustering in heterogeneous distributed systems, CHP(C.Boeres, 2004), Triplet(B. Cirou, 2001), and FCS(S.Chingchit, 1999) have been known.CHP(C.Boeres, 2004) firstly assumes that "virtual identical processors", whose processing speed is the minimum among the given set of processors.Then CHP performs task clustering to generate a set of clusters.In the processor assignment phase, the cluster which can be scheduled in earliest time is selected, while the processor which has possibility to make the cluster's completion time earliest among other processors is selected.Then the cluster is assigned to the selected processor.Such a procedure is iterated until every cluster is assigned to a processor.In CHP algorithm, an unassigned processor can be selected as a next assignment target because it has no waiting time.Thus, each cluster is assigned to different processor, so that many processors are required for execution and therefore CHP can not lead to the processor utilization.
In Triplet algorithm(B.Cirou, 2001), task groups, each of which consists of three tasks, named as "triplet" according to data size to be transferred among tasks and out degree of each task.Then a cluster is generated by merging two triplets according to its execution time and data transfer time on the fastest processor and the slowest processor.On the other hand, each processor is grouped as a function of its processing speed and communication bandwidth, so that several processor groups are generated.As a final stage, each cluster is assigned to a processor groups according to the processor group's load.The processor assignment policy in Triplet is that one cluster is assigned a processor groups composed of two or more processors.Thus, such a policy does not match with the concept of processor utilization.
In FCS algorithm(S.Chingchit, 1999), it defines two parameters, i.e., β: total task size to total data size ratio (where task size means that the time unit required to execute one instruction) for each cluster and τ: processing speed to communication bandwidth ratio for each processor.During task merging steps are performed, if β of a cluster exceeds τ of a processor, the cluster is assigned to the processor.As a result, the number of clusters depends on each processor's speed and communication bandwidth.Thus, there is one possibility that "very small cluster" is generated and then FCS can not match with the concept of processor utilization.

Job model
We assume a job to be executed among distributed processor elements (PEs) is a Directed Acyclic Graph (DAG), which is one of task graphs.Let G s cls =(V s , E s , V s cls ) be the DAG, where s is the number of task merging steps(described in sec.3.2), V s is the set of tasks after s task merging steps, E s is the set of edges (data communications among tasks) after s task merging steps, and V s cls is the set of clusters which consists of one or more tasks after s task merging steps.An i-th task is denoted as n s i .Let w(n s i ) be a size of n s i , i.e., w(n s i ) is the sum of unit times taken for being processed by the reference processor element.We define data dependency and direction of data transfer from n s i to n s j as e s i,j .And c(e s i,j ) is the sum of unit times taken for transferring data from n s i to n s j over the reference communication link.
One constraint imposed by a DAG is that a task can not be started execution until all data from its predecessor tasks arrive.For instance, e s i,j means that n s j can not be started until data from n s i arrives at the processor which will execute n s j .And let pred(n s i ) be the set of immediate predecessors of n s i , and suc(n s i ) be the set of immediate successors of n s i .I fpred(n s i )=∅, n s i is called START task, and if suc(n s i )=∅, n s i is called END task.If there are one or more paths from n s i to n s j , we denote such a relation as n s i ≺ n s j .

Task clustering
We denote the i-th cluster in V s cls as cls s (i).I fn s k is included in cls s (i) by "the s + 1 th task merging", we formulate one task merging as cls s+1 (i) ← cls s (i) ∪{n s k }.If any two tasks, i.e., n s i and n s j , are included in the same cluster, they are assigned to the same processor.Then the communication between n s i and n s j is localized, so that we define c(e s i,j ) becomes zero.Task clustering is a set of task merging steps, that is finished when certain criteria have been satisfied.
Throughout this chapter, we denote that cls s (i) is "linear" if and only if cls s (i) contains no independent task(A.Gerasoulis, 1993).Note that if one cluster is linear, at least one path among any two tasks in the cluster exists and task execution order is unique.

System model
We assume that each PE is completly connected to other PEs, with non-identical processing speeds and communication bandwidths The set of PEs is expressed as P = {P 1 , P 2 ,..
and let the set of processing speeds as alpha, i.e., α = {α 1 , α 2 ,...,α m }.Let the set of communication bandwidths as β, i.e., β i,j means the communication bandwidth from P i to P j .The processing time in the case that n s k is processed on P i is expressed as t p (n s k , α i )=w(n s k )/α i .The data transfer time of e s k,l over β i,j is t c (e s i,j , β k,l )=c(e s i,j )/β k,l .This means that both processing time and data transfer time are not changed with time, and suppose that data transfer time within one PE is negligible.

Processor utilization 4.1 The indicative value for the schedule length
The schedule length depends on many factors, i.e., execution time for each task, communication time for each data exchanged among tasks, execution order after the task scheduling, processing speed, and communication bandwidth.Furthermore, whether a data transfer time can be localized or not depends on the cluster structure.The proposed method is that a cluster is generated after the lower bound of the cluster size (the total execution time of every task included in the cluster) has been derived.The lower bound is decided when the indicative value for the schedule length is minimized.In this chapter, the indicative value is defined as sl w (G s cls , φ s ), that means the indicative value for the schedule length after s task merging steps and φ s is the set of mapping between PEs and clusters after s task merging steps.sl w (G s cls , φ s ) is the maximum value of the execution path length which includes both task execution time and data transfer time, provided that each task is scheduled as late as possible and every data from its immediate predecessors has been arrived before the scheduled time (its start time).Table 1 shows notations and definitions for deriving sl w (G s cls , φ s ).In the table, assigned PEs for cls s (i) and cls s (j) are P p and P q , respectively.And suppose n s k ∈ cls s (i), n s l ∈ cls s (j).In table 1, especially S(n s k , i) means the degree of increase of execution time by independent tasks for n s k .Threrfore, the smaller S(n s k , i), the earlier n s k can be scheduled.The task n s k which dominates sl w (G s cls , φ s ) (In the case of sl w (G s cls , φ s )=level(n s k )) means that the schedule length may be maximized if n s k is scheduled as late as possible.Example 1. Fig. 1 shows one example for deriving sl w (G s cls , φ s )(s = 5).In the figure, there are two PEs, i.e., P 1 and P 2 .The DAG has two clusters, i.e., cls 5 (1) and cls 5 (4) after 5 task merging steps.In (a), numerical values on tasks and edges mean the time unit to be processed on the reference PE and the time unit to be transferred among reference PEs on the reference communication bandwidth.On the other hand, (b) corresponds to the state that cls 5 (1) and cls 5 (4) have been assigned to P 1 and P 2 , respectively.The bottom are shows the derivation process for sl w (G 5 cls , φ 5 ).From the derivation process, it is shown that the schedule length may be maximized if n 5 2 is scheduled as late as possible.
4.2 Relationship between sl w (G s cls , φ s ) and the schedule length Our objective is to minimize sl w (G s cls , φ s ) with maintaining the certain size of each cluster for processor utilization.The schedule length can not be known before scheduling every task, we must estimate it by using sl w (G s cls , φ s ).Thus, it must be proved that sl w (G s cls , φ s ) can effect on the schedule length.In this section, we show that minimizing sl w (G s cls , φ s ) leads to minimizing the schedule length to some extent.In this section we present that relationship between Table 2 shows notations for showing characteristics of sl w (G s cls , φ s ).In an identical processor system, provided that every processor speed and communication bandwidth are 1, no processor assignment policy is needed.Thus, let sl w (G s cls , φ s ) in an identical processor system as sl w (G s cls ).In the literature (Kanemitsu, 2010), it is proved that minimizing sl w (G s cls ) leads to minimizing the lower bound of the schedule length as follows.
Lemma 1.In an identical processor system, let ∆sl s−1 w,up which satisfies sl w (G s cls ) − cp ≤ ∆sl s−1 w,up and be derived before s task merging steps.Then we obtain where cp and g min are defined in table 2, and sl(G s cls ) is the schedule length after s task merging steps.
As for ∆sl s−1 w,up , it is defined in the literature (Kanemitsu, 2010).Furthermore, it can be proved that the upper bound of the schedule length can be reduced by reducing sl w (G s cls ) by the following lemma.Lemma 2. In an identical processor system, if sl(G S cls ) ≤ cp, then we obtain Proof.In seq ≺ s , some edges are localized and others may be not localized.Furthermore, edges in seq ≺ s do not always belong to the critical path.Then we have the following relationship.
− max Also, only in the case of sl(G S cls ) ≤ cp, we have the following rlationship.
From lemma 1 and 2, it is concluded that in an identical processor system the schedule length can be minimized if sl w (G cls ) is minimized.
As a next step, we show the relationship between sl w (G s cls , φ s ) and the schedule length in a heterogeneous distributed system.The following lemma is proved in the literature (Sinnen, 2007).
Lemma 3. In an identical processor system, we have In a heterogeneous distributed system, we assume the state like fig. 2, i.e., at the initial state every task is assigned a processor with the fastest and the widest communication bandwidth (let the processor as P max ).In fig. 2 (a), each task belongs to respective processor.Furthermore, we virtually assign P max to each task to decide the processing time for each task and the data transfer time among any two tasks.Let the mapping as φ 0 .Under the situation, we have the following corollary.
Corollary 1.In a heterogeneous distributed system, let cp w (φ 0 ) as the one with the mapping φ 0 in the table 2. Then we have As for the relationship between cp and cp w , in the literature (Sinnen, 2007), the following is proved.
One path in which every task belongs to seq s .seq ≺ s (i) Set of subpaths in each of which every task in cls s (i) belongs to seq ≺ s .proc(n s k ) The processor to which n s k has been assigned.
t c (c(e s k,l ), β p,q ) , where n s k , n s l are assigned to P p , P q .
cp w max Table 2. Parameter Definitions which are used in analysis on sl w (G s cls , φ s ).
Lemma 4. In an identical processor system, by using g min defined in table 2, we have By using lemma 4, in a heterogeneous distributed system, the following is derived.
Corollary 2. In a heterogeneous distributed system, we have From corollary 1, the following is derived.
Corollary 3. In a heterogeneous distributed system, we have From corollary 2 and 3, the following theorem is derived.
Thorem 4.1.In a heterogeneous distributed system, let the DAG after s task merging steps as G s cls .And assume every cluster in V s cls is assigned to a processor in P. Let the schedule length as sl(G s cls , φ s ).

If we define ∆sl
Proof.From the assumption and corollary 2, we have Also, from corollary 3, we obtain cp w (φ 0 ) ≤ sl(G s cls , φ s ).Thus if this is applied to (12), we have ⇔ ( 14) Assume that ∆sl s−1 w,up is the value which is decided after s − 1 task merging steps.Since

37
On the Effect of Applying the Task Clustering for Identical Processor Utilization to Heterogeneous Systems www.intechopen.comthis value is an upper bound of increase in terms of sl w (G s cls , φ s ) and can be defined in any policy, e.g., the slowest processor is assigned to each cluster and so on.However, at least ∆sl s−1 w,up must be decided before s task merging steps.From the theorem, it can be said that reducing sl w (G s cls , φ s ) leads to reduction of the lower bound of the schedule length in a heterogeneous distributed system.
As for the upper bound of the schedule length, the following theorem is derived. where t c (e s k,l , β i,j ) − t c (e 0 k,l , max p and proc(n 0 k ) are defined in table 2. That is, ζ, λ, µ is derived by scanning every path in the DAG.
Proof.After s task merging steps, there may be both localized edges and not localized edges which compose sl w (G s cls , φ s ).Obviously, we have sl w (G 0 cls , φ 0 )=cp(φ 0 ), such edges are not always ones which belongs to cp(φ 0 ).Therefore the lower bound of sl w (G s cls , φ s ) − cp(φ 0 ) can be derived by three factors, i.e., decrease of the data transfer time by localization in one path, increase of the processing time by task merging steps (from φ 0 to φ s ), and increase of data transfer time for each unlocalized edges (from φ 0 to φ s ).The localized data transfer time is derived by taking the sum of localized data transfer time for one path.On the other hand, if increase of the processing time is derived by taking the minimum of the sum of increase of task processing time from φ 0 to φ s for each path, this value is λ or more.The unlocalized data transfer time is expressed as µ.Then we have If sl(G s cls , φ s ) ≤ cp(φ 0 )=sl(G 0 cls , φ 0 ), we obtain Theorem 4.2 is true if we adopt a clustering policy such that sl(G s cls , φ s ) ≤ sl(G s−1 cls , φ s−1 ).From theorem 4.1 and 4.2, it can be concluded that reducing the sl w (G s cls , φ s ) leads to the reduction of the schedule length in a heterogeneous distributed system.Thus, the first objective of our proposal is to minimize sl w (G s cls , φ s ).

The lower bound of each cluster size
To achieve processor utilization, satisfying only "sl w (G s cls , φ s ) minimization" not enough, because this value does not guarantee each cluster size.Thus, in this section we present how large each cluster size should be.In the literature (Kanemitsu, 2010), the lower bound of each cluster size in an identical processor system is derived as follows.
(24) is the lower bound of each cluster size when sl w (G R cls ) can be minimized, provided that every cluster size is above a certain threshold, δ.And R corresponds to the number of merging steps when every cluster size is δ opt or more.If taking the initial state of the DAG in a heterogeneous system into account, δ opt is expressed by δ opt (φ 0 ) as follows.
By imposing δ opt (φ 0 ), it can be said that at least sl w (G 0 cls , φ 0 ) can be minimized.However, for s ≥ 1 sl w (G s cls , φ s ) can not always be minimized by δ opt (φ 0 ), because the mapping of each cluster and each processor is changed and then sl w (G s cls , φ s ) is not equal to sl w (G 0 cls , φ 0 ).I n this chapter, one heuristic of our method is to impose the same lower bound (δ opt (φ 0 )) for every cluster which will be generated by the task clustering.

Overview of the algorithm
In the previous section, we presented how large each cluster size should be set for processor utilization.In this section, we present the task clustering algorithm with incorporating the following two requirements.1.Every cluster size is δ opt (φ 0 ) or more.

39
On the Effect of Applying the Task Clustering for Identical Processor Utilization to Heterogeneous Systems www.intechopen.com2. Minimize sl w (G R cls , φ R ), where R is the total number of merging steps until the first requirement is satisfied.Fig. 3 shows the task clustering algorithm.At first, the mapping φ 0 is applied to every task.Then δ opt (φ 0 ) is derived.Before the main procedures, two sets are defined, i.e., UEX s and RDY s .UEX s is the set of clusters whose size is smaller than δ opt (φ 0 ), and RDY s is defined as follow.
RDY s is the set of clusters whose preceding cluster sizes are δ opt (φ 0 ) or more.That is, the algorithm tries to merge each cluster in top-to-bottom manner.
The algorithm is proceeded during UEX s = ∅, which implies that at least one cluster in UEX s exists.At line 3, one processor is selected by a processor selection method, e.g., by CHP(C.Boeres, 2004) (In this chapter, we do not present processor selection methods).At line 4, one cluster is selected as pivot s , which corresponds to "the first cluster for merging".Once the pivot s is selected, "the second cluster for merging", i.e., target s is needed.Thus, during line 5 to 7, procedures for selecting target s and merging pivot s and target s are performed.After those procedures, at line 7 RDY s is updated to become RDY s+1 , and pivot s is also updated to become pivot s+1 .Procedures at line 6 and 7 are repeated until the size of pivot s is δ opt (φ 0 ) or more.The algorithm in fig. 3 has common parts with that of the literature (Kanemitsu, 2010), i.e., both algorithms use pivot s and target s for merging two clusters until the size of pivot s exceeds a lower bound of the cluster size.However, one difference among them is that the algorithm in fig. 3 keeps the same pivot s during merging steps until its size exceeds δ opt (φ 0 ), while the algorithm in (Kanemitsu, 2010) selects the new pivot s in every merging step.The reason of keeping the same pivot s is to reduce the time complexity in selection for pivot s , which requires scanning every cluster in RDY s .As a result, the number of scanning RDY s can be reduced with compared to that of (Kanemitsu, 2010).

Processor assignment
In the algorithm presented in fig.3, the processor assignment is performed before selecting pivot s .Suppose that a processor P p is selected before the s + 1th merging step.Then we assume that P p is assigned to every cluster to which P max is assigned, i.e., no actual processor has been assigned.By doing that, we assume that such unassigned clusters are assigned to "an identical processor system by P p " in order to select pivot s .Fig. 4 shows an example of the algorithm.In the figure, (a) is the state of φ 2 , in which the size of cls 2 (1) is δ opt (φ 0 ) or more.Thus, . The communication bandwidth from P 1 to P max is set as min 1≤q≤m,1 =q β 1,q in order to regard communication bandwidth between an actual processor and P max bottleneck in the schedule length.In (b), it is assumed that every cluster in UEX 2 is assigned to P p after P p is selected.Bandwidths among P p are set as min 1≤q≤m,p =q β p,q to estimate the sl w (G 2 cls , φ 2 ) of the worst case.Therefore, pivot 2 (in this case, cls 2 (3)) is selected by deriving LV value for each cluster in RDY 2 , provided that such a mapping state.After (b), if the size of cls 3 (3) is smaller than δ opt (φ 0 ), every cluster in UEX 3 is still assigned to P p to maintain the mapping state.In Define UEX s as a set of clusters whose size is under δ opt (φ 0 ); Define RDY s as a set of clusters which statisies eq. ( 26).; exceeds δ opt (φ 0 ), the mapping is changed i.e., clusters in UEX 4 are assigned to P max to select the new pivot 4 for generating the new cluster.

Selection for pivot s and target s
As mentioned in 5.1, one objective of the algorithm is to minimize sl w (G R cls , φ R ).Therefore, in RDY s , pivot s should have maximum LV value (defined in table 1), because such a cluster may dominate sl w (G s cls , φ s ) and then sl w (G s+1 cls , φ s+1 ) after s + 1 th merging may became lower than sl w (G s cls , φ s ).Our heuristic behined the algorithm is that this policy for selecting pivot s can contribute to minimize sl w (G R cls , φ R ).The same requirement holds to the selection of target s , i.e., target s should be the cluster which dominates LV value of pivot s .In fig. 4 (b), cls 2 (3) having the maximum LV value in RDY 2 is selected.Then n 2 6 , i.e., cls 2 (6) dominating LV 2 (3) is selected as target 2 .similarly in (c) n 3 5 , i.e., cls 3 (5) dominating LV 3 (3) is selected as target 3 .

Merging pivot s and target s
After pivot s and target s have been selected, the merging procedure, i.e., is performed.This procedure means that every cluster in target s is included in pivot s+1 .
Then pivot s and target s are removed from UEX s and RDY s .After this merging step has been performed, clusters satisfying requirements for RDY s+1 (in eq. ( 26)) are included in RDY s+1 .Furthermore, every cluster's LV value is updated for selecting pivot s+1 and target s+1 before the next merging step.

Experiments
We conducted the experimental simulation to confirm advantages of our proposal.Thus, we compared with other conventional methods in terms of the following points of view.
Also we decided the Parallelism Factor (PF) is defined as ρ, taking values of 0.5, 1.0, and 2.0 (H.Topcuoglu, 2002).By using PF, the depth of the DAG is defined as The simulation environment was developed by JRE1.6.0_0, the operating system is Windows XP SP3, the CPU architecture is Intel Core 2 Duo 2.66GHz, and the memory size is 2.0GB.  A. At first, the lower bound of the cluster size is derived as δ opt (φ 0 ).Then the task clustering algorithm in fig. 4 is performed, while processor assignment policy is based on CHP(C.Boeres, 2004).
B. The lower bound of the cluster size is derived as δ opt (φ 0 ).Then the task clustering policy is based on "load balancing" (J.C. Liou, 1997), while processor assignment policy is based on CHP(C.Boeres, 2004), in which merging step for generating one cluster is proceeded until the cluster size exceeds δ opt (φ 0 ).
C. The lower bound of the cluster size is derived as δ opt (φ 0 ).Then the task clustering policy is random-basis, i.e., two clusters smaller than δ opt (φ 0 ) are selected randomly to merge into one larger cluster, while processor assignment policy is based on CHP(C.Boeres, 2004), in which merging step for generating one cluster is proceeded until the cluster size exceeds δ opt (φ 0 ).
The difference between A, B and C is how to merge clusters, while they have the common lower bound for the cluster size and the common processor assignment policy.We compared sl w (G R cls , φ R ) and the schedule length by averaging them in 100 DAGs.Table 3 and 4 show comparison results in terms of sl w (G R cls , φ R ) and the schedule length.The former is the result in the case of random DAGs.On the other hand, the latter is the in the case of FFT DAGs.In both tables, α corresponds to max-min ratio for processing speed in P, and β corresponds to max-min ratio for communication bandwidth in P." sl w (G R cls , φ R ) Ratio" and "sl(G R cls , φ R ) Ratio" correspond to ratios to "A", i.e., a value larger than 1 means that sl w (G R cls , φ R ) or sl(G R cls , φ R ) is larger than that of "A".In table 3, it can be seen that both sl w (G R cls , φ R ) and sl(G R cls , φ R ) in "A" are better than "B" and "C" as a whole.Especially, the larger CCR becomes, the better both sl w (G R cls , φ R ) and sl(G R cls , φ R ) in "A" become.It can not be seen that noteworthy characteristics related to sl w (G R cls , φ R ) and sl(G R cls , φ R ) with varying the degree of heterogeneity (i.e., α and β).The same results hold to table 4. From those results, it can be concluded that minimizing sl w (G R cls , φ R ) leads to minimizing the schedule length as theoretically proved by theorem 4.1 and 4.2.In this experiment, we confirmed that how optimal the lower bound of the cluster size, δ opt (φ 0 ) derived by eq. ( 25).Comparison targets in this experiment are based on "A" at sec.6.2, but only the lower bound of the cluster size is changed, i.e., δ opt (φ 0 ), 0.2δ opt (φ 0 ), 0.5δ opt (φ 0 ), 1.5δ opt (φ 0 ), and 2.0δ opt (φ 0 ).The objective of this experiment is to confirm the range of applicability of δ opt (φ 0 ), due to the fact that δ opt (φ 0 ) is not a value when sl w (G s cls , φ s ) can be minimized for 1 ≤ s.Fig. 5 shows comparison results in terms of the optimality of δ opt (φ 0 ).(a) corresponds to the case of the degree of heterogeneity (α, β)=(5, 5), and (b) corresponds to (10, 10).From (a), it can be seen that δ opt (φ 0 ) takes the best schedule length than other cases during CCR takea from 0.1 to 5.0.However, when CCR is 7 or more, 1.5δ opt (φ 0 ) takes the best schedule length.This is because δ opt (φ 0 ) may be too small for a data intensive DAG.Thus, it can be said that 1.5δ opt (φ 0 ) is more appropriate size than δ opt (φ 0 ) when CCR exceeds a certain value.On the other hand, in (b), the larger CCR becomes, the better the schedule length by case of 1.5δ opt (φ 0 ) becomes.However, during CCR is less than 3.0, δ opt (φ 0 ) can be the best lower bound of the cluster size.As for other lower bounds, 2.0 δ opt (φ 0 ) has the local maximum value of the schedule length ratio when CCR takes from 0.1 to 2.0 in both figures.Then in larger CCR, the schedule length ratio decreases because such size becomes more appropriate for a data intensive DAG.On the other hand, in the case of 0.25δ opt (φ 0 ), the schedule length ratio increases with CCR.This means that 0.25δ opt (φ 0 ) becomes smaller for a data intensive DAG with CCR increases.
From those results, it can be said that the lower bound for the cluster size should be derived according to the mapping state.For example, if the lower bound can be adjusted as a function of each assigned processor's ability (e.g., the processing speed and the communication bandwidth), the better schedule length may be obtained.For example in this chapter the lower bound is derived by using the mapping state of φ 0 .Thowever, by using the other mapping state, we may be obtain the better schedule length.To do this, it must be considered that which mapping state has good effect on the schedule length.This point of view is an issue in the future works.

Conclusion and future works
In this chapter, we presented a policy for deciding the assignment unit size to a processor and a task clustering for processor utilization in heterogeneous distributed systems.We defined the indicative value for the schedule length for heterogeneous distributed systems.Then we theoretically proved that minimizing the indicative value leads to minimization of the schedule length.Furthermore, we defined the lower bound of the cluster size by assuming the initial mapping state.From the experimental results, it is concluded that minimizing the indicative value has good effect on the schedule length.However, we found that the lower bound of the cluster size should be adjusted with taking an assigned processor's ability into account.
As a future work, we will study on how to adjust the lower bound of the cluster size for obtaining the better schedule length and more effective processor utilization.

References
A. Gerasoulis and T. Yang., A Comparison of Clustering Heuristics for Scheduling Directed Acyclic Graphs on Multiprocessors, Journal of Parallel and Distributed Computing, Vol. 16, pp. 276-291, 1992. 45 On the Effect of Applying the Task Clustering for Identical Processor Utilization to Heterogeneous Systems www.intechopen.com
) 34 Grid Computing -Technology and Applications, Widespread Coverage and New Horizons www.intechopen.com
Fig. 5. Optimality for the Lower Bound of the Cluster Size.
Comparison about sl w (G R cls , φ R ) and the schedule length In this experiment, we compared sl w (G R cls , φ R ) and the schedule length to confirm the validity of theorem 4.1 and 4.2.Comparison targets are as follows.

Table 3 .
Comparison of sl w (G R cls , φ R ) and the Schedule Length in Random DAGs(|V 0 | = 1000).