Approximation Algorithm for Scheduling a Chain of Tasks for Motion Estimation on Heterogeneous Systems MPSoC

Co-design embedded system are very important step in digital vehicle and airplane. The multicore and multiprocessor SoC (MPSoC) started a new computing era. It is becoming increasingly used because it can provide designers much more opportunities to meet specific performances. Designing embedded systems includes two main phases: (i) HW/SW Partitioning performed from high-level (eclipse C/C+ + or python (machine learning and deep learning)) functional and architecture models (with virtual prototype and real prototype). And (ii) Software Design performed with significantly more detailed models with scheduling and partitioning tasks algorithm DAG Directed Acyclic Graph and GGEN Generation Graph Estimation Nodes (there are automatic DAG algorithm). Partitioning deci-sions are made according to performance assumptions that should be validated on the more refined software models for ME block and GGEN algorithm. In this paper, we focus to optimize a execution time and amelioration for quality of video with a scheduling and partitioning tasks in video codec. We show how they can be modeled the video sequence test with the size of video in height and width (three models of scheduling tasks in four processor). This modeling with DAG and GGEN are partitioning at different platform in OVP (partitioning, SW design). We can know the optimization of consumption energy and execution time in SoC and MPSoC platform.


Introduction
Multi-core and Multi-processor architectures (SoC and MPSoC) started a new computing era. They are becoming increasingly used as they can provide designers with new opportunities to meet desired requirements in embdedd system for different application and domain. Multi-media and telecommunication streaming applications are now widely used in several domains such as visio-conference, networking, video cripttage and compression, surveillance, medical services, military imaging and telecommunication applications. These applications are characterized with stringent delay time of tasks. Model of scheduling tasks in 4 CPUs for Motion Estimation "ME" and DAG algorithm is illustrated in Figure 1.
Motion estimation module is very important complex module of a video codec. MPSoCs have the most suitable architecture to meet real time high-definition "HD" encoding requirements. This real time HD coding is based on a block matching algorithm "BMA" which locates matching blocks in a sequence of digital video frames. This technique is used to discover temporal redundancy in the video sequence, increasing the effectiveness of inter-frame video compression. The scheduling of tasks is an important step to accelerate the motion estimation process. The objective is to minimize the execution time of this algorithm, by distributing computing the tasks that describe the algorithm to the various cores of an MPSoC. In this case, we can see various methods to jobs scheduling and partitioning tasks for video codec applications, but are not adapted to our motion estimation block problem.
In this project, we choose the second method to schedule our algorithm based on DAG and GGEN for many reasons. This method minimizes the complexity of NPcomplete problems in tasks. This methods are automatic, generic and periodic algorithm. It refine a granularity of tasks in ME block. The true parallelism tasks in SoC and MPSoC system, we start in 4 CPUs as in [1,2], after we can works 8, 16, … , 1024 CPUs. There are some works and scheduling in tasks, it is periodic and acyclic. This methods give us an optimal solution for complex scheduling and partitioning tasks on MPSoC system, as in [3,4]. In this case, we choose parallelizing tasks and instructions in codec video blocks, there is very interessting step for scheduling and partitioning tasks in embedded platforms. We can cite: FPGA, GPU, DSP target, as in [5,6]. The standard H 264/AVC is a new video codec configured and released by ITU-T and ISO/IEC [7,8]. This standard give a very important results in bit rate, quality of image and others criteria in video codec block comparing with others standards [2,[7][8][9].
This paper is scheduled as follows: In section II, we detail the applied algorithm for block motion estimation with scheduling and partitioning tasks method in SoC and MPSoCs systems, this works are designed with acyclic algorithm (DAG-TPG) and co-designed in OVP platform. After, we present the ME module and importance in video codec standards. In section III, we can describe a new scheduling tasks approach and co-design in platform OVP (partitioning). Then, we synthesis with results, section IV. We finished by our conclusion.

ME in codec video H 264 and scheduling tasks approach
In this paper, high-quality video encoding imposes unprecedented performance requirements on real time mobile devices. To address the competing requirements of high performance and real time, embedded mobile multimedia device manufactures have recently adopted MPSoC (multiprocessor system-on-chip). Despite the advancements in new technology digital mobile device, computer system and I-Pad, the execution time, energy consumption and quality of image in SoC and MPSoC systems, we needed parallelism and scheduling tasks applied for H 264/AVC video codec. This approach of scheduling can eliminate problems in ME blocks and remains artifacts in image. We can start to minimize a time delays and time execution in ME blocks in video codec. Secondly, video codec blocks very important in structure and function, needed predictive coding blocks. These structures and functions, can be defined and modeled as with acyclic approach (semi-automatic and automatic) directed acyclic graphs (DAG) and generated graphs GGEN. With this approach, we can give a good and important solution for criteria evaluation in video codec, we can site execution time, time delays in tasks, and others [3,4,9]. We can finished to describe the delays tasks in some CPUs (SoC and MPSoCs target), there are considered real-tim applications [3,4,9]. In other case, video frames and video sequence should meet their deadlines and their estimation, the quality of image is very important. Then, many execution time in SoC and MPSoC targets with scheduling tasks approachs, there are exploit execution time and others criteria evaluations in video codec, we can see [9][10][11][12].
In Table 1, we classify these representative solutions based on their utilized optimization horizons, application models, complexity models, scheduling granularities, and considered sources of execution time. The scheduling tasks approach was chosen due to its efficiency and implementation simplicity for video codec H 264. The main idea of the scheduling tasks in ME is to decompose a sequence video into a frame and a frame into a Macro-block, so we scheduled and partitioned a tasks in parity order, we can see our method [9][10][11][12].

Scheduling and partitioning algorithm tasks
In previous works, scheduling and partitioning algorithms in MPSoCs systems have been formulated heuristically due to the inherent complexity of the multimachine scheduling problem. Such algorithms constitute a very important step in multimedia applications. There is an NP-complete step in co-design. A schedule is an assignment of each task to a machine. For scheduling, the load M i defines the requirement of total processing jobs assigned, and the scheduling length is the load on the busiest machine. We want to find a minimum length for the schedule, as in [3,13]. There are three distinct approaches to these scheduling problems: the theory of queues for networks scheduling, the deterministic scheduling and the software engineering scheduling. The study of approximation algorithms for NP-complete scheduling and partitioning problems for MPSoC systems has started with the work of Graham in 1996, who has analyzed a simple algorithm. When designing an approximate algorithm to minimize such NP-complete problems, we can evaluate its performance in a different way. One would use an algorithm that approximates with a guarantee of the performance of the deviation of the optimum value of the worst case as in [3,4]. However, most scheduling and partitioning algorithms make assumptions on the relationship between the task dependencies. We may then classify scheduling algorithms. Scheduling, data partitioning and parallelism identification are three key points for an efficient application deployment. There are several approaches to manage scheduled tasks for real-time applications. The most important approaches are cyclic and acyclic approaches. Cyclic approaches have been used for Static Data Flow "SDF", Cyclo-Static Data Flow "CSDF" and Petri networks [4,13]. Acyclic approaches have used DAG, Unified Directed Acyclic Graph "UDAG, Weighted Directed Acyclic Graph "WDAG", etc.
In this paper, we consider the problem of partitioning and scheduling tasks with a Task Precedence Graph "TPG" which is given as a DAG. A solution based on the DAG representation has the advantage to be generic, simple and usable with a refined granularity. A node in the DAG is a task which is a set of instructions to be executed sequentially without preemption in the same processor. Modeling, terminologies and all the mathematical formalism of the DAG algorithm is presented in [3,4].

DAG algorithm
DAG algorithm is based on the asynchronous message passing paradigm. The parallel architectures are increasingly popular, but scheduling is very difficult because the data and the program must be partitioned and distributed to processors [3,17,18]. The methodology for MPSoC in embedded applications is the Adequacy-Algorithm-Architecture "AAA". The following issues are of major importance for distributed memory architectures: • Scheduling, data partitioning and parallelism identification.
• Data mapping and program architecture.
• Scheduling and partitioning the execution of the task.

Motion estimation "ME": application
ME block is very important in H 264 and H 265 video codec. In various standards, ME needs very complex tasks and instructions, and takes the largest part of video codec [2,[10][11][12]19].

Principle of the ME block
Principle of ME is the following: for a MB in the current frame, we define a search window in the reference frame. There are several evaluation criteria, such as MSE, SAD, BBM, MAD, NCF. We seek in this window the best MB using the Sum of Absolute Difference "SAD" distortion criterion given by Eq. (1) as in [2,19,20,21], we descibe this function SAD in Figure 2. To optimize the complexity level of the ME module, several fast search algorithms have been defined in the literature as DS, FS, TSS, HDS, PMVFAST, LDPS and the block-matching algorithm "BMA". Intercoding consists in finding a similar block that is aware of a reference frame block. This process is performed by a BMA. General principle of BMA is to exploit the temporal redundancies between consecutive frames.

Block matching algorithm "BMA"
Inter-coding consists in finding a block similar to the current block of a reference frame. This process is performed by a block-matching algorithm. The general principle of the BMA is to exploit existing temporal redundancies between consecutive frames. This method involves searching for each point of the frame of interest It, the point of the frame It þ 1 USD which maximizes a correlation score. The search is performed in a search block [7,[10][11][12]19]. We describe a principal of this method in Figure 3.
Object of interest is determined when the number of corresponding blocks in the previous and current frame is higher than the value of a certain threshold. The threshold value is obtained experimentally [22]. We define the principle from SAD function in Eq. (1), where (R(i; j)) denotes the pixels of the reference MB and (F(x + i; y + j)) denotes the pixels of the current MB. Figure 4 below shows the flow chart of the ME block of H 265 video codec. We use a padding method in order to enforce a whole number of packets. The scheduling methodology is defined in the Figure 5, which illustrates the flow chart of the MB (16 * 16) for H 265 the video codec. The new idea for the flow chart of H 265 is to work with the padding technique rather than the affinity of the granularity (1=2), 1=4 ð Þand the (1=8) pixel method. The padding added to the end of a packet in order to enforce a whole number of packets.

DAG applied to ME
In this section, we describe our new approach for the scheduling tasks DAG. Firstly, we should to make the method generic, we can validated for any frame size. Then, we chose to work on block parity in scheduling tasks for ME blocks. Four combinations are possibles: odd odd, odd even, even odd, even even. The different sequence test in our works are modeled in three models for the scheduling tasks methods. The generic frame size is ("X = N", "Y = M") where "X" represents the pixels for lines and "Y" denotes the column pixels. After scheduling a three models, we can see a problem in border of image or frame. This problem, we can apply padding method to add rows and columns to the frames by adding empty pixels. We notice that "N" has to take an even value that is divisible by 4. Figure 6 shows the work-flow of an entire frame with size (N*M).
For each node in the task graph, one must compute the start and end times of execution cycles by using the weight of each node. The "Gantt of chart" represents the task scheduling in processors in terms of time and gives the order of the tasks of the ME block. This Gantt of chart is illustrated in Figure 7. Clustering is the placement of the various tasks of our application on different clusters. It depends on  the number of processors in the implementation platform. In our case, we chose a platform with four processors. We need four groups, four instants of time and then we have four outputs.

Parameters setting
For a set P of processors, each node must be assigned to a single processor. For all the the addressable memory, we give a portion of the same space to each case. When setting up parameters, we have to take in consideration the time constraints and the constraint of targets HW/SW. The classical assumptions made on the target MPSoC are: • The tasks are non-preemptive.
• Local communication is costless.
Despite, the following equations show the scheduling tasks of a classic frame. First, we treat the odd tasks and the odd nodes in the TPG graph, ts n ii ð Þ and w n ii ð Þ presenting the time and weight in odd nodes.
At the end, we compute the equations of even tasks, where ts(n jj ) is the time of the node (jj) and w(n jj ) is the weight in this node. tf n jj À Á ¼ ts n jj À Á þ w n jj À Á For the time end calculation "tf" in the node n i ; n j À Á . The sequence tasks from the ME blocks are defined in Figure 8.
In general, the nodes and tasks in TPG graph are made of nodes n ii , n ij , n ji , n jj . They are illustrated in Eqs. (4)- (7). Gantt of chart from ME block.
Execution time for the node of the test sequence is presented by Eq. (8) below.
For the nodes or the tasks executed on different processors, the execution period is computed in the output frame for the current reconstruction frame. The nodes are spread by parities over each processor. In the first level, we have 16 nodes. In Level II, we add nodes to the neighbor to reconstruct the macro-blocks. The same is done for all levels, the goal being to reconstruct the frame. In our case, we have four groups since both architectures test four processors. We compute the execution period in Eq. (9) below.
Hence, the set of end times is computed in the three models for scheduling and partitioning tasks. This algorithm is applied to ME blocks for the test video sequence in H 265 video codec: t f n ii ,n ii ð Þ , t f n ij ,n ij ð Þ , t f n jj ,n jj ð Þ , t f n ji ,n ji ð Þ . It is applied to all nodes in the graph found for the task with the ME block using test sequence "Akiyo". Two processors p i and p j are isomorphic if read times are equal. Then the weight odd nodes or tasks w n i ð Þ are equal to the weight nodes or tasks w(n j ). Thus, the set of nodes is n ii , n ij , n jj , n ji and the set of tasks is T ii , T ij , T jj , T ji . Succeeding in scheduling and partitioning the tasks with DAG is equivalent to the following conditions:: • Pred n i ð Þ ¼ pred n j À Á : • W n i ð Þ ¼ W n j À Á : • Succ n i ð Þ ¼ Succ n j À Á : In our case, the platforms contain four processors, therefore we seek other processors satisfying the conditions. Basing on the assumptions above, we can deduce other parameters. The scheduling length for tasks and partitioning problems of processor P þ 1 ð Þis always less than or equal to the processor P.
Where n xx is the set of nodes n ii , n ij , n ji , n jj . This equation is applied to the three models, where "SL(S)" contains the scheduling graph length "Sopt(P)" which has the optimal schedule of P processors. For our application, we have four processors. To apply the hypothesis, we must know the maximum execution time for all nodes ("X" nodes = "X" tasks).

Applied DAG algorithm with ME in video codec
Flow chart of the global steps is illustrated in Figure 9. So, we can resume the scheduling steps are as follows in Table 2.

How to compute the execution time?
To calculate the processing time of an instruction, since the message transfer time depends sent to the calculation of instructions executed by a processor flows. Each instruction requires multiple clock cycles, the instruction is executed in as many cycles steps. Sequential microprocessors run the following statement when they finish first. In the case of instruction parallelism, the microprocessor can process several of these steps at the same time for several different instructions because different internal resources are mobilized. In other words, the processor executes instructions in parallel and sequencially at various stages of completion. This execution queue is called a pipeline. This mechanism has been implemented for the first time in the 1960 by IBM. Figure 10 describe the canonical example of this type of pipeline, under the form of a RISC processor, in five stages. Sequencing instructions in a processor with a 5-stage pipeline needs 9 cycles to execute 5 instructions. At t = 5, all floors have solicited the pipeline, and 5 operations occur simultaneously.

How to compute the transfer time?
Assume that the data transfer time is proportional to the size of the data exchanged (between processors), this transfer time is the time of sending data within Figure 10. The canonical example of this type of pipe is that of a RISC processor (5 stages). tasks in between processors. Then "Tt" is the transfer time and "Td" is the data size. k p1; p2 ð Þis the transfer duration between p1 and p2. Tt is defined by Eq. 11:

Model of sequences
The execution time of an instruction and message transfer between processors, and tasks between processors processed by the ME block are presented in Table 3.
Virtual prototyping with OVP environment We work with a project virtual prototype in OVP simulator. The strategies from a project and simulations in OVP are illustrated in [23]. We describes the cosimulation in Figure 11. This figure is composed the implementation and prototype in OVP (with thread method in CPUs).
Affinity scheduling granularity of the ME algorithm The following diagram shows the flow chart of mode (16 * 16) for the video coding in H 264/AVC. The difference with the flowchart of H 264/AVC standard is the interpolation technique, used instead of the affinity method granularity 1=2, 1=4 and 1=8 pixels. Using the interpolation technique minimizes the number of jobs and the number of level task graphs. Thereafter, the execution time of block ME is optimized for the standard video codec HEVC/H 265 and the standard H.264/AVC old report time. Thus, we have one level graph TPG-DAG. Both graphs were improving accuracy and frame quality. In this section, we consider time communications between different tasks, as independent tasks in the same processor itself or in different processors as shown in Figure 12. We define the communication delays between tasks and scheduling tasks lengths. Figure 12 presents the partitioning and scheduling algorithm for the ME block in the H 265/AVC video codec and the different communications and mappings for the different processors. co-simulations of OVP. Also, we see that our scheduling and partitioning algorithm is optimal within the approximation and modeling formalisms considered.

The results of the ME block applied to scheduling tasks approach
Some experimental curves of execution time for the ME block applied to the DAG algorithm are presented in Figure 13, obtained in OVP for SoC and MPSoC based ARM Cortex A9MP. The important metric in our works is execution time. After, execution time simulation, we chose co-simulation in OVP. In this virtual platform, we can try variety SoC and MPSoC targets in OVP. The execution time is ameliorated compared with other results in literature with respect to the number or frames in test video sequences. For more details we can see [9], in our works we presents a different values in three models in scheduling and partitioning tasks codesigned in platform OVP (SW/HW). Our scheduling algorithm DAG applied to ME block in video codec is optimal because 0 < ¼ C < ¼ 0 : 5 ð Þ . T 1 is the execution time for one processor and T 2 is the execution time for many processors. The gain for the application is computed in Eq. 12.
Also, our parameters for DAG and GGEN algorithms applied to ME blocks in codec video is very important if you compared with others approachs in scheduling tasks. In Figure 13, we give the formulations for scheduling dependent tasks into homogeneous multiprocessor architectures of an arbitrary DAG and GGEN, taking into account communication delays. The time execution is in seconds. T 1 presents the execution time for simulations with C/C++. Then T 2 is the execution time for simulations with SoC platform in OVP. T 3 and T 4 present the execution times when scheduling and partitioning tasks with the MPSoC system in OVP. P 1 is the platform Versatile ARM CortexA9MP*4, P 2 is the platform Ukernel arm Cortex A9MP*4.
We observe our results in simulation for execution time are very more high compared with our results in co-simulation in Virtual Platform OVP, we can see the important minimization. Figure 14 shows the results obtained for the gain of the entire test frame with the DAG algorithm. From Figure 14, we can see that using our technique reduces the execution time of the ME block. The experimental results illustrates the substantial enhancement of execution time. The scheduling and partitioning tasks algorithm DAG is give a true parallism with 4 processors in SoC and MPSoCs targets. For those comparisons, the metric value is the latency time of executions "t" for the SoC system. We selected SOC and tow platforms MPSoC (the models of MPSoC are: Platform including ARM Cortex-A9MPx4 to run ARM MPCore Sample Code and Versatile Express booting Linux on Cortex-A9MP Single, Dual and Quad Core in OVP). We remark that the execution time decreases when changing the platform, as in SoC and MPSoC systems. We deduce that the  scheduling methodology has beneficial results for the execution times in OVP for the various test video sequences.
We show in this part the evaluation of the criteria of our application treated in our research work. We calculate the complexity of the H 264 video codec for ME block in the Eq. (14). Table 5 prensent a results for the function F(sc) of ME blocks. The function (Fsc) is the function for calculating the complexity of the EM block; such as "(w, h)" frame size of the test sequence, "p" is the maximum authorized displacement, "N" is the size of MB processed, "h" is the factor and w is the width.
We notice that the sequences that we use in our research are very complex, so we need a very optimal, precise, generic and automatic approach. With this we get good results in all H 264/AVC video codec.

Conclusion
In this paper, we presented a new scheduling and partitioning tasks algorithm DAG applied to ME for MPSoC platforms in OVP. The main contributions of our approach are the following ones: the semi-automatic scheduling and partitioning tasks, the performance with respect to granularity, the high quality, the accuracy and the short time execution for ME blocks in H 265 video codec. Stemming from complexity and profiling analysis, the DAG algorithm with ME blocks for three architectures platforms are presented in OVP (Soc and MPSoC system). The prototype SoC and MPSoCs system in OVP results highlighted that the processors have interesting performances, complexity, and execution times compared to other published solutions. This is visible in the tables and figures. The co-design HW/SW high level is also presented in SoC and MPSoC systems in OVP with IP of the DAG algorithm applied to ME blocks in H 265 video codec. Our scheduling and partitioning algorithm DAG is able to handle the execution time efficiently with Sequences F(sc)/Frame F(sc)/Video very limited resources, dropping the appropriate tasks in order to reduce the deadline time and the execution time in ME blocks. Our scheduling and partitioning algorithm DAG is fully semi-automatic to the characteristics of each application and it does not require any offline profiling data. The DAG model solution as a lowcomplexity applied to ME blocks for H 265 compared with other approaches. Besides, our results for the H 265 video codec have demonstrated that our proposed low-complexity solution for the general DAG model reduces the execution time by up to 70 %. The execution times computed theoretically are almost the same that are found in OVP. We have also shown how our solution efficiently adapts with respect to the DAG type, and scales well with the number of cores and the number of deadlines considered in the buffer. The design of the bus and the interface of SoC and MPSoC platforms based on Arm Cortex A 9 MP is also described, allowing direct integration of the IP cores on-chip communication used in MPSoC for H 265 video codec. In a future work, we will try to minimize other metrics in the H 265 video codec. We will use an automatic scheduling and partitioning algorithm DAG. We will also implement this work in a real target, based on the ARM Cortex A 9 MP, which is composed of four processors and named Embest SABER Lite, Target from Development SABER Lite-i.MX 6 Quad.