Performance Estimation of Task Graphs Based on Path Profiling

Lattuada, Marco; Pilato, Christian; Ferrandi, Fabrizio

doi:10.1007/s10766-015-0372-7

Performance Estimation of Task Graphs Based on Path Profiling

Published: 23 July 2015

Volume 44, pages 735–771, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Marco Lattuada¹,
Christian Pilato¹^nAff2 &
Fabrizio Ferrandi¹

260 Accesses
Explore all metrics

Abstract

Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5 % in average, even when compiling the code with different optimization levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GPU Architecture

Estimation of execution time for computing tasks

Article Open access 06 November 2022

DIPPM: A Deep Learning Inference Performance Predictive Model Using Graph Neural Networks

References

Wolf, W.: The future of multiprocessor systems-on-chips. In: Proceedings of the 41st Annual Design Automation Conference, DAC ’04, pp. 681–685 (2004)
Niemann, R., Marwedel, P.: An algorithm for hardware/software partitioning using mixed integer linear programming. Des. Autom. Embed. Syst. 2(2), 165–193 (1997)
Article Google Scholar
Marwedel, P.: Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, 2nd edn. Springer, Berlin (2010)
MATH Google Scholar
Ferrandi, F., Pilato, C., Tumeo, A., Sciuto, D.: Mapping and scheduling of parallel C applications with ant colony optimization onto Heterogeneous reconfigurable MPSoCs. In: Proceedings of the 15th IEEE Asia and South Pacific Design Automation Conference, ASP-DAC ’10, pp. 799–804, January 2010 (2010)
Ferrandi, F., Lanzi, P.L., Pilato, C., Sciuto, D., Tumeo, A.: Ant colony heuristic for mapping and scheduling task and communications on heterogeneous embedded systems. IEEE Trans. Comput. Aided Des. Integ. Circ. Syst. 29(6), 911–924 (2010)
Article Google Scholar
Benini, L., Bertozzi, D., Bogliolo, A., Menichelli, F., Olivieri, M.: MPARM: Exploring the Multi-Processor SoC Design Space with SystemC. J. VLSI Sign. Process. 41(2), 169–182 (2005)
Article Google Scholar
Beltrame, G., Fossati, L., Sciuto, D.: ReSP: A Nonintrusive Transaction-Level Reflective MPSoC Simulation Platform for Design Space Exploration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 28(12), 1857–1869 (2009)
Article Google Scholar
Li, Y.A., Antonio, J.K.: Estimating the execution time distribution for a task graph in a heterogeneous computing system. In Proceedings of the 6th Heterogeneous Computing Workshop, HCW ’97, pp. 172–184, (1997)
Manolache, S.: Analysis and optimisation of real-time systems with stochastic behaviour. Technical report, Linkoping University (2005)
Poplavko, P., Basten, T., Bekooij, M., van Meerbergen, J., Mesman, B.: Task-level timing models for guaranteed performance in multiprocessor networks-on-chip. In: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, CASES ’03, pp. 63–72, (2003)
Coffman, E.G.: Computer and Job Shop Scheduling Theory. Wiley, New York (1976)
MATH Google Scholar
Sahu, A., Balakrishnan, M., Panda, P.R.: A generic platform for estimation of multi-threaded program performance on heterogeneous multiprocessors. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, pp. 1018–1023 (2009)
Yaldiz, S., Demir, A., Tasiran, S., Ienne, P., Leblebici, Y.: Characterizing and exploiting task-load variability and correlation for energy management in multi-core systems. In: ESTImedia, pp. 135–140 (2005)
Hubert, H., Stabernack, B., Wels, K.-I.: Performance and memory profiling for embedded system design. In: Proceedings of the International Symposium on Industrial Embedded Systems, SIES ’07, pp. 94–101 (July 2007)
Ball, T., Larus, J. R.: Efficient path profiling. In: MICRO-29: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 46–57 (1996)
Lattuada, M., Ferrandi, F.: Performance modeling of embedded applications with zero architectural knowledge. In: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and Cystem Cynthesis, CODES/ISSS ’10, pp. 277–286 (2010)
Ferrandi, F., Lattuada, M., Pilato, C., Tumeo, A.: Performance modeling of parallel applications on MPSoCs. In: IEEE International Symposium on System-on-Chip, SOC ’09, pp. 64–67 (2009)
OpenMP. Application Program Interface, version 2.5 (May 2005)
Satish, N.R., Ravindran, K., Keutzer, K.: Scheduling task dependence graphs with variable task execution times onto heterogeneous multiprocessors. In: Proceedings of the 8th ACM international conference on Embedded software, EMSOFT ’08, pp. 149–158, New York, NY, USA. ACM (2008)
Zhu, X., Malik, S.: Using a communication architecture specification in an application-driven retargetable prototyping platform for multiprocessing. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’04, pp. 1244–1249 (2004)
Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T., Sardashti, S., Sen, R., Sewell, K., Shoaib, M., Vaish, N., Hill, M.D., Wood, D.A.: The Gem5 simulator. SIGARCH Comput. Archit. News 39(2), 1–7 (2011)
Article Google Scholar
Miele, A., Pilato, C., Sciuto, D.: A simulation-based framework for the exploration of mapping solutions on heterogeneous MPSoCs. Int. J. Embed. Real Time Commun. Syst. 4(1), 22–41 (2013)
Article Google Scholar
Lin, K.-L., Lo, C.-K., Tsay, R.-S.: Source-level timing annotation for fast and accurate tlm computation model generation. In: Design Automation Conference (ASP-DAC), 2010 15th Asia and South Pacific, pp. 235–240, (2010)
Wilson, R., French, R., Wilson, C., Amarasinghe, S., Anderson, J., T. S., Liao, S., Tseng, C., Hall, M., Lam, M., Hennessy, J.: The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler. Technical report, Stanford, CA, USA (1994)
Kreku, J., Tiensyrjä, K., Vanmeerbeeck, G.: Automatic workload generation for system-level exploration based on modified GCC compiler. In: Proceedings of the Conference on Design, Automation and Test in Europe, Date ’10, pp. 369–374, (2010)
Javaid, H., Janapsatya, A., Haque, M.S., Parameswaran, S.: Rapid runtime estimation methods for pipelined MPSoCs. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 363–368 (2010)
Cordes, D., Marwedel, P., Mallik, A.: Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming. In: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES/ISSS ’10, pp. 267–276 (2010)
Kim, S., Ha, S.: System-level performance analysis of multiprocessor system-on-chips by combining analytical model and execution time variation. Microprocess. Microsyst. 38(3), 233–245 (2014)
Kumar, A., Mesman, B., Corporaal, H., Ha, Y.: Iterative probabilistic performance prediction for multi-application multiprocessor systems. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 29(4), 538–551 (2010)
Article Google Scholar
Xu, Y., Wang, B., Hasholzner, R., Rosales, R., Teich, J.: On robust task-accurate performance estimation. In: Proceedings of the 50th Annual Design Automation Conference, DAC ’13, ACM, New York, NY, USA, pp. 171:1–171:6 (2013)
Ernst, R., Ye, W.: Embedded program timing analysis based on path clustering and architecture classification. In: Proceedings of the 1997 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’97, pp. 598–604, (1997)
Malik, S., Martonosi, M., Li, Y.S.: Static timing analysis of embedded software. In Proceedings of the 34th Annual Design Automation Conference, DAC ’97, pp. 147–152 (1997)
Zhai, A., Colohan, C.B., Steffan, J.G., Mowry, T.C.: Compiler optimization of scalar value communication between speculative threads. In: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-X, pp. 171–183 (2002)
Ferrandi, F., Lattuada, M., Pilato, C., Tumeo, A.: Performance estimation for task graphs combining sequential path profiling and control dependence regions. In: Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign, MEMOCODE ’09, pp. 131–140 (2009)
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc, Melbourne (1986)
MATH Google Scholar
Sreedhar, V.C., Gao, G.R., Lee, Y.: Identifying loops using DJ graphs. ACM Trans. Program. Lang. Syst. 18(6), 649–658 (1996)
Article Google Scholar
Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9(3), 319–349 (1987)
Article MATH Google Scholar
Girkar, M., Polychronopoulos, C.: Automatic extraction of functional parallelism from ordinary programs. IEEE Trans. Parallel Distrib. Syst. 3(2), 166–178 (1992)
Article Google Scholar
Bertels, K., Sima, V., Yankova, Y., Kuzmanov, G., Luk, W., Coutinho, G., Ferrandi, F., Pilato, C., Lattuada, M., Sciuto, D., Michelotti, A.: Hartes: Hardware-software codesign for heterogeneous multicore platforms. IEEE Micro. 30, 88–97 (2010)
Article Google Scholar
Thompson, M., Nikolov, H., Stefanov, T., Pimentel, A.D., Erbas, C., Polstra, S., Deprettere, E.F.: A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs. In: Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’07, pp. 9–14 (2007)
Atmel Corporation. DIOPSIS 940HF. http://www.atmel.com (2009)
Texas Instruments. TI OMAP 4. http://www.ti.com (2011)
Xilinx. Vivado Design Suite. http://www.xilinx.com (2013)
Gerstlauer, A.: Host-compiled simulation of multi-core platforms. In: Proceedings of the IEEE International Symposium on Rapid System Prototyping (RSP), pp. 1–6 (June 2010)
Synopsys Inc. Platform Architect. http://www.synopsys.com/Systems/ArchitectureDesign (2012)
Oyamada, M.S., Zschornack, F., Wagner, F.R.: Applying neural networks to performance estimation of embedded software. J. Syst. Architect. 54(1–2), 224–240 (2008)
Article Google Scholar
PandA. PandA framework. http://trac.ws.dei.polimi.it/panda
GNU Compiler Collection. GCC, version 4.3. http://gcc.gnu.org/
Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown. R.B.: MiBench: A free, commercially representative embedded benchmark suite. In: Proceedings of the IEEE International Workshop on Workload Characterization, WWC ’01, pp. 3–14 (2001)
Dorta, A.J., Rodriguez, C., de Sande, F., Gonzalez-Escribano, A.: The OpenMP Source Code Repository. In: Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’05, pp. 244–250 (2005)
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, pp. 24–36 (1995)
ARM922T. Technical Reference Manual. http://infocenter.arm.com
Politecnico di Milano. ReSP web-site. http://code.google.com/p/resp-sim/ (2010)

Download references

Author information

Christian Pilato
Present address: Department of Computer Science, Columbia University, New York, NY, USA

Authors and Affiliations

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy
Marco Lattuada, Christian Pilato & Fabrizio Ferrandi

Authors

Marco Lattuada
View author publications
You can also search for this author in PubMed Google Scholar
Christian Pilato
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Ferrandi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Lattuada.

Appendix 1: Example of Application of Task Graph Estimation Technique based on Path Profiling

This appendix shows how the proposed methodology is applied to estimate the performance of the example presented in Sect. 2 when SolB is considered: Task1 and Task3 are assigned to $CPU_\alpha , Task2a$ and Task2b are assigned to $CPU_\beta $. The resulting $\overline{HTG}$ is shown in Fig. 7: the edge $<Task1,Task3>$ is added to represent the scheduling order, as discussed in Sect. 5.2. The estimation starts with the application of the Hierarchical Path Profiling on the host machine, which results are reported in Table 11. For the sake of readability, we report also the sequence of basic blocks which compose each path, even if this information is equivalent to the one provided by the corresponding CRP. The order of the Control Dependence Regions in a Control Region Path is not relevant since the basic blocks are interleaved during the execution. The table shows how HPP is able to profile the paths and and to collect correlations about the execution of basic blocks before and after a loop, even if it is executed, with a representation that can be easily mapped onto the HTG.

Table 11 Results of applying the Hierarchical Path Profiling Technique to the example of Fig. 1

Full size table

Before estimating the execution time ($HTC_0$) of func_0, $HTC_5$ is estimated as follows:

1.
the contribution $BC_{i,t}$ of each basic block is computed (line 2 of Algorithm 1): the results are reported in Table 12a (e.g. $BC_{9,6} = f(o_{13}) = 1$ since $o_{13}$ is the only statement of $BB_9$);
2.
the contribution $\overline{BC}_{i,t}$ of each basic block including nested loops is computed (lines 4 and 6—Table 13a—e.g. $\overline{BC}_{7,6} = BC_{7,6}$ since Task6 is simple);
3.
the contribution $CC_{c,t}$ is computed summing the contribution of the single basic blocks (line 9—Table 14a—e.g. $CC_{E,6} = \overline{BC}_{6,6} + \overline{BC}_{6,9} = 3$ since $CDR_E$ is composed of $BB_6$ and $BB_9$);
4.
the contributions of the single Control Dependence Regions are summed to compute the contributions $TPC_{p,t}$ (line 13—Table 15a—e.g. $TPC_{ {{}}, 6} = CC_B + CC_E + CC_I = 6$ since path is composed of B, E and I);
5.
the overhead for the task management is added to $TPC_{p,t}$ to compute $\overline{TPC}_{p,t}$; since there is not any overhead cost in this task graph, $\overline{TPC}_{p,t} = TPC_{p,t}$ (line 15—Table 16a—$\overline{TPC}_{ {{}}, 6} = TPC_{ {{}}, 6}$ since Task6 has not overhead cost);
6.
the start and end times of each task are computed (lines 20 and 21—Table 17a—e.g. $START_{ {{}}, 6} = STOP_{{{}}, Entry5}$ since $Entry_5$ is the only predecessor of Task6; $STOP_{ {{}}, 6} = START_{ {{}}, 6} + \overline{TPC}_{ {{}}, 6}$); the execution times of the two paths are computed as the end time of task Exit (line 25—last line of Table 17a—e.g. $PC_{{}} = STOP_{ {{}}, Exit_5}$);
7.
the estimation of the whole $HTG_5$ can be computed (line 27):
$$\begin{aligned} HTC_5=N_5 \cdot \frac{PC_{{}}\cdot f_{{}}+PC_{{}}\cdot f_{{}}}{f_{{}}+f_{{}}}=10 \cdot \frac{105\cdot 100 + 6\cdot 0}{100+0}=1050 \end{aligned}$$
(13)

After $HTC_5$ has been estimated, $HTC_0$ can be estimated in the same way and Fig. 8 shows how the different contributions are combined. These contributions are:

1.
the contribution of each basic block $BC_{i,t}$ (lines 2), obtained from the clock cycles of Table 1; the results are reported in Table 12b;
2.
the contribution of each basic block including nested loops $\overline{BC}_{i,t}$ (lines 4 and 6); the results are reported in Table 13b; note in particular that $\overline{BC}_{5,2a}=BC_{5,2a} +HTC_5=1+1050$;
3.
the contribution of each Control Dependence Region $CC_{c,t}$ (line 9); the results are reported in Table 14b;
4.
the contribution of each path to each task $TPC_{p,t}$ (line 13); the results are reported in Table 15b;
5.
the contribution of each path to each task, along with the overhead cost, $\overline{TPC}_{p,t}$ (line 15); the creation cost (50) is added to Task1 and Task2a; the synchronization and destruction cost (10) is added to Task3 and Task2b; the results are reported in Table 16b;
6.
$START_{p,t}$ and $STOP_{p,t}$ (lines 20 and 21); the results are reported in Table 17b, where the selected topological order is: $Entry_0$-Task0-Task1-Task2a-Task2b-Task3-Task4-$Exit_0$;
7.
the contribution of each path $PC_{p}$ (line 25): the results are reported in the last line of Table 17b;
8.
$HPC_0$ in the two cases presented in Sect. 2:

the CRPs executed are $P_{{}}$ and $P_{{}}$, so the execution time estimated for the parallel version is:
$$\begin{aligned} HTC_{0} = \frac{PC_{{}} \cdot f_{{}} + PC_{{}} \cdot f_{{}}}{f_{{}} + f_{{}}} = \frac{4171 \cdot 5 + 1126 \cdot 5}{5 + 5} = 2648.5 \end{aligned}$$
(14)
the CRPs executed are $P_{{}}$ and $P_{{}}$, so the execution time estimated for the parallel version is:
$$\begin{aligned} HTC_{0} = \frac{PC_{{}} \cdot f_{{}} + PC_{{}} \cdot f_{{}}}{f_{{}} + f_{{}}} = \frac{2122 \cdot 5 + 2122 \cdot 5}{5 + 5} = 2122 \end{aligned}$$
(15)

Table 12 Contribution $BC_{i,t}$

Full size table

Table 13 Contribution $\overline{BC}_{i,t}$

Full size table

Table 14 Contribution $CC_{i,t}$

Full size table

Table 15 Contribution $TPC_{p,t}$

Full size table

Table 16 Contribution $\overline{TPC}_{p,t}$

Full size table

Table 17 Starting and ending times of tasks

Full size table

Finally, the speed-up for the two situations presented in Sect. 2 can be computed. The execution time of the sequential specification is 3123 cycles in both the cases, so the estimated speed-ups are 1.18 and 1.47, respectively.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lattuada, M., Pilato, C. & Ferrandi, F. Performance Estimation of Task Graphs Based on Path Profiling. Int J Parallel Prog 44, 735–771 (2016). https://doi.org/10.1007/s10766-015-0372-7

Download citation

Received: 10 April 2014
Accepted: 06 July 2015
Published: 23 July 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s10766-015-0372-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Estimation of Task Graphs Based on Path Profiling

Abstract

Access this article