Abstract
The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task, and data-level parallelism to meet throughput requirements of digital signal processing applications. Moreover, in the presence of system-level memory constraints, hand optimization of code to satisfy these requirements is inefficient and error prone and can therefore, greatly slow down development time or result in highly underutilized processing resources. In this article, we present vectorization and scheduling methods to effectively exploit multiple forms of parallelism for throughput optimization on hybrid CPU-GPU platforms, while conforming to system-level memory constraints. The methods operate on synchronous dataflow representations, which are widely used in the design of embedded systems for signal and information processing. We show that our novel methods can significantly improve system throughput compared to previous vectorization and scheduling approaches under the same memory constraints. In addition, we present a practical case-study of applying our methods to significantly improve the throughput of an orthogonal frequency division multiplexing receiver system for wireless communications.
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. J. Concurr. Comput.: Pract. Exper. 23, 2 (Feb. 2011), 187--198. Google ScholarDigital Library
- S. S. Bhattacharyya, E. Deprettere, R. Leupers, and J. Takala (Eds.). 2013. Handbook of Signal Processing Systems (second ed.). Springer. Google ScholarDigital Library
- S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. 1996. Software Synthesis from Dataflow Graphs. Kluwer Academic. Google ScholarDigital Library
- Y. Chen and H. Zhou. 2012. Buffer minimization in pipelined SDF scheduling on multi-core platforms. In Proceedings of the Asia South Pacific Design Automation Conference. 127--132.Google Scholar
- F. Ciccozzi. 2013. Automatic synthesis of heterogeneous CPU-GPU embedded applications from a UML profile. In Proceedings of the International Workshop on Model Based Architecting and Construction of Embedded Systems.Google Scholar
- K. Desnos, M. Pelcat, J.-F. Nezan, and Slaheddine Aridhi. 2015. Buffer merging technique for minimizing memory footprints of synchronous dataflow specifications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 1111--1115.Google ScholarCross Ref
- R. P. Dick, D. L. Rhodes, and W. Wolf. 1998. TGFF: Task graphs for free. In Proceedings of the International Workshop on Hardware/Software Codesign. 97--101. Google ScholarDigital Library
- A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21, 2 (2011), 173--193.Google ScholarCross Ref
- A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, A. J. M. Moonen, M. J. G. Bekooij, B. D. Theelen, and M. R. Mousavi. 2006. Throughput analysis of synchronous data flow graphs. In Proceedings of the International Conference on Application of Concurrency to System Design. Google ScholarDigital Library
- M. Goli, M. T. Garba, and H. González-Vélez. 2012. Streaming dynamic coarse-grained CPU/GPU workloads with heterogeneous pipelines in FastFlow. In Proceedings of the International Conferences on High Performance Computing and on Communications on Economics and Social Sciences (HPCC-ICESS’12). 445--452. Google ScholarDigital Library
- C. Gregg and K. Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 134--144. Google ScholarDigital Library
- A. Hagiescu, H. P. Huynh, W.-F. Wong, and R. S. M. Goh. 2011. Automated architecture-aware mapping of streaming applications onto GPUs. In Proceedings of the International Symposium on Parallel and Distributed Processing. 467--478. Google ScholarDigital Library
- C. Hsu, J. Pino, and S. S. Bhattacharyya. 2011. Multithreaded simulation for synchronous dataflow graphs. ACM Trans. Des. Autom. Electr. Syst. 16, 3 (Jun. 2011), 25--1--25--23. Google ScholarDigital Library
- M. Ko, C. Shen, and S. S. Bhattacharyya. 2008. Memory-constrained block processing for DSP software optimization. J. Sign. Process. Syst. 50, 2 (Feb. 2008), 163--177. Google ScholarDigital Library
- E. A. Lee and D. G. Messerschmitt. 1987. Synchronous dataflow. Proc. IEEE 75, 9 (Sep. 1987), 1235--1245.Google ScholarCross Ref
- S. Lin, Y. Liu, W. Plishker, and S. S. Bhattacharyya. 2016. A design framework for mapping vectorized synchronous dataflow graphs onto CPU--GPU platforms. In Proceedings of the International Workshop on Software and Compilers for Embedded Systems, 20--29. Google ScholarDigital Library
- W. Lund, S. Kanur, J. Ersfolk, L. Tsiopoulos, J. Lilius, J. Haldin, and U. Falk. 2015. Execution of dataflow process networks on OpenCL platforms. In Proceedings of the Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 618--625. Google ScholarDigital Library
- J. W. Massey, J. Starr, S. Lee, D. Lee, A. Gerstlauer, and R. W. Heath. 2012. Implementation of a real-time wireless interference alignment network. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers. 104--108.Google Scholar
- J. Park and W. J. Dally. 2010. Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, USA, 1--10. Google ScholarDigital Library
- S. Ritz, M. Pankert, and H. Meyr. 1993. Optimum vectorization of scalable synchronous dataflow graphs. In Proceedings of the International Conference on Application Specific Array Processors.Google Scholar
- L. Schor, A. Tretter, T. Scherer, and L. Thiele. 2013. Exploiting the parallelism of heterogeneous systems using dataflow graphs on top of OpenCL. In Proceedings of the IEEE Workshop on Embedded Systems for Real-Time Multimedia. 41--50.Google Scholar
- C. Shen, W. Plishker, H. Wu, and S. S. Bhattacharyya. 2010. A lightweight dataflow approach for design and implementation of SDR systems. In Proceedings of the Wireless Innovation Conference and Product Exposition. 640--645.Google Scholar
- C. Shen, L. Wang, I. Cho, S. Kim, S. Won, W. Plishker, and S. S. Bhattacharyya. 2011. The DSPCAD Lightweight Dataflow Environment: Introduction to LIDE Version 0.1. Technical Report UMIACS-TR-2011-17. Institute for Advanced Computer Studies, University of Maryland at College Park.Google Scholar
- S. Sriram and S. S. Bhattacharyya. 2009. Embedded Multiprocessors: Scheduling and Synchronization (2nd ed.). CRC Press. Google ScholarDigital Library
- S. Stuijk, M. Geilen, and T. Basten. 2006. Exploring tradeoffs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the Design Automation Conference. Google ScholarDigital Library
- H. Topcuoglu, S. Hariri, and M.-Y. Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13, 3 (2002), 260--274. Google ScholarDigital Library
- S. Tripakis, D. Bui, M. Geilen, B. Rodiers, and E. A. Lee. 2013. Compositionality in synchronous data flow: Modular code generation from hierarchical SDF graphs. ACM Trans. Embed. Comput. Syst. 12, 3 (2013), 83:1--83:26. Google ScholarDigital Library
- A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. 2009. Software pipelined execution of stream programs on GPUs. In Proceedings of the International Symposium on Code Generation and Optimization. 200--209. Google ScholarDigital Library
- G. Zaki, W. Plishker, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. 2013. Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in GNU radio. J. Sign. Process. Syst. 70, 2 (Feb. 2013), 177--191. Google ScholarDigital Library
Index Terms
Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms
Recommendations
A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms
SCOPES '16: Proceedings of the 19th International Workshop on Software and Compilers for Embedded SystemsHeterogeneous computing platforms with multicore central processing units (CPUs) and graphics processing units (GPUs) are of increasing interest to designers of embedded signal processing systems since they offer the potential for significant ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Comments