ABSTRACT
FPGA acceleration of large irregular dataflow graphs is often limited by the long tail distribution of parallelism on fine-grained overlay dataflow architectures. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically during graph pre-processing and dynamically at runtime. We statically reassociate the high-fanin dataflow chains by providing faster routes for late arriving inputs. We also perform a fanout decomposition and selective node replication in order to distribute serialization costs across multiple PEs. Additionally, we modify the dataflow firing rule in hardware to prefer critical nodes when multiple nodes are ready for dataflow evaluation. Effectively these transformations reduce the length of the tail in the parallelism profile for these large-scale graphs. Across a range of dataflow benchmarks extracted from Sparse LU factorization, we demonstrate up to 2.5× (mean 1.21×) improvement when using the static pre-processing alone, a 2.4× (mean 1.17×) improvement when using only dynamic optimizations and an overall 2.9× (mean 1.39×) improvement when both static and dynamic optimizations are enabled. These improvements are on top of 3--10× speedups over CPU implementations without our transformation enabled.
- T. A. Davis and E. Palamadai Natarajan, Algorithm 907: KLU, A Direct Sparse Solver for circuit simulation problems, ACM Transactions on Mathematical Software, Volume 37 Number 3, Sept. 2010. Google ScholarDigital Library
- J.B. Dennis and D.P. Misunas, A preliminary architecture for a basic data-flow processor, SIGARCH Computer Architecture News, Volume 3 Number 4, Dec. 1974. Google ScholarDigital Library
- Nachiket Kapre, SPICE2 -- A Spatial Parallel Architecture for Accelerating the SPICE Circuit Simulator, PhD thesis, California Institute of Technology, Pasadena, 2010.Google Scholar
- Nachiket Kapre and Andre DeHon, Parallelizing sparse Matrix Solve for SPICE circuit simulation using FPGAs, Proceedings of the International Conference on Field-Programmable Technology, Dec. 2010.Google Scholar
- Siddhartha and Nachiket Kapre, Breaking Sequential Dependencies in FPGA-based Sparse LU Factorization, Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines, Mar. 2014. Google ScholarDigital Library
Index Terms
- FPGA Acceleration of Irregular Iterative Computations using Criticality-Aware Dataflow Optimizations (Abstract Only)
Recommendations
Stream-Dataflow Acceleration
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureDemand for low-power data processing hardware continues to rise inexorably. Existing programmable and "general purpose" solutions (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order-of-magnitude improvements and industry adoption of ...
Fine-Grained Synchronizations and Dataflow Programming on GPUs
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingThe last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming ...
Finite-Difference Wave Propagation Modeling on Special-Purpose Dataflow Machines
Modeling wave propagation through the earth is an important application in geoscience. We present a framework for wave propagation modeling on special-purpose hardware, which dramatically improves the application performance compared to conventional ...
Comments