Abstract
In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance and energy improvements over CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use low-level hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software languages but produce very efficient designs, and low-level software languages like C and OpenCL coupled with high-level synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they provide high-level abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two important optimizations for using parallel patterns to generate efficient hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 39.4× on a set of benchmarks from the data analytics domain.
- Vivado high-level synthesis. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html, 2016.Google Scholar
- Sadaf R Alam, Pratul K Agarwal, Melissa C Smith, Jeffrey S Vetter, and David Caliga. Using fpga devices to accelerate biomolecular simulations. Computer, (3):66--73, 2007.Google Scholar
- Arvind. Bluespec: A language for hardware design, simulation, synthesis and verification invited talk. In Proceedings of the First ACM and IEEE International Conference on Formal Methods and Models for Co-Design, MEMOCODE '03, pages 249--, Washington, DC, USA, 2003. IEEE Computer Society.Google Scholar
- Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '10, pages 89--108, New York, NY, USA, 2010. ACM.Google ScholarDigital Library
- J. Bachrach, Huy Vo, B. Richards, Yunsup Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic. Chisel: Constructing hardware in a scala embedded language. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 1212--1221, June 2012.Google ScholarDigital Library
- David Bacon, Rodric Rabbah, and Sunil Shukla. Fpga programming for the masses. Queue, 11(2):40:40--40:52, 2013.Google ScholarDigital Library
- Donald G Bailey. Design for embedded image processing on FPGAs. John Wiley & Sons, 2011.Google ScholarCross Ref
- Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. The polyhedral model is more widely applicable than you think. In ETAPS International Conference on Compiler Construction (CC'2010), pages 283--303, March 2010. Springer Verlag.%Google ScholarDigital Library
- Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, pages 101--113, 2008. ACM.Google ScholarDigital Library
- Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008.Google ScholarDigital Library
- Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In International Symposium on Code Generation and Optimization,, CGO, 2016.Google ScholarDigital Library
- Samuel Brown et al. Performance comparison of finite-difference modeling on cell, fpga and multi-core computers. In SEG/San Antonio Annual Meeting, 2007.Google Scholar
- Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: compiling an embedded data parallel language. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP, pages 47--56, New York, NY, USA, 2011. ACM.Google Scholar
- Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI. ACM, 2010.Google ScholarDigital Library
- Chun Chen, Jacqueline Chame, and Mary Hall. Chill: A framework for composing high-level loop transformations. Technical report, Citeseer, 2008.Google Scholar
- J. Cong, Bin Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. High-level synthesis for fpgas: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30(4):473--491, April 2011.Google ScholarDigital Library
- Christian de Schryver. FPGA Based Accelerators for Financial Applications. Springer, 2015.Google ScholarCross Ref
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, OSDI, pages 137--150, 2004.Google Scholar
- S.A. Edwards. The challenges of synthesizing hardware from c-like languages. Design Test of Computers, IEEE, 23(5):375--386, May 2006.Google ScholarDigital Library
- Nithin George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J. Brown, Arvind K. Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. Hardware system synthesis from domain-specific languages. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pages 1--8, Sept 2014.Google ScholarCross Ref
- Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly-performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.Google ScholarCross Ref
- Frederik Grull and Udo Kebschull. Biomedical image processing and reconstruction with dataflow computing on fpgas. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pages 1--2. IEEE, 2014.Google ScholarCross Ref
- Prabhat K. Gupta. Xeon+fpga platform for the data center. http://www.ece.cmu.edu/ calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf, 2015.Google Scholar
- Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ronald G. Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 223--238, New York, NY, USA, 2015. ACM.Google ScholarDigital Library
- Eric Hielscher. Locality Optimization For Data Parallel Programs. PhD thesis, New York University, 2013.Google Scholar
- Chien-Chin Huang, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. Spartan: A distributed array framework with smart tiling. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 1--15, Santa Clara, CA, July 2015. USENIX Association.Google Scholar
- H.M. Hussain, K. Benkrid, H. Seker, and A.T. Erdogan. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In Adaptive Hardware and Systems (AHS), 2011 NASA/ESA Conference on, pages 248--255, June 2011.Google ScholarCross Ref
- HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. Locality-aware mapping of nested parallel patterns on gpus. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Micro, 2014.Google ScholarDigital Library
- Feng Liu, Soumyadeep Ghosh, Nick P. Johnson, and David I. August. Cgpa: Coarse-grained pipelined accelerators. In Proceedings of the 51st Annual Design Automation Conference, DAC '14, pages 78:1--78:6, 2014. ACM.Google ScholarDigital Library
- Maxeler Technologies. MaxCompiler white paper, 2011.Google Scholar
- Oskar Mencer, Erik Vynckier, James Spooner, Stephen Girdlestone, and Oliver Charlesworth. Finding the right level of abstraction for minimizing operational expenditure. In Proceedings of the fourth workshop on High performance computational finance, pages 13--18. ACM, 2011.Google ScholarDigital Library
- M. Odersky. Scala. http://www.scala-lang.org, 2011.Google Scholar
- Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. Sda: Software-defined accelerator for largescale dnn systems. Hot Chips 26, 2014.Google Scholar
- Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. Accelerating deep convolutional neural networks using specialized hardware. Technical report, Microsoft Research, February 2015.Google Scholar
- D. Petkov, R. Harr, and S. Amarasinghe. Efficient pipelining of nested loops: unroll-and-squash. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM, April 2002.Google ScholarCross Ref
- Simon Peyton Jones [editor], John Hughes [editor], Lennart Augustsson, Dave Barton, Brian Boutel, Warren Burton, Simon Fraser, Joseph Fasel, Kevin Hammond, Ralf Hinze, Paul Hudak, Thomas Johnsson, Mark Jones, John Launchbury, Erik Meijer, John Peterson, Alastair Reid, Colin Runciman, and Philip Wadler. Haskell 98 -- A non-strict, purely functional language. Available from http://www.haskell.org/definition/, feb 1999.Google Scholar
- Louis-Noël Pouchet. Interative Optimization in the Polyhedral Model. PhD thesis, University of Paris-Sud 11, Orsay, France, January 2010.Google Scholar
- Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. Loop transformations: Convexity, pruning and optimization. In 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL'11), pages 549--562, Austin, TX, January 2011. ACM Press.Google ScholarDigital Library
- Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '13, pages 29--38, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 13--24, Piscataway, NJ, USA, 2014. IEEE Press.Google ScholarDigital Library
- Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. Chimps: A high-level compilation flow for hybrid cpu-fpga architectures. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, FPGA '08, pages 261--261, New York, NY, USA, 2008. ACM.Google ScholarDigital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
- Tiark Rompf, Arvind K. Sujeeth, Nada Amin, Kevin Brown, Vojin Jovanovic, HyoukJoong Lee, Manohar Jonnalagedda, Kunle Olukotun, and Martin Odersky. Optimizing data structures in high-level programs. POPL, 2013.Google ScholarDigital Library
- Satnam Singh and David J. Greaves. Kiwi: Synthesis of fpga circuits from parallel programs. In Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines, FCCM '08, pages 3--12, Washington, DC, USA, 2008. IEEE Computer Society.Google ScholarDigital Library
- M.C. Smith, Jeffrey S Vetter, and Sadaf R. Alam. Scientific computing beyond CPUs: FPGA implementations of common scientific kernels. In Proceedings of the 8th Annual Military and Aerospace Programmable Logic Devices International Conference, 2005.Google Scholar
- Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. In TECS'14: ACM Transactions on Embedded Computing Systems, July 2014.Google ScholarDigital Library
- Arvind K. Sujeeth, Hyoukjoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, Kunle Olukotun, Tiark Rompf, and Martin Odersky. Optiml: an implicitly parallel domainspecific language for machine learning. In in Proceedings of the 28th International Conference on Machine Learning, ser. ICML, 2011.Google Scholar
- Arvind K. Sujeeth, Tiark Rompf, Kevin J. Brown, HyoukJoong Lee, Hassan Chafi, Victoria Popic, Michael Wu, Aleksander Prokopec, Vojin Jovanovic, Martin Odersky, and Kunle Olukotun. Composition and reuse with compiled domain-specific languages. In European Conference on Object Oriented Programming, ECOOP, 2013.Google ScholarDigital Library
- Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. Elasticflow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, ICCAD '15, pages 78--85, 2015. IEEE Press.Google ScholarCross Ref
- The Khronos Group. OpenCL 2.0. http://www.khronos.org/opencl/.Google Scholar
- Anand Venkat, Mary Hall, and Michelle Strout. Loop and data transformations for sparse matrix code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, pages 521--532, New York, NY, USA, 2015. ACM.Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association.Google ScholarDigital Library
- GL Zhang, Philip Heng Wai Leong, Chun Hok Ho, Kuen Hung Tsoi, Chris CC Cheung, Dong-U Lee, Ray CC Cheung, and Wayne Luk. Reconfigurable acceleration for monte carlo based financial simulation. In Field-Programmable Technology, 2005. Proceedings. 2005 IEEE International Conference on, pages 215--222. IEEE, 2005.Google ScholarCross Ref
- Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong. Autopilot: A platform-based esl synthesis system. In Philippe Coussy and Adam Morawiec, editors, High-Level Synthesis, pages 99--112. Springer Netherlands, 2008.Google ScholarCross Ref
- Ling Zhuo and Viktor K Prasanna. High-performance designs for linear algebra operations on reconfigurable hardware. Computers, IEEE Transactions on, 57(8):1057--1071, 2008.Google ScholarDigital Library
Index Terms
- Generating Configurable Hardware from Parallel Patterns
Recommendations
Generating Configurable Hardware from Parallel Patterns
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsIn recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance ...
Generating Configurable Hardware from Parallel Patterns
ASPLOS'16In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance ...
Self-Reconfigurable Evolvable Hardware System for Adaptive Image Processing
This paper presents an evolvable hardware system, fully contained in an FPGA, which is capable of autonomously generating digital processing circuits, implemented on an array of processing elements (PEs). Candidate circuits are generated by an embedded ...
Comments