Generating Configurable Hardware from Parallel Patterns

Authors:
Raghu Prabhakar

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
David Koeplinger

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Kevin J. Brown

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
HyoukJoong Lee

Stanford University, Google, Stanford, CA, USA

Stanford University, Google, Stanford, CA, USA
View Profile

,
Christopher De Sa

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Christos Kozyrakis

Stanford University, EPFL, Stanford, CA, USA

Stanford University, EPFL, Stanford, CA, USA
View Profile

,
Kunle Olukotun

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 51 Issue 4April 2016pp 651–665https://doi.org/10.1145/2954679.2872415

Published:25 March 2016Publication History

ACM SIGPLAN Notices

Abstract

In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance and energy improvements over CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use low-level hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software languages but produce very efficient designs, and low-level software languages like C and OpenCL coupled with high-level synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they provide high-level abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two important optimizations for using parallel patterns to generate efficient hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 39.4× on a set of benchmarks from the data analytics domain.

References

Vivado high-level synthesis. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html, 2016.Google Scholar
Sadaf R Alam, Pratul K Agarwal, Melissa C Smith, Jeffrey S Vetter, and David Caliga. Using fpga devices to accelerate biomolecular simulations. Computer, (3):66--73, 2007.Google Scholar
Arvind. Bluespec: A language for hardware design, simulation, synthesis and verification invited talk. In Proceedings of the First ACM and IEEE International Conference on Formal Methods and Models for Co-Design, MEMOCODE '03, pages 249--, Washington, DC, USA, 2003. IEEE Computer Society.Google Scholar
Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '10, pages 89--108, New York, NY, USA, 2010. ACM.Google ScholarDigital Library
J. Bachrach, Huy Vo, B. Richards, Yunsup Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic. Chisel: Constructing hardware in a scala embedded language. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 1212--1221, June 2012.Google ScholarDigital Library
David Bacon, Rodric Rabbah, and Sunil Shukla. Fpga programming for the masses. Queue, 11(2):40:40--40:52, 2013.Google ScholarDigital Library
Donald G Bailey. Design for embedded image processing on FPGAs. John Wiley & Sons, 2011.Google ScholarCross Ref
Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. The polyhedral model is more widely applicable than you think. In ETAPS International Conference on Compiler Construction (CC'2010), pages 283--303, March 2010. Springer Verlag.%Google ScholarDigital Library
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, pages 101--113, 2008. ACM.Google ScholarDigital Library
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008.Google ScholarDigital Library
Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In International Symposium on Code Generation and Optimization,, CGO, 2016.Google ScholarDigital Library
Samuel Brown et al. Performance comparison of finite-difference modeling on cell, fpga and multi-core computers. In SEG/San Antonio Annual Meeting, 2007.Google Scholar
Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: compiling an embedded data parallel language. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP, pages 47--56, New York, NY, USA, 2011. ACM.Google Scholar
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI. ACM, 2010.Google ScholarDigital Library
Chun Chen, Jacqueline Chame, and Mary Hall. Chill: A framework for composing high-level loop transformations. Technical report, Citeseer, 2008.Google Scholar
J. Cong, Bin Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. High-level synthesis for fpgas: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30(4):473--491, April 2011.Google ScholarDigital Library
Christian de Schryver. FPGA Based Accelerators for Financial Applications. Springer, 2015.Google ScholarCross Ref
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, OSDI, pages 137--150, 2004.Google Scholar
S.A. Edwards. The challenges of synthesizing hardware from c-like languages. Design Test of Computers, IEEE, 23(5):375--386, May 2006.Google ScholarDigital Library
Nithin George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J. Brown, Arvind K. Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. Hardware system synthesis from domain-specific languages. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pages 1--8, Sept 2014.Google ScholarCross Ref
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly-performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.Google ScholarCross Ref
Frederik Grull and Udo Kebschull. Biomedical image processing and reconstruction with dataflow computing on fpgas. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pages 1--2. IEEE, 2014.Google ScholarCross Ref
Prabhat K. Gupta. Xeon+fpga platform for the data center. http://www.ece.cmu.edu/ calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf, 2015.Google Scholar
Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ronald G. Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 223--238, New York, NY, USA, 2015. ACM.Google ScholarDigital Library
Eric Hielscher. Locality Optimization For Data Parallel Programs. PhD thesis, New York University, 2013.Google Scholar
Chien-Chin Huang, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. Spartan: A distributed array framework with smart tiling. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 1--15, Santa Clara, CA, July 2015. USENIX Association.Google Scholar
H.M. Hussain, K. Benkrid, H. Seker, and A.T. Erdogan. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In Adaptive Hardware and Systems (AHS), 2011 NASA/ESA Conference on, pages 248--255, June 2011.Google ScholarCross Ref
HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. Locality-aware mapping of nested parallel patterns on gpus. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Micro, 2014.Google ScholarDigital Library
Feng Liu, Soumyadeep Ghosh, Nick P. Johnson, and David I. August. Cgpa: Coarse-grained pipelined accelerators. In Proceedings of the 51st Annual Design Automation Conference, DAC '14, pages 78:1--78:6, 2014. ACM.Google ScholarDigital Library
Maxeler Technologies. MaxCompiler white paper, 2011.Google Scholar
Oskar Mencer, Erik Vynckier, James Spooner, Stephen Girdlestone, and Oliver Charlesworth. Finding the right level of abstraction for minimizing operational expenditure. In Proceedings of the fourth workshop on High performance computational finance, pages 13--18. ACM, 2011.Google ScholarDigital Library
M. Odersky. Scala. http://www.scala-lang.org, 2011.Google Scholar
Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. Sda: Software-defined accelerator for largescale dnn systems. Hot Chips 26, 2014.Google Scholar
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. Accelerating deep convolutional neural networks using specialized hardware. Technical report, Microsoft Research, February 2015.Google Scholar
D. Petkov, R. Harr, and S. Amarasinghe. Efficient pipelining of nested loops: unroll-and-squash. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM, April 2002.Google ScholarCross Ref
Simon Peyton Jones [editor], John Hughes [editor], Lennart Augustsson, Dave Barton, Brian Boutel, Warren Burton, Simon Fraser, Joseph Fasel, Kevin Hammond, Ralf Hinze, Paul Hudak, Thomas Johnsson, Mark Jones, John Launchbury, Erik Meijer, John Peterson, Alastair Reid, Colin Runciman, and Philip Wadler. Haskell 98 -- A non-strict, purely functional language. Available from http://www.haskell.org/definition/, feb 1999.Google Scholar
Louis-Noël Pouchet. Interative Optimization in the Polyhedral Model. PhD thesis, University of Paris-Sud 11, Orsay, France, January 2010.Google Scholar
Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. Loop transformations: Convexity, pruning and optimization. In 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL'11), pages 549--562, Austin, TX, January 2011. ACM Press.Google ScholarDigital Library
Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '13, pages 29--38, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 13--24, Piscataway, NJ, USA, 2014. IEEE Press.Google ScholarDigital Library
Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. Chimps: A high-level compilation flow for hybrid cpu-fpga architectures. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, FPGA '08, pages 261--261, New York, NY, USA, 2008. ACM.Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
Tiark Rompf, Arvind K. Sujeeth, Nada Amin, Kevin Brown, Vojin Jovanovic, HyoukJoong Lee, Manohar Jonnalagedda, Kunle Olukotun, and Martin Odersky. Optimizing data structures in high-level programs. POPL, 2013.Google ScholarDigital Library
Satnam Singh and David J. Greaves. Kiwi: Synthesis of fpga circuits from parallel programs. In Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines, FCCM '08, pages 3--12, Washington, DC, USA, 2008. IEEE Computer Society.Google ScholarDigital Library
M.C. Smith, Jeffrey S Vetter, and Sadaf R. Alam. Scientific computing beyond CPUs: FPGA implementations of common scientific kernels. In Proceedings of the 8th Annual Military and Aerospace Programmable Logic Devices International Conference, 2005.Google Scholar
Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. In TECS'14: ACM Transactions on Embedded Computing Systems, July 2014.Google ScholarDigital Library
Arvind K. Sujeeth, Hyoukjoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, Kunle Olukotun, Tiark Rompf, and Martin Odersky. Optiml: an implicitly parallel domainspecific language for machine learning. In in Proceedings of the 28th International Conference on Machine Learning, ser. ICML, 2011.Google Scholar
Arvind K. Sujeeth, Tiark Rompf, Kevin J. Brown, HyoukJoong Lee, Hassan Chafi, Victoria Popic, Michael Wu, Aleksander Prokopec, Vojin Jovanovic, Martin Odersky, and Kunle Olukotun. Composition and reuse with compiled domain-specific languages. In European Conference on Object Oriented Programming, ECOOP, 2013.Google ScholarDigital Library
Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. Elasticflow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, ICCAD '15, pages 78--85, 2015. IEEE Press.Google ScholarCross Ref
The Khronos Group. OpenCL 2.0. http://www.khronos.org/opencl/.Google Scholar
Anand Venkat, Mary Hall, and Michelle Strout. Loop and data transformations for sparse matrix code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, pages 521--532, New York, NY, USA, 2015. ACM.Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association.Google ScholarDigital Library
GL Zhang, Philip Heng Wai Leong, Chun Hok Ho, Kuen Hung Tsoi, Chris CC Cheung, Dong-U Lee, Ray CC Cheung, and Wayne Luk. Reconfigurable acceleration for monte carlo based financial simulation. In Field-Programmable Technology, 2005. Proceedings. 2005 IEEE International Conference on, pages 215--222. IEEE, 2005.Google ScholarCross Ref
Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong. Autopilot: A platform-based esl synthesis system. In Philippe Coussy and Adam Morawiec, editors, High-Level Synthesis, pages 99--112. Springer Netherlands, 2008.Google ScholarCross Ref
Ling Zhuo and Viktor K Prasanna. High-performance designs for linear algebra operations on reconfigurable hardware. Computers, IEEE Transactions on, 57(8):1057--1071, 2008.Google ScholarDigital Library

Index Terms

Recommendations

Generating Configurable Hardware from Parallel Patterns
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance ...
Read More
Generating Configurable Hardware from Parallel Patterns
ASPLOS'16

In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance ...
Read More
Self-Reconfigurable Evolvable Hardware System for Adaptive Image Processing

This paper presents an evolvable hardware system, fully contained in an FPGA, which is capable of autonomously generating digital processing circuits, implemented on an array of processing elements (PEs). Candidate circuits are generated by an embedded ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
March 2016
824 pages
ISBN:9781450340915
DOI:10.1145/2872362
General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 March 2016
Check for updates
Author Tags
FPGAs
hardware generation
metapipelining
parallel patterns
reconfigurable hardware
tiling
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 2,041
  Total Downloads
- Downloads (Last 12 months)267
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Generating Configurable Hardware from Parallel Patterns

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Generating Configurable Hardware from Parallel Patterns

Generating Configurable Hardware from Parallel Patterns

Self-Reconfigurable Evolvable Hardware System for Adaptive Image Processing