skip to main content
research-article
Public Access

Generating Configurable Hardware from Parallel Patterns

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance and energy improvements over CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use low-level hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software languages but produce very efficient designs, and low-level software languages like C and OpenCL coupled with high-level synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they provide high-level abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two important optimizations for using parallel patterns to generate efficient hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 39.4× on a set of benchmarks from the data analytics domain.

References

  1. Vivado high-level synthesis. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html, 2016.Google ScholarGoogle Scholar
  2. Sadaf R Alam, Pratul K Agarwal, Melissa C Smith, Jeffrey S Vetter, and David Caliga. Using fpga devices to accelerate biomolecular simulations. Computer, (3):66--73, 2007.Google ScholarGoogle Scholar
  3. Arvind. Bluespec: A language for hardware design, simulation, synthesis and verification invited talk. In Proceedings of the First ACM and IEEE International Conference on Formal Methods and Models for Co-Design, MEMOCODE '03, pages 249--, Washington, DC, USA, 2003. IEEE Computer Society.Google ScholarGoogle Scholar
  4. Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '10, pages 89--108, New York, NY, USA, 2010. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Bachrach, Huy Vo, B. Richards, Yunsup Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic. Chisel: Constructing hardware in a scala embedded language. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 1212--1221, June 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David Bacon, Rodric Rabbah, and Sunil Shukla. Fpga programming for the masses. Queue, 11(2):40:40--40:52, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Donald G Bailey. Design for embedded image processing on FPGAs. John Wiley & Sons, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  8. Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. The polyhedral model is more widely applicable than you think. In ETAPS International Conference on Compiler Construction (CC'2010), pages 283--303, March 2010. Springer Verlag.%Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, pages 101--113, 2008. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In International Symposium on Code Generation and Optimization,, CGO, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Samuel Brown et al. Performance comparison of finite-difference modeling on cell, fpga and multi-core computers. In SEG/San Antonio Annual Meeting, 2007.Google ScholarGoogle Scholar
  13. Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: compiling an embedded data parallel language. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP, pages 47--56, New York, NY, USA, 2011. ACM.Google ScholarGoogle Scholar
  14. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI. ACM, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chun Chen, Jacqueline Chame, and Mary Hall. Chill: A framework for composing high-level loop transformations. Technical report, Citeseer, 2008.Google ScholarGoogle Scholar
  16. J. Cong, Bin Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. High-level synthesis for fpgas: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30(4):473--491, April 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Christian de Schryver. FPGA Based Accelerators for Financial Applications. Springer, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, OSDI, pages 137--150, 2004.Google ScholarGoogle Scholar
  19. S.A. Edwards. The challenges of synthesizing hardware from c-like languages. Design Test of Computers, IEEE, 23(5):375--386, May 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nithin George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J. Brown, Arvind K. Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. Hardware system synthesis from domain-specific languages. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pages 1--8, Sept 2014.Google ScholarGoogle ScholarCross RefCross Ref
  21. Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly-performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  22. Frederik Grull and Udo Kebschull. Biomedical image processing and reconstruction with dataflow computing on fpgas. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pages 1--2. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  23. Prabhat K. Gupta. Xeon+fpga platform for the data center. http://www.ece.cmu.edu/ calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf, 2015.Google ScholarGoogle Scholar
  24. Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ronald G. Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 223--238, New York, NY, USA, 2015. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Eric Hielscher. Locality Optimization For Data Parallel Programs. PhD thesis, New York University, 2013.Google ScholarGoogle Scholar
  26. Chien-Chin Huang, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. Spartan: A distributed array framework with smart tiling. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 1--15, Santa Clara, CA, July 2015. USENIX Association.Google ScholarGoogle Scholar
  27. H.M. Hussain, K. Benkrid, H. Seker, and A.T. Erdogan. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In Adaptive Hardware and Systems (AHS), 2011 NASA/ESA Conference on, pages 248--255, June 2011.Google ScholarGoogle ScholarCross RefCross Ref
  28. HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. Locality-aware mapping of nested parallel patterns on gpus. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Micro, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Feng Liu, Soumyadeep Ghosh, Nick P. Johnson, and David I. August. Cgpa: Coarse-grained pipelined accelerators. In Proceedings of the 51st Annual Design Automation Conference, DAC '14, pages 78:1--78:6, 2014. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Maxeler Technologies. MaxCompiler white paper, 2011.Google ScholarGoogle Scholar
  31. Oskar Mencer, Erik Vynckier, James Spooner, Stephen Girdlestone, and Oliver Charlesworth. Finding the right level of abstraction for minimizing operational expenditure. In Proceedings of the fourth workshop on High performance computational finance, pages 13--18. ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Odersky. Scala. http://www.scala-lang.org, 2011.Google ScholarGoogle Scholar
  33. Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. Sda: Software-defined accelerator for largescale dnn systems. Hot Chips 26, 2014.Google ScholarGoogle Scholar
  34. Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. Accelerating deep convolutional neural networks using specialized hardware. Technical report, Microsoft Research, February 2015.Google ScholarGoogle Scholar
  35. D. Petkov, R. Harr, and S. Amarasinghe. Efficient pipelining of nested loops: unroll-and-squash. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM, April 2002.Google ScholarGoogle ScholarCross RefCross Ref
  36. Simon Peyton Jones [editor], John Hughes [editor], Lennart Augustsson, Dave Barton, Brian Boutel, Warren Burton, Simon Fraser, Joseph Fasel, Kevin Hammond, Ralf Hinze, Paul Hudak, Thomas Johnsson, Mark Jones, John Launchbury, Erik Meijer, John Peterson, Alastair Reid, Colin Runciman, and Philip Wadler. Haskell 98 -- A non-strict, purely functional language. Available from http://www.haskell.org/definition/, feb 1999.Google ScholarGoogle Scholar
  37. Louis-Noël Pouchet. Interative Optimization in the Polyhedral Model. PhD thesis, University of Paris-Sud 11, Orsay, France, January 2010.Google ScholarGoogle Scholar
  38. Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. Loop transformations: Convexity, pruning and optimization. In 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL'11), pages 549--562, Austin, TX, January 2011. ACM Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '13, pages 29--38, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 13--24, Piscataway, NJ, USA, 2014. IEEE Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. Chimps: A high-level compilation flow for hybrid cpu-fpga architectures. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, FPGA '08, pages 261--261, New York, NY, USA, 2008. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Tiark Rompf, Arvind K. Sujeeth, Nada Amin, Kevin Brown, Vojin Jovanovic, HyoukJoong Lee, Manohar Jonnalagedda, Kunle Olukotun, and Martin Odersky. Optimizing data structures in high-level programs. POPL, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Satnam Singh and David J. Greaves. Kiwi: Synthesis of fpga circuits from parallel programs. In Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines, FCCM '08, pages 3--12, Washington, DC, USA, 2008. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M.C. Smith, Jeffrey S Vetter, and Sadaf R. Alam. Scientific computing beyond CPUs: FPGA implementations of common scientific kernels. In Proceedings of the 8th Annual Military and Aerospace Programmable Logic Devices International Conference, 2005.Google ScholarGoogle Scholar
  46. Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. In TECS'14: ACM Transactions on Embedded Computing Systems, July 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Arvind K. Sujeeth, Hyoukjoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, Kunle Olukotun, Tiark Rompf, and Martin Odersky. Optiml: an implicitly parallel domainspecific language for machine learning. In in Proceedings of the 28th International Conference on Machine Learning, ser. ICML, 2011.Google ScholarGoogle Scholar
  48. Arvind K. Sujeeth, Tiark Rompf, Kevin J. Brown, HyoukJoong Lee, Hassan Chafi, Victoria Popic, Michael Wu, Aleksander Prokopec, Vojin Jovanovic, Martin Odersky, and Kunle Olukotun. Composition and reuse with compiled domain-specific languages. In European Conference on Object Oriented Programming, ECOOP, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. Elasticflow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, ICCAD '15, pages 78--85, 2015. IEEE Press.Google ScholarGoogle ScholarCross RefCross Ref
  50. The Khronos Group. OpenCL 2.0. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  51. Anand Venkat, Mary Hall, and Michelle Strout. Loop and data transformations for sparse matrix code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, pages 521--532, New York, NY, USA, 2015. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. GL Zhang, Philip Heng Wai Leong, Chun Hok Ho, Kuen Hung Tsoi, Chris CC Cheung, Dong-U Lee, Ray CC Cheung, and Wayne Luk. Reconfigurable acceleration for monte carlo based financial simulation. In Field-Programmable Technology, 2005. Proceedings. 2005 IEEE International Conference on, pages 215--222. IEEE, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  54. Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong. Autopilot: A platform-based esl synthesis system. In Philippe Coussy and Adam Morawiec, editors, High-Level Synthesis, pages 99--112. Springer Netherlands, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  55. Ling Zhuo and Viktor K Prasanna. High-performance designs for linear algebra operations on reconfigurable hardware. Computers, IEEE Transactions on, 57(8):1057--1071, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Generating Configurable Hardware from Parallel Patterns

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    • Published in

                      cover image ACM SIGPLAN Notices
                      ACM SIGPLAN Notices  Volume 51, Issue 4
                      ASPLOS '16
                      April 2016
                      774 pages
                      ISSN:0362-1340
                      EISSN:1558-1160
                      DOI:10.1145/2954679
                      • Editor:
                      • Andy Gill
                      Issue’s Table of Contents
                      • cover image ACM Conferences
                        ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
                        March 2016
                        824 pages
                        ISBN:9781450340915
                        DOI:10.1145/2872362
                        • General Chair:
                        • Tom Conte,
                        • Program Chair:
                        • Yuanyuan Zhou

                      Copyright © 2016 ACM

                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 25 March 2016

                      Check for updates

                      Qualifiers

                      • research-article

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader