ABSTRACT
Irregular data structures, as exemplified with sparse matrices, have proved to be essential in modern computing. Numerous sparse formats have been investigated to improve the overall performance of Sparse Matrix-Vector multiply (SpMV). But in this work we propose instead to take a fundamentally different approach: to automatically build sets of regular sub-computations by mining for regular sub-regions in the irregular data structure. Our approach leads to code that is specialized to the sparsity structure of the input matrix, but which does not need anymore any indirection array, thereby improving SIMD vectorizability. We particularly focus on small sparse structures (below 10M nonzeros), and demonstrate substantial performance improvements and compaction capabilities compared to a classical CSR implementation and Intel MKL IE's SpMV implementation, evaluating on 200+ different matrices from the SuiteSparse repository.
Supplemental Material
- G. Agrawal, J. Saltz, and R. Das. 1995. Interprocedural Partial Redundancy Elimination and its Application to Distributed Memory Compilation. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI. La Jolla, CA, USA, 258ś269. Google ScholarDigital Library
- A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. New Orleans, LA, USA, 781ś792.Google Scholar
- C. Bastoul. 2004. Code Generation in the Polyhedral Model Is Easier Than You Think. In 13th International Conference on Parallel Architectures and Compilation Techniques, PACT. IEEE, Antibes, France, 7ś16. Google ScholarDigital Library
- N. Bell and M. Garland. 2008. Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004. NVIDIA Corporation.Google Scholar
- N. Bell and M. Garland. 2009. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In ACM/IEEE Conference on High Performance Computing, SC. Portland, OR, USA. Google ScholarDigital Library
- K. Cheshmi, S. Kamil, M. M. Strout, and M. M. Dehnavi. 2017. Sympiler: transforming sparse matrix codes by decoupling symbolic analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 13. Google ScholarDigital Library
- K. Cheshmi, S. Kamil, M. M. Strout, and M. M. Dehnavi. 2018. ParSy: inspection and transformation of sparse matrix computations for parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 62. Google ScholarDigital Library
- J.W. Choi, A. Singh, and R.W. Vuduc. 2010. Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. In 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. Bangalore, India, 115ś126. Google ScholarDigital Library
- S. Chou, F. Kjolstad, and S. Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 123. Google ScholarDigital Library
- P. Clauss and B. Kenmei. 2006. Polyhedral Modeling and Analysis of Memory Access Profiles. In IEEE International Conference on Application-Specific Systems, Architecture and Processors, ASAP. Steamboat Springs, CO, USA, 191ś198. Google ScholarDigital Library
- P. Clauss, B. Kenmei, and J. C. Beyler. 2005. The Periodic-Linear Model of Program Behavior Capture. In 11th International Euro-Par Conference. Lisbon, Portugal, 325ś335. Google ScholarDigital Library
- R. Das, P. Havlak, J. Saltz, and K. Kennedy. 1995. Index Array Flattening Through Program Transformation. In ACM/IEEE Supercomputing Conference, SC. San Diego, CA, USA, Article 70. Google ScholarDigital Library
- T. A. Davis and Y. Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Software 38 (2011), 1ś25. Issue 1. Google ScholarDigital Library
- E.F. D’Azevedo, M.R. Fahey, and R.T. Mills. 2005. Vectorized Sparse Matrix Multiply for Compressed Row Storage Format. In Intl. Conference on Computational Science, ICCS. Atlanta, GA, USA, 99ś106. Google ScholarDigital Library
- A. Ekambaram and E. Montagne. 2003. An Alternative Compressed Storage Format for Sparse Matrices. In Intl. Symposium on Computer Science and Information Sciences, ISCIS. Antalya, Turkey, 196ś203.Google Scholar
- J. Godwin, J. Holewinski, and P. Sadayappan. 2012. High-performance Sparse Matrix-vector Multiplication on GPUs for Structured Grid Computations. In 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU. London, UK, 47ś56. Google ScholarDigital Library
- R.G. Grimes, D.R. Kincaid, and D.M. Young. 1980. ITPACK 2.0: User’s Guide. http://books.google.com/books?id=h8RcNAAACAAJGoogle Scholar
- G. Gupta and S. Rajopadhye. 2007. The Z-Polyhedral Model. In 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. San Jose, CA, USA, 237ś248. Google ScholarDigital Library
- S. Han, J. Pool, J. Tran, and W. Dally. 2015. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems, NIPS. Quebec, Canada, 1135ś1143. Google ScholarDigital Library
- B. Hassibi and D.G. Stork. 1992. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. In Advances in Neural Information Processing Systems, NIPS. Denver, CO, USA, 164ś171. Google ScholarDigital Library
- A. Ketterlin and P. Clauss. 2008. Prediction and Trace Compression of Data Access Addresses through Nested Loop Recognition. In 6th International Symposium on Code Generation and Optimization, CGO. Boston, MA, USA, 94ś103. Google ScholarDigital Library
- F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 77. Google ScholarDigital Library
- A. LaMielle and M. Strout. 2010. Enabling Code Generation within the Sparse Polyhedral Framework. Technical Report. Colorado State University.Google Scholar
- Y. LeCun, C. Cortes, and C. Burges. {n. d.}. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/ . Last accessed: April 2019.Google Scholar
- R. Ponnusamy, J.H. Saltz, and A.N. Choudhary. 1993. Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse. In ACM/IEEE Conference on Supercomputing, SC. Portland, OR, USA, 361ś370. Google ScholarDigital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. 2011. Loop Transformations: Convexity, Pruning and Optimization. In Proc. Symposium on Principles of Programming Languages (POPL ’11). ACM, 549ś562. Google ScholarDigital Library
- M. Ravishankar, R. Dathathri, V. Elango, L.-N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. 2015. Distributed Memory Code Generation for Mixed Irregular/Regular Computations. In 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. ACM, San Francisco, CA, USA, 65ś75.Google Scholar
- G. Rodríguez and L.-N. Pouchet. 2018. Polyhedral Modeling of Immutable Sparse Matrices. In 8th International Workshop on Polyhedral Compilation Techniques. Manchester, UK.Google Scholar
- G. Rodríguez, J. M. Andión, M. T. Kandemir, and J. Touriño. 2016. Trace-based Affine Reconstruction of Codes. In Proceedings of the 14th International Symposium on Code Generation and Optimization, CGO. Barcelona, Spain, 139ś149. Google ScholarDigital Library
- G. Rodríguez, M. T. Kandemir, and J. Touriño. 2018. Affine Modeling of Program Traces. ACM. Trans. Comput. 68, 2 (2018), 294ś300.Google Scholar
- Y. Saad. 1990. SPARSKIT: A basic tool kit for sparse matrix computations. (1990).Google Scholar
- J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. 1990. Runtime Scheduling and Execution of Loops on Message Passing Machines. J. Parallel Distrib. Comput. 8, 4 (1990), 303ś312. Google ScholarDigital Library
- S. Sharma, R. Ponnusamy, B. Moon, Y.-S. Hwang, R. Das, and J. Saltz. 1994. Run-time and Compile-time Support for Adaptive Irregular Problems. In ACM/IEEE Conference on Supercomputing, SC. Washington, DC, USA, 97ś106. Google ScholarDigital Library
- M.M. Strout, G. George, and C. Olschanowsky. 2012. Set and Relation Manipulation for the Sparse Polyhedral Framework. In 25th International Workshop on Languages and Compilers for Parallel Computing, LCPC. Tokyo, Japan, 61ś75.Google Scholar
- A. Sukumaran-Rajam and P. Clauss. 2016. The Polyhedral Model of Nonlinear Loops. ACM Trans. Archit. Code Optim. 12, 4 (2016), 48. Google ScholarDigital Library
- W.T. Tang, R. Zhao, M. Lu, Y. Liang, H.P. Huynh, X. Li, and R.S.M. Goh. 2015. Optimizing and Auto-tuning Scale-free Sparse Matrixvector Multiplication on Intel Xeon Phi. In 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO. IEEE Computer Society, San Francisco, CA, USA, 136ś145. Google Scholar
- A. Venkat, M.S. Mohammadi, J. Park, H. Rong, R. Barik, M.M. Strout, and M. Hall. 2016. Automating Wavefront Parallelization for Sparse Matrix Computations. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. Salt Lake City, UT, USA, Article 41. Google ScholarDigital Library
- R. von Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz. 1992. Compiler analysis for irregular problems in Fortran D. In 6th International Workshop on Languages and Compilers for Parallel Computing, LCPC. New Haven, CT, USA, 97ś111. Google ScholarDigital Library
- R.W. Vuduc. 2004. Automatic Performance Tuning of Sparse Matrix Kernels. Ph.D. Dissertation. University of California. Google ScholarDigital Library
- S. Williams, L. Oliker, R.W. Vuduc, J. Shalf, K.A. Yelick, and J. Demmel. 2009. Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms. Parallel Comput. 35, 3 (2009), 178ś194. Google ScholarDigital Library
- S. Yan, C. Li, Y. Zhang, and H. Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. ACM, Orlando, FL, USA, 107ś118. Google ScholarDigital Library
- X. Yang, S. Parthasarathy, and P. Sadayappan. 2011. Fast Sparse Matrixvector Multiplication on GPUs: Implications for Graph Mining. Proc. VLDB Endow. 4, 4 (2011), 231ś242. Google ScholarDigital Library
Index Terms
- Generating piecewise-regular code from irregular structures
Recommendations
Distributed memory code generation for mixed Irregular/Regular computations
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMany applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is ...
Distributed memory code generation for mixed Irregular/Regular computations
PPoPP '15Many applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is ...
Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor
This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge---Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this ...
Comments