透過您的圖書館登入
IP:3.17.74.227
  • 學位論文

多GPU架構之Strassen矩陣乘法

Strassen’s Matrix Multiplication on Multi-GPU

指導教授 : 李哲榮

摘要


裝配有大量平行處理單元的通用圖形處理器已經成為高效能計算領域中不可或缺的設備。為了能更有效率地處理更大尺度的問題,多GPU架構也開始被使用在高效能的機器上。因此,設計良好的演算法以符合多GPU平台在現今已是重要研究課題。 在本論文中,我們在多GPU架構實作了高效能的Strassen乘法演算法。Strassen矩陣乘法演算法具有遞迴性質,在每個步驟中,輸入矩陣將被分解為2*2的子矩陣,遞迴進行7次的子矩陣乘法以及18個子矩陣加法,再組合出最後的結果。比起傳統的O(n^3)矩陣乘法,Strassen演算法能夠達到低於立方次的時間複雜度。 我們提出3種多GPU平台上的快速矩陣乘法實作: 傳統方法:使用傳統的block形式矩陣乘法將工作分配給多個GPU,然後再於GPU上運行cublas4.2的O(n^3)矩陣乘法sgemm。 Hybrid方法:使用傳統的block形式矩陣乘法將工作分配給多個GPU,然後再於GPU上運行快速矩陣乘法。 Strassen方法:使用Strassen的子矩陣分割方式遞迴展開2次,然後再於GPU上運行快速矩陣乘法。 我們在裝配有四張Tesla C2070運算卡的平台上測試,相較於O(n^3)的平行sgemm,我們的Strassenz方法獲得了28.5%的效能進展;以單一GPU的sgemm作為比較對象,則可以達到4.37的speedup。

並列摘要


General Purpose Graphics Processing Unit (GPGPU) that equips massively parallel processing units has become an indispensible building block in High Performance Computing (HPC). To achieve better scalability, the multi-GPU architecture is used in modern HPC machines. Therefore, designing efficient algorithms for multi-GPU platforms has been an important research topic already. In this thesis, we presented the high performance implementations of Stassen’s matrix multiplication algorithm on multi-GPU. Strassen’s algorithm is recursive defined. In each level, the matrices are evenly partitioned into 2x2 submatrices, and Strassen’s algorithm uses 7 matrix multiplications and 18 matrix additions on those submatrices to assemble the final result. Comparing to the general O(n^3) matrix multiplication algorithms, Strassen’s algorithm can achieve subcubic time complexity. Three implementations of matrix multiplication for multi-GPU are presented. Traditional method: which uses the block matrix multiplication algorithm to partition the multiplications of submatrices to multiple GPUs, and uses sgemm on each single GPU. Hybrid method: which uses the block matrix multiplication algorithm to partition the multiplications of submatrices to multiple GPUs, and uses fast matrix multiplication on each single GPU. Strassen method: which applies 2 times Strassen’s partition(which gives 49 sub-multiplications and 126 sub-additions) to distribute the workload for multiple GPU, and uses fast matrix multiplication on each single GPU. Those implementations were experimented on the platform equipped with four Tesla C2070 GPUs, and compared with the sgemm kernel for single GPU in cublas 4.2. About 4.37 times speedup can be obtained by using the Strassen implementation。

並列關鍵字

matrix multiplication Strassen multi-GPU GPGPU GPU

參考文獻


[1] Volker Strassen. Gaussian Elimination is not Optimal. NUMERISCHE MATHEMATIK Volume 13, Number 4, 354-356.
[2] Jacques Cohen and Martin Roth. On the implementation of Strassen's fast multiplication algorithm. ACTA INFORMATICA Volume 6, Number 4, 341-355.
[3] David H. Bailey. Extra-High Speed Matrix Multiplication on the Cray-2. SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3 , pg. 603 607.
[4] Craig C. Douglas , Michael Heroux , Gordon Slishman , Roger M. Smith , Roger M. A. GEMMW: Portable Level 3 Blas Winograd Variant Of Strassen's Matrix-Matrix Multiply Algorithm. Yale University, Department of Computer Scuence.
[5] Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson, Anna Tsao, Thomas Turnbull. Implementation of Strassen's Algorithm for Matrix Multiplication.

被引用紀錄


林志樺(2012)。垂直整合對營運資金管理與財務績效之影響-以我國紡織業為例〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.00528
郭玉娟(2011)。科技專案績效管理—平衡計分卡之應用〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2011.01376
鄧佳宜(2011)。聚乳酸/氮化鋁導熱高分子複材之製備與性質研究〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2011.01165
郭貞君(2008)。論外資購併本國銀行機構問題之探討〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2008.01213
何文慧(2008)。以國家競爭力評比作政府政策制定之分析〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2008.00275

延伸閱讀