research-article

Open Access

Simultaneous and Heterogenous Multithreading

Authors:
Kuan-Chieh Hsu

University of California, Riverside, United States of America

University of California, Riverside, United States of America

0009-0002-4285-9588
View Profile

,
Hung-Wei Tseng

University of California, Riverside, United States of America

University of California, Riverside, United States of America

0000-0001-8383-5203
View Profile

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on MicroarchitectureOctober 2023Pages 137–152https://doi.org/10.1145/3613424.3614285

Published:08 December 2023Publication History

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 137–152

ABSTRACT

The landscape of modern computers is undoubtedly heterogeneous, as all computing platforms integrate multiple types of processing units and hardware accelerators. However, the entrenched programming models focus on using only the most efficient processing units for each code region, underutilizing the processing power within heterogeneous computers.

This paper simultaneous and heterogenous multithreading (SHMT), a programming and execution model that enables opportunities for “real” parallel processing using heterogeneous processing units. In contrast to conventional models, SHMT can utilize heterogeneous types of processing units concurrently for the same code region. Furthermore, SHMT presents an abstraction and a runtime system to facilitate parallel execution. More importantly, SHMT needs to additionally address the heterogeneity in data precision that various processing units support to ensure the quality of the result.

This paper implements and evaluates SHMT on an embedded system platform with a GPU and an Edge TPU. SHMT achieves up to 1.95 × speedup and 51.0% energy reduction compared to GPU baseline.

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. MICRO ’16. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. ISCA ’14. General-purpose code acceleration with limited-precision analog computation. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture. https://doi.org/10.1109/ISCA.2014.6853213Google ScholarCross Ref
Sam Amiri, Mohammad Hosseinabady, Simon McIntosh-Smith, and Jose Nunez-Yanez. DATE ’18. Multi-precision convolutional neural networks on heterogeneous hardware. In 2018 Design, Automation Test in Europe Conference Exhibition.Google Scholar
Sam Amiri, Mohammad Hosseinabady, Andres Rodriguez, Rafael Asenjo, Angeles Navarro, and Jose Nunez-Yanez. FPL ’18. Workload Partitioning Strategy for Improved Parallelism on FPGA-CPU Heterogeneous Chips. In 2018 28th International Conference on Field Programmable Logic and Applications.Google Scholar
Analog Devices, Inc.2023. Analog Devices’ Processors and DSP. https://www.analog.com/en/product-category/processors-dsp.html.Google Scholar
Renzo Andri, Beatrice Bussolino, Antonio Cipolletta, Lukas Cavigelli, and Zhe Wang. MICRO ’22. Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tiles. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Apple Inc.2020. Apple M1. https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/.Google Scholar
Thilini Kaushalya Bandara, Dhananjaya Wijerathne, Tulika Mitra, and Li-Shiuan Peh. ASPLOS ’22. REVAMP: A Systematic Framework for Heterogeneous CGRA Realization. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
Saambhavi Baskaran, Mahmut Taylan Kandemir, and Jack Sampson. MICRO ’22. An architecture interface and offload model for low-overhead, near-data, distributed accelerators. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).Google Scholar
John Burgess. 2020. RTX on—The NVIDIA Turing GPU. IEEE Micro 40, 2 (2020), 36–44. https://doi.org/10.1109/MM.2020.2971677Google ScholarCross Ref
Rohit Chandra, Leo Dagum, David Kohr, Ramesh Menon, Dror Maydan, and Jeff McDonald. 2001. Parallel programming in OpenMP. Morgan Kaufmann Publishers Inc.Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. IISWC ’09. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization.Google Scholar
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. OSDI ’18. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation.Google Scholar
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. NIPS ’18. Learning to Optimize Tensor Programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems.Google Scholar
Xinyu Chen, Yao Chen, Feng Cheng, Hongshi Tan, Bingsheng He, and Weng-Fai Wong. MICRO ’22. ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Yujeong Choi and Minsoo Rhu. HPCA ’20. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In 2020 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems (1989).Google Scholar
Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. ICS ’19. Accelerating Reduction and Scan Using Tensor Core Units. In Proceedings of the ACM International Conference on Supercomputing.Google Scholar
F. Fernandes dos Santos, C. Lunardi, D. Oliveira, F. Libano, and P. Rech. HPCA ’19. Reliability Evaluation of Mixed-Precision Architectures. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
Sultan Durrani, Muhammad Saad Chughtai, Mert Hidayetoglu, Rashid Tahir, Abdul Dakkak, Lawrence Rauchwerger, Fareed Zaffar, and Wen-mei Hwu. PACT ’21. Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques. https://doi.org/10.1109/PACT52795.2021.00032Google ScholarCross Ref
Janick Edinger, Martin Breitbach, Niklas Gabrisch, Dominik Schäfer, Christian Becker, and Amr Rizk. IPDPS ’21. Decentralized Low-Latency Task Scheduling for Ad-Hoc Computing. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Venmugil Elango. IPDPS ’21. Pase: Parallelization Strategies for Efficient DNN Training. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. MICRO ’12. Neural Acceleration for General-Purpose Approximate Programs. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Yuping Fan, Zhiling Lan, Paul Rich, William Allcock, and Michael E. Papka. IPDPS ’22. Hybrid Workload Scheduling on HPC Systems. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. PPoPP ’21. EGEMM-TC: Accelerating Scientific Computing on Tensor Cores with Extended Precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Google Scholar
Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, and Yufei Ding. SC ’21. APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
Petko Georgiev, Nicholas D. Lane, Kiran K. Rachuri, and Cecilia Mascolo. MobiCom ’16. LEO: Scheduling Sensor Inference Algorithms across Heterogeneous Mobile Processors and Network Resources. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking.Google Scholar
Google LLC. 2020. edgetpu compiler. https://coral.ai/docs/edgetpu/compiler.Google Scholar
Google LLC.2022. Google Pixel 6a. https://store.google.com/product/pixel_6a?hl=en-US.Google Scholar
Xiuxian Guan, Zekai Sun, Shengliang Deng, Xusheng Chen, Shixiong Zhao, Zongyuan Zhang, Tianyang Duan, Yuexuan Wang, Chenshu Wu, Yong Cui, Libo Zhang, Yanjun Wu, Rui Wang, and Heming Cui. MICRO ’22. ROG: A High Performance and Robust Distributed Training System for Robotic IoT. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. MICRO ’22. ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. SC ’18. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
Chaoyang He, Murali Annavaram, and Salman Avestimehr. NIPS ’20. Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge. In Proceedings of the 34th International Conference on Neural Information Processing Systems.Google Scholar
Nhut-Minh Ho and Weng-Fai Wong. HPEC ’17. Exploiting half precision arithmetic in Nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference. https://doi.org/10.1109/HPEC.2017.8091072Google ScholarCross Ref
Pedro Holanda and Hannes Mühleisen. DaMoN ’19. Relational Queries with a Tensor Processing Unit. In Proceedings of the 15th International Workshop on Data Management on New Hardware.Google Scholar
Xueyu Hou, Yongjie Guan, Tao Han, and Ning Zhang. IPDPS ’22. DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Kuan-Chieh Hsu and Hung-Wei Tseng. SC ’21. Accelerating Applications Using Edge Tensor Processing Units. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. SIGMOD ’22. TCUDB: Accelerating Database with Tensor Processors. In Proceedings of the 2022 International Conference on Management of Data.Google Scholar
Yu-Ching Hu, Murtuza Taher Lokhandwala, Te I., and Hung-Wei Tseng. MICRO ’19. Dynamic Multi-Resolution Data Storage. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Zhiming Hu, Ahmad Bisher Tarakji, Vishal Raheja, Caleb Phillips, Teng Wang, and Iqbal Mohomed. EMDL ’19. DeepHome: Distributed Inference with Heterogeneous Devices in the Edge. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications.Google Scholar
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. CVPR ’18. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Zhihao Jia, Matei Zaharia, and Alex Aiken. MLSys ’19. Beyond Data and Model Parallelism for Deep Neural Networks.. In Proceedings of Machine Learning and Systems.Google Scholar
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. OSDI ’20. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation.Google Scholar
Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. ISCA ’21. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.Google Scholar
Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. ISCA ’23. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In 2023 ACM/IEEE 50th Annual International Symposium on Computer Architecture.Google Scholar
Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-specific Supercomputer for Training Deep Neural Networks. In Communications of the ACM.Google Scholar
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. ISCA ’17. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. https://doi.org/10.1145/3079856.3080246Google ScholarDigital Library
Liu Ke, Udit Gupta, Mark Hempstead, Carole-Jean Wu, Hsien-Hsin S. Lee, and Xuan Zhang. HPCA ’22. Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation. In 2022 IEEE International Symposium on High-Performance Computer Architecture.Google Scholar
Hamidreza Khaleghzadeh, Ravi Reddy Manumachu, and Alexey Lastovetsky. 2020. A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes. IEEE Access (2020).Google ScholarCross Ref
Daya S Khudia, Babak Zamirai, Mehrzad Samadi, and Scott Mahlke. ISCA ’15. Rumba: An online quality management system for approximate computing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture.Google Scholar
Sungil Kim and Heeyoung Kim. 2016. A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting (2016).Google ScholarCross Ref
Young Geun Kim and Carole-Jean Wu. MICRO ’21. AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning. In 54th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Anish Krishnakumar, Samet E. Arda, A. Alper Goksoy, Sumit K. Mandal, Umit Y. Ogras, Anderson L. Sartor, and Radu Marculescu. 2020. Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2020).Google ScholarCross Ref
Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. HPCA ’21. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture.Google Scholar
Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue. 2020. Distributed Training of Deep Learning Models: A Taxonomic Perspective. In IEEE Transactions on Parallel and Distributed Systems.Google Scholar
Michael A. Laurenzano, Parker Hill, Mehrzad Samadi, Scott Mahlke, Jason Mars, and Lingjia Tang. PLDI ’16. Input Responsiveness: Using Canary Inputs to Dynamically Steer Approximation. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google Scholar
Binrui Li, Shenggan Cheng, and James Lin. CLUSTER ’21. tcFFT: A Fast Half-Precision FFT Library for NVIDIA Tensor Cores. In 2021 IEEE International Conference on Cluster Computing. https://doi.org/10.1109/Cluster48925.2021.00035Google ScholarCross Ref
Shikai Li, Sunghyun Park, and Scott Mahlke. ICS ’18. Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation. In Proceedings of the 2018 International Conference on Supercomputing.Google Scholar
Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, and Song Han. NIPS ’21. MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning. In Annual Conference on Neural Information Processing Systems.Google Scholar
Ji Lin, Wei-Ming Chen, John Cohn, Chuang Gan, and Song Han. NIPS ’20. MCUNet: Tiny Deep Learning on IoT Devices. In Annual Conference on Neural Information Processing Systems.Google Scholar
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. NIPS ’21. On-Device Training Under 256KB Memory. In Annual Conference on Neural Information Processing Systems.Google Scholar
Zihan Liu, Jingwen Leng, Zhihui Zhang, Quan Chen, Chao Li, and Minyi Guo. ASPLOS ’22. VELTAI: Rowards High-Performance Multi-Tenant Deep Learning Services via Adaptive Compilation and Scheduling. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
Google LLC. 2022. XLA: Domain-specific compiler for linear algebra to optimize tensorflow computations. https://www.tensorflow.org/xla.Google Scholar
Atieh Lotfi, Abbas Rahimi, Hadi Esmaeilzadeh, and Rajesh K Gupta. 2015. SqueezCL: Squeezing OpenCL kernels for approximate computing on contemporary GPUs. In Workshop on Approximate Computing.Google Scholar
Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. HPEC ’20. Accelerating MRI Reconstruction on TPUs. In 2020 IEEE High Performance Extreme Computing Conference. https://doi.org/10.1109/HPEC43674.2020.9286192Google ScholarCross Ref
Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. ISBI ’21. Nonuniform Fast Fourier Transform on Tpus. In 2021 IEEE 18th International Symposium on Biomedical Imaging.Google Scholar
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. ICLR ’18. Mixed Precision Training. In International Conference on Learning Representations.Google Scholar
Mehryar Mohri. 2002. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics (2002).Google Scholar
Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. HPCA ’15. SNNAP: Approximate computing on programmable SoCs via neural acceleration. In IEEE 21st International Symposium on High Performance Computer Architecture. https://doi.org/10.1109/HPCA.2015.7056066Google ScholarCross Ref
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. OSDI ’20. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation.Google Scholar
Daniel Nichols, Aniruddha Marathe, Kathleen Shoga, Todd Gamblin, and Abhinav Bhatele. IPDPS ’22. Resource Utilization Aware Job Scheduling to Mitigate Performance Variability. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue (2008).Google ScholarDigital Library
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. PLDI ’21. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation.Google Scholar
NVIDIA Corporation. 2019. NVIDIA T4 TENSOR CORE GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf.Google Scholar
NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.Google Scholar
NXP Semiconductors N.V.2023. NXP MSC8154E Quad-Core DSP with Security. https://www.nxp.com/products/processors-and-microcontrollers/additional-mpu-mcus-architectures/digital-signal-processors/high-performance-quad-core-dsp-with-security:MSC8154E.Google Scholar
Alberto Parravicini, Arnaud Delamare, Marco Arnaboldi, and Marco D. Santambrogio. IPDPS ’21. DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. NIPS ’19. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems.Google ScholarDigital Library
Kiran Ranganath, Joshua D. Suetterlein, Joseph B. Manzano, Shuaiwen Leon Song, and Daniel Wong. SC ’21. MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
Stanislav G. Sedukhin and Marcin Paprzycki. 2012. Generalizing Matrix Multiplication for Efficient Computations on Modern Computers. In Parallel Processing and Applied Mathematics.Google Scholar
Jiuchen Shi, Jiawen Wang, Kaihua Fu, Quan Chen, Deze Zeng, and Minyi Guo. IPDPS ’22. QoS-awareness of Microservices with Excessive Loads via Inter-Datacenter Scheduling. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Siddharth Singh and Abhinav Bhatele. IPDPS ’22. AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HPCA, ’20. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HPCA ’19. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, and Armand Joulin. ICLR ’21. Training with Quantization Noise for Extreme Model Compression. In International Conference on Learning Representations.Google Scholar
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering (2010).Google Scholar
Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. NIPS ’20. Ultra-Low Precision 4-bit Training of Deep Neural Networks. In Advances in Neural Information Processing Systems.Google Scholar
Tuan Ta, Khalid Al-Hawaj, Nick Cebry, Yanghui Ou, Eric Hall, Courtney Golden, and Christopher Batten. MICRO ’22. big.VLITTLE: On-Demand Data-Parallel Acceleration for Mobile Systems on Chip. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Cheng Tan, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. ISCA ’18. Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture.Google Scholar
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. CVPR ’19. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Chunlin Tian, Li Li, Zhan Shi, Jun Wang, and ChengZhong Xu. MICRO ’22. HARMONY: Heterogeneity-Aware Hierarchical Management for Federated Learning System. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Han D. Tran, Milinda Fernando, Kumar Saurabh, Baskar Ganapathysubramanian, Robert M. Kirby, and Hari Sundar. IPDPS ’22. A scalable adaptive-matrix SPMV for heterogeneous architectures. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Jack Turner, Elliot J. Crowley, and Michael F. P. O’Boyle. ASPLOS ’21. Neural Architecture Search as Program Transformation Exploration. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. CVPR ’19. HAQ: Hardware-Aware Automated Quantization With Mixed Precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. NIPS ’18. Training Deep Neural Networks with 8-bit Floating Point Numbers. In Advances in Neural Information Processing Systems.Google Scholar
Shuo Wang, Yun Liang, and Wei Zhang. HPCA ’19. Poly: Efficient Heterogeneous System and Application Management for Interactive Applications. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
Ting Wang, Qian Zhang, and Qiang Xu. DATE ’17. ApproxQA: A unified quality assurance framework for approximate computing. In Design, Automation and Test in Europe Conference and Exhibition.Google Scholar
Joel Wolfrath, Nikhil Sreekumar, Dhruv Kumar, Yuanli Wang, and Abhishek Chandra. IPDPS ’22. HACCS: Heterogeneity-Aware Clustered Client Selection for Accelerated Federated Learning. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Yecheng Xiang and Hyoseung Kim. RTSS ’19. Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference. In 2019 IEEE Real-Time Systems Symposium.Google Scholar
Qingcheng Xiao, Size Zheng, Bingzhe Wu, Pengcheng Xu, Xuehai Qian, and Yun Liang. ISCA ’21. HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.Google Scholar
Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Sasa Misailovic, and Saurabh Bagchi. USENIX ATC ’18. VideoChef: Efficient Approximation for Streaming Video Processing Pipelines. In 2018 USENIX Annual Technical Conference.Google Scholar
Zichao Yang, Heng Wu, Yuanjia Xu, Yuewen Wu, Hua Zhong, and Wenbo Zhang. 2023. Hydra: Deadline-aware and Efficiency-oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUs. IEEE Trans. Comput. (2023).Google Scholar
Minjia Zhang, Zehua Hu, and Mingqin Li. IPDPS ’21. DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
Yunan Zhang, Po-An Tsai, and Hung-Wei Tseng. ISCA ’22. SIMD2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM. In Proceedings of the 49th Annual International Symposium on Computer Architecture.Google Scholar
Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. ISCA ’22. AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction. In Proceedings of the 49th Annual International Symposium on Computer Architecture.Google Scholar
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. ASPLOS ’20. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
Li Zhou, Mohammad Hossein Samavatian, Anys Bacha, Saikat Majumdar, and Radu Teodorescu. SEC ’19. Adaptive Parallel Execution of Deep Neural Networks on Heterogeneous Edge Devices. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing.Google Scholar
Wentao Zhu, Can Zhao, Wenqi Li, Holger R. Roth, Ziyue Xu, and Daguang Xu. MICCAI ’20. LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention.Google Scholar
Yuhao Zhu. PPoPP ’22. RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Google Scholar

Index Terms

Simultaneous and Heterogenous Multithreading

Index terms have been assigned to the content through auto-classification.

Recommendations

A hierarchical tridiagonal system solver for heterogenous supercomputers
ScalA '14: Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Tridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale ...
Read More
Accelerating linpack with CUDA on heterogenous clusters
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper describes the use of CUDA to accelerate the Linpack benchmark on heterogenous clusters, where both CPUs and GPUs are used in synergy with minor or no modifications to the original source code. A host library intercepts the calls to DGEMM and ...
Read More
A Compiler-assisted Runtime-prefetching Scheme for Heterogenous Platforms
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques

Within the recent years, accelerators such as GPGPU have been widely adapted by industry and academia. Many research on kernel computations are reported over 100 times speedup using GPGPU. For real applications on industry, however, the data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture
October 2023
1528 pages
ISBN:9798400703294
DOI:10.1145/3613424

Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2023
Check for updates
Badges
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 10,071
  Total Downloads
- Downloads (Last 12 months)10,071
- Downloads (Last 6 weeks)4,661
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Simultaneous and Heterogenous Multithreading

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

A hierarchical tridiagonal system solver for heterogenous supercomputers

Accelerating linpack with CUDA on heterogenous clusters

A Compiler-assisted Runtime-prefetching Scheme for Heterogenous Platforms