ABSTRACT
The landscape of modern computers is undoubtedly heterogeneous, as all computing platforms integrate multiple types of processing units and hardware accelerators. However, the entrenched programming models focus on using only the most efficient processing units for each code region, underutilizing the processing power within heterogeneous computers.
This paper simultaneous and heterogenous multithreading (SHMT), a programming and execution model that enables opportunities for “real” parallel processing using heterogeneous processing units. In contrast to conventional models, SHMT can utilize heterogeneous types of processing units concurrently for the same code region. Furthermore, SHMT presents an abstraction and a runtime system to facilitate parallel execution. More importantly, SHMT needs to additionally address the heterogeneity in data precision that various processing units support to ensure the quality of the result.
This paper implements and evaluates SHMT on an embedded system platform with a GPU and an Edge TPU. SHMT achieves up to 1.95 × speedup and 51.0% energy reduction compared to GPU baseline.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. MICRO ’16. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. ISCA ’14. General-purpose code acceleration with limited-precision analog computation. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture. https://doi.org/10.1109/ISCA.2014.6853213Google ScholarCross Ref
- Sam Amiri, Mohammad Hosseinabady, Simon McIntosh-Smith, and Jose Nunez-Yanez. DATE ’18. Multi-precision convolutional neural networks on heterogeneous hardware. In 2018 Design, Automation Test in Europe Conference Exhibition.Google Scholar
- Sam Amiri, Mohammad Hosseinabady, Andres Rodriguez, Rafael Asenjo, Angeles Navarro, and Jose Nunez-Yanez. FPL ’18. Workload Partitioning Strategy for Improved Parallelism on FPGA-CPU Heterogeneous Chips. In 2018 28th International Conference on Field Programmable Logic and Applications.Google Scholar
- Analog Devices, Inc.2023. Analog Devices’ Processors and DSP. https://www.analog.com/en/product-category/processors-dsp.html.Google Scholar
- Renzo Andri, Beatrice Bussolino, Antonio Cipolletta, Lukas Cavigelli, and Zhe Wang. MICRO ’22. Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tiles. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Apple Inc.2020. Apple M1. https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/.Google Scholar
- Thilini Kaushalya Bandara, Dhananjaya Wijerathne, Tulika Mitra, and Li-Shiuan Peh. ASPLOS ’22. REVAMP: A Systematic Framework for Heterogeneous CGRA Realization. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
- Saambhavi Baskaran, Mahmut Taylan Kandemir, and Jack Sampson. MICRO ’22. An architecture interface and offload model for low-overhead, near-data, distributed accelerators. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).Google Scholar
- John Burgess. 2020. RTX on—The NVIDIA Turing GPU. IEEE Micro 40, 2 (2020), 36–44. https://doi.org/10.1109/MM.2020.2971677Google ScholarCross Ref
- Rohit Chandra, Leo Dagum, David Kohr, Ramesh Menon, Dror Maydan, and Jeff McDonald. 2001. Parallel programming in OpenMP. Morgan Kaufmann Publishers Inc.Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. IISWC ’09. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization.Google Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. OSDI ’18. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation.Google Scholar
- Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. NIPS ’18. Learning to Optimize Tensor Programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems.Google Scholar
- Xinyu Chen, Yao Chen, Feng Cheng, Hongshi Tan, Bingsheng He, and Weng-Fai Wong. MICRO ’22. ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Yujeong Choi and Minsoo Rhu. HPCA ’20. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In 2020 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
- George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems (1989).Google Scholar
- Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. ICS ’19. Accelerating Reduction and Scan Using Tensor Core Units. In Proceedings of the ACM International Conference on Supercomputing.Google Scholar
- F. Fernandes dos Santos, C. Lunardi, D. Oliveira, F. Libano, and P. Rech. HPCA ’19. Reliability Evaluation of Mixed-Precision Architectures. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
- Sultan Durrani, Muhammad Saad Chughtai, Mert Hidayetoglu, Rashid Tahir, Abdul Dakkak, Lawrence Rauchwerger, Fareed Zaffar, and Wen-mei Hwu. PACT ’21. Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques. https://doi.org/10.1109/PACT52795.2021.00032Google ScholarCross Ref
- Janick Edinger, Martin Breitbach, Niklas Gabrisch, Dominik Schäfer, Christian Becker, and Amr Rizk. IPDPS ’21. Decentralized Low-Latency Task Scheduling for Ad-Hoc Computing. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Venmugil Elango. IPDPS ’21. Pase: Parallelization Strategies for Efficient DNN Training. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. MICRO ’12. Neural Acceleration for General-Purpose Approximate Programs. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Yuping Fan, Zhiling Lan, Paul Rich, William Allcock, and Michael E. Papka. IPDPS ’22. Hybrid Workload Scheduling on HPC Systems. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. PPoPP ’21. EGEMM-TC: Accelerating Scientific Computing on Tensor Cores with Extended Precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Google Scholar
- Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, and Yufei Ding. SC ’21. APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
- Petko Georgiev, Nicholas D. Lane, Kiran K. Rachuri, and Cecilia Mascolo. MobiCom ’16. LEO: Scheduling Sensor Inference Algorithms across Heterogeneous Mobile Processors and Network Resources. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking.Google Scholar
- Google LLC. 2020. edgetpu compiler. https://coral.ai/docs/edgetpu/compiler.Google Scholar
- Google LLC.2022. Google Pixel 6a. https://store.google.com/product/pixel_6a?hl=en-US.Google Scholar
- Xiuxian Guan, Zekai Sun, Shengliang Deng, Xusheng Chen, Shixiong Zhao, Zongyuan Zhang, Tianyang Duan, Yuexuan Wang, Chenshu Wu, Yong Cui, Libo Zhang, Yanjun Wu, Rui Wang, and Heming Cui. MICRO ’22. ROG: A High Performance and Robust Distributed Training System for Robotic IoT. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. MICRO ’22. ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. SC ’18. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
- Chaoyang He, Murali Annavaram, and Salman Avestimehr. NIPS ’20. Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge. In Proceedings of the 34th International Conference on Neural Information Processing Systems.Google Scholar
- Nhut-Minh Ho and Weng-Fai Wong. HPEC ’17. Exploiting half precision arithmetic in Nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference. https://doi.org/10.1109/HPEC.2017.8091072Google ScholarCross Ref
- Pedro Holanda and Hannes Mühleisen. DaMoN ’19. Relational Queries with a Tensor Processing Unit. In Proceedings of the 15th International Workshop on Data Management on New Hardware.Google Scholar
- Xueyu Hou, Yongjie Guan, Tao Han, and Ning Zhang. IPDPS ’22. DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Kuan-Chieh Hsu and Hung-Wei Tseng. SC ’21. Accelerating Applications Using Edge Tensor Processing Units. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
- Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. SIGMOD ’22. TCUDB: Accelerating Database with Tensor Processors. In Proceedings of the 2022 International Conference on Management of Data.Google Scholar
- Yu-Ching Hu, Murtuza Taher Lokhandwala, Te I., and Hung-Wei Tseng. MICRO ’19. Dynamic Multi-Resolution Data Storage. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Zhiming Hu, Ahmad Bisher Tarakji, Vishal Raheja, Caleb Phillips, Teng Wang, and Iqbal Mohomed. EMDL ’19. DeepHome: Distributed Inference with Heterogeneous Devices in the Edge. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications.Google Scholar
- Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. CVPR ’18. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- Zhihao Jia, Matei Zaharia, and Alex Aiken. MLSys ’19. Beyond Data and Model Parallelism for Deep Neural Networks.. In Proceedings of Machine Learning and Systems.Google Scholar
- Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. OSDI ’20. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation.Google Scholar
- Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. ISCA ’21. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.Google Scholar
- Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. ISCA ’23. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In 2023 ACM/IEEE 50th Annual International Symposium on Computer Architecture.Google Scholar
- Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-specific Supercomputer for Training Deep Neural Networks. In Communications of the ACM.Google Scholar
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. ISCA ’17. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. https://doi.org/10.1145/3079856.3080246Google ScholarDigital Library
- Liu Ke, Udit Gupta, Mark Hempstead, Carole-Jean Wu, Hsien-Hsin S. Lee, and Xuan Zhang. HPCA ’22. Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation. In 2022 IEEE International Symposium on High-Performance Computer Architecture.Google Scholar
- Hamidreza Khaleghzadeh, Ravi Reddy Manumachu, and Alexey Lastovetsky. 2020. A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes. IEEE Access (2020).Google ScholarCross Ref
- Daya S Khudia, Babak Zamirai, Mehrzad Samadi, and Scott Mahlke. ISCA ’15. Rumba: An online quality management system for approximate computing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture.Google Scholar
- Sungil Kim and Heeyoung Kim. 2016. A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting (2016).Google ScholarCross Ref
- Young Geun Kim and Carole-Jean Wu. MICRO ’21. AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning. In 54th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Anish Krishnakumar, Samet E. Arda, A. Alper Goksoy, Sumit K. Mandal, Umit Y. Ogras, Anderson L. Sartor, and Radu Marculescu. 2020. Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2020).Google ScholarCross Ref
- Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. HPCA ’21. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture.Google Scholar
- Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue. 2020. Distributed Training of Deep Learning Models: A Taxonomic Perspective. In IEEE Transactions on Parallel and Distributed Systems.Google Scholar
- Michael A. Laurenzano, Parker Hill, Mehrzad Samadi, Scott Mahlke, Jason Mars, and Lingjia Tang. PLDI ’16. Input Responsiveness: Using Canary Inputs to Dynamically Steer Approximation. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google Scholar
- Binrui Li, Shenggan Cheng, and James Lin. CLUSTER ’21. tcFFT: A Fast Half-Precision FFT Library for NVIDIA Tensor Cores. In 2021 IEEE International Conference on Cluster Computing. https://doi.org/10.1109/Cluster48925.2021.00035Google ScholarCross Ref
- Shikai Li, Sunghyun Park, and Scott Mahlke. ICS ’18. Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation. In Proceedings of the 2018 International Conference on Supercomputing.Google Scholar
- Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, and Song Han. NIPS ’21. MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning. In Annual Conference on Neural Information Processing Systems.Google Scholar
- Ji Lin, Wei-Ming Chen, John Cohn, Chuang Gan, and Song Han. NIPS ’20. MCUNet: Tiny Deep Learning on IoT Devices. In Annual Conference on Neural Information Processing Systems.Google Scholar
- Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. NIPS ’21. On-Device Training Under 256KB Memory. In Annual Conference on Neural Information Processing Systems.Google Scholar
- Zihan Liu, Jingwen Leng, Zhihui Zhang, Quan Chen, Chao Li, and Minyi Guo. ASPLOS ’22. VELTAI: Rowards High-Performance Multi-Tenant Deep Learning Services via Adaptive Compilation and Scheduling. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
- Google LLC. 2022. XLA: Domain-specific compiler for linear algebra to optimize tensorflow computations. https://www.tensorflow.org/xla.Google Scholar
- Atieh Lotfi, Abbas Rahimi, Hadi Esmaeilzadeh, and Rajesh K Gupta. 2015. SqueezCL: Squeezing OpenCL kernels for approximate computing on contemporary GPUs. In Workshop on Approximate Computing.Google Scholar
- Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. HPEC ’20. Accelerating MRI Reconstruction on TPUs. In 2020 IEEE High Performance Extreme Computing Conference. https://doi.org/10.1109/HPEC43674.2020.9286192Google ScholarCross Ref
- Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. ISBI ’21. Nonuniform Fast Fourier Transform on Tpus. In 2021 IEEE 18th International Symposium on Biomedical Imaging.Google Scholar
- Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. ICLR ’18. Mixed Precision Training. In International Conference on Learning Representations.Google Scholar
- Mehryar Mohri. 2002. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics (2002).Google Scholar
- Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. HPCA ’15. SNNAP: Approximate computing on programmable SoCs via neural acceleration. In IEEE 21st International Symposium on High Performance Computer Architecture. https://doi.org/10.1109/HPCA.2015.7056066Google ScholarCross Ref
- Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. OSDI ’20. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation.Google Scholar
- Daniel Nichols, Aniruddha Marathe, Kathleen Shoga, Todd Gamblin, and Abhinav Bhatele. IPDPS ’22. Resource Utilization Aware Job Scheduling to Mitigate Performance Variability. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue (2008).Google ScholarDigital Library
- Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. PLDI ’21. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation.Google Scholar
- NVIDIA Corporation. 2019. NVIDIA T4 TENSOR CORE GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf.Google Scholar
- NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.Google Scholar
- NXP Semiconductors N.V.2023. NXP MSC8154E Quad-Core DSP with Security. https://www.nxp.com/products/processors-and-microcontrollers/additional-mpu-mcus-architectures/digital-signal-processors/high-performance-quad-core-dsp-with-security:MSC8154E.Google Scholar
- Alberto Parravicini, Arnaud Delamare, Marco Arnaboldi, and Marco D. Santambrogio. IPDPS ’21. DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. NIPS ’19. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems.Google ScholarDigital Library
- Kiran Ranganath, Joshua D. Suetterlein, Joseph B. Manzano, Shuaiwen Leon Song, and Daniel Wong. SC ’21. MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
- Stanislav G. Sedukhin and Marcin Paprzycki. 2012. Generalizing Matrix Multiplication for Efficient Computations on Modern Computers. In Parallel Processing and Applied Mathematics.Google Scholar
- Jiuchen Shi, Jiawen Wang, Kaihua Fu, Quan Chen, Deze Zeng, and Minyi Guo. IPDPS ’22. QoS-awareness of Microservices with Excessive Loads via Inter-Datacenter Scheduling. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Siddharth Singh and Abhinav Bhatele. IPDPS ’22. AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HPCA, ’20. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
- Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HPCA ’19. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
- Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, and Armand Joulin. ICLR ’21. Training with Quantization Noise for Extreme Model Compression. In International Conference on Learning Representations.Google Scholar
- John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering (2010).Google Scholar
- Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. NIPS ’20. Ultra-Low Precision 4-bit Training of Deep Neural Networks. In Advances in Neural Information Processing Systems.Google Scholar
- Tuan Ta, Khalid Al-Hawaj, Nick Cebry, Yanghui Ou, Eric Hall, Courtney Golden, and Christopher Batten. MICRO ’22. big.VLITTLE: On-Demand Data-Parallel Acceleration for Mobile Systems on Chip. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Cheng Tan, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. ISCA ’18. Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture.Google Scholar
- Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. CVPR ’19. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
- Chunlin Tian, Li Li, Zhan Shi, Jun Wang, and ChengZhong Xu. MICRO ’22. HARMONY: Heterogeneity-Aware Hierarchical Management for Federated Learning System. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Han D. Tran, Milinda Fernando, Kumar Saurabh, Baskar Ganapathysubramanian, Robert M. Kirby, and Hari Sundar. IPDPS ’22. A scalable adaptive-matrix SPMV for heterogeneous architectures. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Jack Turner, Elliot J. Crowley, and Michael F. P. O’Boyle. ASPLOS ’21. Neural Architecture Search as Program Transformation Exploration. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
- Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. CVPR ’19. HAQ: Hardware-Aware Automated Quantization With Mixed Precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
- Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. NIPS ’18. Training Deep Neural Networks with 8-bit Floating Point Numbers. In Advances in Neural Information Processing Systems.Google Scholar
- Shuo Wang, Yun Liang, and Wei Zhang. HPCA ’19. Poly: Efficient Heterogeneous System and Application Management for Interactive Applications. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google Scholar
- Ting Wang, Qian Zhang, and Qiang Xu. DATE ’17. ApproxQA: A unified quality assurance framework for approximate computing. In Design, Automation and Test in Europe Conference and Exhibition.Google Scholar
- Joel Wolfrath, Nikhil Sreekumar, Dhruv Kumar, Yuanli Wang, and Abhishek Chandra. IPDPS ’22. HACCS: Heterogeneity-Aware Clustered Client Selection for Accelerated Federated Learning. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Yecheng Xiang and Hyoseung Kim. RTSS ’19. Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference. In 2019 IEEE Real-Time Systems Symposium.Google Scholar
- Qingcheng Xiao, Size Zheng, Bingzhe Wu, Pengcheng Xu, Xuehai Qian, and Yun Liang. ISCA ’21. HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.Google Scholar
- Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Sasa Misailovic, and Saurabh Bagchi. USENIX ATC ’18. VideoChef: Efficient Approximation for Streaming Video Processing Pipelines. In 2018 USENIX Annual Technical Conference.Google Scholar
- Zichao Yang, Heng Wu, Yuanjia Xu, Yuewen Wu, Hua Zhong, and Wenbo Zhang. 2023. Hydra: Deadline-aware and Efficiency-oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUs. IEEE Trans. Comput. (2023).Google Scholar
- Minjia Zhang, Zehua Hu, and Mingqin Li. IPDPS ’21. DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google Scholar
- Yunan Zhang, Po-An Tsai, and Hung-Wei Tseng. ISCA ’22. SIMD2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM. In Proceedings of the 49th Annual International Symposium on Computer Architecture.Google Scholar
- Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. ISCA ’22. AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction. In Proceedings of the 49th Annual International Symposium on Computer Architecture.Google Scholar
- Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. ASPLOS ’20. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
- Li Zhou, Mohammad Hossein Samavatian, Anys Bacha, Saikat Majumdar, and Radu Teodorescu. SEC ’19. Adaptive Parallel Execution of Deep Neural Networks on Heterogeneous Edge Devices. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing.Google Scholar
- Wentao Zhu, Can Zhao, Wenqi Li, Holger R. Roth, Ziyue Xu, and Daguang Xu. MICCAI ’20. LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention.Google Scholar
- Yuhao Zhu. PPoPP ’22. RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Google Scholar
Index Terms
- Simultaneous and Heterogenous Multithreading
Recommendations
A hierarchical tridiagonal system solver for heterogenous supercomputers
ScalA '14: Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsTridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale ...
Accelerating linpack with CUDA on heterogenous clusters
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing UnitsThis paper describes the use of CUDA to accelerate the Linpack benchmark on heterogenous clusters, where both CPUs and GPUs are used in synergy with minor or no modifications to the original source code. A host library intercepts the calls to DGEMM and ...
A Compiler-assisted Runtime-prefetching Scheme for Heterogenous Platforms
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation TechniquesWithin the recent years, accelerators such as GPGPU have been widely adapted by industry and academia. Many research on kernel computations are reported over 100 times speedup using GPGPU. For real applications on industry, however, the data ...
Comments