skip to main content
10.1145/3613424.3614285acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections

Simultaneous and Heterogenous Multithreading

Published:08 December 2023Publication History

ABSTRACT

The landscape of modern computers is undoubtedly heterogeneous, as all computing platforms integrate multiple types of processing units and hardware accelerators. However, the entrenched programming models focus on using only the most efficient processing units for each code region, underutilizing the processing power within heterogeneous computers.

This paper simultaneous and heterogenous multithreading (SHMT), a programming and execution model that enables opportunities for “real” parallel processing using heterogeneous processing units. In contrast to conventional models, SHMT can utilize heterogeneous types of processing units concurrently for the same code region. Furthermore, SHMT presents an abstraction and a runtime system to facilitate parallel execution. More importantly, SHMT needs to additionally address the heterogeneity in data precision that various processing units support to ensure the quality of the result.

This paper implements and evaluates SHMT on an embedded system platform with a GPU and an Edge TPU. SHMT achieves up to 1.95 × speedup and  51.0% energy reduction compared to GPU baseline.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  2. Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. MICRO ’16. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  3. Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. ISCA ’14. General-purpose code acceleration with limited-precision analog computation. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture. https://doi.org/10.1109/ISCA.2014.6853213Google ScholarGoogle ScholarCross RefCross Ref
  4. Sam Amiri, Mohammad Hosseinabady, Simon McIntosh-Smith, and Jose Nunez-Yanez. DATE ’18. Multi-precision convolutional neural networks on heterogeneous hardware. In 2018 Design, Automation Test in Europe Conference Exhibition.Google ScholarGoogle Scholar
  5. Sam Amiri, Mohammad Hosseinabady, Andres Rodriguez, Rafael Asenjo, Angeles Navarro, and Jose Nunez-Yanez. FPL ’18. Workload Partitioning Strategy for Improved Parallelism on FPGA-CPU Heterogeneous Chips. In 2018 28th International Conference on Field Programmable Logic and Applications.Google ScholarGoogle Scholar
  6. Analog Devices, Inc.2023. Analog Devices’ Processors and DSP. https://www.analog.com/en/product-category/processors-dsp.html.Google ScholarGoogle Scholar
  7. Renzo Andri, Beatrice Bussolino, Antonio Cipolletta, Lukas Cavigelli, and Zhe Wang. MICRO ’22. Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tiles. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  8. Apple Inc.2020. Apple M1. https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/.Google ScholarGoogle Scholar
  9. Thilini Kaushalya Bandara, Dhananjaya Wijerathne, Tulika Mitra, and Li-Shiuan Peh. ASPLOS ’22. REVAMP: A Systematic Framework for Heterogeneous CGRA Realization. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle Scholar
  10. Saambhavi Baskaran, Mahmut Taylan Kandemir, and Jack Sampson. MICRO ’22. An architecture interface and offload model for low-overhead, near-data, distributed accelerators. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  11. G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).Google ScholarGoogle Scholar
  12. John Burgess. 2020. RTX on—The NVIDIA Turing GPU. IEEE Micro 40, 2 (2020), 36–44. https://doi.org/10.1109/MM.2020.2971677Google ScholarGoogle ScholarCross RefCross Ref
  13. Rohit Chandra, Leo Dagum, David Kohr, Ramesh Menon, Dror Maydan, and Jeff McDonald. 2001. Parallel programming in OpenMP. Morgan Kaufmann Publishers Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. IISWC ’09. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization.Google ScholarGoogle Scholar
  15. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. OSDI ’18. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation.Google ScholarGoogle Scholar
  16. Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. NIPS ’18. Learning to Optimize Tensor Programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  17. Xinyu Chen, Yao Chen, Feng Cheng, Hongshi Tan, Bingsheng He, and Weng-Fai Wong. MICRO ’22. ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  18. Yujeong Choi and Minsoo Rhu. HPCA ’20. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In 2020 IEEE International Symposium on High Performance Computer Architecture.Google ScholarGoogle Scholar
  19. George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems (1989).Google ScholarGoogle Scholar
  20. Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. ICS ’19. Accelerating Reduction and Scan Using Tensor Core Units. In Proceedings of the ACM International Conference on Supercomputing.Google ScholarGoogle Scholar
  21. F. Fernandes dos Santos, C. Lunardi, D. Oliveira, F. Libano, and P. Rech. HPCA ’19. Reliability Evaluation of Mixed-Precision Architectures. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google ScholarGoogle Scholar
  22. Sultan Durrani, Muhammad Saad Chughtai, Mert Hidayetoglu, Rashid Tahir, Abdul Dakkak, Lawrence Rauchwerger, Fareed Zaffar, and Wen-mei Hwu. PACT ’21. Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques. https://doi.org/10.1109/PACT52795.2021.00032Google ScholarGoogle ScholarCross RefCross Ref
  23. Janick Edinger, Martin Breitbach, Niklas Gabrisch, Dominik Schäfer, Christian Becker, and Amr Rizk. IPDPS ’21. Decentralized Low-Latency Task Scheduling for Ad-Hoc Computing. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  24. Venmugil Elango. IPDPS ’21. Pase: Parallelization Strategies for Efficient DNN Training. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  25. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. MICRO ’12. Neural Acceleration for General-Purpose Approximate Programs. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  26. Yuping Fan, Zhiling Lan, Paul Rich, William Allcock, and Michael E. Papka. IPDPS ’22. Hybrid Workload Scheduling on HPC Systems. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  27. Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. PPoPP ’21. EGEMM-TC: Accelerating Scientific Computing on Tensor Cores with Extended Precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Google ScholarGoogle Scholar
  28. Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, and Yufei Ding. SC ’21. APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google ScholarGoogle Scholar
  29. Petko Georgiev, Nicholas D. Lane, Kiran K. Rachuri, and Cecilia Mascolo. MobiCom ’16. LEO: Scheduling Sensor Inference Algorithms across Heterogeneous Mobile Processors and Network Resources. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking.Google ScholarGoogle Scholar
  30. Google LLC. 2020. edgetpu compiler. https://coral.ai/docs/edgetpu/compiler.Google ScholarGoogle Scholar
  31. Google LLC.2022. Google Pixel 6a. https://store.google.com/product/pixel_6a?hl=en-US.Google ScholarGoogle Scholar
  32. Xiuxian Guan, Zekai Sun, Shengliang Deng, Xusheng Chen, Shixiong Zhao, Zongyuan Zhang, Tianyang Duan, Yuexuan Wang, Chenshu Wu, Yong Cui, Libo Zhang, Yanjun Wu, Rui Wang, and Heming Cui. MICRO ’22. ROG: A High Performance and Robust Distributed Training System for Robotic IoT. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  33. Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. MICRO ’22. ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  34. Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. SC ’18. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In International Conference for High Performance Computing, Networking, Storage and Analysis.Google ScholarGoogle Scholar
  35. Chaoyang He, Murali Annavaram, and Salman Avestimehr. NIPS ’20. Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge. In Proceedings of the 34th International Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  36. Nhut-Minh Ho and Weng-Fai Wong. HPEC ’17. Exploiting half precision arithmetic in Nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference. https://doi.org/10.1109/HPEC.2017.8091072Google ScholarGoogle ScholarCross RefCross Ref
  37. Pedro Holanda and Hannes Mühleisen. DaMoN ’19. Relational Queries with a Tensor Processing Unit. In Proceedings of the 15th International Workshop on Data Management on New Hardware.Google ScholarGoogle Scholar
  38. Xueyu Hou, Yongjie Guan, Tao Han, and Ning Zhang. IPDPS ’22. DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  39. Kuan-Chieh Hsu and Hung-Wei Tseng. SC ’21. Accelerating Applications Using Edge Tensor Processing Units. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google ScholarGoogle Scholar
  40. Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. SIGMOD ’22. TCUDB: Accelerating Database with Tensor Processors. In Proceedings of the 2022 International Conference on Management of Data.Google ScholarGoogle Scholar
  41. Yu-Ching Hu, Murtuza Taher Lokhandwala, Te I., and Hung-Wei Tseng. MICRO ’19. Dynamic Multi-Resolution Data Storage. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  42. Zhiming Hu, Ahmad Bisher Tarakji, Vishal Raheja, Caleb Phillips, Teng Wang, and Iqbal Mohomed. EMDL ’19. DeepHome: Distributed Inference with Heterogeneous Devices in the Edge. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications.Google ScholarGoogle Scholar
  43. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. CVPR ’18. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  44. Zhihao Jia, Matei Zaharia, and Alex Aiken. MLSys ’19. Beyond Data and Model Parallelism for Deep Neural Networks.. In Proceedings of Machine Learning and Systems.Google ScholarGoogle Scholar
  45. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. OSDI ’20. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation.Google ScholarGoogle Scholar
  46. Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. ISCA ’21. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  47. Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. ISCA ’23. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In 2023 ACM/IEEE 50th Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  48. Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-specific Supercomputer for Training Deep Neural Networks. In Communications of the ACM.Google ScholarGoogle Scholar
  49. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. ISCA ’17. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. https://doi.org/10.1145/3079856.3080246Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Liu Ke, Udit Gupta, Mark Hempstead, Carole-Jean Wu, Hsien-Hsin S. Lee, and Xuan Zhang. HPCA ’22. Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation. In 2022 IEEE International Symposium on High-Performance Computer Architecture.Google ScholarGoogle Scholar
  51. Hamidreza Khaleghzadeh, Ravi Reddy Manumachu, and Alexey Lastovetsky. 2020. A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes. IEEE Access (2020).Google ScholarGoogle ScholarCross RefCross Ref
  52. Daya S Khudia, Babak Zamirai, Mehrzad Samadi, and Scott Mahlke. ISCA ’15. Rumba: An online quality management system for approximate computing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  53. Sungil Kim and Heeyoung Kim. 2016. A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting (2016).Google ScholarGoogle ScholarCross RefCross Ref
  54. Young Geun Kim and Carole-Jean Wu. MICRO ’21. AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning. In 54th Annual IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  55. Anish Krishnakumar, Samet E. Arda, A. Alper Goksoy, Sumit K. Mandal, Umit Y. Ogras, Anderson L. Sartor, and Radu Marculescu. 2020. Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2020).Google ScholarGoogle ScholarCross RefCross Ref
  56. Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. HPCA ’21. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture.Google ScholarGoogle Scholar
  57. Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue. 2020. Distributed Training of Deep Learning Models: A Taxonomic Perspective. In IEEE Transactions on Parallel and Distributed Systems.Google ScholarGoogle Scholar
  58. Michael A. Laurenzano, Parker Hill, Mehrzad Samadi, Scott Mahlke, Jason Mars, and Lingjia Tang. PLDI ’16. Input Responsiveness: Using Canary Inputs to Dynamically Steer Approximation. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarGoogle Scholar
  59. Binrui Li, Shenggan Cheng, and James Lin. CLUSTER ’21. tcFFT: A Fast Half-Precision FFT Library for NVIDIA Tensor Cores. In 2021 IEEE International Conference on Cluster Computing. https://doi.org/10.1109/Cluster48925.2021.00035Google ScholarGoogle ScholarCross RefCross Ref
  60. Shikai Li, Sunghyun Park, and Scott Mahlke. ICS ’18. Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation. In Proceedings of the 2018 International Conference on Supercomputing.Google ScholarGoogle Scholar
  61. Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, and Song Han. NIPS ’21. MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning. In Annual Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  62. Ji Lin, Wei-Ming Chen, John Cohn, Chuang Gan, and Song Han. NIPS ’20. MCUNet: Tiny Deep Learning on IoT Devices. In Annual Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  63. Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. NIPS ’21. On-Device Training Under 256KB Memory. In Annual Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  64. Zihan Liu, Jingwen Leng, Zhihui Zhang, Quan Chen, Chao Li, and Minyi Guo. ASPLOS ’22. VELTAI: Rowards High-Performance Multi-Tenant Deep Learning Services via Adaptive Compilation and Scheduling. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle Scholar
  65. Google LLC. 2022. XLA: Domain-specific compiler for linear algebra to optimize tensorflow computations. https://www.tensorflow.org/xla.Google ScholarGoogle Scholar
  66. Atieh Lotfi, Abbas Rahimi, Hadi Esmaeilzadeh, and Rajesh K Gupta. 2015. SqueezCL: Squeezing OpenCL kernels for approximate computing on contemporary GPUs. In Workshop on Approximate Computing.Google ScholarGoogle Scholar
  67. Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. HPEC ’20. Accelerating MRI Reconstruction on TPUs. In 2020 IEEE High Performance Extreme Computing Conference. https://doi.org/10.1109/HPEC43674.2020.9286192Google ScholarGoogle ScholarCross RefCross Ref
  68. Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. ISBI ’21. Nonuniform Fast Fourier Transform on Tpus. In 2021 IEEE 18th International Symposium on Biomedical Imaging.Google ScholarGoogle Scholar
  69. Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. ICLR ’18. Mixed Precision Training. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  70. Mehryar Mohri. 2002. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics (2002).Google ScholarGoogle Scholar
  71. Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. HPCA ’15. SNNAP: Approximate computing on programmable SoCs via neural acceleration. In IEEE 21st International Symposium on High Performance Computer Architecture. https://doi.org/10.1109/HPCA.2015.7056066Google ScholarGoogle ScholarCross RefCross Ref
  72. Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. OSDI ’20. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation.Google ScholarGoogle Scholar
  73. Daniel Nichols, Aniruddha Marathe, Kathleen Shoga, Todd Gamblin, and Abhinav Bhatele. IPDPS ’22. Resource Utilization Aware Job Scheduling to Mitigate Performance Variability. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  74. John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue (2008).Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. PLDI ’21. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation.Google ScholarGoogle Scholar
  76. NVIDIA Corporation. 2019. NVIDIA T4 TENSOR CORE GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf.Google ScholarGoogle Scholar
  77. NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  78. NXP Semiconductors N.V.2023. NXP MSC8154E Quad-Core DSP with Security. https://www.nxp.com/products/processors-and-microcontrollers/additional-mpu-mcus-architectures/digital-signal-processors/high-performance-quad-core-dsp-with-security:MSC8154E.Google ScholarGoogle Scholar
  79. Alberto Parravicini, Arnaud Delamare, Marco Arnaboldi, and Marco D. Santambrogio. IPDPS ’21. DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  80. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. NIPS ’19. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Kiran Ranganath, Joshua D. Suetterlein, Joseph B. Manzano, Shuaiwen Leon Song, and Daniel Wong. SC ’21. MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Google ScholarGoogle Scholar
  82. Stanislav G. Sedukhin and Marcin Paprzycki. 2012. Generalizing Matrix Multiplication for Efficient Computations on Modern Computers. In Parallel Processing and Applied Mathematics.Google ScholarGoogle Scholar
  83. Jiuchen Shi, Jiawen Wang, Kaihua Fu, Quan Chen, Deze Zeng, and Minyi Guo. IPDPS ’22. QoS-awareness of Microservices with Excessive Loads via Inter-Datacenter Scheduling. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  84. Siddharth Singh and Abhinav Bhatele. IPDPS ’22. AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  85. Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HPCA, ’20. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture.Google ScholarGoogle Scholar
  86. Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HPCA ’19. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google ScholarGoogle Scholar
  87. Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, and Armand Joulin. ICLR ’21. Training with Quantization Noise for Extreme Model Compression. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  88. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering (2010).Google ScholarGoogle Scholar
  89. Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. NIPS ’20. Ultra-Low Precision 4-bit Training of Deep Neural Networks. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  90. Tuan Ta, Khalid Al-Hawaj, Nick Cebry, Yanghui Ou, Eric Hall, Courtney Golden, and Christopher Batten. MICRO ’22. big.VLITTLE: On-Demand Data-Parallel Acceleration for Mobile Systems on Chip. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  91. Cheng Tan, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. ISCA ’18. Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  92. Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. CVPR ’19. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  93. Chunlin Tian, Li Li, Zhan Shi, Jun Wang, and ChengZhong Xu. MICRO ’22. HARMONY: Heterogeneity-Aware Hierarchical Management for Federated Learning System. In 2022 55th IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  94. Han D. Tran, Milinda Fernando, Kumar Saurabh, Baskar Ganapathysubramanian, Robert M. Kirby, and Hari Sundar. IPDPS ’22. A scalable adaptive-matrix SPMV for heterogeneous architectures. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  95. Jack Turner, Elliot J. Crowley, and Michael F. P. O’Boyle. ASPLOS ’21. Neural Architecture Search as Program Transformation Exploration. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle Scholar
  96. Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. CVPR ’19. HAQ: Hardware-Aware Automated Quantization With Mixed Precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  97. Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. NIPS ’18. Training Deep Neural Networks with 8-bit Floating Point Numbers. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  98. Shuo Wang, Yun Liang, and Wei Zhang. HPCA ’19. Poly: Efficient Heterogeneous System and Application Management for Interactive Applications. In 2019 IEEE International Symposium on High Performance Computer Architecture.Google ScholarGoogle Scholar
  99. Ting Wang, Qian Zhang, and Qiang Xu. DATE ’17. ApproxQA: A unified quality assurance framework for approximate computing. In Design, Automation and Test in Europe Conference and Exhibition.Google ScholarGoogle Scholar
  100. Joel Wolfrath, Nikhil Sreekumar, Dhruv Kumar, Yuanli Wang, and Abhishek Chandra. IPDPS ’22. HACCS: Heterogeneity-Aware Clustered Client Selection for Accelerated Federated Learning. In 2022 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  101. Yecheng Xiang and Hyoseung Kim. RTSS ’19. Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference. In 2019 IEEE Real-Time Systems Symposium.Google ScholarGoogle Scholar
  102. Qingcheng Xiao, Size Zheng, Bingzhe Wu, Pengcheng Xu, Xuehai Qian, and Yun Liang. ISCA ’21. HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  103. Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Sasa Misailovic, and Saurabh Bagchi. USENIX ATC ’18. VideoChef: Efficient Approximation for Streaming Video Processing Pipelines. In 2018 USENIX Annual Technical Conference.Google ScholarGoogle Scholar
  104. Zichao Yang, Heng Wu, Yuanjia Xu, Yuewen Wu, Hua Zhong, and Wenbo Zhang. 2023. Hydra: Deadline-aware and Efficiency-oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUs. IEEE Trans. Comput. (2023).Google ScholarGoogle Scholar
  105. Minjia Zhang, Zehua Hu, and Mingqin Li. IPDPS ’21. DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture. In 2021 IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  106. Yunan Zhang, Po-An Tsai, and Hung-Wei Tseng. ISCA ’22. SIMD2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM. In Proceedings of the 49th Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  107. Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. ISCA ’22. AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction. In Proceedings of the 49th Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  108. Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. ASPLOS ’20. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle Scholar
  109. Li Zhou, Mohammad Hossein Samavatian, Anys Bacha, Saikat Majumdar, and Radu Teodorescu. SEC ’19. Adaptive Parallel Execution of Deep Neural Networks on Heterogeneous Edge Devices. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing.Google ScholarGoogle Scholar
  110. Wentao Zhu, Can Zhao, Wenqi Li, Holger R. Roth, Ziyue Xu, and Daguang Xu. MICCAI ’20. LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention.Google ScholarGoogle Scholar
  111. Yuhao Zhu. PPoPP ’22. RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Google ScholarGoogle Scholar

Index Terms

  1. Simultaneous and Heterogenous Multithreading
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Article Metrics

          • Downloads (Last 12 months)10,071
          • Downloads (Last 6 weeks)4,661

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format