ABSTRACT
Migrating memory systems from x86 to ARM can result in weak memory order issues due to memory model differences. This necessitates the addition of memory barriers to prevent such problems. However, current automatic memory barrier insertion approaches fail to identify all potential locations where WMM (Weak Memory Model) bugs may occur and also often overuse unnecessary memory barriers. To address this issue, we propose Hawkeyes, an approach that combines dynamic memory access conflict detection and instruction windows to locate out-of-order memory access issues in multi-threaded programs. Hawkeyes performs compile-time instrumentation to locate all memory conflicts at run time and analyzes the micro-instructions together with the instruction window to identify out-of-order instruction intervals. By comparing such intervals among different threads, Hawkeyes determines the locations that need to maintain order. We validate the correctness of Hawkeyes on open-source libraries and evaluate Hawkeyes on public benchmarks. We demonstrate that Hawkeyes not only pinpoints all the locations where WMM bugs may appear but also achieves high accuracy in barrier insertion.
- Alibaba develops its own 5nm 128-core arm-based server chip. https://www.tomshardware.com/news/alibaba-unveils-128-core-server-cpu, 2021.Google Scholar
- Jade Alglave. A formal hierarchy of weak memory models. Formal Methods in System Design, 41:178--210, 2012.Google ScholarDigital Library
- Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell. Clarifying and compiling c/c++ concurrency: from c++ 11 to power. ACM SIGPLAN Notices, 47(1):509--520, 2012.Google ScholarDigital Library
- Martin Beck, Koustubha Bhat, Lazar Stričević, Geng Chen, Diogo Behrens, Ming Fu, Viktor Vafeiadis, Haibo Chen, and Hermann Härtig. Atomig: Automatically migrating millions lines of code from tso to wmm. 2023.Google Scholar
- Jacob Burnim, Koushik Sen, and Christos Stergiou. Testing concurrent programs on relaxed memory models. In Proceedings of the 2011 international symposium on Software Testing and Analysis, pages 122--132, 2011.Google ScholarDigital Library
- Ernie Cohen and Bert Schirmer. From total store order to sequential consistency: A practical reduction theorem. In Interactive Theorem Proving: First International Conference, ITP 2010, Edinburgh, UK, July 11--14, 2010. Proceedings 1, pages 403--418. Springer, 2010.Google ScholarDigital Library
- Joseph Emeras, Sébastien Varrette, Valentin Plugaru, and Pascal Bouvry. Amazon elastic compute cloud (ec2) versus in-house hpc platform: A cost analysis. IEEE Transactions on Cloud Computing, 7(2):456--468, 2016.Google ScholarCross Ref
- Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. ACM SIGARCH Computer Architecture News, 18(2SI):15--26, 1990.Google ScholarDigital Library
- Caixin Gong, Chengjin Tian, Zhengheng Wang, Sheng Wang, Xiyu Wang, Qiulei Fu, Wu Qin, Long Qian, Rui Chen, Jiang Qi, Ruo Wang, Guoyun Zhu, Chenghu Yang, Wei Zhang, and Feifei Li. Tair-pmem: a fully durable non-volatile memory database. Proc. VLDB Endow., 15(12):3346--3358, aug 2022.Google ScholarDigital Library
- James R Goodman. Cache consistency and sequential consistency. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 1991.Google Scholar
- Changyi Gu. Building Embedded Systems: Programmable Hardware. Apress, 2016.Google ScholarDigital Library
- Sudheendra Hangal, Durgam Vahia, Chaiyasit Manovit, and Juin-Yeu Joseph Lu. Tsotool: A program for verifying memory systems using the memory consistency model. ACM SIGARCH Computer Architecture News, 32(2):114, 2004.Google ScholarDigital Library
- Shahidullah Kaiser, Md Sadun Haq, Ali Şaman Tosun, and Turgay Korkmaz. Container technologies for arm architecture: A comprehensive survey of the state-of-the-art. IEEE Access, 2022.Google Scholar
- Qiang Li, Qiao Xiang, Yuxin Wang, Haohao Song, Ridi Wen, Wenhui Yao, Yuanyuan Dong, Shuqi Zhao, Shuo Huang, Zhaosheng Zhu, Huayong Wang, Shanyang Liu, Lulu Chen, Zhiwu Wu, Haonan Qiu, Derui Liu, Gexiao Tian, Chao Han, Shaozong Liu, Yaohui Wu, Zicheng Luo, Yuchao Shao, Junping Wu, Zheng Cao, Zhongjie Wu, Jiaji Zhu, Jinbo Wu, Jiwu Shu, and Jiesheng Wu. More than capacity: Performance-oriented evolution of pangu in alibaba. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 331--346, Santa Clara, CA, February 2023. USENIX Association.Google Scholar
- Weiyu Luo and Brian Demsky. C11tester: a race detector for c/c++ atomics. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 630--646, 2021.Google ScholarDigital Library
- Jeremy Manson, William Pugh, and Sarita V Adve. The java memory model. ACM SIGPLAN Notices, 40(1):378--391, 2005.Google ScholarDigital Library
- Luc Maranget, Susmit Sarkar, and Peter Sewell. A tutorial introduction to the arm and power relaxed memory models. Draft available from http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf, 2012.Google Scholar
- Paul E McKenney. Memory barriers: a hardware view for software hackers. Linux Technology Center, IBM Beaverton, 2010.Google Scholar
- Rui Miao, Lingjun Zhu, Shu Ma, Kun Qian, Shujun Zhuang, Bo Li, Shuguang Cheng, Jiaqi Gao, Yan Zhuang, Pengcheng Zhang, Rong Liu, Chao Shi, Binzhang Fu, Jiaji Zhu, Jiesheng Wu, Dennis Cai, and Hongqiang Harry Liu. From luna to solar: the evolutions of the compute-to-storage networks in alibaba cloud. In Proceedings of the ACM SIGCOMM 2022 Conference, SIGCOMM '22, page 753--766, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarDigital Library
- Christopher Pulte, Jean Pichon-Pharabod, Jeehoon Kang, Sung-Hwan Lee, and Chung-Kil Hur. Promising-arm/risc-v: a simpler and faster operational concurrency model. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 1--15, 2019.Google ScholarDigital Library
- Konstantin Serebryany and Timur Iskhodzhanov. Threadsanitizer: data race detection in practice. In Proceedings of the workshop on binary instrumentation and applications, pages 62--71, 2009.Google ScholarDigital Library
- Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O Myreen. x86-tso: a rigorous and usable programmer's model for x86 multiprocessors. Communications of the ACM, 53(7):89--97, 2010.Google ScholarDigital Library
- Bogdan Marius Tudor and Yong Meng Teo. On understanding the energy consumption of arm-based multicore servers. In Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems, pages 267--278, 2013.Google ScholarDigital Library
- John D. Valois. Implementing lock-free queues. 1994.Google Scholar
- Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services. IEEE Micro, 41(5):67--75, 2021.Google Scholar
Recommendations
Addressing instruction fetch bottlenecks by using an instruction register file
LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsThe Instruction Register File (IRF) is an architectural extension for providing improved access to frequently occurring instructions. An optimizing compiler can exploit an IRF by packing an application's instructions, resulting in decreased code size, ...
Advanced Instruction Set Architectures for Reducing Program Memory Usage in a DSP Processor
DELTA '02: Proceedings of the The First IEEE International Workshop on Electronic Design, Test and Applications (DELTA '02)On-chip memories can consume multiple times the area of a processor core, thus affecting to the chip costs dramatically. In this paper, three approaches for reducing program memory footprint in a DSP processor are analyzed: fully 16-bit and two versions ...
Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation TechniquesModern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory ...
Comments