Hawkeyes: Addressing Weak Memory Order in Program Migration Based on Instruction Windows

Authors:
Zhangqi Zhu

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0001-1958-1189
View Profile

,
Yuhui Cai

Xiamen University, Xiamen, China

Xiamen University, Xiamen, China

0009-0007-0760-9707
View Profile

,
Binbin Xu

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0003-5067-5751
View Profile

,
Pingchao Yang

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0002-9273-3430
View Profile

,
Jicheng Chen

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0002-6825-2987
View Profile

,
Zhirong Shen

Xiamen University, Xiamen, China

Xiamen University, Xiamen, China

0000-0003-2673-5868
View Profile

CHEOPS '24: Proceedings of the 4th Workshop on Challenges and Opportunities of Efficient and Performant Storage SystemsApril 2024Pages 17–22https://doi.org/10.1145/3642963.3652204

Published:14 May 2024Publication History

CHEOPS '24: Proceedings of the 4th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems

Pages 17–22

ABSTRACT

Migrating memory systems from x86 to ARM can result in weak memory order issues due to memory model differences. This necessitates the addition of memory barriers to prevent such problems. However, current automatic memory barrier insertion approaches fail to identify all potential locations where WMM (Weak Memory Model) bugs may occur and also often overuse unnecessary memory barriers. To address this issue, we propose Hawkeyes, an approach that combines dynamic memory access conflict detection and instruction windows to locate out-of-order memory access issues in multi-threaded programs. Hawkeyes performs compile-time instrumentation to locate all memory conflicts at run time and analyzes the micro-instructions together with the instruction window to identify out-of-order instruction intervals. By comparing such intervals among different threads, Hawkeyes determines the locations that need to maintain order. We validate the correctness of Hawkeyes on open-source libraries and evaluate Hawkeyes on public benchmarks. We demonstrate that Hawkeyes not only pinpoints all the locations where WMM bugs may appear but also achieves high accuracy in barrier insertion.

References

Alibaba develops its own 5nm 128-core arm-based server chip. https://www.tomshardware.com/news/alibaba-unveils-128-core-server-cpu, 2021.Google Scholar
Jade Alglave. A formal hierarchy of weak memory models. Formal Methods in System Design, 41:178--210, 2012.Google ScholarDigital Library
Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell. Clarifying and compiling c/c++ concurrency: from c++ 11 to power. ACM SIGPLAN Notices, 47(1):509--520, 2012.Google ScholarDigital Library
Martin Beck, Koustubha Bhat, Lazar Stričević, Geng Chen, Diogo Behrens, Ming Fu, Viktor Vafeiadis, Haibo Chen, and Hermann Härtig. Atomig: Automatically migrating millions lines of code from tso to wmm. 2023.Google Scholar
Jacob Burnim, Koushik Sen, and Christos Stergiou. Testing concurrent programs on relaxed memory models. In Proceedings of the 2011 international symposium on Software Testing and Analysis, pages 122--132, 2011.Google ScholarDigital Library
Ernie Cohen and Bert Schirmer. From total store order to sequential consistency: A practical reduction theorem. In Interactive Theorem Proving: First International Conference, ITP 2010, Edinburgh, UK, July 11--14, 2010. Proceedings 1, pages 403--418. Springer, 2010.Google ScholarDigital Library
Joseph Emeras, Sébastien Varrette, Valentin Plugaru, and Pascal Bouvry. Amazon elastic compute cloud (ec2) versus in-house hpc platform: A cost analysis. IEEE Transactions on Cloud Computing, 7(2):456--468, 2016.Google ScholarCross Ref
Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. ACM SIGARCH Computer Architecture News, 18(2SI):15--26, 1990.Google ScholarDigital Library
Caixin Gong, Chengjin Tian, Zhengheng Wang, Sheng Wang, Xiyu Wang, Qiulei Fu, Wu Qin, Long Qian, Rui Chen, Jiang Qi, Ruo Wang, Guoyun Zhu, Chenghu Yang, Wei Zhang, and Feifei Li. Tair-pmem: a fully durable non-volatile memory database. Proc. VLDB Endow., 15(12):3346--3358, aug 2022.Google ScholarDigital Library
James R Goodman. Cache consistency and sequential consistency. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 1991.Google Scholar
Changyi Gu. Building Embedded Systems: Programmable Hardware. Apress, 2016.Google ScholarDigital Library
Sudheendra Hangal, Durgam Vahia, Chaiyasit Manovit, and Juin-Yeu Joseph Lu. Tsotool: A program for verifying memory systems using the memory consistency model. ACM SIGARCH Computer Architecture News, 32(2):114, 2004.Google ScholarDigital Library
Shahidullah Kaiser, Md Sadun Haq, Ali Şaman Tosun, and Turgay Korkmaz. Container technologies for arm architecture: A comprehensive survey of the state-of-the-art. IEEE Access, 2022.Google Scholar
Qiang Li, Qiao Xiang, Yuxin Wang, Haohao Song, Ridi Wen, Wenhui Yao, Yuanyuan Dong, Shuqi Zhao, Shuo Huang, Zhaosheng Zhu, Huayong Wang, Shanyang Liu, Lulu Chen, Zhiwu Wu, Haonan Qiu, Derui Liu, Gexiao Tian, Chao Han, Shaozong Liu, Yaohui Wu, Zicheng Luo, Yuchao Shao, Junping Wu, Zheng Cao, Zhongjie Wu, Jiaji Zhu, Jinbo Wu, Jiwu Shu, and Jiesheng Wu. More than capacity: Performance-oriented evolution of pangu in alibaba. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 331--346, Santa Clara, CA, February 2023. USENIX Association.Google Scholar
Weiyu Luo and Brian Demsky. C11tester: a race detector for c/c++ atomics. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 630--646, 2021.Google ScholarDigital Library
Jeremy Manson, William Pugh, and Sarita V Adve. The java memory model. ACM SIGPLAN Notices, 40(1):378--391, 2005.Google ScholarDigital Library
Luc Maranget, Susmit Sarkar, and Peter Sewell. A tutorial introduction to the arm and power relaxed memory models. Draft available from http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf, 2012.Google Scholar
Paul E McKenney. Memory barriers: a hardware view for software hackers. Linux Technology Center, IBM Beaverton, 2010.Google Scholar
Rui Miao, Lingjun Zhu, Shu Ma, Kun Qian, Shujun Zhuang, Bo Li, Shuguang Cheng, Jiaqi Gao, Yan Zhuang, Pengcheng Zhang, Rong Liu, Chao Shi, Binzhang Fu, Jiaji Zhu, Jiesheng Wu, Dennis Cai, and Hongqiang Harry Liu. From luna to solar: the evolutions of the compute-to-storage networks in alibaba cloud. In Proceedings of the ACM SIGCOMM 2022 Conference, SIGCOMM '22, page 753--766, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarDigital Library
Christopher Pulte, Jean Pichon-Pharabod, Jeehoon Kang, Sung-Hwan Lee, and Chung-Kil Hur. Promising-arm/risc-v: a simpler and faster operational concurrency model. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 1--15, 2019.Google ScholarDigital Library
Konstantin Serebryany and Timur Iskhodzhanov. Threadsanitizer: data race detection in practice. In Proceedings of the workshop on binary instrumentation and applications, pages 62--71, 2009.Google ScholarDigital Library
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O Myreen. x86-tso: a rigorous and usable programmer's model for x86 multiprocessors. Communications of the ACM, 53(7):89--97, 2010.Google ScholarDigital Library
Bogdan Marius Tudor and Yong Meng Teo. On understanding the energy consumption of arm-based multicore servers. In Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems, pages 267--278, 2013.Google ScholarDigital Library
John D. Valois. Implementing lock-free queues. 1994.Google Scholar
Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services. IEEE Micro, 41(5):67--75, 2021.Google Scholar

Recommendations

Addressing instruction fetch bottlenecks by using an instruction register file
LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

The Instruction Register File (IRF) is an architectural extension for providing improved access to frequently occurring instructions. An optimizing compiler can exploit an IRF by packing an application's instructions, resulting in decreased code size, ...
Read More
Advanced Instruction Set Architectures for Reducing Program Memory Usage in a DSP Processor
DELTA '02: Proceedings of the The First IEEE International Workshop on Electronic Design, Test and Applications (DELTA '02)

On-chip memories can consume multiple times the area of a processor core, thus affecting to the chip costs dramatically. In this paper, three approaches for reducing program memory footprint in a DSP processor are analyzed: fully 16-bit and two versions ...
Read More
Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHEOPS '24: Proceedings of the 4th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems
April 2024
38 pages
ISBN:9798400705380
DOI:10.1145/3642963
General Chair:
Shadi Ibrahim
Inria, France
,
Program Chairs:
Suren Byna
The Ohio State University, USA
,
Amelie Chi Zhou
Hong Kong Baptist University, Hong Kong
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate6of8submissions,75%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 20
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hawkeyes: Addressing Weak Memory Order in Program Migration Based on Instruction Windows

CHEOPS '24: Proceedings of the 4th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems

ABSTRACT

References

Cited By

Recommendations

Addressing instruction fetch bottlenecks by using an instruction register file

Advanced Instruction Set Architectures for Reducing Program Memory Usage in a DSP Processor

Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP