ABSTRACT
Dynamic binary translation (DBT) is the cornerstone of many important applications. Yet, it takes a tremendous effort to develop and maintain a real-world DBT system. To mitigate the engineering effort, helper functions are frequently employed during the development of a DBT system. Though helper functions greatly facilitate the DBT development, their adoption incurs substantial performance overhead due to the helper function calls. To solve this problem, this paper presents a novel approach to inline helper functions in DBT systems. The proposed inlining approach addresses several unique technical challenges. As a result, the performance overhead introduced by helper function calls can be reduced, and meanwhile, the benefits of helper functions for DBT development are not lost. We have implemented a prototype based on the proposed inlining approach using a popular DBT system, QEMU. Experimental results on the benchmark programs from the SPEC CPU 2017 benchmark suite show that an average of 1.2x performance speedup can be achieved. Moreover, the translation overhead introduced by inlining helper functions is negligible.
- 2019. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008) ( 2019 ), 1-84. htps://doi.org/10.1109/ IEEESTD. 2019.8766229Google Scholar
- Andrew Ayers, Richard Schooler, and Robert Gottlieb. 1997. Aggressive Inlining. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (Las Vegas, Nevada, USA) ( PLDI '97). Association for Computing Machinery, New York, NY, USA, 134-145. htps://doi.org/10.1145/258915.258928Google Scholar
- Fabrice Bellard. 2005. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Anaheim, CA) ( ATC '05). USENIX, USA, 41-46.Google ScholarDigital Library
- Derek Bruening and Vladimir Kiriansky. 2008. Process-Shared and Persistent Code Caches. In Proceedings of the Fourth ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. Association for Computing Machinery, New York, NY, USA, 61-70. htps://doi.org/10.1145/1346256.1346265Google ScholarDigital Library
- Brad Calder and Dirk Grunwald. 1994. Reducing Indirect Function Call Overhead in C++ Programs. In Proceedings of the 21st ACM SIGPLANSIGACT Symposium on Principles of Programming Languages (Portland, Oregon, USA) ( POPL '94). Association for Computing Machinery, New York, NY, USA, 397-408. htps://doi.org/10.1145/174675.177973Google ScholarDigital Library
- John Cavazos and Michael F. P. O'Boyle. 2005. Automatic Tuning of Inlining Heuristics. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC '05). IEEE Computer Society, USA, 14. htps: //doi.org/10.1109/SC. 2005.14Google Scholar
- Emilio G. Cota, Paolo Bonzini, Alex Bennée, and Luca P. Carloni. 2017. Cross-ISA Machine Emulation for Multicores. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (Austin, USA) ( CGO '17). IEEE Press, 210-220.Google Scholar
- Peter Feiner, Angela Demke Brown, and Ashvin Goel. 2012. Comprehensive Kernel Instrumentation via Dynamic Binary Translation. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (London, England, UK) (ASPLOS XVII). Association for Computing Machinery, New York, NY, USA, 135-146. htps://doi.org/10.1145/2150976.2150992Google ScholarDigital Library
- GCC. 2020. Optimization Options. htps://gcc.gnu.org/onlinedocs/ gcc/Optimize-Options.html.Google Scholar
- Byron Hawkins, Brian Demsky, Derek Bruening, and Qin Zhao. 2015. Optimizing Binary Translation of Dynamically Generated Code. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (San Francisco, California) ( CGO '15). IEEE Computer Society, USA, 68-78.Google ScholarCross Ref
- Shiliang Hu and James E. Smith. 2004. Using Dynamic Binary Translation to Fuse Dependent Instructions. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (Palo Alto, California) (CGO '04). IEEE Computer Society, USA, 213.Google Scholar
- Suresh Jagannathan and Andrew Wright. 1996. Flow-Directed Inlining. In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation (Philadelphia, Pennsylvania, USA) ( PLDI '96). Association for Computing Machinery, New York, NY, USA, 193-205. htps://doi.org/10.1145/231379.231417Google Scholar
- Jinhu Jiang, Rongchao Dong, Zhongjun Zhou, Changheng Song, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang. 2020. More with Less-Deriving More Translation Rules with Less Training Data for DBTs Using Parameterization. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 415-426. htps: //doi.org/10.1109/MICRO50266. 2020.00043Google Scholar
- LLVM. 2020. Inlining. htps://clang.llvm.org/docs/analyzer/developerdocs/IPA.html.Google Scholar
- Guilherme Ottoni, Thomas Hartin, Christopher Weaver, Jason Brandt, Belliappa Kuttanna, and Hong Wang. 2011. Harmonia: A Transparent, Eficient, and Harmonious Dynamic Binary Translator Targeting the Intel® Architecture. In Proceedings of the 8th ACM International Conference on Computing Frontiers (Ischia, Italy) (CF '11). Association for Computing Machinery, New York, NY, USA, Article 26, 10 pages. htps://doi.org/10.1145/2016604.2016635Google ScholarDigital Library
- Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A Practical Binary Optimizer for Data Centers and Beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (Washington, DC, USA) ( CGO '19). IEEE Press, 2-14.Google ScholarCross Ref
- Ian Piumarta and Fabio Riccardi. 1998. Optimizing Direct Threaded Code by Selective Inlining. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (Montreal, Quebec, Canada) ( PLDI '98). Association for Computing Machinery, New York, NY, USA, 291-300. htps://doi.org/10.1145/ 277650.277743Google ScholarDigital Library
- Aleksandar Prokopec, Gilles Duboscq, David Leopoldseder, and Thomas Würthinger. 2019. An Optimization-Driven Incremental Inline Substitution Algorithm for Just-in-Time Compilers. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (Washington, DC, USA) ( CGO '19). IEEE Press, 164-179.Google ScholarCross Ref
- Vijay Janapa Reddi, Dan Connors, Robert Cohn, and Michael D. Smith. 2007. Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '07). IEEE Computer Society, USA, 74-88. htps://doi.org/10.1109/CGO. 2007.29Google ScholarDigital Library
- Douglas Simon, John Cavazos, Christian Wimmer, and Sameer Kulkarni. 2013. Automatic Construction of Inlining Heuristics Using Machine Learning. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO '13). IEEE Computer Society, USA, 1-12. htps://doi.org/10.1109/CGO. 2013.6495004Google ScholarDigital Library
- Changheng Song, Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Weihua Zhang. 2019. Unleashing the Power of Learning: An Enhanced Learning-Based Approach for Dynamic Binary Translation. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (Renton, WA, USA) ( USENIX ATC '19). USENIX Association, USA, 77-89.Google Scholar
- Tom Spink, Harry Wagstaf, and Björn Franke. 2019. A Retargetable System-Level DBT Hypervisor. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 505-520. htps://www.usenix.org/conference/atc19/presentation/spinkGoogle Scholar
- Standard Performance Evaluation Corporation. 2020. SPEC CPU 2017. htps://www.spec.org/cpu2017.Google Scholar
- Levon Stepanian, Angela Demke Brown, Allan Kielstra, Gita Koblents, and Kevin Stoodley. 2005. Inlining Java Native Calls at Runtime. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments (Chicago, IL, USA) ( VEE '05). Association for Computing Machinery, New York, NY, USA, 121-131. htps://doi.org/ 10.1145/1064979.1064997Google ScholarDigital Library
- Wenwen Wang, Stephen McCamant, Antonia Zhai, and Pen-Chung Yew. 2018. Enhancing Cross-ISA DBT Through Automatically Learned Translation Rules. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (Williamsburg, VA, USA) ( ASPLOS '18). Association for Computing Machinery, New York, NY, USA, 84-97. htps://doi. org/10.1145/3173162.3177160Google ScholarDigital Library
- Wenwen Wang, Chenggang Wu, Tongxin Bai, Zhenjiang Wang, Xiang Yuan, and Huimin Cui. 2014. A Pattern Translation Method for Flags in Binary Translation. Journal of Computer Research and Development 51, 10 ( 2014 ), 2336-2347. htp://crad.ict.ac.cn/EN/10.7544/issn1000-1239. 2014.20130018Google Scholar
- Wenwen Wang, Jiacheng Wu, Xiaoli Gong, Tao Li, and Pen-Chung Yew. 2018. Improving Dynamically-Generated Code Performance on Dynamic Binary Translators. In Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (Williamsburg, VA, USA) ( VEE '18). Association for Computing Machinery, New York, NY, USA, 17-30. htps://doi.org/10.1145/ 3186411.3186413Google ScholarDigital Library
- Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Stephen McCamant. 2016. A General Persistent Code Caching Framework for Dynamic Binary Translation (DBT). In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (Denver, CO, USA) ( USENIX ATC '16). USENIX Association, USA, 591-603.Google Scholar
- Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Stephen McCamant. 2020. Eficient and Scalable Cross-ISA Virtualization of Hardware Transactional Memory. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (San Diego, CA, USA) ( CGO '20). Association for Computing Machinery, New York, NY, USA, 107-120. htps://doi.org/10.1145/3368826.3377919Google Scholar
- Wenwen Wang, Pen-Chung Yew, Antonia Zhai, Stephen McCamant, Youfeng Wu, and Jayaram Bobba. 2017. Enabling Cross-ISA Ofloading for COTS Binaries. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (Niagara Falls, New York, USA) ( MobiSys '17). Association for Computing Machinery, New York, NY, USA, 319-331. htps://doi.org/10.1145/3081333.3081337Google ScholarDigital Library
- Jin Wu, Jian Dong, Ruili Fang, Wenwen Wang, and Decheng Zuo. 2020. PerfDBT: Eficient Performance Regression Testing of Dynamic Binary Translation. In 2020 IEEE 38th International Conference on Computer Design (ICCD). 389-392. htps://doi.org/10.1109/ICCD50377. 2020. 00071Google Scholar
- Xiaochun Zhang, Qi Guo, Yunji Chen, Tianshi Chen, and Weiwu Hu. 2015. HERMES: A Fast Cross-ISA Binary Translator with PostOptimization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (San Francisco, California) ( CGO '15). IEEE Computer Society, USA, 246-256.Google Scholar
- Ziyi Zhao, Zhang Jiang, Ximing Liu, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew. 2020. DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms. In 49th International Conference on Parallel Processing-ICPP (Edmonton, AB, Canada) ( ICPP '20). Association for Computing Machinery, New York, NY, USA, Article 7, 11 pages. htps://doi.org/10.1145/3404397.3404403Google Scholar
Index Terms
- Helper function inlining in dynamic binary translation
Recommendations
Low overhead dynamic binary translation on ARM
PLDI '17The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Low overhead dynamic binary translation on ARM
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and ImplementationThe ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Ablego: a function outlining and partial inlining framework: Research Articles
Frequently invoked large functions are common in non-numeric applications. These large functions present challenges to modern compilers not only because they require more time and resources at compilation time, but also because they may prevent ...
Comments