Abstract
To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer to such activities as page remappings. Unfortunately, page remappings are expensive. We show that a big part of this cost arises from address translation coherence, particularly on systems employing virtualization. In response, we propose hardware translation invalidation and coherence or HATRIC, a readily implementable hardware mechanism to piggyback translation coherence atop existing cache coherence protocols. We perform detailed studies using KVM-based virtualization, showing that HATRIC achieves up to 30% performance and 10% energy benefits, for per-CPU area overheads of 0.2%. We also quantify HATRIC's benefits on systems running Xen and find up to 33% performance improvements.
- Keith Adams and Ole Agesen. 2006. A Comparison of Software and Hardware Techniques for x86 Virtualization. In Proceedings of the 12th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 2-13. Google ScholarDigital Library
- Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In Proceedings of the Twentieth International Conference on Architec- tural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 607-618. Google ScholarDigital Library
- Jeongseob Ahn, Seongwook Jin, and Jaehyuk Huh. 2012. Revisiting Hardwareassisted Page Walks for Virtualized Systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 476-487. http://dl.acm.org/citation.cfm?id= 2337159.2337214 Google ScholarDigital Library
- Andrea Arcangeli. 2010. Transparent Hugepage Support. KVM Forum (August 2010). Retrieved April 18, 2017 from https://www.linux-kvm.org/images/9/9e/ 2010-forum-thp.pdfGoogle Scholar
- Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In Proceed- ings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 416-427. http: //dl.acm.org/citation.cfm?id=2337159.2337207 Google ScholarDigital Library
- Amitabha Banerjee, Rishi Mehta, and Zach Shen. 2015. NUMA Aware I/O in Virtualized Systems. In Proceedings of the 2015 IEEE 23rd Annual Sympo- sium on High-Performance Interconnects (HOTI '15). IEEE Computer Society, Washington, DC, USA, 10-17. Google ScholarDigital Library
- Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don'T Walk (the Page Table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 48-59. Google ScholarDigital Library
- Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2011. SpecTLB: A Mechanism for Speculative Address Translation. In Proceedings of the 38th Annual Interna- tional Symposium on Computer Architecture (ISCA '11). ACM, New York, NY, USA, 307-318. Google ScholarDigital Library
- Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-dimensional Page Walks for Virtualized Systems. In Proceed- ings of the 13th International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, USA, 26-35. Google ScholarDigital Library
- Abhishek Bhattacharjee. 2013. Large-reach Memory Management Unit Caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 383-394. Google ScholarDigital Library
- Abhishek Bhattacharjee. 2017. Translation-Triggered Prefetching. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 63-76. Google ScholarDigital Library
- Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared lastlevel TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 62-63. Google ScholarDigital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 72-81. Google ScholarDigital Library
- Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. 2006. Die Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, USA, 469-479. Google ScholarDigital Library
- Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu. 2014. Improving DRAM performance by parallelizing refreshes with accesses. In 2014 IEEE 20th Interna- tional Symposium on High Performance Computer Architecture (HPCA). 356-367.Google Scholar
- Jonathan Corbet. 2016. Heterogeneous memory management. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/684916Google Scholar
- Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 435-448. Google ScholarDigital Library
- Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. 2013. A Circuit-architecture Co-optimization Framework for Exploring Nonvolatile Memory Hierarchies. ACM Trans. Archit. Code Optim. 10, 4, Article 23 (Dec. 2013), 22 pages. Google ScholarDigital Library
- Malcolm C. Easton and Peter A. Franaszek. 1979. Use Bit Scanning in Replacement Decisions. IEEE Trans. Comput. C-28, 2 (Feb 1979), 133-141. Google ScholarDigital Library
- Babak Falsafi, Tim Harris, Dushyanth Narayanan, and David A. Patterson. 2016. Rack-scale Computing (Dagstuhl Seminar 15421). Dagstuhl Reports 5, 10 (2016), 35-49.Google Scholar
- Dongrui Fan, Zhimin Tang, Hailin Huang, and Guang R. Gao. 2005. An Energy Efficient TLB Design Methodology. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED '05). ACM, New York, NY, USA, 351-356. Google ScholarDigital Library
- Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 37-48. Google ScholarDigital Library
- Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Mi- croarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 178-189. Google ScholarDigital Library
- Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. 2016. Agile Paging: Exceeding the Best of Nested and Shadow Paging. In Proceedings of the 43rd Interna- tional Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 707-718. Google ScholarDigital Library
- Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large Pages May Be Harmful on NUMA Systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 231-242. http://dl.acm.org/citation.cfm?id=2643634.2643659 Google ScholarDigital Library
- Jerome Glisse. 2016. HMM (Heterogeneous memory management) v5. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/619067Google Scholar
- Fei Guo, Seongbeom Kim, Yury Baskakov, and Ishan Banerjee. 2015. Proactively Breaking Large Pages to Improve Memory Overcommitment Performance in VMware ESXi. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15). ACM, New York, NY, USA, 39-51. Google ScholarDigital Library
- John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1-17. Google ScholarDigital Library
- Intel. 2015. Introducing Intel Optane Technology - Bringing 3D XPoint Memory to Storage and Memory Products. (2015). Retrieved April 18, 2017 from https://newsroom.intel.com/press-kits/ introducing-intel-optane-technology-bringing-3d-xpoint-memory-to-storage\ -and-memory-productsGoogle Scholar
- Toni Juan, Tomas Lang, and Juan J. Navarro. 1997. Reducing TLB Power Requirements. In Proceedings of the 1997 International Symposium on Low Power Electronics and Design (ISLPED '97). ACM, New York, NY, USA, 196-201. Google ScholarDigital Library
- I. Kadayif, A. Sivasubramaniam, M. Kandemir, G. Kandiraju, and G. Chen. 2002. Generating Physical Addresses Directly for Saving Instruction TLB Energy. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microar- chitecture (MICRO 35). IEEE Computer Society Press, Los Alamitos, CA, USA, 185-196. http://dl.acm.org/citation.cfm?id=774861.774882 Google ScholarDigital Library
- Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 546-558. Google ScholarDigital Library
- Vasileios Karakostas, Jayneel Gandhi, Adrian Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Unsal. 2016. Energy-efficient address translation. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 631-643.Google Scholar
- Anshuman Khandaul. 2016. Define coherent device memory node. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/404403Google Scholar
- Joonyoung Kim, Younsu Kim, undefined, undefined, undefined, and undefined. 2014. HBM: Memory solution for bandwidth-hungry processors. 2014 IEEE Hot Chips 26 Symposium (HCS) 00 (2014), 1-24.Google Scholar
- Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 705-721. http://dl.acm.org/citation.cfm?id=3026877.3026931 Google ScholarDigital Library
- Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, Berkeley, CA, USA, 277-289. http://dl.acm.org/ citation.cfm?id=2813767.2813788 Google ScholarDigital Library
- Gabriel Loh and Mark D. Hill. 2012. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap. IEEE Micro 32, 3 (May 2012), 70-78. Google ScholarCross Ref
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190-200. Google ScholarDigital Library
- Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Trans. Archit. Code Optim. 10, 1, Article 2 (April 2013), 38 pages. Google ScholarDigital Library
- Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. 2016. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 233-247. Google ScholarDigital Library
- Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip Cache Coherence is Here to Stay. Commun. ACM 55, 7 (July 2012), 78-89. Google ScholarDigital Library
- Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 126-136.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In 40th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO 2007). 3-14. Google ScholarDigital Library
- Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104. Google ScholarDigital Library
- Mark Oskin and Gabriel H. Loh. 2015. A Software-Managed Approach to Die- Stacked DRAM. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Washington, DC, USA, 188-200. Google ScholarDigital Library
- Jiannan Ouyang, John R. Lange, and Haoqiang Zheng. 2016. Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs. In Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '16). ACM, New York, NY, USA, 17-23. Google ScholarDigital Library
- J. T. Pawlowski. 2011. Hybrid memory cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HCS). 1-24.Google ScholarCross Ref
- Sujay Phadke and Satish Narayanasamy. 2011. MLP aware heterogeneous memory system. In 2011 Design, Automation Test in Europe. 1-6.Google Scholar
- Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 558-567.Google Scholar
- Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitec- ture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269. Google ScholarDigital Library
- Binh Pham, Jan Vesely, Gabriel Loh, and Abhishek Bhattacharjee. 2015. Using TLB Speculation to Overcome Page Splintering in Virtual Machines. Rutgers Technical Report DCS-TR-713. Department of Computer Science, Rutgers University, Pistcataway, NJ.Google Scholar
- Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee. 2015. Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 1-12. Google ScholarDigital Library
- Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 85-95. Google ScholarDigital Library
- Dulloor Subramanya Rao and Karsten Schwan. 2010. vNUMA-mgr: Managing VM memory on NUMA platforms. In 2010 International Conference on High Performance Computing. 1-10.Google ScholarCross Ref
- Jia Rao, Kun Wang, Xiaobo Zhou, and Cheng-Zhong Xu. 2013. Optimizing Virtual Machine Scheduling in NUMA Multicore Systems. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 306-317. Google ScholarDigital Library
- Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. 2010. UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. In HPCA - 16 2010 The Sixteenth International Symposium on High- Performance Computer Architecture. 1-12.Google Scholar
- Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. 2015. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 79-91. Google ScholarDigital Library
- Agam Shah. 2014. Micron's Revolutionary Hybrid Memory Cube Tech is 15 Times Faster than Today's DRAM. (2014). Retrieved April 18, 2017 from http://www.pcworld.com/article/2366680/ computer-memory-overhaul-due-with-microns-hmc-in-early-2015.htmlGoogle Scholar
- Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. (2011). Retrieved April 18, 2017 from https://www.microarch.org/micro44/files/Micro% 20Keynote%20Final%20-%20Avinash%20Sodani.pdfGoogle Scholar
- Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers. Google ScholarDigital Library
- Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. 2016. BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling. IEEE Transactions on Parallel and Distributed Systems 27, 10 (Oct 2016), 3071-3087. Google ScholarDigital Library
- Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 171-182. Google ScholarDigital Library
- Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel H. Loh, and Abhishek Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS). 161-171.Google Scholar
- Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman S. Unsal. 2011. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 340-349. Google ScholarDigital Library
- VMware. 2011. Performance Best Practices for VMware vSphere 5.0. (2011). Retrieved April 18, 2017 from https://www.vmware.com/pdf/Perf_Best_Practices_ vSphere5.0.pdfGoogle Scholar
- Yuan Xie. 2011. Modeling, Architecture, and Applications for Emerging Memory Technologies. IEEE Des. Test 28, 1 (Jan. 2011), 44-51. Google ScholarDigital Library
- Yuan Xie. 2013. Emerging Memory Technologies: Design, Architecture, and Applications. Springer Publishing Company, Incorporated. Google ScholarDigital Library
- Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain Coherence Directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 359- 370. Google ScholarDigital Library
Index Terms
- Hardware Translation Coherence for Virtualized Systems
Recommendations
LATR: Lazy Translation Coherence
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsWe propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid ...
Hardware Translation Coherence for Virtualized Systems
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureTo improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. ...
Hardware Translation Coherence for Virtualized Systems
ISCA'17To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. ...
Comments