ABSTRACT
Computing in virtualized environments has become a common practice for many businesses. Typically, hosting companies aim for lower operational costs by targeting high utilization of host machines maintaining just enough machines to meet the demand. In this scenario, frequent virtual machine context switches are common, resulting in increased TLB miss rates (often, by over 5X when contexts are doubled) and subsequent expensive page walks. Since each TLB miss in a virtual environment initiates a 2D page walk, the data caches get filled with a large fraction of page table entries (often, in excess of 50%) thereby evicting potentially more useful data contents.
In this work, we propose CSALT - a Context-Switch Aware Large TLB, to address the problem of increased TLB miss rates and their adverse impact on data caches. First, we demonstrate that the CSALT architecture can effectively cope with the demands of increased context switches by its capacity to store a very large number of TLB entries. Next, we show that CSALT mitigates data cache contention caused by conflicts between data and translation entries by employing a novel TLB-Aware Cache Partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional L1-L2 TLBs and 25% over a baseline which has a large L3 TLB.
- "AMD Nested Paging," http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%201-final-TM.pdf.Google Scholar
- "ARM1136JF-S and ARM1136J-S," http://infocenter.arm.com/help/topic/com.arm.doc.ddi0211k/ddi0211k_arm1136_r1p5_trm.pdf.Google Scholar
- "bpoe8-eishi-arima," http://prof.ict.ac.cn/bpoe_8/wp-content/uploads/arima.pdf, (Accessed on 08/24/2017).Google Scholar
- "Intel(R) 64 and IA-32 Architectures Optimization Reference Manual," http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.Google Scholar
- The Graph500 List. {Online}. Available: Graph500:http://www.graph500.org/Google Scholar
- C. L. Akanksha Jain, "Back to the Future: Leveraging Belady`s Algorithm for Improved Cache Replacement," https://www.cs.utexas.edu/~lin/papers/isca16.pdf, 2016. Google ScholarDigital Library
- Amazon, "Amazon EC2 - Virtual Server Hosting," https://aws.amazon.com/ec2/.Google Scholar
- A. Arcangeli, "Transparent hugepage support," in KVM Forum, vol. 9, 2010.Google Scholar
- T. W. Barr, A. L. Cox, and S. Rixner, "SpecTLB: A Mechanism for Speculative Address Translation," in Proceedings of the 38th Annual International Symposium on Computer Architecture, ser. ISCA '11. New York, NY, USA: ACM, 2011, pp. 307--318. {Online}. Available Google ScholarDigital Library
- K. Begnum, N. A. Lartey, and L. Xing, "Cloud-Oriented Virtual Machine Management with MLN," in Cloud Computing: First International Conference, CloudCom 2009, Beijing, China, December 1--4, 2009. Proceedings. Springer Berlin Heidelberg, 2009. Google ScholarDigital Library
- F. Bellard, "QEMU, a Fast and Portable Dynamic Translator," in Proceedings of the Annual Conference on USENIX Annual Technical Conference, ser. ATEC '05. Berkeley, CA, USA: USENIX Association, 2005, pp. 41--41. {Online}. Available: http://dl.acm.org/citation.cfm?id=1247360.1247401 Google ScholarDigital Library
- R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, "Accelerating Two-dimensional Page Walks for Virtualized Systems," in Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XIII. New York, NY, USA: ACM, 2008, pp. 26--35. {Online}. Available Google ScholarDigital Library
- A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared last-level TLBs for chip multiprocessors." in HPCA. IEEE Computer Society, 2011, pp. 62--63. {Online}. Available: http://dblp.uni-trier.de/db/conf/hpca/hpca2011.html#BhattacharjeeLM11 Google ScholarDigital Library
- A. Bhattacharjee and M. Martonosi, "Inter-core Cooperative TLB for Chip Multiprocessors," in Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XV. New York, NY, USA: ACM, 2010, pp. 359--370. {Online}. Available Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '08. New York, NY, USA: ACM, 2008, pp. 72--81. {Online}. Available Google ScholarDigital Library
- J. Chang and G. S. Sohi, Cooperative caching for chip multiprocessors. IEEE Computer Society, 2006, vol. 34, no. 2. Google ScholarDigital Library
- J. Chang and G. S. Sohi, "Cooperative Cache Partitioning for Chip Multiprocessors," in ACM International Conference on Supercomputing 25th Anniversary Volume. New York, NY, USA: ACM, 2014, pp. 402--412. {Online}. Available Google ScholarDigital Library
- X. Chang, H. Franke, Y. Ge, T. Liu, K. Wang, J. Xenidis, F. Chen, and Y. Zhang, "Improving virtualization in the presence of software managed translation lookaside buffers," in ACM SIGARCH Computer Architecture News, vol. 41, no. 3. ACM, 2013, pp. 120--129. Google ScholarDigital Library
- N. Ganapathy and C. Schimmel, "General purpose operating system support for multiple page sizes." in USENIX Annual Technical Conference, no. 98, 1998, pp. 91--104. Google ScholarDigital Library
- I. Habib, "Virtualization with KVM," Linux J., vol. 2008, no. 166, Feb. 2008. {Online}. Available: http://dl.acm.org/citation.cfm?id=1344209.1344217 Google ScholarDigital Library
- W. Hasenplaugh, P. S. Ahuja, A. Jaleel, S. Steely Jr., and J. Emer, "The Gradient-based Cache Partitioning Algorithm," ACM Trans. Archit. Code Optim., vol. 8, no. 4, pp. 44:1--44:21, Jan. 2012. {Online}. Available Google ScholarDigital Library
- HP, "HPE Cloud Solutions," https://www.hpe.com/us/en/solutions/cloud.html.Google Scholar
- IBM, "SmartCloud Enterprise," https://www.ibm.com/cloud/.Google Scholar
- Intel, " Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide Part 1."Google Scholar
- Intel. Intel(R) 64 and IA-32 Architectures Software DeveloperâĂŹs Manual Volume 3A: System Programming Guide, Part 1. {Online}. Available: http://www.intel.com/Assets/en_US/PDF/manual/253668.pdfGoogle Scholar
- Intel, "Intel(R) Virtualization Technology," http://www.intel.com/content/www/us/en/virtualization/virtualization-technology/intel-virtualization-technology.html.Google Scholar
- Intel, "5-Level Paging and 5-Level EPT," 2016, https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf,.Google Scholar
- P. Jääskeläinen, P. Kellomäki, J. Takala, H. Kultala, and M. Lepistö, "Reducing context switch overhead with compiler-assisted threading," in Embedded and Ubiquitous Computing, 2008. EUC'08. IEEE/IFIP International Conference on, vol. 2. IEEE, 2008, pp. 461--466. Google ScholarDigital Library
- A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM, 2010, pp. 60--71. Google ScholarDigital Library
- R. Kandemir, Mahmut and Prabhakar, M. Karakoy, and Y. Zhang, "Multilayer Cache Partitioning for Multiprogram Workloads," in Proceedings of the 17th International Conference on Parallel Processing - Volume Part I, ser. Euro-Par'11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 130--141. {Online}. Available: http://dl.acm.org/citation.cfm?id=2033345.2033360 Google ScholarDigital Library
- G. B. Kandiraju and A. Sivasubramaniam, Going the distance for TLB prefetching: an application-driven study. IEEE Computer Society, 2002, vol. 30, no. 2. Google ScholarDigital Library
- D. Kaseridis, J. Stuecheli, and L. K. John, "Bank-aware dynamic cache partitioning for multicore architectures," in Parallel Processing, 2009. ICPP'09. International Conference on. IEEE, 2009, pp. 18--25. Google ScholarDigital Library
- K. Kędzierski, M. Moreto, F. J. Cazorla, and M. Valero, "Adapting cache partitioning algorithms to pseudo-lru replacement policies," in Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1--12.Google Scholar
- Y. Kim, W. Yang, and O. Mutlu, "Ramulator: A Fast and Extensible DRAM Simulator," IEEE Comput. Archit. Lett., vol. 15, no. 1, pp. 45--49, Jan. 2016. {Online}. Available Google ScholarDigital Library
- A. Kivity, D. Laor, G. Costa, P. Enberg, N. Har'El, D. Marti, and V. Zolotarov, "Osv---Optimizing the Operating System for Virtual Machines," in 2014 USENIX Annual Technical Conference (USENIX ATC 14). Philadelphia, PA: USENIX Association, 2014, pp. 61--72. {Online}. Available: https://www.usenix.org/conference/atc14/technical-sessions/presentation/kivity Google ScholarDigital Library
- A. Kyrola, G. Blelloch, and C. Guestrin, "GraphChi: Large-scale Graph Computation on Just a PC," in Conference on Operating Systems Design and Implementation (OSDI). USENIX Association, 2012, pp. 31--46. Google ScholarDigital Library
- C. Li, C. Ding, and K. Shen, "Quantifying the Cost of Context Switch," in Proceedings of the 2007 Workshop on Experimental Computer Science, ser. ExpCS '07. New York, NY, USA: ACM, 2007. {Online}. Available Google ScholarDigital Library
- F. Liu and Y. Solihin, "Understanding the Behavior and Implications of Context Switch Misses," ACM Trans. Archit. Code Optim., vol. 7, no. 4, pp. 21:1--21:28, Dec. 2010. {Online}. Available Google ScholarDigital Library
- H. Liu, "A Measurement Study of Server Utilization in Public Clouds," 2011, http://ieeexplore.ieee.org/document/6118751/media. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '05. New York, NY, USA: ACM, 2005, pp. 190--200. {Online}. Available Google ScholarDigital Library
- R. Manikantan, K. Rajan, and R. Govindarajan, "Probabilistic Shared Cache Management (PriSM)," in Proceedings of the 39th Annual International Symposium on Computer Architecture, ser. ISCA '12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 428--439. {Online}. Available: http://dl.acm.org/citation.cfm?id=2337159.2337208 Google ScholarDigital Library
- Z. A. Mann, "Allocation of virtual machines in cloud data centers---a survey of problem models and optimization algorithms," ACM Comput. Surv., vol. 48, no. 1, pp. 11:1--11:34, Aug. 2015. {Online}. Available Google ScholarDigital Library
- R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, "Evaluation Techniques for Storage Hierarchies," IBM Syst. J., vol. 9, no. 2, pp. 78--117, Jun. 1970. {Online}. Available Google ScholarDigital Library
- Y. Mei, L. Liu, X. Pu, S. Sivathanu, and X. Dong, "Performance analysis of network I/O workloads in virtualized data centers," IEEE Trans. Services Computing, vol. 6, no. 1, pp. 48--63, 2013. {Online}. Available Google ScholarDigital Library
- X. Meng, C. Isci, J. Kephart, L. Zhang, E. Bouillet, and D. Pendarakis, "Efficient resource provisioning in compute clouds via VM multiplexing," in Proceedings of the 7th International Conference on Autonomic Computing, ser. ICAC '10. New York, NY, USA: ACM, 2010, pp. 11--20. {Online}. Available Google ScholarDigital Library
- Microsoft, "Microsoft Azure," https://www.microsoft.com/en-us/cloud-platform/server-virtualization.Google Scholar
- M. Moreto, F. J. Cazorla, A. Ramirez, and M. Valero, "Transactions on High-performance Embedded Architectures and Compilers III," P. Stenström, Ed. Berlin, Heidelberg: Springer-Verlag, 2011, ch. Dynamic Cache Partitioning Based on the MLP of Cache Misses, pp. 3--23. {Online}. Available: http://dl.acm.org/citation.cfm?id=1980776.1980778 Google ScholarDigital Library
- D. S. Nathan Beckmann, "Maximizing Cache Performance Under Uncertainty," http://people.csail.mit.edu/sanchez/papers/2017.eva.hpca.pdf, 2017.Google Scholar
- J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, transparent operating system support for superpages," ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 89--104, 2002. Google ScholarDigital Library
- Oracle. Translation Storage Buffers. {Online}. Available: https://blogs.oracle.com/elowe/entry/translation_storage_buffersGoogle Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank citation ranking: Bringing order to the web," 1999.Google Scholar
- A. Pan and V. S. Pai, "Imbalanced Cache Partitioning for Balanced Data-parallel Programs," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM, 2013, pp. 297--309. {Online}. Available Google ScholarDigital Library
- M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, "Prediction-based superpage-friendly TLB designs," in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 2015, pp. 210--222.Google Scholar
- C. H. Park, T. Heo, J. Jeong, and J. Huh, "Hybrid tlb coalescing: Improving tlb translation coverage under diverse fragmented memory allocations," in Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017, pp. 444--456. Google ScholarDigital Library
- B. Pham, J. Veselý, G. H. Loh, and A. Bhattacharjee, "Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?" in Proceedings of the 48th International Symposium on Microarchitecture, ser. MICRO-48. New York, NY, USA: ACM, 2015, pp. 1--12. {Online}. Available Google ScholarDigital Library
- B. Pham, J. Vesely, G. H. Loh, and A. Bhattacharjee, "Using TLB Speculation to Overcome Page Splintering in Virtual Machines," 2015.Google Scholar
- G. C. Platform, "Load Balancing and Scaling." {Online}. Available: https://cloud.google.comGoogle Scholar
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive Insertion Policies for High Performance Caching," in Proceedings of the 34th Annual International Symposium on Computer Architecture, ser. ISCA '07. New York, NY, USA: ACM, 2007, pp. 381--391. {Online}. Available Google ScholarDigital Library
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive insertion policies for high performance caching," in ACM SIGARCH Computer Architecture News, vol. 35, no. 2. ACM, 2007, pp. 381--391. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 39. Washington, DC, USA: IEEE Computer Society, 2006, pp. 423--432. {Online}. Available Google ScholarDigital Library
- Rackspace, "OPENSTACK - The Open Alternative To Cloud Lock-In," https://www.rackspace.com/en-us/cloud/openstack.Google Scholar
- J. H. Ryoo, N. Gulur, S. Song, and L. K. John, "Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB," in Computer Architecture, 2017 IEEE International Symposium on, ser. ISCA '17. ACM, 2017. {Online}. Available: http://lca.ece.utexas.edu/pubs/isca2017.pdf Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis, "Vantage: Scalable and Efficient Fine-grain Cache Partitioning," in Proceedings of the 38th Annual International Symposium on Computer Architecture, ser. ISCA '11. New York, NY, USA: ACM, 2011, pp. 57--68. {Online}. Available Google ScholarDigital Library
- A. W. Services, "High Performance Computing," https://aws.amazon.com/hpc/,.Google Scholar
- SUN, "The SPARC Architecture Manual," http://www.sparc.org/standards/SPARCV9.pdf.Google Scholar
- K. T. Sundararajan, T. M. Jones, and N. P. Topham, "Energy-efficient Cache Partitioning for Future CMPs," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '12. New York, NY, USA: ACM, 2012, pp. 465--466. {Online}. Available Google ScholarDigital Library
- V. Vasudevan, D. G. Andersen, and M. Kaminsky, "The Case for VOS: The Vector Operating System," in Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, ser. HotOS'13. Berkeley, CA, USA: USENIX Association, 2011, pp. 31--31. {Online}. Available: http://dl.acm.org/citation.cfm?id=1991596.1991638 Google ScholarDigital Library
- P.-H. Wang, C.-H. Li, and C.-L. Yang, "Latency Sensitivity-based Cache Partitioning for Heterogeneous Multi-core Architecture," in Proceedings of the 53rd Annual Design Automation Conference, ser. DAC '16. New York, NY, USA: ACM, 2016, pp. 5:1--5:6. {Online}. Available Google ScholarDigital Library
- R. Wang and L. Chen, "Futility Scaling: High-Associativity Cache Partitioning," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-47. Washington, DC, USA: IEEE Computer Society, 2014, pp. 356--367. {Online}. Available Google ScholarDigital Library
- W. Wang, P. Mishra, and S. Ranka, "Dynamic Cache Reconfiguration and Partitioning for Energy Optimization in Real-time Multi-core Systems," in Proceedings of the 48th Design Automation Conference, ser. DAC '11. New York, NY, USA: ACM, 2011, pp. 948--953. {Online}. Available Google ScholarDigital Library
- C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: Signature-based hit predictor for high performance caching," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2011, pp. 430--441. Google ScholarDigital Library
- Y. Xie and G. H. Loh, "PIPP: Promotion/Insertion Pseudo-partitioning of Multi-core Shared Caches," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 174--183. {Online}. Available Google ScholarDigital Library
- C.-H. Yen, "SOLARIS OPERATING SYSTEM HARDWARE VIRTUALIZATION PRODUCT ARCHITECTURE," 2007. {Online}. Available: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3F5AEF9CE2ABE7D1D7CC18DC5208A151?doi=10.1.1.110.9986&rep=rep1&type=pdfGoogle Scholar
- C. Yu and P. Petrov, "Off-chip Memory Bandwidth Minimization Through Cache Partitioning for Multi-core Platforms," in Proceedings of the 47th Design Automation Conference, ser. DAC '10. New York, NY, USA: ACM, 2010, pp. 132--137. {Online}. Available Google ScholarDigital Library
- M. Zhou, Y. Du, B. Childers, R. Melhem, and D. Mossé, "Writeback-aware Partitioning and Replacement for Last-level Caches in Phase Change Main Memory Systems," ACM Trans. Archit. Code Optim., vol. 8, no. 4, pp. 53:1--53:21, Jan. 2012. Google ScholarDigital Library
Index Terms
- CSALT: context switch aware large TLB
Recommendations
Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB
ISCA'17With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention. In the radix-4 type of page tables used in x86 architectures, a ...
Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureWith increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention. In the radix-4 type of page tables used in x86 architectures, a ...
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on MicroarchitectureAddress translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) ...
Comments