ABSTRACT
Architectural heterogeneity is increasing: numerous products and studies have proven the benefits of combining cores and accelerators with varying ISAs into a single system. However, an underappreciated barrier to unlocking the full potential of heterogeneity is the need to specify and to reconcile differences in memory consistency models across layers of the hardware-software stack and among on-chip components.
This paper presents ArMOR, a framework for specifying, comparing, and translating between memory consistency models. ArMOR defines MOSTs, an architecture-independent and precise format for specifying the semantics of memory ordering requirements such as preserved program order or explicit fences. MOSTs allow any two consistency models to be directly and algorithmically compared, and they help avoid many of the pitfalls of traditional consistency model analysis. As a case study, we use ArMOR to automatically generate translation modules called shims that dynamically translate code compiled for one memory model to execute on hardware implementing a different model.
- S. Adve and K. Gharachorloo, "Shared memory consistency models: A tutorial," IEEE Computer, vol. 29, no. 12, pp. 66--76, 1996. Google ScholarDigital Library
- S. Adve and M. Hill, "Weak ordering: a new definition," ISCA, 1990. Google ScholarDigital Library
- J. Alglave, "A formal hierarchy of weak memory models," Formal Methods in System Design (FMSD), vol. 41, no. 2, pp. 178--210, 2012. Google ScholarDigital Library
- J. Alglave, M. Batty, A. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson, "GPU concurrency: weak behaviours and programming assumptions," ASPLOS, 2015. Google ScholarDigital Library
- J. Alglave, A. Fox, S. Ishtiaq, M. O. Myreen, S. Sarkar, P. Sewell, and F. Z. Nardelli, "The semantics of Power and ARM machine code," 4th Workshop on Declarative Aspects of Multicore Programming (DAMP), 2009. Google ScholarDigital Library
- J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Fences in weak memory models," CAV, 2010. Google ScholarDigital Library
- J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: Modelling, simulation, testing, and data-mining for weak memory," ACM TOPLAS, vol. 36, July 2014. Google ScholarDigital Library
- ARM, "ARM architecture reference manual," 2013.Google Scholar
- Arvind and J.-W. Maessen, "Memory model = instruction reordering + store atomicity," ISCA, 2006. Google ScholarDigital Library
- M. Bach, M. Charney, R. Cohn, E. Demikhovsky, T. Devor, K. Hazelwood, A. Jaleel, C.-K. Luk, G. Lyons, H. Patil, and A. Tal, "Analyzing parallel programs with Pin," IEEE Computer, vol. 43, no. 3, pp. 34--41, 2010. Google ScholarDigital Library
- L. Baraz, T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang, and Y. Zemach, "IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems," MICRO, 2003. Google ScholarDigital Library
- M. Batty, K. Memarian, S. Owens, S. Sarkar, and P. Sewell, "Clarifying and compiling C/C++ Concurrency: from C++11 to POWER," POPL, 2012. Google ScholarDigital Library
- C. Bienia, "Benchmarking modern multiprocessors," Ph.D. dissertation, Princeton University, January 2011. Google ScholarDigital Library
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comp. Arch. News, vol. 39, no. 2, Aug. 2011. Google ScholarDigital Library
- H.-J. Boehm and S. Adve, "Foundations of the C++ concurrency memory model," PLDI, 2008. Google ScholarDigital Library
- Broadcom, "Migrating CPU specific code from the PowerPC to the Broadcom SB-1 processor," White Paper SB-1-WP100-R, 2002.Google Scholar
- S. Burckhardt, R. Alur, and M. M. K. Martin, "CheckFence: Checking consistency of concurrent data types on relaxed memory models," PLDI, 2007. Google ScholarDigital Library
- T. Chen, R. Raghavan, J. N. Dale, and E. Iwata, "Cell broadband engine architecture and its first implementation---a performance view," IBM Journal of Research and Development, vol. 51, no. 5, pp. 559--572, 2007. Google ScholarDigital Library
- B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "DeNovo: Rethinking the memory hierarchy for disciplined parallelism," PACT, 2011. Google ScholarDigital Library
- M. DeVuyst, A. Venkat, and D. Tullsen, "Execution migration in a heterogeneous-ISA chip multiprocessor," ASPLOS, 2012. Google ScholarDigital Library
- Y. Duan, A. Muzahid, and J. Torrellas, "WeeFence: Toward making fences free in TSO," ISCA, 2013. Google ScholarDigital Library
- I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-M. W. Hwu, "An asymmetric distributed shared memory model for heterogeneous parallel systems," ASPLOS, 2010. Google ScholarDigital Library
- K. Gharachorloo, A. Gupta, and J. Hennessy, "Two techniques to enhance the performance of memory consistency models," 29th International Conference on Parallel Processing (ICPP), 1991.Google Scholar
- P. Greenhalgh, "big.LITTLE processing with ARM Cortex-A15 & Cortex-A7," ARM White Paper, 2011. {Online}. Available: http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdfGoogle Scholar
- M. Gschwind, K. Ebcioğlu, E. Altman, and S. Sathaye, "Binary translation and architecture convergence issues for IBM System/390," ICS, 2000. Google ScholarDigital Library
- L. Higham and L. Jackson, "Translating between Itanium and Sparc memory consistency models," SPAA, 2006. Google ScholarDigital Library
- T. Q. Huynh and A. Roychoudhury, "Memory model sensitive bytecode verification," Formal Methods in System Design (FMSD), vol. 31, 2007. Google ScholarDigital Library
- IBM, "Power ISA version 2.07," 2013.Google Scholar
- Intel, "Intel Itanium architecture software developer's manual, revision 2.3," 2010.Google Scholar
- Intel, "Intel 64 and IA-32 architectures software developer's manual," 2013.Google Scholar
- J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel, "Cohesion: A hybrid memory model for accelerators," ISCA, 2010. Google ScholarDigital Library
- Khronos Group, "OpenCL 2.0." {Online}. Available: http://www.khronos.org/openclGoogle Scholar
- M. Kuperstein, M. Vechev, and E. Yahav, "Automatic inference of memory fences," FMCAD, 2012. Google ScholarDigital Library
- N. M. Lê, A. Pop, A. Cohen, and F. Zappa Nardelli, "Correct and efficient work-stealing for weak memory models," PPoPP, 2013.Google Scholar
- J. Lee and D. A. Padua, "Hiding relaxed memory consistency with compilers," PACT, 2000. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: building customized program analysis tools with dynamic instrumentation," PLDI, 2005. Google ScholarDigital Library
- D. Lustig and M. Martonosi, "Reducing GPU offload latency via fine-grained CPU-GPU synchronization," HPCA, 2013. Google ScholarDigital Library
- D. Lustig, M. Pellauer, and M. Martonosi, "PipeCheck: Specifying and verifying microarchitectural enforcement of memory consistency models," MICRO, 2014. Google ScholarDigital Library
- D. Lustig, C. Trippel, M. Pellauer, and M. Martonosi, "ArMOR: Defending against consistency model mismatches in heterogeneous architectures," Princeton Computer Science Tech. Report TR-981-15, 2015, (conference paper extension).Google Scholar
- S. Mador-Haim, L. Maranget, S. Sarkar, K. Memarian, J. Alglave, S. Owens, R. Alur, M. M. K. Martin, P. Sewell, and D. Williams, "An axiomatic memory model for POWER multiprocessors," 2012.Google ScholarDigital Library
- J. Manson, W. Pugh, and S. Adve, "The Java memory model," POPL, 2005. Google ScholarDigital Library
- F. Z. Nardelli, P. Sewell, J. Sevcik, S. Sarkar, S. Owens, L. Maranget, M. Batty, and J. Alglave, "Relaxed memory models must be rigorous," 2009.Google Scholar
- NVIDIA, "NVIDIA Tegra K1: A new era in mobile computing," 2014. {Online}. Available: http://www.nvidia.com/content/pdf/tegra_white_papers/tegra_k1_whitepaper_v1.0.pdfGoogle Scholar
- NVIDIA, "CUDA C programming guide v5.5," 2013.Google Scholar
- S. Owens, S. Sarkar, and P. Sewell, "A better x86 memory model: x86-TSO," 22nd Conference on Theorem Proving in Higher Order Logics (TPHOLs), 2009. Google ScholarDigital Library
- R. Paige and R. E. Tarjan, "Three partition refinement algorithms," SIAM Journal on Computing, vol. 16, no. 6, pp. 973--989, 1987. Google ScholarDigital Library
- S. Pelley, P. M. Chen, and T. F. Wenisch, "Memory persistency," ISCA, 2014. Google ScholarDigital Library
- A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, "A reconfigurable fabric for accelerating large-scale datacenter services," ISCA, 2014. Google ScholarDigital Library
- Qualcomm, "Snapdragon S4 processors: System on chip solutions for a new mobile age," October 2011. {Online}. Available: https://developer.qualcomm.com/download/qusnapdragons4whitepaperfnlrev6.pdfGoogle Scholar
- B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson, "Programming model for a heterogeneous x86 platform," PLDI, 2009. Google ScholarDigital Library
- S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams, "Understanding POWER microprocessors," PLDI, 2011. Google ScholarDigital Library
- J. Ševčík, V. Vafeiadis, F. Zappa Nardelli, S. Jagannathan, and P. Sewell, "CompCertTSO: A verified compiler for relaxed-memory concurrency," Journal of the ACM (JACM), vol. 60, no. 3, p. 22, 2013. Google ScholarDigital Library
- P. Sewell et al., "C/C++11 mappings to processors," http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.Google Scholar
- D. Shasha and M. Snir, "Efficient and correct execution of parallel programs that share memory," TOPLAS, 1988. Google ScholarDigital Library
- X. Shen, Arvind, and L. Rudolph, "Commit-Reconcile and Fences: A new memory model for architects and compiler writers," ISCA, 1999. Google ScholarDigital Library
- A. L. Shimpi, "AMD announced K12 core: Custom 64-bit ARM design in 2016." {Online}. Available: http://www.anandtech.com/show/7990/amd-announces-k12-core-custom-64bit-arm-design-in-2016Google Scholar
- A. Singh, S. Narayanasamy, D. Marino, T. Millstein, and M. Musuvathi, "End-to-end sequential consistency," ISCA, 2012. Google ScholarDigital Library
- D. Sorin, M. Hill, and D. Wood, A Primer on Memory Consistency and Cache Coherence, ser. Synthesis Lectures on Computer Architecture, M. Hill, Ed. Morgan & Claypool Publishers, 2011. Google ScholarDigital Library
- SPARC, "SPARC architecture manual, version 9," 1994. Google ScholarDigital Library
- H. Sung, R. Komuravelli, and S. V. Adve, "DeNovoND: efficient hardware support for disciplined non-determinism," ASPLOS, 2013. Google ScholarDigital Library
- Z. Sura, X. Fang, C.-L. Wong, S. P. Midkiff, J. Lee, and D. Padua, "Compiler techniques for high performance sequentially consistent Java programs," PPoPP, 2005. Google ScholarDigital Library
- J. M. Tendler, J. S. Dodson, J. Fields, H. Le, and B. Sinharoy, "POWER4 system microarchitecture," IBM Journal of Research and Development, vol. 46, no. 1, pp. 5--25, 2002. Google ScholarDigital Library
- "Top500," http://www.top500.org, accessed: Jul. 28, 2014.Google Scholar
- V. Vafeiadis and F. Z. Nardelli, "Verifying fence elimination optimisations," SAS, 2011. Google ScholarDigital Library
- A. Venkat and D. M. Tullsen, "Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor," ISCA, 2014. Google ScholarDigital Library
Index Terms
- ArMOR: defending against memory consistency model mismatches in heterogeneous architectures
Recommendations
ArMOR: defending against memory consistency model mismatches in heterogeneous architectures
ISCA'15Architectural heterogeneity is increasing: numerous products and studies have proven the benefits of combining cores and accelerators with varying ISAs into a single system. However, an underappreciated barrier to unlocking the full potential of ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors
Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Comments