skip to main content
research-article

Lock Cohorting: A General Technique for Designing NUMA Locks

Published:18 February 2015Publication History
Skip Abstract Section

Abstract

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.

Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.

We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

References

  1. A. Agarwal and M. Cheritan. 1989. Adaptive backoff synchronization techniques. SIGARCH Comput. Archit. News 17, 3, 396--406. DOI:http://dx.doi.org/10.1145/74926.74970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AMD. 2012. AMD64 Architecture Programmer's Manual: Vol. 2 System Programming. http://support.amd.com/us/Embedded_TechDocs/24593.pdf.Google ScholarGoogle Scholar
  3. T. E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 1, 6--16. DOI:http://dx.doi.org/10.1109/71.80120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium.Google ScholarGoogle Scholar
  5. Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra J. Marathe, and Mark Moir. 2013a. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In Proceedings of the 17th International Conference on Principles of Distributed Systems. Roberto Baldoni, Nicolas Nisse, and Maarten van Steen, Eds., Lecture Notes in Computer Science, vol. 8304, Springer, 83--97. http://dx.doi.org/10.1007/978-3-319-03850-6_7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013b. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'13). ACM, New York, 157--166. DOI:http://dx.doi.org/10.1145/2442516.2442532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29. DOI:http://dx.doi.org/10.1109/MM.2010.31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Travis Craig. 1993. Building FIFO and priority-queueing spin locks from atomic swap. Tech. Rep. TR 93-02-02. Department of Computer Science, University of Washington.Google ScholarGoogle Scholar
  9. Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP'13). ACM, New York, 33--48. DOI:http://dx.doi.org/10.1145/2517349.2522714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Dice. 2003. US Patent # 07318128: Wakeup affinity and locality. http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=7318128.Google ScholarGoogle Scholar
  11. David Dice. 2011a. Atomic fetch and add vs CAS. (2011). https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs.Google ScholarGoogle Scholar
  12. David Dice. 2011b. Brief announcement: a partitioned ticket lock. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 309--310. DOI:http://dx.doi.org/10.1145/1989493.1989543. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David Dice. 2011c. Polite busy-waiting with WRPAUSE on SPARC. https://blogs.oracle.com/dave/entry/polite_busy_waiting_with_wrpause.Google ScholarGoogle Scholar
  14. David Dice. 2011d. Solaris Scheduling: SPARC and CPUIDs. (2011). https://blogs.oracle.com/dave/entry/solaris_scheduling_and_cpuids.Google ScholarGoogle Scholar
  15. David Dice and Alex Garthwaite. 2002. Mostly lock-free malloc. In Proceedings of the 3rd International Symposium on Memory Management (ISMM'02). ACM, New York, 163--174. DOI:http://dx.doi.org/10.1145/512429.512451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. David Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA Locks. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 65--74. DOI:http://dx.doi.org/10.1145/1989493.1989502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David Dice, Virendra J. Marathe, and Nir Shavit. 2012a. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 247--256. DOI:http://dx.doi.org/10.1145/2145816.2145848. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David Dice, Nir Shavit, and Virendra J. Marathe. 2012b. US Patent Application 20130047011 - Turbo Enablement. http://www.google.com/patents/US20130047011.Google ScholarGoogle Scholar
  19. David Dice, Nir Shavit, and Virendra J. Marathe. 2012c. US Patent US8694706 - Lock Cohorting. (2012). http://www.google.com/patents/US8694706.Google ScholarGoogle Scholar
  20. Stijn Eyerman and Lieven Eeckhout. 2010. Modeling critical sections in Amdahl's law and it simplications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM, New York, 362--370. DOI:http://dx.doi.org/10.1145/1815961.1816011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 257--266. DOI:http://dx.doi.org/10.1145/2145816.2145849. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nitin Garg, Ed Zhu, and Fabiano C. Botelho. 2011. Light-weight locks. CoRR abs/1109.2638(2011). http://arxiv.org/abs/1109.2638.Google ScholarGoogle Scholar
  23. J. R. Goodman and H. H. J. Hum. 2009. MESIF: A two-hop cache coherency protocol for point-to-point interconnects. https://researchspace.auckland.ac.nz/bitstream/handle/2292/11594/MESIF-2009.pdf.Google ScholarGoogle Scholar
  24. Neil J. Gunther, Shanti Subramanyam, and Stefan Parvu. 2011. A methodology for optimizing multithreaded system scalability on multi-cores. CoRR abs/1105.4301 (2011).Google ScholarGoogle Scholar
  25. Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures. 355--364. DOI:http://dx.doi.org/10.1145/1810479.1810540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Intel Corporation. 2009. An introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf. Document Number: 320412-001US.Google ScholarGoogle Scholar
  28. F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating System. ACM, New York, 117--128. DOI:http://dx.doi.org/10.1145/1736020.1736035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. D. Kallimanis. 2013. Highly-Efficient synchronization techniques in shared-memory distributed systems. http://www.cs.uoi.gr/tech_reports//publications/PD-2013-2.pdf.Google ScholarGoogle Scholar
  30. David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Queue Delegation Locking. (2014). http://www.it.uu.se/research/group/languages/software/qd_lock_lib.Google ScholarGoogle Scholar
  31. libmemcached.org. 2013. libmemcached. www.libmemcached.org.Google ScholarGoogle Scholar
  32. Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'12). USENIX Association, Berkeley, CA, 6--6. http://dl.acm.org/citation.cfm?id=2342821.2342827. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A Hierarchical CLH Queue Lock. In Proceedings of the 12th International Euro-Par Conference. 801--810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Magnussen, A. Landin, and E. Hagersten. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing. 165--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. John Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Syst. 9, 1, 21--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. memcached.org. 2013. memcached -- a distributed memory object caching system. www.memcached.org. (2013).Google ScholarGoogle Scholar
  37. Avi Mendelson and Freddy Gabbay. 2001. The effect of seance communication on multiprocessing systems. ACM Trans. Comput. Syst. 19, 2, 252--281. DOI:http://dx.doi.org/10.1145/377769.377780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Oracle Corporation. 2010. Oracle's Sun Fire X4800 server architecture. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sf4800g5-architecture-163848.pdf.Google ScholarGoogle Scholar
  39. Oracle Corporation. 2012. Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B server architecture. http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o11-090-sparc-t4-arch-496245.pdf.Google ScholarGoogle Scholar
  40. Y. Oyama, K. Taura, and A. Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing For Symbolic And Irregular Applications (PDSIA'99). World Scientific,182--204.Google ScholarGoogle Scholar
  41. Mark S. Papamarcos and Janak H. Patel. 1984. A low-overhead coherence solution for multiprocessors with private cache memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA'84). ACM, New York, 348--354. DOI:http://dx.doi.org/10.1145/800015.808204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Martin Pohlack and Stephan Diestelhorst. 2011. From lightweight hardware transactional memory to lightweight lock elision. In Proceedings of the 6th ACM SIGPLAN Workshop on Transactional Computing.Google ScholarGoogle Scholar
  43. Zoran Radović and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P. Pipe, Thomas F. Wenisch, and Milo M. K. Martin. 2012. Computational sprinting. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA'12). IEEE, 1--12. DOI:http://dx.doi.org/10.1109/HPCA.2012.6169031. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Michael L. Scott. 2002. Non-blocking timeout in scalable queue-based spin locks. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC'02). ACM, New York, 31--40. DOI:http://dx.doi.org/10.1145/571825.571830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Michael L. Scott. 2013. Shared-memory synchronization. Synthesis Lectures Comput. Architec. 8, 2, 1--221. DOI:http://dx.doi.org/10.2200/S00499ED1V01Y201304CAC023. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Michael L. Scott and William Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming. 44--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53,7, 89--97. DOI:http://dx.doi.org/10.1145/1785414.1785443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-adjusting binary search trees. J. ACM 32, 3, 652--686. DOI:http://dx.doi.org/10.1145/3828.3835. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 253--264. DOI:http://dx.doi.org/10.1145/1508244.1508274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. P. Sweazey and A. J. Smith. 1986. A class of compatible cache consistency protocols and their support by the IEEE futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA'86). IEEE, 414--423. http://dl.acm.org/citation.cfm?id=17407.17404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Trilok Vyas, Yujie Liu, and Michael Spear. 2013. Transactionalizing legacy code: An experience report using GCC and memcached. In Proceedings of the 8th ACM SIGPLAN Workshop on Transactional Computing.Google ScholarGoogle Scholar
  53. Wikipedia. 2014a. Closure (computer programming). http://en.wikipedia.org/wiki/Closure_(computer_programming).Google ScholarGoogle Scholar
  54. Wikipedia. 2014b. Futures and promises. http://en.wikipedia.org/wiki/Futures_and_promises.Google ScholarGoogle Scholar
  55. Benlong Zhang, Junbin Kang, Tianyu Wo, Yuda Wang, and Renyu Yang. 2014. A flexible and scalable affinity lock for the kernel. In Proceedings of the 16th IEEE International Conference on High Performance Computing and Communications (HPCC'14).Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Lock Cohorting: A General Technique for Designing NUMA Locks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Parallel Computing
        ACM Transactions on Parallel Computing  Volume 1, Issue 2
        Special Issue on PPOPP 2012
        January 2015
        224 pages
        ISSN:2329-4949
        EISSN:2329-4957
        DOI:10.1145/2737841
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 February 2015
        • Accepted: 1 September 2014
        • Revised: 1 July 2014
        • Received: 1 April 2013
        Published in topc Volume 1, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader