research-article

Lock Cohorting: A General Technique for Designing NUMA Locks

Authors:
David Dice

Oracle Labs

Oracle Labs
View Profile

,
Virendra J. Marathe

Oracle Labs

Oracle Labs
View Profile

,
Nir Shavit

MIT

MIT
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 1 Issue 2Article No.: 13pp 1–42https://doi.org/10.1145/2686884

Published:18 February 2015Publication History

ACM Transactions on Parallel Computing

Abstract

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.

Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.

We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

References

A. Agarwal and M. Cheritan. 1989. Adaptive backoff synchronization techniques. SIGARCH Comput. Archit. News 17, 3, 396--406. DOI:http://dx.doi.org/10.1145/74926.74970. Google ScholarDigital Library
AMD. 2012. AMD64 Architecture Programmer's Manual: Vol. 2 System Programming. http://support.amd.com/us/Embedded_TechDocs/24593.pdf.Google Scholar
T. E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 1, 6--16. DOI:http://dx.doi.org/10.1109/71.80120. Google ScholarDigital Library
Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium.Google Scholar
Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra J. Marathe, and Mark Moir. 2013a. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In Proceedings of the 17th International Conference on Principles of Distributed Systems. Roberto Baldoni, Nicolas Nisse, and Maarten van Steen, Eds., Lecture Notes in Computer Science, vol. 8304, Springer, 83--97. http://dx.doi.org/10.1007/978-3-319-03850-6_7. Google ScholarDigital Library
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013b. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'13). ACM, New York, 157--166. DOI:http://dx.doi.org/10.1145/2442516.2442532. Google ScholarDigital Library
Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29. DOI:http://dx.doi.org/10.1109/MM.2010.31. Google ScholarDigital Library
Travis Craig. 1993. Building FIFO and priority-queueing spin locks from atomic swap. Tech. Rep. TR 93-02-02. Department of Computer Science, University of Washington.Google Scholar
Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP'13). ACM, New York, 33--48. DOI:http://dx.doi.org/10.1145/2517349.2522714. Google ScholarDigital Library
David Dice. 2003. US Patent # 07318128: Wakeup affinity and locality. http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=7318128.Google Scholar
David Dice. 2011a. Atomic fetch and add vs CAS. (2011). https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs.Google Scholar
David Dice. 2011b. Brief announcement: a partitioned ticket lock. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 309--310. DOI:http://dx.doi.org/10.1145/1989493.1989543. Google ScholarDigital Library
David Dice. 2011c. Polite busy-waiting with WRPAUSE on SPARC. https://blogs.oracle.com/dave/entry/polite_busy_waiting_with_wrpause.Google Scholar
David Dice. 2011d. Solaris Scheduling: SPARC and CPUIDs. (2011). https://blogs.oracle.com/dave/entry/solaris_scheduling_and_cpuids.Google Scholar
David Dice and Alex Garthwaite. 2002. Mostly lock-free malloc. In Proceedings of the 3rd International Symposium on Memory Management (ISMM'02). ACM, New York, 163--174. DOI:http://dx.doi.org/10.1145/512429.512451. Google ScholarDigital Library
David Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA Locks. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 65--74. DOI:http://dx.doi.org/10.1145/1989493.1989502. Google ScholarDigital Library
David Dice, Virendra J. Marathe, and Nir Shavit. 2012a. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 247--256. DOI:http://dx.doi.org/10.1145/2145816.2145848. Google ScholarDigital Library
David Dice, Nir Shavit, and Virendra J. Marathe. 2012b. US Patent Application 20130047011 - Turbo Enablement. http://www.google.com/patents/US20130047011.Google Scholar
David Dice, Nir Shavit, and Virendra J. Marathe. 2012c. US Patent US8694706 - Lock Cohorting. (2012). http://www.google.com/patents/US8694706.Google Scholar
Stijn Eyerman and Lieven Eeckhout. 2010. Modeling critical sections in Amdahl's law and it simplications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM, New York, 362--370. DOI:http://dx.doi.org/10.1145/1815961.1816011. Google ScholarDigital Library
Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 257--266. DOI:http://dx.doi.org/10.1145/2145816.2145849. Google ScholarDigital Library
Nitin Garg, Ed Zhu, and Fabiano C. Botelho. 2011. Light-weight locks. CoRR abs/1109.2638(2011). http://arxiv.org/abs/1109.2638.Google Scholar
J. R. Goodman and H. H. J. Hum. 2009. MESIF: A two-hop cache coherency protocol for point-to-point interconnects. https://researchspace.auckland.ac.nz/bitstream/handle/2292/11594/MESIF-2009.pdf.Google Scholar
Neil J. Gunther, Shanti Subramanyam, and Stefan Parvu. 2011. A methodology for optimizing multithreaded system scalability on multi-cores. CoRR abs/1105.4301 (2011).Google Scholar
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures. 355--364. DOI:http://dx.doi.org/10.1145/1810479.1810540. Google ScholarDigital Library
Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann. Google ScholarDigital Library
Intel Corporation. 2009. An introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf. Document Number: 320412-001US.Google Scholar
F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating System. ACM, New York, 117--128. DOI:http://dx.doi.org/10.1145/1736020.1736035. Google ScholarDigital Library
N. D. Kallimanis. 2013. Highly-Efficient synchronization techniques in shared-memory distributed systems. http://www.cs.uoi.gr/tech_reports//publications/PD-2013-2.pdf.Google Scholar
David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Queue Delegation Locking. (2014). http://www.it.uu.se/research/group/languages/software/qd_lock_lib.Google Scholar
libmemcached.org. 2013. libmemcached. www.libmemcached.org.Google Scholar
Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'12). USENIX Association, Berkeley, CA, 6--6. http://dl.acm.org/citation.cfm?id=2342821.2342827. Google ScholarDigital Library
Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A Hierarchical CLH Queue Lock. In Proceedings of the 12th International Euro-Par Conference. 801--810. Google ScholarDigital Library
P. Magnussen, A. Landin, and E. Hagersten. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing. 165--171. Google ScholarDigital Library
John Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Syst. 9, 1, 21--65. Google ScholarDigital Library
memcached.org. 2013. memcached -- a distributed memory object caching system. www.memcached.org. (2013).Google Scholar
Avi Mendelson and Freddy Gabbay. 2001. The effect of seance communication on multiprocessing systems. ACM Trans. Comput. Syst. 19, 2, 252--281. DOI:http://dx.doi.org/10.1145/377769.377780. Google ScholarDigital Library
Oracle Corporation. 2010. Oracle's Sun Fire X4800 server architecture. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sf4800g5-architecture-163848.pdf.Google Scholar
Oracle Corporation. 2012. Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B server architecture. http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o11-090-sparc-t4-arch-496245.pdf.Google Scholar
Y. Oyama, K. Taura, and A. Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing For Symbolic And Irregular Applications (PDSIA'99). World Scientific,182--204.Google Scholar
Mark S. Papamarcos and Janak H. Patel. 1984. A low-overhead coherence solution for multiprocessors with private cache memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA'84). ACM, New York, 348--354. DOI:http://dx.doi.org/10.1145/800015.808204. Google ScholarDigital Library
Martin Pohlack and Stephan Diestelhorst. 2011. From lightweight hardware transactional memory to lightweight lock elision. In Proceedings of the 6th ACM SIGPLAN Workshop on Transactional Computing.Google Scholar
Zoran Radović and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 241--252. Google ScholarDigital Library
Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P. Pipe, Thomas F. Wenisch, and Milo M. K. Martin. 2012. Computational sprinting. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA'12). IEEE, 1--12. DOI:http://dx.doi.org/10.1109/HPCA.2012.6169031. Google ScholarDigital Library
Michael L. Scott. 2002. Non-blocking timeout in scalable queue-based spin locks. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC'02). ACM, New York, 31--40. DOI:http://dx.doi.org/10.1145/571825.571830. Google ScholarDigital Library
Michael L. Scott. 2013. Shared-memory synchronization. Synthesis Lectures Comput. Architec. 8, 2, 1--221. DOI:http://dx.doi.org/10.2200/S00499ED1V01Y201304CAC023. Google ScholarDigital Library
Michael L. Scott and William Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming. 44--52. Google ScholarDigital Library
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53,7, 89--97. DOI:http://dx.doi.org/10.1145/1785414.1785443. Google ScholarDigital Library
Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-adjusting binary search trees. J. ACM 32, 3, 652--686. DOI:http://dx.doi.org/10.1145/3828.3835. Google ScholarDigital Library
M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 253--264. DOI:http://dx.doi.org/10.1145/1508244.1508274. Google ScholarDigital Library
P. Sweazey and A. J. Smith. 1986. A class of compatible cache consistency protocols and their support by the IEEE futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA'86). IEEE, 414--423. http://dl.acm.org/citation.cfm?id=17407.17404. Google ScholarDigital Library
Trilok Vyas, Yujie Liu, and Michael Spear. 2013. Transactionalizing legacy code: An experience report using GCC and memcached. In Proceedings of the 8th ACM SIGPLAN Workshop on Transactional Computing.Google Scholar
Wikipedia. 2014a. Closure (computer programming). http://en.wikipedia.org/wiki/Closure_(computer_programming).Google Scholar
Wikipedia. 2014b. Futures and promises. http://en.wikipedia.org/wiki/Futures_and_promises.Google Scholar
Benlong Zhang, Junbin Kang, Tianyu Wo, Yuda Wang, and Renyu Yang. 2014. A flexible and scalable affinity lock for the kernel. In Proceedings of the 16th IEEE International Conference on High Performance Computing and Communications (HPCC'14).Google ScholarDigital Library

Index Terms

Lock Cohorting: A General Technique for Designing NUMA Locks
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Lock cohorting: a general technique for designing NUMA locks
PPOPP '12

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock ...
Read More
Lock cohorting: a general technique for designing NUMA locks
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock ...
Read More
NUMA-aware reader-writer locks
PPoPP '13

Non-Uniform Memory Access (NUMA) architectures are gaining importance in mainstream computing systems due to the rapid growth of multi-core multi-chip machines. Extracting the best possible performance from these new machines will require us to revisit ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 1, Issue 2
Special Issue on PPOPP 2012
January 2015
224 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2737841
Editor:
Phillip B. Gibbons
Intel Labs, Pittsburgh, USA
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 February 2015
- Accepted: 1 September 2014
- Revised: 1 July 2014
- Received: 1 April 2013
Published in topc Volume 1, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Concurrency
NUMA
hierarchical locks
locks
multicore
mutex
mutual exclusion
spin locks
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 595
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.