Abstract
Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.
Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.
We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.
- A. Agarwal and M. Cheritan. 1989. Adaptive backoff synchronization techniques. SIGARCH Comput. Archit. News 17, 3, 396--406. DOI:http://dx.doi.org/10.1145/74926.74970. Google ScholarDigital Library
- AMD. 2012. AMD64 Architecture Programmer's Manual: Vol. 2 System Programming. http://support.amd.com/us/Embedded_TechDocs/24593.pdf.Google Scholar
- T. E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 1, 6--16. DOI:http://dx.doi.org/10.1109/71.80120. Google ScholarDigital Library
- Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium.Google Scholar
- Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra J. Marathe, and Mark Moir. 2013a. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In Proceedings of the 17th International Conference on Principles of Distributed Systems. Roberto Baldoni, Nicolas Nisse, and Maarten van Steen, Eds., Lecture Notes in Computer Science, vol. 8304, Springer, 83--97. http://dx.doi.org/10.1007/978-3-319-03850-6_7. Google ScholarDigital Library
- Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013b. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'13). ACM, New York, 157--166. DOI:http://dx.doi.org/10.1145/2442516.2442532. Google ScholarDigital Library
- Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29. DOI:http://dx.doi.org/10.1109/MM.2010.31. Google ScholarDigital Library
- Travis Craig. 1993. Building FIFO and priority-queueing spin locks from atomic swap. Tech. Rep. TR 93-02-02. Department of Computer Science, University of Washington.Google Scholar
- Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP'13). ACM, New York, 33--48. DOI:http://dx.doi.org/10.1145/2517349.2522714. Google ScholarDigital Library
- David Dice. 2003. US Patent # 07318128: Wakeup affinity and locality. http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=7318128.Google Scholar
- David Dice. 2011a. Atomic fetch and add vs CAS. (2011). https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs.Google Scholar
- David Dice. 2011b. Brief announcement: a partitioned ticket lock. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 309--310. DOI:http://dx.doi.org/10.1145/1989493.1989543. Google ScholarDigital Library
- David Dice. 2011c. Polite busy-waiting with WRPAUSE on SPARC. https://blogs.oracle.com/dave/entry/polite_busy_waiting_with_wrpause.Google Scholar
- David Dice. 2011d. Solaris Scheduling: SPARC and CPUIDs. (2011). https://blogs.oracle.com/dave/entry/solaris_scheduling_and_cpuids.Google Scholar
- David Dice and Alex Garthwaite. 2002. Mostly lock-free malloc. In Proceedings of the 3rd International Symposium on Memory Management (ISMM'02). ACM, New York, 163--174. DOI:http://dx.doi.org/10.1145/512429.512451. Google ScholarDigital Library
- David Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA Locks. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 65--74. DOI:http://dx.doi.org/10.1145/1989493.1989502. Google ScholarDigital Library
- David Dice, Virendra J. Marathe, and Nir Shavit. 2012a. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 247--256. DOI:http://dx.doi.org/10.1145/2145816.2145848. Google ScholarDigital Library
- David Dice, Nir Shavit, and Virendra J. Marathe. 2012b. US Patent Application 20130047011 - Turbo Enablement. http://www.google.com/patents/US20130047011.Google Scholar
- David Dice, Nir Shavit, and Virendra J. Marathe. 2012c. US Patent US8694706 - Lock Cohorting. (2012). http://www.google.com/patents/US8694706.Google Scholar
- Stijn Eyerman and Lieven Eeckhout. 2010. Modeling critical sections in Amdahl's law and it simplications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM, New York, 362--370. DOI:http://dx.doi.org/10.1145/1815961.1816011. Google ScholarDigital Library
- Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 257--266. DOI:http://dx.doi.org/10.1145/2145816.2145849. Google ScholarDigital Library
- Nitin Garg, Ed Zhu, and Fabiano C. Botelho. 2011. Light-weight locks. CoRR abs/1109.2638(2011). http://arxiv.org/abs/1109.2638.Google Scholar
- J. R. Goodman and H. H. J. Hum. 2009. MESIF: A two-hop cache coherency protocol for point-to-point interconnects. https://researchspace.auckland.ac.nz/bitstream/handle/2292/11594/MESIF-2009.pdf.Google Scholar
- Neil J. Gunther, Shanti Subramanyam, and Stefan Parvu. 2011. A methodology for optimizing multithreaded system scalability on multi-cores. CoRR abs/1105.4301 (2011).Google Scholar
- Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures. 355--364. DOI:http://dx.doi.org/10.1145/1810479.1810540. Google ScholarDigital Library
- Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann. Google ScholarDigital Library
- Intel Corporation. 2009. An introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf. Document Number: 320412-001US.Google Scholar
- F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating System. ACM, New York, 117--128. DOI:http://dx.doi.org/10.1145/1736020.1736035. Google ScholarDigital Library
- N. D. Kallimanis. 2013. Highly-Efficient synchronization techniques in shared-memory distributed systems. http://www.cs.uoi.gr/tech_reports//publications/PD-2013-2.pdf.Google Scholar
- David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Queue Delegation Locking. (2014). http://www.it.uu.se/research/group/languages/software/qd_lock_lib.Google Scholar
- libmemcached.org. 2013. libmemcached. www.libmemcached.org.Google Scholar
- Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'12). USENIX Association, Berkeley, CA, 6--6. http://dl.acm.org/citation.cfm?id=2342821.2342827. Google ScholarDigital Library
- Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A Hierarchical CLH Queue Lock. In Proceedings of the 12th International Euro-Par Conference. 801--810. Google ScholarDigital Library
- P. Magnussen, A. Landin, and E. Hagersten. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing. 165--171. Google ScholarDigital Library
- John Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Syst. 9, 1, 21--65. Google ScholarDigital Library
- memcached.org. 2013. memcached -- a distributed memory object caching system. www.memcached.org. (2013).Google Scholar
- Avi Mendelson and Freddy Gabbay. 2001. The effect of seance communication on multiprocessing systems. ACM Trans. Comput. Syst. 19, 2, 252--281. DOI:http://dx.doi.org/10.1145/377769.377780. Google ScholarDigital Library
- Oracle Corporation. 2010. Oracle's Sun Fire X4800 server architecture. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sf4800g5-architecture-163848.pdf.Google Scholar
- Oracle Corporation. 2012. Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B server architecture. http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o11-090-sparc-t4-arch-496245.pdf.Google Scholar
- Y. Oyama, K. Taura, and A. Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing For Symbolic And Irregular Applications (PDSIA'99). World Scientific,182--204.Google Scholar
- Mark S. Papamarcos and Janak H. Patel. 1984. A low-overhead coherence solution for multiprocessors with private cache memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA'84). ACM, New York, 348--354. DOI:http://dx.doi.org/10.1145/800015.808204. Google ScholarDigital Library
- Martin Pohlack and Stephan Diestelhorst. 2011. From lightweight hardware transactional memory to lightweight lock elision. In Proceedings of the 6th ACM SIGPLAN Workshop on Transactional Computing.Google Scholar
- Zoran Radović and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 241--252. Google ScholarDigital Library
- Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P. Pipe, Thomas F. Wenisch, and Milo M. K. Martin. 2012. Computational sprinting. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA'12). IEEE, 1--12. DOI:http://dx.doi.org/10.1109/HPCA.2012.6169031. Google ScholarDigital Library
- Michael L. Scott. 2002. Non-blocking timeout in scalable queue-based spin locks. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC'02). ACM, New York, 31--40. DOI:http://dx.doi.org/10.1145/571825.571830. Google ScholarDigital Library
- Michael L. Scott. 2013. Shared-memory synchronization. Synthesis Lectures Comput. Architec. 8, 2, 1--221. DOI:http://dx.doi.org/10.2200/S00499ED1V01Y201304CAC023. Google ScholarDigital Library
- Michael L. Scott and William Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming. 44--52. Google ScholarDigital Library
- Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53,7, 89--97. DOI:http://dx.doi.org/10.1145/1785414.1785443. Google ScholarDigital Library
- Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-adjusting binary search trees. J. ACM 32, 3, 652--686. DOI:http://dx.doi.org/10.1145/3828.3835. Google ScholarDigital Library
- M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 253--264. DOI:http://dx.doi.org/10.1145/1508244.1508274. Google ScholarDigital Library
- P. Sweazey and A. J. Smith. 1986. A class of compatible cache consistency protocols and their support by the IEEE futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA'86). IEEE, 414--423. http://dl.acm.org/citation.cfm?id=17407.17404. Google ScholarDigital Library
- Trilok Vyas, Yujie Liu, and Michael Spear. 2013. Transactionalizing legacy code: An experience report using GCC and memcached. In Proceedings of the 8th ACM SIGPLAN Workshop on Transactional Computing.Google Scholar
- Wikipedia. 2014a. Closure (computer programming). http://en.wikipedia.org/wiki/Closure_(computer_programming).Google Scholar
- Wikipedia. 2014b. Futures and promises. http://en.wikipedia.org/wiki/Futures_and_promises.Google Scholar
- Benlong Zhang, Junbin Kang, Tianyu Wo, Yuda Wang, and Renyu Yang. 2014. A flexible and scalable affinity lock for the kernel. In Proceedings of the 16th IEEE International Conference on High Performance Computing and Communications (HPCC'14).Google ScholarDigital Library
Index Terms
- Lock Cohorting: A General Technique for Designing NUMA Locks
Recommendations
Lock cohorting: a general technique for designing NUMA locks
PPOPP '12Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock ...
Lock cohorting: a general technique for designing NUMA locks
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingMulticore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock ...
NUMA-aware reader-writer locks
PPoPP '13Non-Uniform Memory Access (NUMA) architectures are gaining importance in mainstream computing systems due to the rapid growth of multi-core multi-chip machines. Extracting the best possible performance from these new machines will require us to revisit ...
Comments