Skip to main content
Log in

Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Mellor-Crummey J M, Scott M L. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Systems, 1991, 9(1): 21-65.

    Article  Google Scholar 

  2. Michael M M, Scott M L. Implementation of atomic primitives on distributed shared memory multiprocessors. In Proc. the 1st HPCA, Raleigh, USA, Jan. 22-25, 1995, pp.221-231.

  3. Anderson T E. The performance implications of spin-waiting alternatives for shared-memory multiprocessors. In Proc. ICPP, volume II Software, University Park, USA, Aug. 1989, pp.170-174.

  4. Anderson T E. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel and Distributed Systems, 1990, 1(1): 6-16.

    Article  Google Scholar 

  5. Goodman J R, Vernon M K,Woest P J. Effcient synchronization primitives for large-scale cache-coherent multiprocessors. In Proc. the 3rd ASPLOS, Boston Mass, USA, Apr. 3-6, 1989, pp.64-75.

  6. Kagi A. Mechanisms for effcient shared-memory, lock-based synchronization. [PhD thesis]. University of Wisconsin-Madison, May 1999.

  7. Kagi A, Burger D, Goodman J R. Effcient synchronization: Let them eat QOLB. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.170-180.

  8. Kumar S, Jiang D, Chandra R, Singh J P. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. In Proc. ACM SIGMETRICS 1999, Atlanta, USA, May 1-4, 1999, pp.23-34.

  9. Rudolph L, Segall Z. Dynamic decentralized cache schemes for mimd parallel processors. In Proc. the 11th ISCA, Ann Arbor, USA, June 5-1, 1984, pp.340-347.

  10. Radovic Z, Hagersten E. Hierarchical backoff locks for nonuniform communication architectures. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.241-252.

  11. Graunke G, Thakkar S. Synchronization algorithms for shared-memory multiprocessors. IEEE Computer, 1990, 23(6): 60-69.

    Article  Google Scholar 

  12. Magnusson P S, Landin A, Hagersten E. Queue locks on cache coherent multiprocessors. In Proc. the 8th ISPP, Cancun, Mexico, Apr. 26-29, 1994, pp.165-171.

  13. Rajwar R, Kagi A, Goodman J R. Improving the throughput of synchronization by insertion of delays. In Proc. the 6th HPCA, Toulouse, France, Jan. 8-12, 2000, pp.168-179.

  14. Rajwar R, Kagi A, Goodman J R. Inferential queueing and speculative push for reducing critical communication latencies. In Proc. the 17th ICS, San Francisco, USA, June 23-26, 2003, pp.273-284.

  15. Hoffmann R, Korch M, Rauber T. Performance evaulation of task pools based on hardware synchronization. In Proc. the 18th SC, Pittsburgh, USA, Nov. 6-12, 2004, pp.44.

  16. Vallejo E, Beivide R et al. Architectural support for fair reader-writer locking. In Proc. the 43rd MICRO, Atlanta, USA, Dec. 4-8, 2010, pp.275-286.

  17. Lev Y, Luchangco V, Olszewski M. Scalable reader-writer locks. In Proc. the 21st SPAA, Calgary, Canada, Aug. 11-13, 2009, pp.101-110.

  18. Suleman M A, Mutlu O, Qureshi M L, Patt Y N. Accelerating critical section execution with asymmetric multi-core architectures. In Proc. the 14th ASPLOS, Washington, USA, March 7-11, 2009, pp.253-264.

  19. Kuskin J et al. The stanford flash multiprocessor. In Proc. the 21st ISCA, Chicago, USA, Apr. 18-21, 1994, pp.302-313.

  20. Laudon J, Lenoski D. The sgi origin: A ccNUMA highly scalable server. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.170-180.

  21. Barroso L A et al. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th ISCA, Vancouver, Canada, June 10-14, 2000, pp.282-293.

  22. Gharachorloo K et al. Architecture and design of Alpha Server GS320. In Proc. the 9th ASPLOS, Cambridge, USA, Nov. 12-15, 2000, pp.13-24.

  23. James D D, Laundrie A T, Gjessing S, Sohni G S. Distributed directory scheme: Scalable coherence interface. IEEE Computer, June 1990, 23(6): 74-77.

    Article  Google Scholar 

  24. Agarwal A, Bianchini R et al. The MIT alewife machine: Architecture and performance. In Proc. the 22nd ISCA, Santa Margherita Ligure, Italy, 22-24, 1995, pp.2-13.

  25. Chaudhuri M, Heinrich M. The impact of negative acknowledgments in shared memory scientific applications. IEEE Trans. Parallel and Distributed Systems, 2004, 15(2): 134-152.

    Article  Google Scholar 

  26. Hu W, Hou R, Xiao J, Zhang L. High Performance general-purpose microprocessors: Past and future. Journal of Computer Science and Technology, 2006, 21(5): 631-640.

    Article  Google Scholar 

  27. Pai V S, Ranganathan P, Adve S V. RSIM: An execution-driven simulator for ilp-based shared-memory multiprocessors and uniprocessors. In Proc. the 3rd Workshop on Computer Architecture Education, San Antonio, USA, Feb. 1-5, 1997.

  28. Pai V S, Ranganathan P, Adve S V. RSIM reference manual version 1.0. Technical Report 9705, Dept. of Electrical and Computer Engineering, Rice University, 1997.

  29. Gharachorloo K, Gupta A, Hennessy J. Two techniques to enhance the performance of memory consistency models. In Proc. ICPP, Austin, USA, Aug. 1991, pp.355-364.

  30. Woo S C et al. The splash-2 programs: Characterization and methodological considerations. In Proc. the 22nd ISCA, Santa Margherita Ligure, Italy, June 22-24, 1995, pp.24-36.

  31. Heinrich M, Chaudhuri M. Ocean warning: Avoid drowing. Computer Architecture News, 2003, 31(3): 30-32.

    Article  Google Scholar 

  32. de Dios A, Sahelices B, Ibáñez P, Viñals V, Llabería J M. Speeding-up synchronizations in dsm multiprocessors. In Proc. the 12nd Euro-Par, Dresden, Germany, Aug. 28-Sept. 1, 2006, pp.473-484.

  33. Alameldeen A R,Wood D A. Variability in architectural simulations of multi-threaded workloads. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.7-18.

  34. Lenoski D et al. The Stanford DASH multiprocessor. IEEE Computer, 1992, 25(3): 63-79

    Article  Google Scholar 

  35. Rajwar R, Goodman J R. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proc. the 34th MICRO, Austin, USA, Dec. 2-5, 2001, pp.294-305.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamín Sahelices.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sahelices, B., de Dios, A., Ibáñez, P. et al. Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers. J. Comput. Sci. Technol. 27, 75–91 (2012). https://doi.org/10.1007/s11390-012-1207-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-012-1207-2

Keywords

Navigation