Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

Sahelices, Benjamín; de Dios, Agustín; Ibáñez, Pablo; Viñals-Yúfera, Víctor; Llabería, José María

doi:10.1007/s11390-012-1207-2

Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

Regular Paper
Published: 09 January 2012

Volume 27, pages 75–91, (2012)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Benjamín Sahelices¹,
Agustín de Dios¹,
Pablo Ibáñez²,
Víctor Viñals-Yúfera² &
…
José María Llabería³

95 Accesses
Explore all metrics

Abstract

Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving multiprocessor performance with fine-grain coherence bypass

Article 11 September 2014

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

Article 03 January 2021

References

Mellor-Crummey J M, Scott M L. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Systems, 1991, 9(1): 21-65.
Article Google Scholar
Michael M M, Scott M L. Implementation of atomic primitives on distributed shared memory multiprocessors. In Proc. the 1st HPCA, Raleigh, USA, Jan. 22-25, 1995, pp.221-231.
Anderson T E. The performance implications of spin-waiting alternatives for shared-memory multiprocessors. In Proc. ICPP, volume II Software, University Park, USA, Aug. 1989, pp.170-174.
Anderson T E. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel and Distributed Systems, 1990, 1(1): 6-16.
Article Google Scholar
Goodman J R, Vernon M K,Woest P J. Effcient synchronization primitives for large-scale cache-coherent multiprocessors. In Proc. the 3rd ASPLOS, Boston Mass, USA, Apr. 3-6, 1989, pp.64-75.
Kagi A. Mechanisms for effcient shared-memory, lock-based synchronization. [PhD thesis]. University of Wisconsin-Madison, May 1999.
Kagi A, Burger D, Goodman J R. Effcient synchronization: Let them eat QOLB. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.170-180.
Kumar S, Jiang D, Chandra R, Singh J P. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. In Proc. ACM SIGMETRICS 1999, Atlanta, USA, May 1-4, 1999, pp.23-34.
Rudolph L, Segall Z. Dynamic decentralized cache schemes for mimd parallel processors. In Proc. the 11th ISCA, Ann Arbor, USA, June 5-1, 1984, pp.340-347.
Radovic Z, Hagersten E. Hierarchical backoff locks for nonuniform communication architectures. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.241-252.
Graunke G, Thakkar S. Synchronization algorithms for shared-memory multiprocessors. IEEE Computer, 1990, 23(6): 60-69.
Article Google Scholar
Magnusson P S, Landin A, Hagersten E. Queue locks on cache coherent multiprocessors. In Proc. the 8th ISPP, Cancun, Mexico, Apr. 26-29, 1994, pp.165-171.
Rajwar R, Kagi A, Goodman J R. Improving the throughput of synchronization by insertion of delays. In Proc. the 6th HPCA, Toulouse, France, Jan. 8-12, 2000, pp.168-179.
Rajwar R, Kagi A, Goodman J R. Inferential queueing and speculative push for reducing critical communication latencies. In Proc. the 17th ICS, San Francisco, USA, June 23-26, 2003, pp.273-284.
Hoffmann R, Korch M, Rauber T. Performance evaulation of task pools based on hardware synchronization. In Proc. the 18th SC, Pittsburgh, USA, Nov. 6-12, 2004, pp.44.
Vallejo E, Beivide R et al. Architectural support for fair reader-writer locking. In Proc. the 43rd MICRO, Atlanta, USA, Dec. 4-8, 2010, pp.275-286.
Lev Y, Luchangco V, Olszewski M. Scalable reader-writer locks. In Proc. the 21st SPAA, Calgary, Canada, Aug. 11-13, 2009, pp.101-110.
Suleman M A, Mutlu O, Qureshi M L, Patt Y N. Accelerating critical section execution with asymmetric multi-core architectures. In Proc. the 14th ASPLOS, Washington, USA, March 7-11, 2009, pp.253-264.
Kuskin J et al. The stanford flash multiprocessor. In Proc. the 21st ISCA, Chicago, USA, Apr. 18-21, 1994, pp.302-313.
Laudon J, Lenoski D. The sgi origin: A ccNUMA highly scalable server. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.170-180.
Barroso L A et al. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th ISCA, Vancouver, Canada, June 10-14, 2000, pp.282-293.
Gharachorloo K et al. Architecture and design of Alpha Server GS320. In Proc. the 9th ASPLOS, Cambridge, USA, Nov. 12-15, 2000, pp.13-24.
James D D, Laundrie A T, Gjessing S, Sohni G S. Distributed directory scheme: Scalable coherence interface. IEEE Computer, June 1990, 23(6): 74-77.
Article Google Scholar
Agarwal A, Bianchini R et al. The MIT alewife machine: Architecture and performance. In Proc. the 22nd ISCA, Santa Margherita Ligure, Italy, 22-24, 1995, pp.2-13.
Chaudhuri M, Heinrich M. The impact of negative acknowledgments in shared memory scientific applications. IEEE Trans. Parallel and Distributed Systems, 2004, 15(2): 134-152.
Article Google Scholar
Hu W, Hou R, Xiao J, Zhang L. High Performance general-purpose microprocessors: Past and future. Journal of Computer Science and Technology, 2006, 21(5): 631-640.
Article Google Scholar
Pai V S, Ranganathan P, Adve S V. RSIM: An execution-driven simulator for ilp-based shared-memory multiprocessors and uniprocessors. In Proc. the 3rd Workshop on Computer Architecture Education, San Antonio, USA, Feb. 1-5, 1997.
Pai V S, Ranganathan P, Adve S V. RSIM reference manual version 1.0. Technical Report 9705, Dept. of Electrical and Computer Engineering, Rice University, 1997.
Gharachorloo K, Gupta A, Hennessy J. Two techniques to enhance the performance of memory consistency models. In Proc. ICPP, Austin, USA, Aug. 1991, pp.355-364.
Woo S C et al. The splash-2 programs: Characterization and methodological considerations. In Proc. the 22nd ISCA, Santa Margherita Ligure, Italy, June 22-24, 1995, pp.24-36.
Heinrich M, Chaudhuri M. Ocean warning: Avoid drowing. Computer Architecture News, 2003, 31(3): 30-32.
Article Google Scholar
de Dios A, Sahelices B, Ibáñez P, Viñals V, Llabería J M. Speeding-up synchronizations in dsm multiprocessors. In Proc. the 12nd Euro-Par, Dresden, Germany, Aug. 28-Sept. 1, 2006, pp.473-484.
Alameldeen A R,Wood D A. Variability in architectural simulations of multi-threaded workloads. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.7-18.
Lenoski D et al. The Stanford DASH multiprocessor. IEEE Computer, 1992, 25(3): 63-79
Article Google Scholar
Rajwar R, Goodman J R. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proc. the 34th MICRO, Austin, USA, Dec. 2-5, 2001, pp.294-305.

Download references

Author information

Authors and Affiliations

Computer Science Department and HiPEAC European Network of Excellence, University of Valladolid, Valladolid, Spain
Benjamín Sahelices & Agustín de Dios
Computer Science and Systems Engineering Department, I3A Research Institute and HiPEAC European Network of Excellence, University of Zaragoza, Zaragoza, Spain
Pablo Ibáñez & Víctor Viñals-Yúfera (Member, ACM, IEEE)
Computer Architecture Department and HiPEAC European Network of Excellence, Polytechnic University of Cataluña, Barcelona, Spain
José María Llabería

Authors

Benjamín Sahelices
View author publications
You can also search for this author in PubMed Google Scholar
Agustín de Dios
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Ibáñez
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Viñals-Yúfera
View author publications
You can also search for this author in PubMed Google Scholar
José María Llabería
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamín Sahelices.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sahelices, B., de Dios, A., Ibáñez, P. et al. Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers. J. Comput. Sci. Technol. 27, 75–91 (2012). https://doi.org/10.1007/s11390-012-1207-2

Download citation

Received: 30 October 2010
Revised: 23 August 2011
Published: 09 January 2012
Issue Date: January 2012
DOI: https://doi.org/10.1007/s11390-012-1207-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

Abstract

Access this article

Similar content being viewed by others

Improving multiprocessor performance with fine-grain coherence bypass

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

Abstract

Access this article

Similar content being viewed by others

Improving multiprocessor performance with fine-grain coherence bypass

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation