Skip to main content
Log in

Inferential Queueing and Speculative Push

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of two mechanisms to reduce these communication latencies: Inferentially Queued locks (IQLs) and Speculative Push (SP). With IQLs, the processor infers the existence, and limits, of a critical section from the use of synchronization instructions and joins a queue of lock requestors, reducing synchronization delay. The SP mechanism extracts information about program structure by observing IQLs. SP allows the cache controller, responding to a request for a cache line that likely includes a lock variable, to predict the data sets the requestor will modify within the associated critical section. The controller then pushes these lines from its own cache to the target cache, as well as writing them to memory. Overlapping the protected data transfer with that of the lock can substantially reduce the communication latencies within critical sections. By pushing data in exclusive state, the mechanism can collapse a read-modify-write sequences within a critical section into a single local cache access. The write-back to memory allows the receiving cache to ignore the push. Neither mechanism requires any programmer or compiler support nor any instruction set changes. Our experiments demonstrate that IQLs and SP can improve performance of applications employing frequent synchronization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. T. E. Anderson, The performance implications of spin-waiting alternatives for shared-memory multiprocessors. Proceedings of the 1989 International Conference on Parallel Processing, volume II (software), pp. 170-174(August 1989)

    Google Scholar 

  2. T. E. Anderson, The performance of spin lock alternatives for shared-memory multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 1(1): 6-16 (January 1990).

    Google Scholar 

  3. A. Käagi, D. Burger, and J. R. Goodman, Efficient synchronization: Let them eat QOLB, Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 170-180 (June 1997).

  4. A. Käagi, Mechanisms for Efficient Shared-Memory, Lock-Based Synchronization, PhD thesis, University of Wisconsin, Madison, WI (May 1999).

  5. L. Rudolph and Z. Segall, Dynamic decentralized cache schemes for MIMD parallel processors. Proceedings of the 11th Annual International Symposium on Computer Architecture, pp. 340-347 (June 1984).

  6. J. R. Goodman, M. K. Vernon, and P. J. Woest, Efficient synchronization primitives for large-scale cache-coherent shared-memory multiprocessors, Proceedings of the Third Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 64-75 (April 1989).

  7. G. Graunke and S. Thakkar, Synchronization algorithms for shared-memory multiprocessors, IEEE Computer, 23(6): 60-69 (June 1990).

    Google Scholar 

  8. J. M. Mellor-Crummey and M. L. Scott, Algorithms for scalable synchronization on shared-memory multiprocessors, ACM Transactions on Computer Systems, 9(1): 21-65 (February 1991).

    Google Scholar 

  9. B.-H. Lim and A. Agarwal, Reactive synchronization algorithms for multiprocessors, Proceedings of the Sixth Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 25-35 (October 1994).

  10. L.A. Barroso,and K. Gharachorloo,and E. Bugnion, Memory system characterization of commercial workloads,Proceedings of the 25th Annual International Symposium on Computer Architecture,pp.3-14 (June 1998).

  11. K. Keeton,D. Patterson,Y. He,R. Raphael,and W. Baker,Performance characterization of a Quad Pentium Pro SMP using OLTP workloads,Proceedings of the 25th Annual International Symposium on Computer Architecture,pp.15-26 (June 1998).

  12. P. Ranganathan,K. Gharachorloo, S. Adve, and L.A. Barroso, Performance of database workloads on shared-memory systems with out-of-order processors,Proceedings of the Eighth Symposium on Architectural Support for Programming Languages and Operating Systems,pp. 307-318 (October 1998).

  13. K. Gharachorloo, M. Sharma, S. Steely, and S. V. Doren, Architecture and design of AlphaServer GS320, Proceedings of the Ninth Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 13-24 (November 2000).

  14. R. Rajwar, A. Käagi, and J. R. Goodman, Improving the throughput of synchronization by insertion of delays, Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, pp. 168-179 (January 2000).

  15. R. Rajwar, and A. Käagi, and J. R. Goodman, Inferential queueing and speculative push for reducing critical communication latencies, Proceedings of the 2003 International Conference on Supercomputing, pp. 273-284 (June 2003).

  16. D. Kroft, Lockup-free instruction fetch/prefetch cache organization, Proceedings of the Eighth Annual International Symposium on Computer Architecture, pp. 81-87 (May 1981).

  17. A. R. Lebeck and D. A. Wood, Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors, Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 48-59 (June 1995).

  18. A. Singhal, D. Broniarczyk, F. M. Cerauskis, J. Price, L. Yuan, G. Cheng, D. Doblar, S. Fosth, N. Agarwal, K. Harvey and E. HagerstenGigaplane: A high performance bus for large SMPs, Proceedings of the Symposium on High Performance Interconnects IV, pp. 41-52 (August 1996).

  19. A. Charlesworth, A. Phelps, R. Williams, and G. Gilbert, Gigaplane-XB: Extending the ultra enterprise family, Proceedings of the Symposium on High Performance Interconnects V, pp. 97-112 (August 1997).

  20. J. Laudon and D. E. Lenoski, The SGI Origin: A ccNUMA highly scalable server, Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 241-251 (June 1997).

  21. D. Lenoski, J. Laudon, K. Gharachorloo, W.D. Weber, A. Gupta,J. L. Hennessy,M. Horowitz,and M. Lam, The Stanford DASH multiprocessor,IEEE Computer,25(3): 63-79 (March 1992).

    Google Scholar 

  22. Institute of Electrical and Electronics Engineers, New York, NY. IEEE Standard for the Scalable Coherent Interface (SCI), ANSI/IEEE Std. 1596–1992, (August 1993)

  23. T. Lovett and R. Clapp, STiNG: A CC-NUMA computer system for the commercial marketplace, Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 308-317, (May 1996).

  24. T. Brewer and G. Astfalk, The evolution of the HP/Convex Exemplar, Proceedings of the 42nd IEEE Computer Society International Conference (COMPCON),pp. 81-86 (February 1997).

  25. A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir, The NYU ultracomputer—designing an MIMD shared memory parallel computer, IEEE Transactions on Computers, C-32(2):175-189 (February 1983).

    Google Scholar 

  26. E. H. Jensen, and G. W. Hagensen, and J. M. Broughton, A new approach to exclusive data access in shared memory multiprocessors, Technical Report UCRL-97663, Lawrence Livermore National Laboratory, Livermore, CA, (November 1987).

    Google Scholar 

  27. K. Gharachorloo, D. Lenoski, J. Laudon, P. B. Gibbons, A. Gupta, and J. L. Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, Proceedings of the 17th Annual International Symposium on Computer Architecture,pp. 15-26 (May 1990).

  28. Compaq Computer Corporation, Alpha 21264 Hardware Reference Manual (July 1999).

  29. P. Bitar and A. M. Despain, Multiprocessor cache synchronization: Issues, innovations, evolution, Proceedings of the 13th Annual International Symposium on Computer Architecture, pp. 424-433 (June 1986).

  30. R. Rajwar, A. Käagi, and J. R. Goodman, Using speculative push to reduce communication latencies in critical sections, Technical Report CS-TR-1472, Computer Sciences Department, University of Wisconsin, Madison, WI (April 2000).

  31. P. Stensträom, M. Brorsson, and L. Sandberg, An adaptive cache coherence protocol optimized for migratory sharing, Proceedings of the 20th Annual International Symposium on Computer Architecture, pp.109-118 (May 1993).

  32. A. L. Cox and R. J. Fowler, Adaptive cache coherency for detecting migratory shared data, Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 98-108 (May 1993).

  33. T. Mowry and A. Gupta, Tolerating latency through software-controlled prefetching in shared-memory multiprocessors, Journal of Parallel and Distributed Computing, 12((2): 87-106 (June 1992).

    Google Scholar 

  34. P. Trancoso and J. Torrellas, The impact of speeding up critical sections with data prefetching and forwarding, Proceedings of the 1996 International Conference on Parallel Processing, Vol. III (software), pp. 79-86, (August 1996).

    Google Scholar 

  35. H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve, An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors, Proceedings of the Third International Symposium on High-Performance Computer Architecture, pp. 204-215 (February 1997).

  36. D. K. Poulsen and P.-C. Yew, Data prefetching and data forwarding in shared memory multiprocessors. Proceedings of the 1994 International Conference on Parallel Processing, Vol II (software), pp. 276-280 (August 1994).

    Google Scholar 

  37. S. Kaxiras and J. R. Goodman, Improving CC-NUMA performance using instruction-based prediction. Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, pp.161-170 (January 1999).

  38. M. Hill, J. Larus, S. Reinhardt, and D. Wood, Cooperative shared memory: Software and hardware for scalable multiprocessors, ACM Transactions on Computer Systems, 11(4): 300-318 (November 1993).

    Google Scholar 

  39. J. Skeppstedt and P. Stensträom, A compiler algorithm that reduces read latency in ownership-based cache coherence protocols, Proceedings of the 1995 International Conference on Parallel Architectures and Compilation Techniques (1995).

  40. A.-C. Lai and B. Falsafi, Selective, accurate, and timely self-invalidation using last-touch prediction, Proceedings of the 27th Annual International Symposium on Computer Architecture (June 2000).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rajwar, R., Kägi, A. & Goodman, J.R. Inferential Queueing and Speculative Push. International Journal of Parallel Programming 32, 225–258 (2004). https://doi.org/10.1023/B:IJPP.0000029274.45582.a8

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:IJPP.0000029274.45582.a8

Navigation