research-article

Fence-free work stealing on bounded TSO processors

Authors:
Adam Morrison

Technion -- Israel Institute of Technology, Haifa, Israel

Technion -- Israel Institute of Technology, Haifa, Israel
View Profile

,
Yehuda Afek

Tel Aviv University, Tel Aviv, Israel

Tel Aviv University, Tel Aviv, Israel
View Profile

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsFebruary 2014Pages 413–426https://doi.org/10.1145/2541940.2541987

Published:24 February 2014Publication History

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Pages 413–426

ABSTRACT

Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when removing a task, and these fences are believed to be necessary for correctness.

This paper refutes this belief, demonstrating work stealing algorithms in which a worker does not issue a memory fence for microarchitectures with a bounded total store ordering (TSO) memory model. Bounded TSO is a novel restriction of TSO~-- capturing mainstream x86 and SPARC TSO processors -- that bounds the number of stores a load can be reordered with.

Our algorithms eliminate the memory fence penalty, improving the running time of a suite of parallel benchmarks on modern x86 multicore processors by 7%-11% on average (and up to 23%), compared to the Cilk and Chase-Lev work stealing queues.

References

The SPARC Architecture Manual Version 8. Prentice Hall, 1992. Google ScholarDigital Library
UltraSPARC T1 Supplement to the Ultra-SPARC Architecture 2005. http://www.oracle.com/technetwork/systems/opensparc/t1-08-ust1-uasuppl-draft-p-ext-1537736.html, March 2006.Google Scholar
Intel CilkPlus Language Specification. Technical report, Intel Corporation, 2011.Google Scholar
Intel Threading Building Blocks. http://threadingbuildingblocks.org/, June 2012.Google Scholar
Intel 64 and IA-32 Architectures Optimization Reference Manual. https://www-ssl.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html, July 2013.Google Scholar
Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3: System Programming Guide. http://download.intel.com/products/processor/manual/325384.pdf, June 2013.Google Scholar
Umut A. Acar, Arthur Chargueraud, and Mike Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
Samy Al Bahra. Nonblocking algorithms and scalable multicore programming. Communications of the ACM, 56(7):50--61, July 2013. Google ScholarDigital Library
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34:115--144, 2001.Google ScholarCross Ref
Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and Martin Vechev. Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, pages 487--498, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, March 2009. Google ScholarDigital Library
Colin Blundell, Milo M.K. Martin, and Thomas F. Wenisch. Invisifence: Performance-transparent memory ordering in conventional multiprocessors. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 233--244, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '05, pages 21--28, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
Guojing Cong David A. Bader. A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 65(9):994--1006, 2005. Google ScholarDigital Library
Dave Dice, Hui Huang, and Mingyao Yang. Asymmetric Dekker Synchronization. http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt, 2001.Google Scholar
Yuelu Duan, Abdullah Muzahid, and Josep Torrellas. WeeFence: toward making fences free in TSO. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 213--224, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
Jason Evans. Scalable memory allocation using jemalloc. http://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919, 2011.Google Scholar
Karl-Filip Faxen. Efficient work stealing for fine grained parallelism. In Proceedings of the 2010 39th International Conference on Parallel Processing, ICPP '10, pages 313--322, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the 19th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13:124--149, January 1991. Google ScholarDigital Library
Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer architecture, ISCA '93, pages 289--300, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarDigital Library
Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12:463--492, July 1990. Google ScholarDigital Library
David Kanter. Haswell Transactional Memory Alternatives. http://www.realworldtech.com/haswell-tm-alt/, August 2012.Google Scholar
Vivek Kumar, Daniel Frampton, Stephen M. Blackburn, David Grove, and Olivier Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Doug Lea. A Java fork/join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
I-Ting Angelina Lee, Silas Boyd-Wickizer, Zhiyi Huang, and Charles E. Leiserson. Using memory mapping to support cactus stacks in work-stealing runtime systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pages 411--420, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Changhui Lin, Vijay Nagarajan, and Rajiv Gupta. Addressaware fences. In Proceedings of the 27th International Conference on Supercomputing, ICS '13, pages 313--324, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
Feng Liu, Nayden Nedev, Nedyalko Prisadnikov, Martin Vechev, and Eran Yahav. Dynamic synthesis for relaxed memory models. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 429--440, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 45--54, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 13-- 24, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
James Reinders. Intel Threading Building Blocks. O'Reilly Media, July 2007. Google ScholarDigital Library
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors. Communications of the ACM, 53(7):89--97, July 2010. Google ScholarDigital Library
Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd Millstein, and Madanlal Musuvathi. End-to-end sequential consistency. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 524--535, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Mechanisms for store-wait-free multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 266--277, New York, NY, USA, 2007. ACM. Google ScholarDigital Library

Index Terms

Fence-free work stealing on bounded TSO processors
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Temporally Bounding TSO for Fence-Free Asymmetric Synchronization
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

This paper introduces a temporally bounded total store ordering (TBTSO) memory model, and shows that it enables nonblocking fence-free solutions to asymmetric synchronization problems, such as those arising in memory reclamation and biased locking.

...
Read More
Fence-free work stealing on bounded TSO processors
ASPLOS '14

Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when ...
Read More
Fence-free work stealing on bounded TSO processors
ASPLOS '14

Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ
ACM SIGARCH Computer Architecture News Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 49, Issue 4
ASPLOS '14
April 2014
729 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2644865
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
memory fences
tso
work stealing
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '14 Paper Acceptance Rate49of217submissions,23%Overall Acceptance Rate535of2,713submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 432
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fence-free work stealing on bounded TSO processors

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Temporally Bounding TSO for Fence-Free Asymmetric Synchronization

Fence-free work stealing on bounded TSO processors

Fence-free work stealing on bounded TSO processors