skip to main content
10.1145/2541940.2541987acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Fence-free work stealing on bounded TSO processors

Published:24 February 2014Publication History

ABSTRACT

Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when removing a task, and these fences are believed to be necessary for correctness.

This paper refutes this belief, demonstrating work stealing algorithms in which a worker does not issue a memory fence for microarchitectures with a bounded total store ordering (TSO) memory model. Bounded TSO is a novel restriction of TSO~-- capturing mainstream x86 and SPARC TSO processors -- that bounds the number of stores a load can be reordered with.

Our algorithms eliminate the memory fence penalty, improving the running time of a suite of parallel benchmarks on modern x86 multicore processors by 7%-11% on average (and up to 23%), compared to the Cilk and Chase-Lev work stealing queues.

References

  1. The SPARC Architecture Manual Version 8. Prentice Hall, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. UltraSPARC T1 Supplement to the Ultra-SPARC Architecture 2005. http://www.oracle.com/technetwork/systems/opensparc/t1-08-ust1-uasuppl-draft-p-ext-1537736.html, March 2006.Google ScholarGoogle Scholar
  3. Intel CilkPlus Language Specification. Technical report, Intel Corporation, 2011.Google ScholarGoogle Scholar
  4. Intel Threading Building Blocks. http://threadingbuildingblocks.org/, June 2012.Google ScholarGoogle Scholar
  5. Intel 64 and IA-32 Architectures Optimization Reference Manual. https://www-ssl.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html, July 2013.Google ScholarGoogle Scholar
  6. Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3: System Programming Guide. http://download.intel.com/products/processor/manual/325384.pdf, June 2013.Google ScholarGoogle Scholar
  7. Umut A. Acar, Arthur Chargueraud, and Mike Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Samy Al Bahra. Nonblocking algorithms and scalable multicore programming. Communications of the ACM, 56(7):50--61, July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34:115--144, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  10. Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and Martin Vechev. Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, pages 487--498, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, March 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Colin Blundell, Milo M.K. Martin, and Thomas F. Wenisch. Invisifence: Performance-transparent memory ordering in conventional multiprocessors. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 233--244, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '05, pages 21--28, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Guojing Cong David A. Bader. A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 65(9):994--1006, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dave Dice, Hui Huang, and Mingyao Yang. Asymmetric Dekker Synchronization. http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt, 2001.Google ScholarGoogle Scholar
  17. Yuelu Duan, Abdullah Muzahid, and Josep Torrellas. WeeFence: toward making fences free in TSO. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 213--224, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jason Evans. Scalable memory allocation using jemalloc. http://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919, 2011.Google ScholarGoogle Scholar
  19. Karl-Filip Faxen. Efficient work stealing for fine grained parallelism. In Proceedings of the 2010 39th International Conference on Parallel Processing, ICPP '10, pages 313--322, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the 19th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13:124--149, January 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer architecture, ISCA '93, pages 289--300, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12:463--492, July 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. David Kanter. Haswell Transactional Memory Alternatives. http://www.realworldtech.com/haswell-tm-alt/, August 2012.Google ScholarGoogle Scholar
  26. Vivek Kumar, Daniel Frampton, Stephen M. Blackburn, David Grove, and Olivier Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Doug Lea. A Java fork/join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. I-Ting Angelina Lee, Silas Boyd-Wickizer, Zhiyi Huang, and Charles E. Leiserson. Using memory mapping to support cactus stacks in work-stealing runtime systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pages 411--420, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Changhui Lin, Vijay Nagarajan, and Rajiv Gupta. Addressaware fences. In Proceedings of the 27th International Conference on Supercomputing, ICS '13, pages 313--324, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Feng Liu, Nayden Nedev, Nedyalko Prisadnikov, Martin Vechev, and Eran Yahav. Dynamic synthesis for relaxed memory models. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 429--440, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 45--54, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 13-- 24, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. James Reinders. Intel Threading Building Blocks. O'Reilly Media, July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors. Communications of the ACM, 53(7):89--97, July 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd Millstein, and Madanlal Musuvathi. End-to-end sequential consistency. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 524--535, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Mechanisms for store-wait-free multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 266--277, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fence-free work stealing on bounded TSO processors

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
        February 2014
        780 pages
        ISBN:9781450323055
        DOI:10.1145/2541940

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 February 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ASPLOS '14 Paper Acceptance Rate49of217submissions,23%Overall Acceptance Rate535of2,713submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader