ABSTRACT
Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when removing a task, and these fences are believed to be necessary for correctness.
This paper refutes this belief, demonstrating work stealing algorithms in which a worker does not issue a memory fence for microarchitectures with a bounded total store ordering (TSO) memory model. Bounded TSO is a novel restriction of TSO~-- capturing mainstream x86 and SPARC TSO processors -- that bounds the number of stores a load can be reordered with.
Our algorithms eliminate the memory fence penalty, improving the running time of a suite of parallel benchmarks on modern x86 multicore processors by 7%-11% on average (and up to 23%), compared to the Cilk and Chase-Lev work stealing queues.
- The SPARC Architecture Manual Version 8. Prentice Hall, 1992. Google ScholarDigital Library
- UltraSPARC T1 Supplement to the Ultra-SPARC Architecture 2005. http://www.oracle.com/technetwork/systems/opensparc/t1-08-ust1-uasuppl-draft-p-ext-1537736.html, March 2006.Google Scholar
- Intel CilkPlus Language Specification. Technical report, Intel Corporation, 2011.Google Scholar
- Intel Threading Building Blocks. http://threadingbuildingblocks.org/, June 2012.Google Scholar
- Intel 64 and IA-32 Architectures Optimization Reference Manual. https://www-ssl.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html, July 2013.Google Scholar
- Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3: System Programming Guide. http://download.intel.com/products/processor/manual/325384.pdf, June 2013.Google Scholar
- Umut A. Acar, Arthur Chargueraud, and Mike Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Samy Al Bahra. Nonblocking algorithms and scalable multicore programming. Communications of the ACM, 56(7):50--61, July 2013. Google ScholarDigital Library
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34:115--144, 2001.Google ScholarCross Ref
- Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and Martin Vechev. Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, pages 487--498, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, March 2009. Google ScholarDigital Library
- Colin Blundell, Milo M.K. Martin, and Thomas F. Wenisch. Invisifence: Performance-transparent memory ordering in conventional multiprocessors. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 233--244, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '05, pages 21--28, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- Guojing Cong David A. Bader. A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 65(9):994--1006, 2005. Google ScholarDigital Library
- Dave Dice, Hui Huang, and Mingyao Yang. Asymmetric Dekker Synchronization. http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt, 2001.Google Scholar
- Yuelu Duan, Abdullah Muzahid, and Josep Torrellas. WeeFence: toward making fences free in TSO. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 213--224, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Jason Evans. Scalable memory allocation using jemalloc. http://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919, 2011.Google Scholar
- Karl-Filip Faxen. Efficient work stealing for fine grained parallelism. In Proceedings of the 2010 39th International Conference on Parallel Processing, ICPP '10, pages 313--322, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the 19th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
- Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13:124--149, January 1991. Google ScholarDigital Library
- Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer architecture, ISCA '93, pages 289--300, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarDigital Library
- Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12:463--492, July 1990. Google ScholarDigital Library
- David Kanter. Haswell Transactional Memory Alternatives. http://www.realworldtech.com/haswell-tm-alt/, August 2012.Google Scholar
- Vivek Kumar, Daniel Frampton, Stephen M. Blackburn, David Grove, and Olivier Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Doug Lea. A Java fork/join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- I-Ting Angelina Lee, Silas Boyd-Wickizer, Zhiyi Huang, and Charles E. Leiserson. Using memory mapping to support cactus stacks in work-stealing runtime systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pages 411--420, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Changhui Lin, Vijay Nagarajan, and Rajiv Gupta. Addressaware fences. In Proceedings of the 27th International Conference on Supercomputing, ICS '13, pages 313--324, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Feng Liu, Nayden Nedev, Nedyalko Prisadnikov, Martin Vechev, and Eran Yahav. Dynamic synthesis for relaxed memory models. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 429--440, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 45--54, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 13-- 24, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- James Reinders. Intel Threading Building Blocks. O'Reilly Media, July 2007. Google ScholarDigital Library
- Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors. Communications of the ACM, 53(7):89--97, July 2010. Google ScholarDigital Library
- Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd Millstein, and Madanlal Musuvathi. End-to-end sequential consistency. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 524--535, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
- Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Mechanisms for store-wait-free multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 266--277, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Index Terms
- Fence-free work stealing on bounded TSO processors
Recommendations
Temporally Bounding TSO for Fence-Free Asymmetric Synchronization
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsThis paper introduces a temporally bounded total store ordering (TBTSO) memory model, and shows that it enables nonblocking fence-free solutions to asymmetric synchronization problems, such as those arising in memory reclamation and biased locking.
...
Fence-free work stealing on bounded TSO processors
ASPLOS '14Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when ...
Fence-free work stealing on bounded TSO processors
ASPLOS '14Work stealing is the method of choice for load balancing in task parallel programming languages and frameworks. Yet despite considerable effort invested in optimizing work stealing task queues, existing algorithms issue a costly memory fence when ...
Comments