A Lock-Free Dynamic-Sized Work-Stealing Algorithm with Steal-Half and Work Sharing

—In the paper Thread Scheduling for Multipro- grammed Multiprocessors , Arora, Blumofe, and Paxton (ABP) implement a work-stealing algorithm using a double-ended queue (or deque ) to solve the task-scheduling problem [1]. In this paper, we will begin by giving an overview of work-stealing in general and the advantages and disadvantages that their paper provides. Our focuses are to address and rectify a few key disadvantages of their static-sized deque implementation: the potential inefﬁciency that can occur when multiple stealers attempt to take work from the same thread, the inability of the deque to grow, and a lack of efﬁcient work-balancing. Before explaining our algorithm (and our new data structure), we will explain how our solution solves a different problem than that expressed in A Dynamic-Sized Nonblocking Work Stealing Deque , how it improves the ABP solution scenario aforementioned, how it provides an alternative answer to the work-balancing algorithm (called WorkSharingThread ) in The Art of Multiprocessor Programming , and how it differs from the algorithm in the paper Non-blocking steal-half work queues . Following this, we will delve into the details of our algorithm, informally proving how it can effectively distribute a workload when there are multiple stealers. Finally, we will show the results of our experiments on various workloads, comparing the performance to that of the ABP.


I. INTRODUCTION
A. Background on Work-Stealing As expressed by Blumofe and Leiserson in Scheduling Multithreaded Computations by Work Stealing, the work-stealing algorithm is often used to "address the problem of scheduling multithreaded computations" [3]. In modern computers, there is more than one processor, and therefore, more than one thread of execution can be executing at any single instance of time. The scheduler portion of the operating system assigns different tasks to the threads [4]. Some threads, however, could potentially be assigned too many tasks or have jobs that take much longer than those assigned to other threads. This could result in a situation where one thread has no work, while other threads still have a substantial amount of work to complete.
In [1], the authors represent each thread's workload as a deque (double-ended queue), where each thread can add and take tasks that need to be executed off one side of the deque (referred to as the bottom). Work-stealing only occurs when a thread has completed all of its own tasks; it is at this point that it attempts to take from other threads. In their fixed-size implementation, they randomize the victim which a thread steals from to ensure that their is little contention when trying to take a task from another thread's workload and put it onto their own [1]. Stealers can lessen the amount of tasks another thread has to complete by removing tasks from the opposite end of the double-ended queue (referred to as the top).

B. Overview of Our Solution
Although this approach works well in many cases, we found a few issues with it that we hope to improve: 1) Not Growable: It uses fixed-size arrays, which causes problems when tasks spawn other tasks [1]. This could eventually lead to an overflow of the array, which their solution does not handle [2]. Our implementation is growable, since we use a linked list of arrays to hold the tasks for a single thread.
2) Does Not Rebalance: [1] does not re-balance the number of tasks assigned to a thread. Because of this, one thread may contain many tasks that it has to complete, while another may always be forced to steal. Our solution solves this by forcing a thread with no work to take half of another thread's work so that the work is split evenly between them.
3) Unnecessary Competition for Resources: Competition for resources occurs if the randomization formula is not effective and selects the same thread to steal from. This is costly in ABP's implementation because only one stealer is able to steal if they all attempt to take from the same thread at the same time (due to the CAS operation ensuring exclusive access to the top of the deque). When a stealer attempts to steal (PopTop() in their implementation) and fails to do so, it will continue all the way to the end of its implementation before aborting [1]. In our solution, we also only allow one stealer, but each stealer has its own assigned queue per thread that it can take from. As a result, a stealer can potentially take from multiple different other threads without contention.
4) The Expensive Random Function: Finally, the random function is a very expensive operation in many programming languages. Therefore, the selection of which thread to steal from will unnecessarily take many clock cycles. Ours uses deterministic ordering in order to divide the workload, and thus, we avoid using the random function.
We will concentrate on the problems mentioned in I-B.

II. RELATED WORKS
The existing works for work-stealing often build directly off of [1] or solve work-balancing in general. Ours builds off of the existing solutions in order to solve the work-stealing problem in a different way. In the papers, processors are represented by deques, and threads lie inside the deques. However, in our implementation, we refer to the objects that lie inside our data structure as threads and tasks since we do not actually have access to processes in our Java program. In order to alleviate confusion, when explaining the related works, we will refer to them using the terminology from our own implementation.
A. Using Dynamic Deque to Solve Work-Stealing 1) The Solution by Hendler, Lev, Moir, and Shavit: [2] draws heavily from [1], which was the motivation for our solution. A high-level overview of the ABP algorithm [1] is given in the Background on Work-Stealing section. As stated earlier, threads can have many tasks assigned to them. If the number of tasks was known before the scheduling began, then an array implementation would be sufficient. However, tasks can spawn other tasks; [1] does not have an elegant solution to this.
In A Dynamic-Sized Nonblocking Work Stealing Deque, the authors use a doubly-linked list of DequeNodes (which they call a DynamicDeque) [2]. Each DequeNode contains an array of tasks for the thread to complete, just like the ABP solution. The top and bottom also work similarly; in ABP, these two "pointers" were represented as indices into a thread's task array [1]. However, in [2], they are represented by two values: a pointer to the DequeNode that points to the bottom or the top and an index into the array represented by the DequeNode. These two characteristics, in unison, make the dynamic-sized implementation follow fairly easily from ABP.
Since low-level memory allocation affects the scalability of parallel computations, this related work creates both a shared and thread-local pool of allocated DequeNodes that DynamicDeques can later draw from. Each thread first tries to draw from empty deques that are local to it. If that fails it will try to access the shared thread pool using a compare-and-swap to serialize the accesses [2].
2) Comparing Our Implementation to Theirs: Our solution will be presented more thoroughly later, but like the aforementioned, it is also growable: tasks can be created by other tasks and our algorithm will be able to accommodate these additions easily. However, the solution in [2] is still only able to steal one task at a time. Only one stealer is able to steal from the same thread [2]. Ours also only allows one thread to steal from one victim but allows each thread to redistribute the load of busy tasks by splitting the work with another thread who has no work.
In addition, since we used Java for our implementation, we have no control over the destruction of the objects that we create. As a result, our comparison will be inexact. Since our goal is to compare to the original ABP algorithm, we have implemented the original ABP code in Java to control for this factor.
Finally, their implementation still uses randomness to decide which thread a stealer should take from [2]. Our implementation is more deterministic.

B. Non-Blocking Steal-Half Work Queues
1) The Solution in [5]: While the original ABP algorithm [1] and the dynamic-sized deque solution [2] only allow one item to be stolen at a time, the solution by Shavit and Hendler allows half of the tasks in a thread's workload to be stolen. Just like [1] and [2], the data structure used is a deque. If it was a simple extension of ABP where the synchronization of each steal simply happened one after the other, then it would be very inefficient. According to the authors of [5], "in order to steal more than one item at a time...one must overcome a much greater uncertainty." In the specific scenario of the half-steal described in the paper, this means that the distance between the bottom and top pointer cannot be less than half the size of the work-stealing deque [5]. If it was, then the top and bottom pointers could overlap, and the functionality of the deque would no longer be clear. In order to solve this problem, it "requires consensus to be performed for half the items in the deque." This ensures that all items agree on the number of items to be stolen in a single CAS operation [5].
In order to remedy the uncertainty and the numerous synchronization operations that would seem to be required, the authors of Non-Blocking Steal-Half Work Queues do the consensus operation when the number of tasks left is a power of two. The details behind this method are complex, but they were "able to show is that missing the counter update can only affect locations beyond the next power-of-two at any given point" [5].
2) Comparing Our Implementation to Theirs: In Non-Blocking Steal-Half Work Queues, there are stark performance differences between successful and unsuccessful stealing. If a steal fails in [5], then the pointer of each item that was expected to be stolen by the other thread is redundantly copied." This is worse than the ABP solution, where the steal attempt is simply aborted for the one unsuccessful item transfer. In our algorithm, threads will not be competing with others to steal-half from a single victim. Only one thread will be able to steal-half from a single thread's workload. Thus, we avoid this potential inefficiency.
When a steal of [5] succeeds, there is only one CAS operation that needs to be done in order to steal the batch of tasks. This is similar to how our algorithm works, except that we have an outer data structure that ensures exclusive access. In our solution, if a thread wants to do steal-half, it will always be able to because it has the access to an entry in the aforementioned data structure, and as a result, has exclusive access to every task's queue that has the same number as the mask [5].

C. WorkSharingThread: A Work-Balancing Implementation
1) The Solution Presented in [4]: Work-stealing is defined in [2] "as each process tries to work on its newly created threads locally, and attempts to steal threads from other processes only when it has no local threads to execute." In Chapter 16 of The Art of Multiprocessor Programming, the authors present a different way to solve work-stealing that gave us the needed inspiration for fixing the original ABP algorithm and its growable counterpart. In this implementation, they have threads with few tasks balance their workloads with another, more heavily taxed thread [4].
In WorkSharingThread, which is their Java implementation of the algorithm, the tasks are organized into queues, and the re-balancing is done if the number of victim tasks exceeds some predefined value. Like the ABP solution and the dynamic-sized deque, the algorithm presented by Herlihy and Shavit chooses the thread to steal from at random. Additionally, in order to re-balance, one thread locks its own queue, as well as the queue of the other thread that will be used in the re-balancing [4].
2) Comparing Our Implementation to Theirs: [4]'s implementation and ours both use re-balancing in order to solve work-stealing. However, our re-balancing operation is lockfree, while theirs must acquire 2 locks. This limits the progress a system can make since neither task queue can do anything while the re-balancing takes place. We also use different data structures than Herlihy and Shavit, but this will be explained in more detail in the explanation of our implementation.

A. Details of the Data Structure
Our solution improves upon the related works mentioned in the previous sections. It uses a complex data structure that allows it to be growable and have simultaneous stealing. In addition, it does not use the costly random function in order to steal. Finally, we have a lock-free re-balancing solution.
The following diagrams illustrate this data structure, and we will use the details of these figures in order to help explain the technical minutia of our implementation. Note that each thread has an array of task queues (the red boxes in figure 1), and that the size of a thread's array is equal to the number of threads. Each slot in the array represents a task queue for a thread. In our diagram, this means that if there are four threads, each thread would have an array of size four. The first slot in thread one's array would contain a task queue for thread one, the second slot would contain a task queue for thread two, the third slot would contain a task queue for thread three, and the fourth slot would contain a task queue for thread four. Thread two's array would be identical in structure to thread one's array, as would thread three's array and thread four's array. Essentially, every thread has a task queue for every thread.
Each task queue (the purple box in figure 1) is a actually a linked list of arrays, which is similar to the dynamic-sized deque implementation in [2]. Each node within the task queue contains an array of tasks. Tasks are represented as yellow squares in figure 1. If new tasks are being created and the number of tasks now exceeds the size of the array in the last node of the linked list of arrays, we create a new node. Indexes into the task queue are used for adding and stealing tasks. Since each task queue has at most one producer and one consumer at a time, they are very simple and do not require atomic operations. In addition to an array of tasks and a pointer to the next task node (initially null), each task node contains a consumer pointer and a producer pointer. Each of the pointers represent an index into the array of tasks.
Finally, there is global array of "masks" that every thread can access, and the size of this mask array is equal to the number of threads. These masks are mutually exclusive as only one thread can own a mask at a time, and they are numbered the same as the threads (n threads numbered from one to n means n masks numbered from one to n). Each mask is really just a pointer to a thread, and when a mask is unused, it points to null. In the current implementation, the mask array is actually implemented as a linked list, but it functions the same as an array implementation would.
B. Basics of the Algorithm 1) Spawning Tasks: Each thread can only spawn tasks in its own array. Therefore, if thread one wants to spawn a task for thread four, thread one would add a task to the task queue located at the fourth slot of thread one's array. We use roundrobin to decide who should get a newly spawned task. Each thread will start by pointing to its own index in its task queue array, and it will give the first spawned task to itself. It will give the second spawned task to itself + 1, the third to itself + 2, and so on. This means that thread one would give itself its first spawned task, then it would give the next spawned task to thread two, then thread three, then thread four, and finally it would give the next task to itself.
This code implementation of this functionality can be seen in figure 5 (the addTask method) and in figure 6 (the enqueue-Tasks method). 2) Stealing Tasks: In order for a thread to get a new task, it must be wearing a mask. The number of the mask determines which task lists this thread can access. If a thread is wearing mask one, it can access everyone's task queue for thread one. If a thread gets its own mask (that matches its thread number) and has work in one of its own task queues, it will steal one of its own tasks. If a thread cannot access its own mask, it does not have enough information to know whether it ran out of work, so it will try to steal one task from someone else (this could change if the thread loops around and gets back to its own mask and finds that it ran out of work). If a thread gets its own mask and has no work, it will try to steal half of someone else's work (this could change if the thread loops around and gets back to its own mask and finds that it has work). This way, threads will steal only one task if they either have work or do not know if they have work, and they will split someone else's workload half and half if they know they ran out of work.
Our code for stealing can be referenced in figure 4.

3) Details of the Mask:
The acquisition of masks is a key part of the algorithm, and it works because of the preferences of each of the threads. For example, thread one favors mask one the most, thread two favors mask two the most, etc. If a particular thread fails to get its own mask, then its second preference is the one greater than its index by one (wrapping around the array when it exceeds the array size). For example thread one favors mask two if it can't get mask one, thread two favors mask three if it can't get mask two, etc. Essentially, all masks will be favored the most by a single thread, the second most by a single thread, the third most by a single thread, and so forth, so that all masks have equal average priority.
Our code for the mask can be referenced in figure 4. 4) Steal-Half: As explained in III-B1, task spawning is deterministic in the sense that a producer always follows the same order to assign newly spawned tasks to threads. This means if some threads are faster than others, there will be an imbalance of work. The multistealing that occurs in this algorithm is designed to fix such imbalances. Let us consider an example where there are three threads, and threads one and two are very fast and have no work, while thread three has 40 tasks. In this example, either thread one or thread two would end up stealing 20 tasks in order to redistribute the workload. The new workloads following a multisteal could be: thread one has 20 tasks, thread two has 0 tasks, and thread three has 20 tasks. Thread two could now steal from either thread one or thread three. After this, the end result would be task list sizes of 10, 10, and 20, which is much better than 0, 0, and 40. Note that this adjustment of workloads only took two multisteal operations. Also note that this multistealing technique reduces contention on task queues. This is because after a multisteal occurs, there is now one more non empty task queue from which stealers can take.
Our code for steal-half can be referenced in figure 4.

1) No Randomization:
There is no randomization in our algorithm because newly spawned tasks are given to the threads in a round-robin order, and mask acquisition has the same deterministic ordering. Both of these algorithms could be made random, but it would be unnecessary.
2) Task Lists Do Not Need to Use Atomic Operations for Adding and Removing Tasks: An underlying feature of this algorithm is that work is assigned to specific threads. This is to eliminate contention on the task queues. A thread can only produce work into the queues its own array, and it can only acquire work from a specific set of queues per mask.
3) Task Queues are Growable: This allows for more flexibility and memory efficiency. However, we could have some way of pooling task queue nodes, but since we stated that garbage collection is outside the scope of our algorithm, this may be unnecessary.
4) Re-balancing is Fast: Before re-balancing a workload between two threads, our implementation first calculates the total size of a thread's task list, which is O(numthreads * maximum nodes in a thread's task queue for this mask). This may seem to be a costly operation, but the number of threads will always be fixed. However, even if there are many threads, the maximum number of nodes in a thread's task queue will be small due to the large array sizes that we use for the task nodes.
5) Non-Blocking: According to [4], a blocking algorithm is one in which "an unexpected delay by one thread can prevent others from making progress." Our solution is non-blocking due to the implementation of the mask. This can best be shown through an example.
Assume that a thread is holding mask 2. In addition, assume that another thread is trying to access mask 2. This new thread's CAS will fail, but it will not block itself until mask 2 is released. Instead, it moves on and tries to access mask 3.
6) Lock-Free: In [4], an algorithm is defined as lock-free "if it guarantees that infinitely often some method call finishes in a finite number of steps." This algorithm is lock-free because spawning a new task can always be done in O(1) time, and acquiring a new task always allows at least one thread to make progress. There are a number of recent lock-free data structures described in literature including lock-free containers (1; 2; 3; 4; 5; 6; 7; 8; 9) and lock-free transactional platforms (10; 11; 12? ). These data structure have found practical use in multiple real-life systems (13; 14; 15; 16).
The mask acquisition can be reduced to a scenario where there are some number of mutually exclusive shared resources, and there are an equal number of threads. A resource is free if it points to null, and a resource is taken if it points to a thread.
The algorithm can best be explained in 4 steps. First, look at one of the shared resources. Second, try to CAS this resource to point to me, where I expect it to point to null. Third, if the CAS succeeded, do a finite number of computational steps, then set the resource to point to null again. Fourth, if the CAS failed, try the next resource (so that I continually circle around the resource list looking for an open one).
The previous paragraph explains why re-balancing is also lock free. It is lock-free because no thread ever blocks, and when a thread is able to use a resource, it does so in a finite number of steps before giving it back. When a thread cannot use a resource, it is because another thread is making progress. Since spawning tasks, stealing tasks, and re-balancing are the components that comprise our algorithm, our solution is lockfree.
7) Not Wait-Free: In [4], a method (or algorithm) is defined as wait-free "if it guarantees that every call finishes its execution in a finite number of steps." While our algorithm fulfills the progress conditions mentioned in the last two subsections, it is not wait-free. This is because it is possible for a thread to get stuck in the phase where it is getting a task. This could happen for any of the two reasons: every time a thread gets a mask there is no work left or a thread can never get a mask because it gets preempted by other threads.
8) Avoids the ABA Problem: In [4], the ABA problem (17) is described as one that often occurs in "dynamic memory algorithms that use conditional synchronization operations." It can occur when CAS operations are used to determine which thread does an operation first. The problem occurs when previous alterations of a data structure make it seem as if a pointer to a dynamic structure is what is expected, when in fact, it could possibly have been removed from the data structure and inserted back in with a new value. The CAS operation will then be invalid. As [4], explains, this is often fixed by adding a time stamp to the data structure.
However, our solution avoids this because we do not reuse old nodes. We simply reallocate a new task node and let the JVM take care of the old task nodes (whose tasks have already been completed). Although garbage collection is not lock-free, it is outside the scope of this paper. It is an implementation detail of Java that we cannot avoid.

A. Machine Specs
All of our test were done on a Lenovo y410p computer that runs a 64-bit Windows 8.1 operating system. It contains an i7-4700MQ processor that runs at 2.40 GHz. The machine also has 8GB of RAM, with 4 physical cores (8 logical).

B. How We Judge Performance
The performance testing is based on the experimentation methodology presented by Izadpanah et. al (18). For performance testing, we compared execution times of various workloads for our implementation versus the original ABP solution. Both solutions were implemented in Java, and therefore, we do not explicitly handle garbage collection. In addition, we do not compare the amount of memory usage between ABP and our algorithm.
For each task workload, we measured the execution time when it was divided among 2,4,8,16,32,64, and 128 threads. For each of the aforementioned number of threads, we ran our algorithm 5 times and took the average of the execution times before placing it as a single point on our graph. Our goal was to have our implementation perform better than ABP in the case when there are many threads.

C. Performance Results
In order to reference the tests easier, we have assigned each of the 4 tests a name that succinctly describes its functionality. The test done to display the results in figure 7 is called "Smal-lVolumeHardSingleSpawner," the one referencing figure 8 is called "EvenMixEasyHard," the test done to collect the results for figure 9 is called "HighVolumeEasySingleSpawner," and the one for figure 10 is called "UnevenMixEasyHard." Fig. 7. The "SmallVolumeHardSingleSpawner" test has a single spawner produce 1,000 long tasks (each of which is a dummy function that takes O(n 3 ) time, where n = 2000) among the available threads. Fig. 8. The "EvenMixEasyHard" test contains three types of tasks. Task 1 is an easy task that does not spawn any child tasks and takes a short amount of time. Task 2, on the other hand, is a hard task that spawns many instances of task 1 and takes a long time. Finally, task 3 is the outer task, which spawns an approximately even amount of task 1 and task 2.  1) Explanation of Graph Results for SmallVolumeHardSin-gleSpawner Test: For our algorithm, as the number of threads increases, the likelihood of yielding the CPU increases. Even though there is more context switching and memory usage for the additional threads, the number of tasks each thread must execute decreases. This means that each thread can finish its own workload more quickly and steal half of another thread's workload.
For the ABP algorithm, as the number of threads increases, the likelihood that a thread will randomly choose a victim with work decreases. Similar to our algorithm, as the number of threads increases, the likelihood of yielding the CPU increases. This is because there will be more failed steal attempts, each of which causes a yield. While each thread has less work to do, the single spawning thread acts as a bottleneck for this test case, and ours distributes the work evenly and more efficiently for a sufficiently large number of threads.
2) Explanation of Graph Results for EvenMixEasyHard Test: For this case, the performance of both algorithms is comparable. This can be explained by the fact that there are many tasks that spawn other tasks. For the ABP algorithm, this means there is virtually no stealing and therefore no bottlenecks. For our algorithm, there is also virtually no stealing and no bottlenecks. In both algorithms, threads can work in parallel, and the stealing features unique to each algorithm become irrelevant.
3) Explanation of Graph Results for HighVolumeEasySin-gleSpawner Test: For this test case, each task takes virtually no time to complete. This means that most of the time in both algorithms is spent acquiring more tasks. In the case of both algorithms, there is only a single spawner and therefore a large amount of contention for tasks. This explains the overall increase in execution time as the number of threads increases. For our algorithm specifically, there is a significant amount of object creation and deletion as there are 20000 total tasks yet only 100 slots in each node's array. We think Java's object creation and deletion contributes significantly to our performance degradation. 4) Explanation of Graph Results for UnevenMixEasyHard Test: This test case incorporates features of all the previous test cases. A small portion of the tasks take a long time to execute and spawn more tasks. A much larger portion of the tasks take a short amount of time to execute and do not spawn more tasks. There is plenty of work to go around, but some threads may be preoccupied with more difficult tasks than other threads. This creates an uneven load balance among the threads.
Our algorithm uses a round robin approach so that spawned work is always evenly distributed. When this round robin approach breaks down due to some threads completing work faster than other threads, the lock-free steal-half functionality is able to re-balance everyone's workload. Since work is distributed among many threads, steal-half is successful most of the time.
The ABP algorithm only allows spawned threads to be placed in the spawning thread's queue. In contrast to our algorithm, this means stealing is often unsuccessful as work is concentrated at a few locations instead of being spread throughout all of the threads' queues.

V. CONCLUSION
In this paper, we came up with an alternative solution to the work-stealing deque for load-distribution. In our implementation, we give each thread an array of task queues, where the size of a thread's array is equal to the number of threads. The mask array helps ensure that one thread of execution has access to all tasks that a particular thread is assigned.
As can be seen from our graphs, our algorithm seems to work better than the ABP solution in the cases where contention is more likely and there are many threads. The ABP solution seems to work better for a smaller number of threads and when there are short tasks for the threads to complete.