Accelerating the XGBoost algorithm using GPU computing

We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU) and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3 (cid:1) and 6 (cid:1) using a Titan X compared to a 4 core i7 CPU, and 1.2 (cid:1) using a Titan X compared to 2 (cid:1) Xeon CPUs (24 cores). We show that it is possible to process the Higgs dataset (10 million instances, 28 features) entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classiﬁcation, regression and ranking tasks.


INTRODUCTION
Gradient boosting is an important tool in the field of supervised learning, providing state of the art 20 performance on classification, regression and ranking tasks. XGBoost is an implementation of a gener-21 alised gradient boosting algorithm that has become a tool of choice in machine learning competitions. 22 This is due to its excellent predictive performance, highly optimised multicore and distributed machine 23 implementation and the ability to handle sparse data. 24 Despite good performance relative to existing gradient boosting implementations, XGBoost can be 25 very time consuming to run. Common tasks can take hours or even days to complete. Building highly 26 accurate models using gradient boosting also requires extensive parameter tuning. In this process the 27 algorithm must be run many times to explore the effect of parameters such as the learning rate and L1/L2 28 regularisation terms on cross validation accuracy. By far the most time-consuming part of the XGBoost 29 algorithm is the construction of decision trees within each boosting iteration. This paper describes and  Graphics processing units (GPUs) have recently been used to accelerate compute intensive tasks in 33 machine learning and many other fields through the utilisation of their specialised SIMD architecture. Figure 1 shows an example decision tree that can predict whether or not an individual owns a house. 87 The decision is based on their age and whether or not they have a job. The tree correctly classifies all 88 instances from Table 1. 89 Decision tree algorithms typically expand nodes from the root in a greedy manner in order to maximise some criterion measuring the value of the split. For example, decision tree algorithms from the C4.5 family (Quinlan, 2014), designed for classification, use information gain as the split criterion. Information gain describes a change in entropy H from some previous state to a new state. Entropy is defined as  Where T is a set of labelled training instances, y 2 Y is an instance label and P(y) is the probability of drawing an instance with label y from T . Information gain is defined as

IG(T, T left , T right )=H T (n left /n total ) ⇤ H(T left ) (n right /n total ) ⇤ H(T right )
Here T left and T right are the subsets of T created by a decision rule. n total , n left and n right refer to the 90 number of examples in the respective sets.

91
Many different criteria exist for evaluating the quality of a split. Any function can be used that 92 produces some meaningful separation of the training instances with respect to the label being predicted. 93 In order to find the split that maximises our criterion we can enumerate all possible splits on the input 94 instances for each feature. In the case of numerical features and assuming the data has been sorted, this  Another consideration when building decision trees is applying some form of regularisation to prevent 101 overfitting. Overfitting on training data leads to poor model generalisation ability and poor performance 102 on test data. Given a sufficiently large decision tree it is possible to generate unique decision rules for 103 every instance in the training set such that each training instance is correctly labelled. This results in Gradient boosting is a variation on boosting which represents the learning problem as gradient descent on some arbitrary differentiable loss function that measures the performance of the model on the training set. More specifically, the boosting algorithm executes M boosting iterations to learn a function F(x) that outputs predictionsŷ = F(x) minimising some loss function L(y,ŷ). At each iteration we add a new estimator f (x) to try to correct the prediction of y for each training instance.
We can correct the model by setting f (x) to: This fits the model f (x) for the current boosting iteration to the residuals y F m (x) of the previous 118 iteration. In practice, we approximate f (x), for example by using a depth limited decision tree. 119 This iterative process can be shown to be a gradient descent algorithm when the loss function is the squared error: The loss over all training instances can be written as We seek to minimise J by adjusting F(x i ). The gradient for a particular instance x i is given by We can see that the residuals are the negative gradient of the squared error loss function: By adding a model that approximates this negative gradient to the ensemble we move closer to a local 120 minimum of the loss function, thus implementing gradient descent.  122 Here we derive the XGBoost algorithm following the explanation in (Chen and Guestrin, 2016). XGBoost 123 is a generalized gradient boosting implementation that includes a regularisation term, used to combat 124 overfitting, as well as support for arbitrary differentiable loss functions.

125
Instead of optimising plain squared error loss, an objective function with two parts is defined, a loss function over the training set as well as a regularisation term which penalises the complexity of the model: L(y i ,ŷ i ) can be any convex differentiable loss function that measures the difference between the prediction and true label for a given training instance. Ω( f k ) describes the complexity of tree f k and is defined in the XGBoost algorithm (Chen and Guestrin, 2016) as where T is the number of leaves of tree f k and w are the leaf weights (i.e., the predicted values stored 126 at the leaf nodes).
When Ω( f k ) is included in the objective function we are forced to optimize for a 127 less complex tree that simultaneously minimizes L(y i ,ŷ i ). This helps to reduce overfitting. γT provides 128 a constant penalty for each additional tree leaf and λ w 2 penalises extreme weights. γ and λ are user 129 configurable parameters.

4/28
Given that boosting proceeds in an iterative manner we can state the objective function for the current iteration m in terms of the prediction of the previous iterationŷ i (m 1) adjusted by the newest tree f k : We can then optimise to find the f k which minimises our objective.

131
Taking the Taylor expansion of the above function to the second order allows us to easily accommodate different loss functions: Here, g i and h i are the first and second order derivatives respectively of the loss function for instance i: Note that the modelŷ i (m 1) is left unchanged during this optimisation process. The simplified objective function with constants removed is We can also make the observation that a decision tree predicts constant values within a leaf. f k (x) can 133 then be represented as w q(x) where w is the vector containing scores for each leaf and q(x) maps instance 134 x to a leaf.

135
The objective function can then be modified to sum over the tree leaves and the regularization term from Equation 1: Here, I j refers to the set of training instances in leaf j. The sums of the derivatives in each leaf can be 136 defined as follows: Also note that w q(x) is a constant within each leaf and can be represented as w j . Simplifying we get The weight w j for each leaf minimises the objective function at The best leaf weight w j given the current tree structure is then Using the best w j in Equation 2 the objective function for finding the best tree structure then becomes Equation 3 is used in XGBoost as a measure of the quality of a given tree.  Given that it is intractable to enumerate through all possible tree structures we greedily expand the tree from the root node. In order to evaluate the usefulness of a given split we can look at the contribution of a single leaf node j to the objective function from Equation 3 : We can then consider the contribution to the objective function from splitting this leaf into two leaves: The improvement to the objective function from creating the split is then defined as Gain = Ob j split Ob j lea f which yields The quality of any given split separating a set of training instances is evaluated using the gain function in sorted order. A running sum of G L and H L is kept as we move from left to right, as shown in Table 2. G R 148 and H R are inferred from this running sum and the node total.
149 Table 2 shows an example set of instances in a leaf. We can assume we know the sums G and H 150 within this node as these are simply the G L or G R from the parent split. Therefore we have everything we 151 need to evaluate Gain for every possible split within these instances and select the best.  Table 3. If f2 is the feature to be predicted then an input training pair (x i , y i ) takes the Representing input data using sparsity in this way has implications on how splits are calculated. XGBoost's 166 default method of handling missing data when learning decision tree splits is to find the best "missing  The XGBoost algorithm handles this by performing two scans over the input data, the second being in  The algorithm then selects the best split from either the forwards or backwards scan.

184
The purpose of this paper is to describe how to efficiently implement decision tree learning for XGBoost   if a global synchronisation barrier is required within a kernel, the kernel must be separated into two 220 distinct kernels where synchronisation occurs between the kernel launches.  222 CUDA exposes three primary tiers of memory for reading and writing. Device wide global memory, 223 thread block accessible shared memory and thread local registers.

224
• Global memory Global memory is accessible by all threads and has the highest latency. Input 225 data, output data and large amounts of working memory are typically stored in global memory.

226
Global memory can be copied from the device (i.e., the GPU) to the host computer and vice versa.

234
• Shared memory 48KB of shared memory is available to each thread block. Shared memory is 235 accessible by all threads in the block. Shared memory has a significantly lower latency than global 236 memory and is typically used as working storage within a thread block. It is sometimes described 237 as a "programmer managed cache".

242
GPU primitives are small algorithms used as building blocks in massively parallel algorithms. While many 243 data parallel tasks can be expressed with simple programs without them, GPU primitives may be used 244 to compose more complicated algorithms while retaining high performance, readability and reliability.

245
Understanding which specific tasks can be achieved using parallel primitives and the relative performance 246 of GPU primitives as compared to their CPU counterparts is key to designing effective GPU algorithms. A parallel reduction reduces an array of values into a single value using a binary associative operator.

249
Given a binary associative operator and an array of elements the reduction returns (a 0 a 1 ... a n 1 ).

250
Note that floating point addition is not strictly associative. This means a sequential reduction operation 251 will likely result in a different answer to a parallel reduction (the same applies to the scan operation 252 described below). This is discussed in greater detail in Section 2.5. The reduction operation is easy to 253 implement in parallel by passing partial reductions up a tree, taking O(log n) iterations given n input items 254 and n processors. This is illustrated in Figure 3.

255
In practice, GPU implementations of reductions do not launch one thread per input item but instead 256 perform parallel reductions over "tiles" of input items then sum the tiles together sequentially. The size 257 of a tile varies according to the optimal granularity for a given hardware architecture. Reductions are 258 also typically tiered into three layers: warp, block and kernel. Individual warps can very efficiently 259 perform partial reductions over 32 items using shuffle instructions introduced from Nvidia's Kepler GPU 260 architecture onwards. The thread block can then combine these warp reductions to complete a tile of input.

261
The thread block can iterate over many input tiles sequentially, summing the reduction from each. When  Reductions are highly efficient operations on GPUs. An implementation is given in (Harris, 2007) 270 that approaches the maximum bandwidth of the device tested.  272 The prefix sum takes a binary associative operator (most commonly addition) and applies it to an array 273 of elements. Given a binary associative operator and an array of elements the prefix sum returns 274 [a 0 , (a 0 a 1 ),...,(a 0 a 1 ,...,a n 1 )]. A prefix sum is an example of a calculation which seems inherently 275 serial but has an efficient parallel algorithm: the Blelloch scan algorithm.   Given that a sequential scan performs only n addition operations the simple parallel scan is not work 281 efficient. A work efficient parallel algorithm will perform the same number of operations as the sequential 282 algorithm and may provide significantly better performance in practice. A work efficient algorithm is 283 described in (Blelloch, 1990). The algorithm is separated into two phases, an "upsweep" phase similar

Parallel Prefix Sum (Scan)
Sum flags r 4 Scatter address  This is discussed further in Section 2.4.

294
A scan may also be implemented using warp intrinsics to create fast 32 item prefix sums based on the 295 simple scan in Figure 5. Code for this is shown in Listing 3. Although the simple scan algorithm is not 296 work efficient, we use this approach for small arrays of size 32. the new position of all zero digits. All "1" digits must be placed after all "0" digits, therefore the final 302 positions of the "1"s can be calculated as the exclusive scan of the "1"s plus the total number of "0"s. The 303 exclusive scan of "1" digits does not need to be calculated as it can be inferred from the array index and 304 12/28   Table 6. GPU vs CPU Scan Benchmark the exclusive scan of "0"s. For example at index 5 (using 0 based indexing), if our exclusive scan shows a 305 sum of 3 "0"s, then there must be two "1"s because a digit can only be 0 or 1.

306
The basic radix sort implementation only sorts unsigned integers but this can be extended to correctly Tables 5, 6 and 7 show that GPU primitive performance improves relative to the CPU algorithm as the 325 input size is increased, beginning to plateau at very large sizes as the GPU becomes saturated with work.

326
The relatively poor performance at small sizes is due to the overhead of launching GPU kernels. GPU 327 kernel launch times are profiled in (Boyer, 2016) and found to cost between 3 and 14 microseconds. Note 328 that the 1024 element reduction in Table 5 takes approximately 10 microseconds. At small sizes execution 329 time is dominated by kernel launch overhead, making GPU algorithms impractical for processing small 330 batches of data sequentially. Radix sort on the GPU still outperforms std::sort for 1024 elements, despite 331 the small input size. This is because much more work is being done compared to scan or reduction. The 332 kernel overhead is therefore less significant. At large sizes GPU radix sort shows dramatic performance 333 improvements over std::sort-up to two orders of magnitude.

335
Variations on scan and reduce consider multiple sequences contained within the same input array and 336 identified by key flags. This is useful for building decision trees as the data can be repartitioned into 337 smaller and smaller groups as we build the tree. 338 We will describe an input array as containing either "interleaved" or "segmented" sequences. Table   339 8 shows an example of two interleaved sequences demarcated by flags. Its values are mixed up and do 340 not reside contiguously in memory. This is in contrast to Table 9, with two "segmented" sequences. The

343
A scan can be performed on the sequences from Table 9 using the conventional scan algorithm described 344 in Section 2.3.2 and modifying the binary associative operator to accept key value pairs. Listing 6 shows 345 an example of a binary associative operator that performs a segmented summation. It resets the sum when 346 the key changes.

348
A segmented reduction can be implemented efficiently by applying the segmented scan described above 349 and collecting the final value of each sequence. This is because the last element in a scan is equivalent to

366
A scan operation performed on interleaved sequences is commonly described as a multiscan operation.

367
A multiscan may be implemented, like multireduce, by passing a vector of sums as input to the binary 368 associative operator. This increases the local storage requirements proportionally to the number of buckets.

369
General purpose multiscan for GPUs is discussed in (Eilers, 2014) with the conclusion that "multiscan 370 cannot be recommended as a general building block for GPU algorithms". However, highly practical  in 1/32 the theoretical double precision performance. Therefore an algorithm relying on double precision 389 arithmetic will have severely limited performance on these GPUs. 390 We test the loss of precision from 32 bit floating point operations to see if double precision is necessary. 391 We test 32 bit parallel and sequential summation by summing over a large array of random numbers.

392
Sequential double precision summation is used as the baseline, with the error measured as the absolute 393 difference from the baseline. The experiment is performed over 10 million random numbers between -1 394 and 1, with 100 repeats. The mean error and standard deviation are reported in Table 11. The Thrust 395 library is used for parallel GPU reduction based on single precision operations.

15/28
The 32 bit parallel summation shows dramatically superior numerical stability copmared to the 32 bit 397 sequential summation. This is because the error of parallel summation grows proportionally to O(log n), 398 as compared to O(n) for sequential summation (Higham, 1993). The parallel reduction algorithm from 399 Figure 3 is commonly referred to as "pairwise summation" in literature relating to floating point precision.

400
The average error of 0.0007 over 10 million items shown in Table 11 is more than acceptable for the 401 purposes of gradient boosting. This also suggests that the sequential summation within the original data back and forward between the host and device will not be able to achieve peak performance. Building 442 the entire decision tree in device memory has the disadvantage that the capacity of device memory is 443 often significantly less than host memory. Despite this, we show that it is possible to process some very 444 large benchmark datasets entirely in device memory on a commodity GPU.

445
Thirdly, our algorithm implements the sparsity aware tree construction method introduced by XGBoost.

446
This allows it to efficiently process sparse input matrices in terms of run-time and memory usage. This is 447 in contrast to all previous GPU tree construction algorithms.

448
Additionally our implementation is provided as a part of a fully featured machine learning library. It  Node Id  0  0  0  0  0  0  0  0  Instance Id  0  2  3  3  2  0  1 3 Feature value 0.1 0.5 0.9 5.2 3.1 3.6 3.9 4.7  to the approach presented in this work impractical.  The first phase of the algorithm finds the best split for each leaf node at the current level.

472
Given the above data layout notice that each feature resides in a contiguous block and may be processed  f0   f1  f2  Node Id  2  1  2  2  1  2  1  2  Instance Id  0  2  3  3  2  0  1 3 Feature value 0.1 0.5 0.9 5.2 3.1 3.6 3.9 4.7 Table 15. Interleaved Node Buckets   f0  f1  f2  Node Id  1  2  2  1  2  Instance Id  0  2  3  3  2  1  0 3 Feature value 0.5 0.1 0.9 5.2 3.1 3.9 3.6 4.7 In order to calculate the best split for a given feature we evaluate Equation 4 at each possible split  The thread block moves from left to right across a given feature, consuming "tiles" of input. A tile 484 here refers to the set of input items able to be processed by a thread block in one iteration. Table 14 gives 485 an example of a thread block with four threads evaluating a tile with four items. For a given tile, gradient 486 pairs are scanned and all splits are evaluated.

487
Each 32 thread warp performs a reduction to find the best local split and keeps track of the current 488 best feature value and accompanying gradient statistics in shared memory. At the end of processing the 489 feature another reduction is performed over all the warps' best items to find the best split for the feature.

499
So far the algorithm description only explains how to find a split at the first level where all instances are 500 bucketed into a single node. A decision tree algorithm must by definition separate instances into different 501 nodes and then evaluate splits over these subsets of instances. This leaves us with two possible options 502 for processing nodes. The first is to leave all data instances in their current location, keeping track of 503 which node they currently reside in using an auxiliary array as shown in Table 15. When we perform a 504 scan across all data values we keep temporary statistics for each node. We therefore scan across the array 505 processing all instances as they are interleaved with respect to their node buckets. This is the method used 506 by the CPU XGBoost algorithm. We also perform this method on the GPU but only to tree depths of 507 around 5. This interleaved algorithm is fully described in Section 3.1.6.

508
The second option is to radix sort instances by their node buckets at each level in the tree. This second 509 option is described fully in Section 3.1.7. Briefly, data values are first ordered by their current node and 510 then by feature values within their node buckets as shown in Table 16. This transforms the interleaved 511 scan ("multiscan") problem described above into a segmented scan, which has constant temporary storage 512 requirements and thus scales to arbitrary depths in a GPU implementation.

513
In our implementation we use the interleaved method for trees of up to depth 5 and then switch to if (node active ) { exclusive scan output = scan result ; } } in interleaved order. In a decision tree algorithm, when enumerating through feature values to find a 544 split, we should not choose a split that falls between two elements with the same value. This is because a 545 decision rule will not be able to separate elements with the same value. For a value to be considered as a 546 split the corresponding item must be the leftmost item with that feature value for that particular node (we 547 could also arbitrarily take the rightmost value).

548
Because the node buckets are interleaved it is not possible to simply check the item to the left to see 549 if the feature value is the same-the item to the left of a given item may reside in a different node. To 550 check if an item with a certain feature value is the leftmost item with that value in its node bucket we can 551 formulate a scan with a special binary associative operator. First each item is assigned a bit vectorx of 552 length n + 1 where n is the number of buckets. If the item resides within bucket i then x i will be set to 1.

553
If the item's feature value is distinct from the value of the item directly to the left (irrespective of bucket) 554 then x n+1 is set to 1. All other bits are set to 0. 555 We can then define a binary associative operator as follows: The output of this algorithm contains the best splits for each leaf node for a given feature. Each thread 572 block outputs the best splits for its assigned feature. These splits are then further reduced by a global 573 kernel to find the best splits for any feature.  575 The sorting implementation of the split finding algorithm operates on feature value data grouped into node 576 buckets. Given data sorted by node ID first and then feature values second we can perform segmented 577 scan/reduce operations over an entire feature only needing a constant amount of temporary storage.

578
The segmented reduction to find gradient pair sums for each node is implemented as a segmented 579 sum scan, storing the final element from each segment as the node sum. Another segmented scan is then 580 performed over the input feature to get the exclusive scan of gradient pairs. After scanning each tile 581 the split gain is calculated using the scan and reduction as input and the best splits are stored in shared 582 memory.

583
The segmented scan is formulated by performing an ordinary scan over key value pairs with a binary 584 associative operator that resets the sum when the key changes. In this case the key is the current node 585 bucket and the value is the gradient pair. The operator is shown in Equation 6.   for each feature, and each node. This is reduced by a global kernel to find the best splits for each node, of 589 any feature.  To illustrate this with an example, Figure 9 shows the state of a decision tree after having calculated 600 splits for level 1. The node positions in the data structure used for split finding (Table 17) must be updated 601 before proceeding to calculate the splits for level 2. To do this we update the array in Table 18 that maps   602 instances to a node.

603
First we update the node ID map in the missing direction. All instances residing in node 1 are updated 604 in the right direction to node 4. Instances residing in node 2 are updated in the left direction to node 5.  Id  1  2  2  1  1  2  2  Instance Id  0  2  1  3  0  1 2 Feature value 0.75 0.5 0.9 2.7 4.1 3.6 3.9 The node ID map now looks like Table 19. 606 We now update the map again using the feature values from Table 17, overwriting the previous values.

607
Instance 0 resides in node 1 so we check if f 0 < 0.8. This is true so instance 0 moves down the left 608 branch into node 3. Instance 1 moves into node 5 and instance 2 moves into node 6 based on their f1 609 values. Note that instance 3 has a missing value for f0. Its node position is therefore kept as the missing 610 direction updated in the previous step. This process is shown in Figure 20.

611
The per instance node ID array is now up-to-date for the new level so we write these values back into 612 the per feature value array, giving Table 21.

614
If the sorting version of the algorithm is being used, the feature values need to be sorted by node position.

615
If the interleaved version of the algorithm is being used (for example, in early tree levels) this step is 616 unnecessary. Each feature value with its updated node position is sorted such that each node bucket 617 resides in contiguous memory. This is achieved using a segmented key/value radix sort. Each feature 618 represents a segment, the sort key is the node position and the feature value/instance ID tuple is the value. 619 We us the segmented radix sort function from the CUB library. It delegates the sorting of each feature 620 segment to a separate thread block. Note that radix sorting is stable so the original sorted order of the 621 feature values will be preserved within contiguous node buckets, after sorting with node position as the   memory, a subset of rows from each dataset is taken in order to fit within device memory.

629
Datasets are described in Table 23 and parameters used for each dataset are shown in Table 24.

630
For the YLTR dataset we use the supplied training/test split. For the Higgs dataset we randomly select 631 500,000 instances for the test set, as in (Chen and Guestrin, 2016). For the Bosch dataset we randomly 632 sample 10% of the instances for the test set and use the rest for the training set. 633 We use 500 boosting iterations for all datasets unless otherwise specified. This is a common real 634 world setting that provides sufficiently long runtimes for benchmarking. We set η (the learning rate) examples, instead choosing the split value as the right most training example minus some constant.

653
Differences also occur due to floating point precision as discussed in Section 2.5. of the minimum batch size at which the GPU algorithm begins to be effective.

666
In Figure 11 we show the performance of the Titan X from configuration #2 against configuration 667 #3 (a high-end 24 core server) on the Yahoo dataset with 500 boosting iterations and varying numbers 668 of threads. The Titan X outperforms the 24 core machine by approximately 1.2x, even if the number of 669 threads for the 24 core machine is chosen optimally.    The accuracy shows some variance as the interleaved split finding algorithm records feature splits 679 in a slightly different way as compared to the sorting algorithm. Both versions split the training set in 680 exactly the same place but the sorting version records the feature value for the split as halfway between 681 two instances and the interleaved version records the split point as slightly less than the rightmost instance.

682
Because of this, when we use the model on the unseen test set the results can be marginally different for 683 the two versions.

684
Surprisingly the interleaved algorithm is still faster than the sorting algorithm at level 5 despite the 685 fact that the multiscan and multireduce operations must sequentially iterate over 2 5 = 32 nodes at each 686 step. This shows that executing instructions on elements held in registers or shared memory carries a very 687 low cost relative to uncoalesced reordering of elements in device memory, as is performed when radix 688 sorting.  690 We show the device memory consumption in Table 30 for all three benchmark datasets. Each dataset can 691 be fit entirely within the 12GB device memory of a Titan X card.

692
In Table 31