GPU-accelerated adjoint algorithmic differentiation ✩

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memoryispotentiallyrequiredtostoretheintermediatepartialderivativesonadedicateddatastructure, referred to as the ‘‘tape’’. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memoryconsumptioncouldbesubstantiallyreducedcomparedtothenaivereferenceimplementation,in somecasesevenbyanorderofcomplexity.Thevectorizationallowedusageofoptimizedparallellibraries during forward and reverse passes which resulted in high speedups


Algorithmic Differentiation
Algorithmic Differentiation (AD) is a technique for augmenting numerical simulation programs with the ability to compute first and higher mathematical derivatives. In sharp contrast with classical numerical differentiation by finite differences, AD delivers gradients, Jacobians, and Hessians with machine accuracy by avoiding truncation through analytic differentiation of individual statements within an arbitrarily complex code implemented, for example, in C/C++ or Fortran. The adjoint (also: reverse) mode of AD is of particular interest in the context of large-scale sensitivity analysis and nonlinear optimization. Gradients of arbitrary size n can be computed at a constant (typically between 1 and 100) relative computational cost, i.e., with respect to the original simulation. A finite difference approximation of the same gradient would have to perform at least n + 1 original simulations. Assuming a run time of the original simulation of 1 min, the computation of a gradient of size n = 10 6 would take between 1 and 100 min in adjoint AD mode compared to at least 10 6 + 1 min (almost two years) when using finite difference approximation or tangent (also: forward) mode AD. This run time will most likely turn out to be prohibitive, thus, rendering the use of gradient-based optimization techniques infeasible unless adjoint mode AD is available. Gradients of this size are very common, for example, in computational fluid dynamics simulations run frequently in the atmospheric sciences and in automotive or aircraft design. AD has been applied successfully to a constantly growing number of practically relevant problems in Computational Science, Engineering, Medicine, and Finance as illustrated by numerous publications, for example, in the proceedings of the international AD conference series [1][2][3][4][5][6]. Both mathematical and algorithmic foundations of AD as well as various advanced subtopics are covered by two text books [7,8]. The AD community runs a web site (www.autodiff.org) with links to research groups, software tools, and an extensive bibliography.
When a computer program evaluates a cost function, the process evaluates differentiable elemental functions and operations, such as sine and multiplication, respectively, and creates many intermediate variables. The computation induces a directed acyclic graph (DAG) which contains the intermediate variables as nodes. The local partial derivatives of the elemental functions with respect to their arguments attached to the edges of the DAG yield the linearized DAG (lDAG) as shown in Fig. 1(a). The gradient can be computed by one reverse sweep through the lDAG propagating adjoints backwards from the dependent outputs (here y) to the independent inputs (here x 1 and x 2 ) until the gradient is accumulated. The adjointv of a variable v is defined as the derivative of y with respect to v [7,8]. The adjoint of a variable is computed as a weighted sum of the adjoints of its successors in the lDAG, where weights are the local derivatives. Propagation of adjoints can be visualized as an adjoint DAG such as shown in Fig. 1(b). To perform this backpropagation, AD tools need to create and store some representation of the lDAG which is often referred to as the tape.
AD software tools take two alternative approaches: source code transformation (e.g. TAPENADE) or operator and function overloading (e.g. dco). Some AD tools implement a combination of both (NAG AD compiler). While source code transformation promises better performing derivative code, the tools typically fail to cover the latest language standards of Fortran and C/C++. Our target language is C++. Hence we use overloading to implement the ideas to be presented. See also [9][10][11][12].

High performance computing
Ideally it should be possible to compute the gradient in roughly twice the function evaluation time. Unfortunately, there are several reasons why this is rarely achieved in practice. Besides the arithmetic operations, many memory storage operations are required during creation and interpretation of the tape. The tape quickly becomes excessively large, because modern processors evaluate gigabytes Adjoint DAG with the appropriate incremental adjoint calculations on the edges. Execution of these operations from top to bottom results in the adjoints of the input variables (x 1 ,x 2 ) T , i.e., the gradient. Adjoints are initialized with zero. Table 1 Vector and matrix operations and their adjoint operations (mostly taken from [13,14]). Bold lower case letters are vectors (in R n ), where x i is the ith component of a vector x, andx its adjoint vector. Bold capitals are matrices (in R n×n , unless specified differently),X is the adjoint matrix of X, and v is a scalar. Matrices may be symmetric positive definite (indicated by s.p.d.). The element-wise application of a scalar function y := EW (x, op) is defined by y i := op(x i ), for all 1 ≤ i ≤ n, and the element-wise multiplication z := EW (x, y, ·) is defined by z i := x i · y i , for all 1 ≤ i ≤ n. And ∂_op is the derivative of the unary operator op.

Expression
Adjoint expressions Notes Matrix-vector product  [11,12]. Moreover, the lDAG can be pruned by elimination of edges or vertices while creating the tape [15]. Parallelization is another challenge for AD. While it may be relatively easy to speed up the cost function using OpenMP, overloading AAD tools struggle to do so due to the need for synchronization of tape accesses during the forward and backward phases [16]. The use of OpenMP has been shown to potentially slow down the gradient computations which can be improved by the use of expression templates [12].
Well-designed cost functions are likely to make use of highly optimized libraries such as BLAS or LAPACK to perform vector and matrix operations. Since these functions are typically linked in, and not available as source code, they are not easily automatically differentiable. Differentiable versions of these operations need to be implemented which typically cannot compete with the performance of highly optimized library functions. Alternatively, AD tools can treat these functions as intrinsics, by providing symbolic derivatives. A corresponding vector operation needs to be performed during the reverse pass in this case, which is often similar to the forward operation and can also be computed using a highly optimized library [13] (Table 1). This vectorization of the adjoints can lead to substantially smaller tapes because the memory overhead is reduced. Furthermore, it allows integration of iterative linear solvers, which may not be automatically differentiable otherwise [14,17].
GPUs provide massive computational power by running thousands of light-weight threads in parallel [18]. For this to be effective, special programming techniques are required, because the number of concurrent threads is limited by resources, such as registers and shared memory [19]. GPU memory is very fast but limited in size. Since the bandwidth between CPU and GPU memory is much lower, data transfers between them should be minimized as much as possible [20]. C or C++ code can be compiled for GPUs using frameworks such as CUDA [21] or OpenCL [22]. Fortunately, many optimized GPU-accelerated operations for vectors and matrices are available [23][24][25][26][27]. While features such as dynamic memory allocation, recursion, and function pointers are possible with the latest generation of GPUs [21], avoiding these features often results in better performance. For these reasons, implementation of general GPU-accelerated AAD tools is difficult and previous attempts focused on using source transformations inside shaders for special applications [28] or on batch computation of many small gradients which fit into shared memory [29].

Fluorescence-mediated tomography
This software was originally developed and applied for fluorescence-mediated tomography (FMT) [17,[30][31][32]. FMT is a noninvasive imaging technology to assess the three dimensional distribution of fluorescence in human fingers, breasts or, most commonly, laboratory mice [33]. The fluorescence can be provided by contrast agents or genetically transfected cells expressing fluorescent proteins [31]. Commercially available FMT devices operate with continuous wave illumination, e.g., with a servo-mounted laser, and acquire multiple diffuse transillumination images of the incident and reemitted light [34,35]. In the last years, FMT has become a broadly applicable tool for biomedical and pharmaceutical research, particularly in combination with an anatomical modality such as µCT which provides valuable information for reconstruction and image analysis [30,32,33,36]. Due to the high scattering in the near-infrared wavelengths, fluorescence reconstruction is a mathematically and computationally challenging problem. One important aspect is an accurate optical model containing information about the shape and heterogeneous scattering and absorption maps. In our previous study we derived a scattering map by performing automated segmentation of tissue classes (bones, lungs, fat, muscle, skin) based on the µCT data and assigned known scattering coefficients [17]. Subsequently, an absorption map was reconstructed by iterative nonlinear minimization of a cost function, expressing the difference between measured and predicted surface measurements. The absorption map turned out to be important for quantitative fluorescence reconstruction in several organs such as heart, liver, and kidneys [17]. The required gradients were computed using the proposed software for GPU-accelerated AD and we found the vectorization beneficial for computational speed but also necessary to integrate a GPU-accelerated sparse linear solver into the AD system [14,25]. We found the cost function for this problem to be too complicated to introduce the software, however, and therefore resort to more simple cost functions in the following.

Aim of the study
The aim of this study is to provide a software for GPU-accelerated AAD and show its value using performance measurements. Therefore, we introduce four increasingly complex cost functions in Section 2. In Section 3 we report performance measurements with respect to gradient computation times and memory consumption. To separate the improvements of vectorization and GPU acceleration, we first distinguish between naive and vectorized AAD for CPUs. Then, vectorized CPU and GPU versions are compared to assess the additional gains due to using GPUs. Section 4 provides a user guide for the software, which is available as supplemental material, and describes how to systematically extend the functionality for more complex cost functions. Finally, in Section 5, we discuss our results and future directions of research.

Cost functions
We implemented four different cost functions, which are described in the following. They all have a variable input size, which allows us to analyze the performance for increasing dimensions of the input domain.

Sum of sigmoids
The SumSigmoid cost function is defined as y =  n i=1 sigmoid(x i ), i.e., it applies the scalar sigmoid function to each element of the input vector followed by computing the sum of the results (see Fig. 2). We use this function as a simple introductory example. The scalar in the context of gradient-based methods for training neural networks [37]. We use the prefix ''∂_'' to denote the first derivative. The computational complexity of evaluating the value and the gradient of SumSigmoid is linear in the input size, i.e., O(n). For our experiments we use a default input size of n = 2 · 10 7 .

Robust linear least squares
Sparse least squares problems need to be solved for many applications. They occur when measurements b are acquired about an unknown state x, which is observed through a linear system, which is represented as a sparse matrix A. The least squares term ∥Ax − b∥ 2 is minimized to reconstruct x. Typically, iterative sparse least squares methods are used [38]. Due to the squared residual, outliers or high-valued measurements may dominate the solution. The problem can be overcome by replacing the squaring operator with the robust Huber function or another robust loss function [39]. The Huber function is defined as We arbitrarily set δ = 1.28. Its parabolic shape near the origin and a linear shape for inputs with higher magnitude yields a cost function ( Fig. 3) which is not quadratic anymore, however. This cost function can be minimized using iterative nonlinear methods, such as the nonlinear conjugate gradient method, which requires the gradient at each iteration [40]. For our experiments we assume that the sparse matrix A ∈ R m×n contains 6 random nonzero elements per row and that m = 5 · n. The constant A is stored inside the C++ functor that implements the cost function. The computational complexity is linear in terms of the input size, because there is a constant number of nonzeros per row. We use a default input size of n = 10 6 .

Maximum entropy models
Maximum entropy models, also called log-linear models, can be used as binary classifiers. The log-likelihood l is computed as the linear combination of a feature vector x with the parameter f of the log-linear model: l = x T f. Training of the classifier requires a large number of feature vectors. The cost term  n i=1 (− logit((Ax) i )) needs to be minimized, where A ∈ R m×n contains the features as rows [41] and the scalar logit() function is defined as logit(x) := log(x/(1 − x)).
For our experiments we assume m = 5 · n to avoid overfitting. The DAG is shown in Fig. 4. The computational complexity and the memory complexity are O(n 2 ). We use a default input size of n = 5000.

Cholesky solver
Symmetric positive definite (s.p.d.) matrices occur frequently, for example in the context of finite element or finite difference methods for partial differential equations [17,42]. Often, such a matrix is constructed from input parameters and then a linear system is solved during a forward function evaluation. AAD tools need to backpropagate the adjoints through the linear solver which may be problematic with iterative solvers [14]. Dense s.p.d. systems can be solved using Cholesky decomposition at computational complexity O(n 3 ). While  Table 1. A is constant data that does not depend on the input x. realistic problems often use sparse iterative solvers, we here use a dense Cholesky solver for the purposes of simplicity and illustration and because it is compatible with naive AAD methods.  In the SolveCholesky cost function (see Fig. 5), a matrix A = I + xx T ∈ R n×n is constructed based on the input x followed by solving system Aw = b for w. Then the squared L 2 -distance of w and a constant measurement vector is used as cost term in addition to a magnitude penalty applied on x for regularization. Similar cost functions occur for example in diffuse optical tomography [17,14,43].
The computation of the gradient ∂_SolveCholesky results in excessive use of tape memory due to the primal computational cost of O(n 3 ).
Fortunately, this can be improved using the concept of vectorization, because the adjoint operation requires solving a linear system using the same matrix (Table 1). Linear s.p.d. systems are self-adjoint; yielding an adjoint system with the same system matrix. The previously factorized matrix can be stored on the tape and reused during backpropagation. This reduces the memory usage for taping from O(n 3 ) to O(n 2 ). For our experiments we used a default input size of n = 900.

Performance measurements
For the performance measurements, we used four different cost functions, which are described in the previous section. First, the time for the function evaluations was measured to assess the speedup achieved by parallel CPU and GPU processing. Then we measured the memory required for taping during gradient computations. We compared the three mentioned versions, i.e., the naive and vectorized CPU versions and the vectorized GPU version. To achieve this, we queried the amount of memory assigned to the process before and after generation of the complete DAG. For the CPU-based versions the CPU memory was measured. For the GPU version, only the GPU memory was measured because this constitutes the main bottleneck and the used CPU memory is negligible in most cases. Furthermore, the time to compute the gradients was measured. We performed these measurements for certain default input sizes of the cost functions to allow reporting in tables. These input sizes were selected so that they require less than 16 GB memory for taping when using the naive version. Additionally, we performed all measurements for a range of input sizes, to assess the effect of the input size on the computation time. All function evaluations and gradient computations were performed using double precision floating-point arithmetic.

Function evaluation speed
First, we measured the effect of CPU parallelization using OpenMP. This resulted in a speedup of 3.1±0.6 (mean and standard deviation) compared to a single-threaded version ( Table 2). Usage of the Math Kernel Library (MKL), which is also internally parallelized, resulted in a notable further speedup only for the SolveCholesky function. Compared to the CPU-MKL version, the GPU-accelerated code resulted in an additional speedup of 5.8 ± 3.5 (Table 2).

Tape size
Measurements of the memory consumption showed that vectorization yields substantial reductions of the tape size (Table 3). For the first three functions, tape size reductions by factors between 4.2 and 12.0 were achieved. For the SolveCholesky cost function, the required tape memory was reduced by an order of complexity, i.e., from O(n 3 ) to O(n 2 ), which resulted in a dramatic reduction by the factor 815.7. The memory consumption was similar for the CPU-based and GPU-based vectorized versions. The memory reduction is particularly important for the GPU mode because GPUs are much more limited in the amount of available memory.

Gradient computation speed
During performance measurements, we found that OpenMP resulted in a performance loss for the naive CPU version (Table 4), because multiple threads have to synchronize when accessing the tape during generation of the compute graph. Turning off OpenMP resulted in a speedup of 1.5 ± 0.3. Vectorization resulted in a cumulative speedup of 24.9 ± 33.5 compared to the single-threaded naive version. Usage of OpenMP provided an additional speedup of 3.6 ± 1.8 for the vectorized version. This shows that vectorization is a suitable method to achieve a benefit from OpenMP for AAD. By using the highly optimized MKL library for the vector operations, another speedup of 1.6 ± 1.0 could be achieved. Finally, the GPU version resulted in a speedup of 7.5 ± 4.4 compared to the CPU-MKL version (Table 4).
However, for small input sizes, the CPU version performs better than the GPU version, because GPU operations require a certain amount of work to sufficiently utilize the GPU processors (Fig. 6).
We also compared the gradient computation times to the function evaluation times (Table 5). This shows that using the naive mode results in gradient computation times that are much higher, i.e., more than 100 times, than the function evaluation times. Therefore, this would become a major bottleneck for gradient descent methods where one gradient computation and a few function evaluations are required for the line search at each iteration. Using CPU and GPU vector modes, the ratios are substantially lower, showing that the gradient computation time would not dominate the total time during iterative gradient descent methods. For SolveCholesky, it is even below 2, because the expensive Cholesky decomposition can be reused during backpropagation.

Devices and software
A PC (Fujitsu Celsius M730) equipped with an Intel Xeon E5-1620 v2 (3.7 GHz) quad-core processor, 64 GB of DDR3 RAM (CAS latency 13, DRAM Frequency 933.2 MHz) and an Nvidia GeForce GTX Titan Black (15 Kepler cores, 2880 CUDA cores, 889 MHz, 6 GB memory at 1750 MHz) was used for performance measurements. The used operating system was Windows 7 (64-bit). The Intel MKL library 11.1 Update 3 was used for parallel host operations. The C++ code was compiled with Visual Studio 2012 Ultimate (Update 4). The CUDA Toolkit 7.0 was used for the measurements.

Gradient computations
To use our code for gradient computations, a cost function needs to be implemented as a C++ class providing a generic overload of the function call operator, taking a DeviceVector<T> as input (Listing 1). Then the function AdjointModelGradient is called to compute the gradient and function value at a given argument. This approach triggers template instantiation and compilation of the cost function class using the AdjointDouble class instead of the regular C++ double. An AdjointDouble contains a double as value and a reference to its adjoint which is stored on the tape because the adjoint is required beyond the lifetime of the AdjointDouble instance. The overloaded operators, such as *, +, and -, compute the local derivatives and store these on the tape. Furthermore, overloads are implemented for some scalar functions such as sin, cos, or sigmoid.
The tape consists of two arrays, one for the adjoints and one for the local partial derivatives. These correspond to nodes and edges of the lDAG, respectively. The arrays are generously preallocated, which is a O(1) operation because only the virtual address space is reserved and the actual memory is assigned to the process on demand, i.e., page by page by the operating system. Storage of each edge requires 24 bytes, i.e., one double for the local derivative and two pointers to the source and destination adjoints. The scalar operations perform the storage on the statically allocated tape in a thread-safe manner, using an atomic increment of the current position on the node and edge arrays, which is implemented using the std::atomic functionality available with C++ 11. After completion of the function evaluation, the entire lDAG is stored on the tape and the adjoints are propagated backwards to compute the gradient.

GPU-accelerated operations
To enable vector-based programming we implemented generic C++ classes to represent vectors and matrices (DeviceVector<> and DeviceMatrix<>). For these classes, only the elements, e.g., doubles, are stored on the GPU, while the memory required for the class structure is stored in CPU memory. A DeviceVector can represent an original memory block or point to a subvector of another vector or a row or column of a matrix. Similarly, a DeviceMatrix can represent an original 2D array or point to a submatrix of another DeviceMatrix<>. This is achieved by storing a stride, i.e., gap, between subsequent elements of vectors and between rows of matrices. Fortunately, this is compatible with the interface of the cuBLAS library [26]. Furthermore, these classes manage the memory in an exception-safe manner using C++11 smart pointers (std::shared_ptr), i.e., a memory block is automatically released when the last subvector, submatrix, row or column referring to it is destructed. The ability to use subvectors and submatrices is particularly useful to implement blocking schemes, as we did to provide a simple and self-contained Cholesky solver. We linked vector and matrix operations to the cuBLAS library wherever possible. More general reduction operations than available in cuBLAS were implemented using the Thrust library [24]. Furthermore, we implemented CUDA kernels for element-wise operations and to transpose and multiply sparse matrices in the CRS format [25]. Since the RobustLeastSquares cost function performs multiplication of a sparse matrix with a dense vector, we also implemented this operation on the GPU. To utilize the processing power of GPUs for AAD, we had to extended the naive adjoint AD tool with special functionality, simply because the AdjointDouble class and its overloads for scalar operations are not suitable for lightweight threads of GPUs. Modern GPUs run thousands of threads concurrently and the required synchronization when taping scalar operations would be prohibitively expensive.
To enable vectorized taping, which avoids these problems, both data structures and operations were adjusted. Using the rich feature set of C++ [44], we implemented template specializations for the templated classes DeviceVector<> and DeviceMatrix<> for the element type AdjointDouble. In contrast to the default template instantiation which would use a memory block to store AdjointDoubles continuously, i.e., values and adjoints in alternating succession, the template specialization stores values and adjoints in separate arrays. This structure-of-array pattern allows direct usage of these arrays for high-performance libraries such as cuBLAS [26]. The template specialization DeviceMatrix<AdjointDouble> provides basically the same functionality as the regular template instantiations, e.g., DeviceMatrix<double>, including the ability to create submatrices and fetch rows and columns.
To trigger vectorized taping during the forward pass, we implemented overloaded functions for all required vector and matrix operations, e.g., for matrix-vector multiplication in Listing 2. These function overloads apply the required vector operation on the value array. Furthermore, they tape a 'propagator' object which performs the corresponding vector operation to the adjoint arrays during the reverse pass ( Table 1). The propagator classes need to derive from the abstract base class IPropagator, i.e., need to implement a Propagate() method to be called during backpropagation of the adjoints, as well as a virtual destructor. The IPropagator objects are stored as special edges on the tape. It should be noted that scalar operations for AdjointDouble are still performed on the CPU and not on the GPU. This allows combination of the naive CPU version with the vectorized GPU version which is useful for incremental optimization of the adjoint code, e.g., by porting only the most expensive operations to the GPU. DeviceVectorMatrixMulPropagator ( DeviceVector <double > y, 12 DeviceVector <double > x, DeviceMatrix <double > A) 13 :y(y),A(A),x(x){} 14 15 virtual void Propagate (){

Extending the functionality
To apply the software to more complex problems, extension of the functionality may be necessary because we only provide overloads for the operations listed in Table 1. Then the required functions need to be implemented as exemplified in Listing 2. To ensure correct implementation we recommend to test each overload individually. This can be achieved by testing the gradient computation of a simple cost function using the new function. Tangent mode AD is very useful for this purpose because it also allows computation of the gradient. Since it may be prohibitively slow to compute entire gradients for cost functions with large input domains, we only compute a random subset of the gradient elements using the function TestRandomGradientPositions. Unfortunately this requires availability of differentiable GPU code which is usually not available for linked-in functions. Alternatively, the GPU code can be tested against the naive CPU AAD code that only requires a differentiable CPU version which may be considerably easier and faster to implement.

Discussion
We showed how the concept of vectorization benefits both CPU-and GPU-based AAD. The naive CPU-based version requires excessive amounts of memory for taping which is strongly improved by vectorization. This reduction of the memory consumption is particularly useful for GPU-accelerated AAD, because GPUs typically have much less memory than the host computers. Cholesky decomposition is an example where the tape size is reduced by an order of complexity, i.e., from O(n 3 ) to O(n 2 ), when switching to vector mode.
Gradient computation times should ideally be of the same order as the function evaluation times, because they require a similar amount of arithmetic operations [7]. However, the naive reverse mode results in run times that are orders of magnitudes slower than the function evaluations. This is due to three reasons. First, the cost function needs to be provided as differentiable code, because high-performance libraries that are linked in are not automatically differentiable. This code usually performs worse than the vendor-provided and highly optimized libraries such as MKL. Second, the taping causes many memory operations covering large amounts of memory, which hardly benefits from the caching architecture. Third, the usage of multi-threading, e.g., OpenMP, is counter-productive because the threads need to synchronize when accessing the tape, which is necessary at each arithmetic operation.
Vectorization solves or at least reduces these problems. It allows usage of highly optimized parallel vector operations during forward and backward passes of the gradient computation, which greatly improves the performance. Furthermore, differentiable versions of the vector operations are not required which reduces programming effort unless they are needed for the purpose of testing.
Vectorization simplifies the transition towards GPU processing, because chunky operations are ideal for GPUs and because existing optimized vector and matrix operations can be used. Our approach keeps the main data of the tape on the GPU, i.e., the vector elements. This is important because memory transfers between GPU and CPU memory are avoided. Data about the logical structure of the vectors and matrices is stored on CPU memory. Using this approach, we could show that GPU processing achieves substantial reductions for the gradient computation times compared to the vectorized CPU version, for sufficiently large problem sizes.
While tangent mode AD is hardly suitable to compute large gradients, because one function evaluation is required for each element of the gradient, we found it very useful for testing. To test a newly implemented function overload, we integrate it into a simple cost function and compute a random subset of the gradient entries using the tangent versions and compare it with the gradient computed in adjoint mode. This approach allows systematic and robust extension of functionality by adding more vector operations.

Limitations
Our naive AAD code is relatively simple and we used it to illustrate difficulties and possible improvements of AAD. There are other approaches that perform on line pruning during creation of the tape [15] or use C++ expression templates to combine a whole statement into one taped operation [11,12]. Furthermore, our method cannot be used directly for existing code. Instead, it requires that the cost function is re-implemented using our vector and matrix classes and operations. While vector-based processing has the advantage of using optimized vector operations, it may be less cache efficient because entire vectors may need to be channeled repeatedly through the caching architecture. While the vector operations reduce the tape size, the limited GPU memory may still not be sufficient for some applications. This could be overcome by taping data on the host memory once the GPU runs out of memory. Currently, we only have one static tape, i.e., multiple threads cannot create their own tapes. This could be improved by instantiating multiple instances of tapes, e.g., one per thread [12].

Conclusion
We showed how GPUs can be used for efficient gradient computations using adjoint algorithmic differentiation. The concept of vectorization reduces the tape size and simplifies the transition to many-core architectures. We believe that this method is comprehensible, extensible and broadly applicable for many problems in scientific computing that require gradient computations and we successfully applied this method in the context of fluorescence-mediated tomography; see [17,[30][31][32] for an in-depth discussion of this case study.