ACORNS: An Easy-To-Use Code Generator for Gradients and Hessians

The computation of first and second-order derivatives is a staple in many computing applications, ranging from machine learning to scientific computing. We propose an algorithm to automatically differentiate algorithms written in a subset of C99 code and its efficient implementation as a Python script. We demonstrate that our algorithm enables automatic, reliable, and efficient differentiation of common algorithms used in physical simulation and geometry processing.


INTRODUCTION
The calculation of derivatives is used in many applications, including machine learning, scientific computing, geometry processing, computer vision, natural language processing, and many more. The calculation of derivatives of large and complex computational graphs is a computational bottleneck for these applications. Thus the efficient computation of derivatives is an area of active research interest. Extensive research and engineering efforts have been spent in the last years to support applications in deep learning, that led to the development of libraries such as TensorFlow or PyTorch, which target applications with large and dense tensors. The availability of easy-to-use differentiation libraries is one of the reasons for the massive advancement in deep learning. They allow researchers to focus on the design of new network architectures and loss functions, while automating and ensuring correctness in the low-level tasks required to minimize these functionals. However, they are specialized for computing first derivatives only, and they are optimized for large, dense tensors, which are uncommon in applications outside of deep learning. The novelty in these libraries is not in the automatic differentiation technique used, but in the ease of use, cross-platform availability, and ease of integration in new research projects.
Our goal is to provide a similar solution for scientific applications requiring first and second derivatives of expressions involving small to medium tensors, using an approach based on symbolic differentiation. When paired with the optimization algorithms available in modern compilers, our approach generates code around an order of magnitude faster than existing methods while allowing users to write their functions directly in the C language. The algorithm is easy to integrate into existing build systems and produces dependency-free code.
The core parts of the library are implemented in Python using a C99 parser [4]. We perform symbolic differentiation with unrolled derivatives, stored as expression trees, and generate an efficient kernel to evaluate them using an existing C99 compiler, addressing the potentially long compilation times with a file-splitting approach. State-of-the-art compilers are capable of sophisticated optimizations such as loop and expression trees vectorization, elimination of redundancy in expressions, and reusing computations performed in previous iterations of loops. Our approach benefits from these features, and the performance of our method will likely increase as compiler technology progresses. The expression unrolling also allows us to trivially parallelize computations not only for different data points but also for a single data point over multiple CPU cores, which are common in modern architectures. Finally, we provide support efficient differentiation of algorithms with nested loops, arrays, and functions. A key difference with existing libraries, is that we create kernels to efficiently evaluate the expression itself, its gradient, and its Hessian, all optionally supporting parallel evaluation.
Our open-source implementation makes the integration of the algorithm straightforward in modern C and C++ applications, thanks to its ability to directly differentiate code and to produce standard, dependency-free C99 code that can be easily used in existing applications.
To demonstrate the practical impact of our approach, we integrated it into two applications, one in geometry processing and one in scientific computing. We show that our library provides important speedups in the overall running time of mesh parametrization algorithms and in the quasi-static simulation of non-linear elastic deformations.

RELATED WORK
Automatic (or algorithmic) differentiation has been introduced in the pioneering works of [28,33]. We briefly review the most relevant work and refer readers to various books and survey for further detail [5,11,18,20] and www.autodiff.org for a collection of implementations. We provide comparisons against representative methods for operator overloading and source transformation for both gradients and higher order derivatives in Section 4.

Operator Overloading
The most common way of implementing automatic differentiation is through defining a new object class, which associates the gradient information to the data. In forward mode, the gradient information is updated throughout the trace of the computation. And in reverse mode, a tape of computation is recorded, and the gradient information is then (back)propagated following the recorded tape. Due to the versatility and freedom of prototyping, operator overloading based approaches have gained a lot of attraction, and notable examples include [1,2,10,15,21,22]. However, the efficiency of the code often suffers due to the additional runtime overhead, which is especially noticeable when the same computation is executed repeatedly.

Source Transformation
Source transformation approaches parse the input source code (most commonly C and Fortran [14,31]) and generate a new algorithm that computes the derivatives of the input function. Compared to operator overloading approaches, obtaining the source code file a priori opens the doors to global data flow analysis and optimization, thus reducing the running time. Additionally, this approach enables the developers to directly debug the code for derivatives computation, a crucial feature to identify and address potential issues with numerical round-off or overflows. The reliance on compilers to compile the generated source-code makes these approaches future-proof, in the sense that the generated code will benefit from progress on compiler design and optimization.
A limitation of source transformation is that, since the flow is assumed fixed to allow for the above optimization, it is challenging to support conditionals or unbounded loops. In our algorithm, we restrict to a subset of C99 covering common use-cases in scientific computing, finite element analysis, and computer graphics.
A minor, yet practically relevant, drawback of source code transformation is the presence of the additional source files, posing challenges on the build system and version control of the evolving software. We show that our library, thanks to its permissive open-source license, minimal dependencies, and availability in the Conda package system, is effortless to integrate into a CMakebased build system [19]. We show how to automatically generate the required files as part of the build process, making the automatic differentiation transparent to the user (Section 3.5).

Higher Order Derivatives
Although valuable in scientific computing and numerical optimization, computing higher-order derivatives is challenging due to the need of memory management and sparsity pattern detection [9,32]. Therefore, only few existing libraries support second or higher-order derivatives computation. Our method supports dense Hessian computation, taking advantage of the symmetric structure to reduce computational times.

METHOD
The input of our algorithm is an algorithm written in a subset of C99 [23]: we support arrays, for loops, nested loops, binary assignments, functions, and variable declarations. The output is a set of self-contained, multi-threaded C99 functions to evaluate the function in the input file, its first derivative, and its second derivative. We designed our library to become part of a build system, seamlessly generating the derivative code as part of the build procedure of a software package (Section 3.5).
We overview our algorithm in Section 3.1, and show a complete execution on a prototypical expression in Section 3.2. We then discuss how to parallelize the evaluation in Section 3.3 and how to make our approach scale to large expressions in Section 3.4. We show the performance of the algorithm on synthetic expressions (Section 4), comparing it with state-of-the-art differentiation libraries, and integrate it into a physical simulation library for a realistic benchmark.

Overview
Primer on Automatic Differentiation. Given a target function f : R n → R m , the corresponding Jacobian matrix has n × m entries J i j = ∂f i ∂x j . We use the pyc parser [3] to parse the input C99 code to produce an Abstract Syntax Tree (AST) of the input algorithm. The leaves of the tree are either numerical constants or variables and the internal nodes follow a hierarchical structure built from the code. Let us consider an example, with f a composite function: From the example we see that the auto-differentiation of a function f breaks down the action of the Jacobian matrix on a vector into simple components, which are evaluated sequentially. We store the operations required for the calculation of the derivative of each of these pieces as a string. The final derivative is the accumulation of the piece-wise derivatives of the sub-operations following the chain rule.
For computing second order derivatives, we differentiate the tree twice. We further optimize over the number of calculations by taking advantage of the symmetry of the Hessian: we compute only the lower triangular part of the matrix and copy it over the upper triangular half.
Implementation Specific Details. The derivative formulas are unrolled in a sequence of assignments (no loops or conditional are used) and saved on a C file. The file is then compiled with full code optimization (-O3 with gcc [29] and clang [17]). This leverages the compiler's ability to perform redundancy elimination on trees, reusing computations (especially memory loads and stores), and automatically vectorizing the code when appropriate.
In the case of loops or conditional statements being present in the program, we perform an additional parsing step to expand these structures. We perform loop unrolling to produce a sequence of operations and eliminate other instructions that control the loop. Conditional statements are removed by evaluating their condition during parsing. The output of this phase is a list of assignment instructions, which are stored in an intermediate binary format. This file is then parsed to create the AST, differentiated, and the differentiated AST is exported as a C file.

Example
To further illustrate the mechanisms of ACORNS, we consider a typical example from statistics. We will compute the gradient of the cross entropy loss function, between two probability distributions a and b. The loss function can be given by the following piece of C99 code:

Parallelization
To accelerate the evaluation of large gradients or Hessians, we observe that the evaluation of the different entries is independent, and can thus be trivially parallelized using OpenMP [6] directives. We experimentally study the scaling of our generated code in Section 4.

Large Expressions
As the complexity of the expressions and/or the number of outputs grows, the size of the generated C file grows accordingly. While this opens more opportunities for the compilers to optimize the evaluation, it also increases memory consumption and compilation time. For large expressions, such as the Hessian of a cubic Lagrangian element, this might become impractical. We propose a simple, yet effective and practical, strategy to overcome this issue: we split the output into multiple C files that are compiled independently. This has the downside of reducing the optimization opportunities   of compilers. However, in our experiments, the performance drop is negligible and we thus use this option aggressively to reduce compilation times.
To evaluate the effect of the split on performances, we consider a function for calculating the Neo-Hookean Elasticity energy [24] for a linear, quadratic, and cubic Lagrange element (12, 30, and 60 variables respectively). The code generated by these function is large (29 GB for the function with 60 variables) and the compilation with gcc9 does not terminate in 160 hours.
We tested splitting into different number of files, measuring compilation time, evaluation time, and binary size after compilation.
As expected, the compilation time decreases as the files become smaller (Figure 2) while the size of the binary generated is not noticeably affected by the split (Figure 3). Surprisingly, the evaluation time of the generated code is also not noticeably affected by the split (Figure 4). As a default, we thus opt for splitting so that each file had a size of around 16 MB, which is a good compromise between number of files and compilation time.

Integration in a CMake Build System
One of the key design goal of our system is the easy and fast deployment in numerical based methods. We show how ACORNS can be integrated within a CMake build system for quick and easy deployment. In the following we describe the process and we refer to our github repository for the complete example https://github.com/deshanadesai/acorns. Let us consider a simple problem: we want to minimize the following function (saved in the file function.c) using gradient descent. This will generate the ders_0.h and ders_0.c file, which will be directly compiled by CMake and linked with the other files in the project. To use the derivatives it is then sufficient to include the generated header file. We show an example of using the computed derivative to find the root by using gradient descent. if ( fabs ( step ) <= 1e -5) break ; } printf ( " Minimum : ␣ % f \ n " , vals [0]); printf ( " Iterations : ␣ % d \ n " , iteration ); return 0; }

RESULTS
Our algorithm is implemented in Python. We run our experiments on a 2.35 GHz AMD EPYC™ 7452 running Ubuntu 19.10 GNU/Linux 5.3.0-29-generic x86_64. The reference implementation of ACORNS is freely available and can be easily obtained trough a conda-forge package 1 under MIT license. The source code and the scripts to generate the experiments are available at https: //github.com/deshanadesai/acorns.
We compare our generated code against 4 state-of-the-art differentiation algorithms: (1) Tapenade [14] shares many similarities with our system. It converts C code into a C program computing its derivatives and Hessians 2 .
(2) PyTorch [22] is a backward auto-differentation library tailored for machine learning. It provides both a CPU and a GPU version, we compare against the CPU version only. (3) Mitsuba Autodiff [12,15] compute automatic differentiation using operator overloading and depends on Eigen [13] for the linear algebra and storage. We run all our comparisons using only static stack-allocated matrices, since the dynamic mode is slower and unnecessary for the expressions we use for benchmarking. (4) Enoki [16] is a new library developed specifically to support differentiable rendering. Similarly to PyTorch, it supports both CPU and GPU evaluation, and we compare only with the CPU version and only for gradients, since it does not support Hessian computation.
Expression Types. We use three expression to evaluate different realistic scenarios where autodifferentiation is used: (1) An high-order polynomial, a non-polynomial function, and procedural expressions with large number of variables: (2) A scalar trigonometric function: (3) A vector-valued polynomial with s variables: For each expression, we randomly generate evaluation points and plot the average running times over 10 runs. ACORNS generates C code, Mitsuba and Enoki are both running in C++. All the C and C++ code is compiled with gcc9/g++9 with the flags -O3, -ffast-math, -flto and, only for C++, -std=c++11.
Gradient. In Figure 5 we report the timings, for different number of evaluation points, for the three functions. For the two polynomial functions we are roughly 5× faster than Mitsuba and Enoki and 2.9 times faster than PyTorch. For the trigonometric function, due to the increase in the cost of evaluating the trigonometric function, our advantage is slightly reduced. For gradients, our approach is very similar to Tapenade, and we obtain comparable timings.
In Figure 6 left we fix the number of evaluations and change the parameter s in (3), to evaluate how the methods scale with the number of variables. As expected, all methods scale linearly, but our method is still 3.97× 7.95×, and 3× faster than Mistuba, Enoki, and Pytorch respectively. As for Figure 5, Tapenade's code has very similar running time as ours.
Hessian. We repeat the same experiments as in previous section to compare ACORNS with Mitsuba, PyTorch, and Tapenade to compute Hessians. Figure 7 shows that our code is very efficient, 12.8× faster than Mitsuba and 1336× faster than PyTorch (which is not optimized to evaluate Hessians). Tapenade is faster than the other alternatives, but it is still around 5 times slower than  our code. Note also that while the Hessian generation is completely automatic with ACORN, it does require manual interaction with Tapenade, since Hessian computation is not fully supported. The scaling with respect to the number of variables for Hessian evaluation (Figure 6 right) has a similar trend as for the gradients.
Parallel Evaluation. We ran the function in (3) for s = 25 and computed the gradient for 100000 evaluation points allowing OpenMP to use a different number of threads. We observe an almost linear scaling up to 15 threads (Figure 8, right) on our workstation, and it then flattens as more threads are added. We speculate that this is due to saturation of the memory bandwidth of the workstation, since our task is inherently massively parallelizable.

APPLICATIONS
We integrated ACORNS in two applications in geometry processing and in physical simulation to evaluate the gain in performance in a realistic setting. In both these cases, the algorithm needs to assemble a dense gradient and a sparse Hessian. For both applications, ACORNS is used to compute small dense blocks, which are then assembled in a larger sparse Hessian matrix. Directly supporting the automatic construction of the sparse Hessian is an interesting avenue for future work.

Parametrization
Given a surface embedded in three dimensional space, an important problem in geometry processing is to assign parametric values to it, i.e. computing a flattening of the surface on a plane. We refer the readers to [7,26] for a comprehensive survey. In our application, we consider minimizing the non-convex symmetric Dirichlet energy [27] with projected Newton's method [30]: when the Hessian is not positive (semi-)definite, possibly at saddle points, we project it to the closest positive definite matrix through polar decomposition. The algorithm involves computing the full Jacobians and Hessians per element: our original implementation uses Mitsuba, and we replaced it with ACORNS. We summarize the statistics in Table 1. The autodiff time is reduce considerably, but in this application is not the bottleneck, thus resulting in a small reduction of the overall running time.

Finite Element Simulation
The finite element method aims at solving partial differential equations (PDEs) describing physical phenomena. A typical example is to simulate the deformation of an elastic body subject to some external forces (boundary conditions); the algorithm involves minimizing an elastic energy, and requires computing derivatives. We integrate ACORNS in PolyFEM [25] and compared, for different finite element discretization order (ranging from linear to cubic), the timings of a Neo-Hookean simulation using the Mitsuba autodiff and with ACORNS to compute Hessian and gradient of the energy. In the finite element method, all computations are performed locally on each tetrahedron. The different order will have different local number of degree of freedom (which determine the size of the gradient and Hessian): a linear tetrahedron has 12, a quadratic has 30, and a cubic has 60. Our method provides an overall speedup of 2 to 4 times, depending on the degree, over the entire simulation time (Table 2).

CONCLUDING REMARKS
We introduced ACORNS, a software library for automatic differentiation that generates efficient kernels for computing both gradients and Hessian and can be easily integrated in existing C or C++ projects. The core idea of our algorithm is to fully unroll the computational graph, and rely on modern compilers to optimize the resulting code. Compared with alternatives, our algorithm is faster at evaluation time, but slower during compilation: we believe that this tradeoff is very interesting for many scientific computing applications where the evaluation of small dense Hessian is repeated millions of times. The compilation time could be reduced by performing symbolic simplification on the code before generating the c files. In addition, we could avoid the parsing time of the compiler by directly generating bytecode (for example for the llvm compiler) and compiling it in memory, avoiding unnecessary disk access. Variable conditional are not support by ACORNS. Support for it could be added, but the performance will likely decrease, due to the reduced optimization options and more complex parallelization. We would like to extend ACORNS to directly support the generation of sparse Hessians [8,32] and also extend the code generation to target GPU accelerators.
Our open-source reference implementation and a set of examples on how to use it is available at https://github.com/deshanadesai/acorns, and it is also released as a Conda package on conda-forge to allow an easy installation on all major operating systems.