Generalization of partitioned Runge--Kutta methods for adjoint systems

This study computes the gradient of a function of numerical solutions of ordinary differential equations (ODEs) with respect to the initial condition. The adjoint method computes the gradient approximately by solving the corresponding adjoint system numerically. In this context, Sanz-Serna [SIAM Rev., 58 (2016), pp. 3--33] showed that when the initial value problem is solved by a Runge--Kutta (RK) method, the gradient can be exactly computed by applying an appropriate RK method to the adjoint system. Focusing on the case where the initial value problem is solved by a partitioned RK (PRK) method, this paper presents a numerical method, which can be seen as a generalization of PRK methods, for the adjoint system that gives the exact gradient.


Introduction
We consider a system of ordinary differential equations (ODEs) of the form d dt x(t; θ) = f (x(t; θ)) with the initial condition x(0; θ) = θ ∈ R d , where f : R d → R d is assumed to be sufficiently smooth. We denote the numerical approximation at t n = nh by x n (θ) ≈ x(nh; θ), where h is the step size. Let C : R d → R be a differentiable function. This paper is concerned with the computation of ∇ θ C(x n (θ)), i.e. the gradient of C(x n (θ)) with respect to the initial condition θ. Calculating the gradient ∇ θ C(x n (θ)) is a fundamental task when solving optimization problems such as min θ N n=1 C(x n (θ)).
A simple method of obtaining an approximation for the gradient is to compare C(x n (θ)) with the numerical approximation corresponding to a perturbed initial condition. For example, (C(x n (θ + ∆e i )) − C(x n (θ)))/∆ may be regarded as an approximation of the ith component of the gradient, where ∆ is a small scalar constant, and e i is the ith column of the d-dimensional identity matrix. However, this approach becomes computationally expensive when the dimensionality d or the number of time steps N is large. Further, the approximation accuracy could deteriorate significantly due to the discretization error. Alternatively, the adjoint method has been widely used in various fields such as variational data assimilation in meteorology and oceanography [4], inversion problems in seismology [5], optimal design in aerodynamics [6], and neural networks [2]. In this approach, the gradient is approximated by integrating the so-called adjoint system numerically. This approach is usually more efficient than the aforementioned approach using perturbed initial conditions; however, the approximation accuracy may still be limited when the collected discretization errors are large.
Recently, Sanz-Serna showed that the gradient ∇ θ C(x n (θ)) can be computed exactly when the original system is integrated by a Runge-Kutta (RK) method [14]: the intended exact gradient is obtained by applying an appropriate RK method to the adjoint system. Note that even if the numerical integration of the original system (1) is not sufficiently accurate, a sufficiently accurate approximation or the exact computation of the gradient ∇ θ C(x n (θ)) is often required. We provide two examples: • The forward propagation of several deep neural networks is interpreted as a discretization of ODEs. For example, the Residual Network (ResNet), which is commonly used in pattern recognition tasks such as image classification, can be seen as an explicit Euler discretization [3]. Also, neural network architectures based on symplectic integrators for Hamiltonian systems have been developed recently, which avoid numerical instabilities [8]. Since the output of such neural networks is not the exact solution of ODE but its numerical approximation, their training is formulated as an optimization problem of the form (2). In other words, the backpropagation algorithm [7] is a special case of the adjoint method in this context.
• Let us consider solving the optimization problem of the form (2). If we apply the Newton method, we need to solve a linear system having the Hessian of the objective function with respect to the initial state θ as the coefficient matrix. When the linear system is solved by a Krylov subspace method such as the conjugate gradient (CG) method, the Hessian-vector multiplication needs to be computed. The adjoint method can be used to approximate the Hessian-vector multiplication [17,18]. 1 However, this approach usually results in applying the CG method to a non-symmetric linear system even though the exact Hessian is always symmetric, and consequently, its convergence is not always guaranteed. Symmetry can be ideally attained if the Hessianvector multiplication is computed exactly [10]. We note that the need for computing the exact Hessian-vector multiplication also arises in the context of uncertainty quantification (see, e.g. [11,16]).
To the best of authors' knowledge, an algorithm that systematically computes the exact gradient has been developed only for the cases where the ODE system (1) is discretized by using RK methods; similar algorithms should be developed for other types of numerical methods. We note that while it may not be so difficult to construct an algorithm for a specific ODE solver, providing a recipe for a particular class of numerical methods is useful for practitioners. Among others, we focus on the class of partitioned RK (PRK) methods. A straightforward conjecture would be that the exact gradient is obtained by applying an appropriate PRK method to the adjoint system. Indeed, our previous report showed that this conjecture is true for the Störmer-Verlet method [13]. However, except for some special cases, this conjecture does not hold in general. In this study, we shall show that the exact gradient can be computed by a certain generalization of PRK methods.
In this paper, after reviewing the approach proposed by Sanz-Serna in Section 2, we show main results in Section 3. Several examples are provided in Section 4, and concluding remarks are given in Section 5.

Preliminaries
In this section, we review the adjoint method and the approach proposed by Sanz-Serna [14].

Adjoint method
Let x(t) be the solution to the system (1) for the perturbed initial condition x(0) = θ + ε. By linearising the system (1) at x(t), we see that as ε → 0 it follows that The corresponding adjoint system, which is often introduced by using Lagrange multipliers, is given by Here, ∇ x f (x(t)) ⊤ is the transpose of the Jacobian ∇ x f (x(t)). For any solutions to (3) and (4), we see that d dt λ(t) ⊤ δ(t) = 0; thus, it follows that λ(t) ⊤ δ(t) = λ(0) ⊤ δ(0) for all t > 0, so in particular, Because of the chain rule By comparing (6) with (5), it is concluded that the solution to the adjoint system (4) with In other words, solving the adjoint system (4) backwardly with the final condition ). Now, we compute the gradient ∇ θ C(x N (θ)). 2 The above discussion indicates that approximating the solution to the adjoint system (4) with the final condition λ(t N ) = ∇ x C(x N (θ)) gives an approximation of ∇ θ C(x N (θ)) at t = 0. In general, the approximation accuracy depends on the accuracy of the numerical integrators for both the original system (1) and adjoint system (4).

Exact gradient calculation
In the following, we assume that x n (θ) is the approximation obtained by applying an RK method to the original system (1). In this case, Sanz-Serna showed in [14] that the gradient ∇ θ C(x N (θ)) can be computed exactly (up to round-off in practical computation) by applying an appropriate RK method to the adjoint system (4). Below, we briefly review the procedure. The obvious argument (θ) will often be omitted.
Suppose that the original system (1) has been discretized by an s-stage RK method characterized by {a i j } and {b i }: with the initial condition x 0 = θ. The pair of {a i j } and {b i } will be simply denoted by (a, b). Correspondingly, the variational system (3) is discretized by the same RK method We discretize the adjoint system (4) with an s-stage RK method, which may differ from the RK method (a, b), characterized by the coefficients (A, B): 2 As a particular case of ∇ θ C(x n (θ)), n = 1, . . . , N, here, we consider ∇ θ C(x N (θ)).
with the final condition λ N = ∇ x C(x N (θ)).
Combining the chain rule ∇ θ C(x N (θ)) = ∇ θ x N (θ) ⊤ ∇ x C(x N (θ)) with δ n = (∇ θ x n (θ))δ 0 , we see that ∇ x C(x N (θ)) ⊤ δ N = ∇ θ C(x N (θ)) ⊤ δ 0 . Therefore, if the approximation of the solution to the adjoint system (4) N (θ)). The theory of geometric numerical integration [9] tells us that if an RK method is chosen for the adjoint system such that the pair of the RK methods for the original and adjoint systems satisfies the condition for a partitioned RK (PRK) method to be canonical, the property λ ⊤ N δ N = λ ⊤ 0 δ 0 is guaranteed and the gradient ∇ θ C(x N (θ)) is exactly obtained as shown in Theorem 1. We note that the canonical numerical methods are well-known in the context of geometric numerical integration. For more details, we refer the reader to [9, Chapter VI] (for PRK methods, see also [1,15]).
Theorem 1 ( [14]). Assume that the original system (1) is solved by an RK method (a, b), and the coefficients (A, B) of the RK method for the adjoint system (4) satisfy Then, by solving the adjoint system with the final condition λ N (= λ(t N )) = ∇ x C(x N (θ)), we obtain the exact gradient λ 0 = ∇ θ C(x N (θ)).

Remark 1. The conditions in Theorem 1 indicate that
which makes sense only when every weight b i is nonzero. However, for some RK methods one or more of the weights b i vanish. In such cases, the above conditions cannot be used to find an appropriate RK method for the adjoint system. We refer the reader to Appendix in [14] for a workaround.
The following theorem shows that the gradient can be computed exactly by an appropriate GPRK method.
Hence, if the conditions in Theorem 2 are satisfied, it follows that λ ⊤ n+1 δ n+1 = λ ⊤ n δ n . Remark 2. We note that if the coefficients of the PRK method applied to the system (7) satisfy b [1] = b [2] , the GPRK method, which is uniquely determined from the conditions in Theorem 2, satisfy (13). Therefore, in this case, the GPRK method falls into the category of PRK methods.

Examples
We consider several classes of PRK methods and show corresponding GPRK methods that are uniquely determined from the conditions in Theorem 2.

Concluding remarks
In this paper, we have shown that the gradient of a function of numerical solutions to ODEs with respect to the initial condition can be systematically and efficiently computed when the ODE is solved by a partitioned Runge-Kutta method. The key idea is to apply a certain generalization of partitioned Runge-Kutta methods to the corresponding adjoint system. The proposed method is applicable to, for example, estimation of initial condition or underlying system parameters of ODE models and training of deep neural networks.
It is an interesting problem to construct a similar recipe for other types of ODE solvers such as linear multistep methods and exponential integrators. We are now working on this issue.