Instantons for rare events in heavy-tailed distributions

Large deviation theory and instanton calculus for stochastic systems are widely used to gain insight into the evolution and probability of rare events. At its core lies the realization that rare events are, under the right circumstances, dominated by their least unlikely realization. Their computation through a saddle-point approximation of the path integral for the corresponding stochastic field theory then reduces an inefficient stochastic sampling problem into a deterministic optimization problem: finding the path of smallest action, the instanton. In the presence of heavy tails, though, standard algorithms to compute the instanton critically fail to converge. The reason for this failure is the divergence of the scaled cumulant generating function (CGF) due to a non-convex large deviation rate function. We propose a solution to this problem by"convexifying"the rate function through nonlinear reparametrization of the observable, which allows us to compute instantons even in the presence of super-exponential or algebraic tail decay. The approach is generalizable to other situations where the existence of the CGF is required, such as exponential tilting in importance sampling for Monte-Carlo algorithms. We demonstrate the proposed formalism by applying it to rare events in several stochastic systems with heavy tails, including extreme power spikes in fiber optics induced by soliton formation.


I. INTRODUCTION
In many situations of physical relevance, rare events are tremendously important despite their infrequent occurrence: Heat waves, stock market crashes, or earth quakes all occur with small probability but devastating consequences. Unfortunately, due to their rareness, these events are hard to observe in experiment or numerical simulation, and require special treatment. Rare event algorithms [1] are typically based on one of the two following ideas: Either to increase the rate of occurrence of the rare event by biasing the underling system (importance sampling), or to substitute all possible ways of observing a rare event by its most common realization (large deviations/instanton theory). Under the hood both are connected to the exponentially tilted measure and the cumulant generating function. As we will see, when naively implementing standard schemes, both become ill-defined when the underlying probability densities become heavytailed.
Here, we focus on numerical algorithms connected to instanton theory and its rigorous counterpart, large deviation theory, to recover the tails of probability distributions in a stochastic system. A large deviation principle (LDP) states that the probability of rare events decays exponentially, and its exponential scaling is given by the minima of the corresponding rate function I [2,3]. More precisely, let P ε be a family of probability measures on a suitable space X . We say P ε satisfies an LDP with rate function I : Ω → R if for all Ω ⊂ X , we have [4], where denotes log-asymptotic equivalence in the limit ε → 0. In a physical sense, the probability P ε (Ω) can formally be written as a path integral, and the estimate (1) be-comes a saddle point approximation or Laplace method. In large deviation theory, the Gärtner-Ellis theorem [5] provides a direct formula for the rate function I. Roughly speaking, if the limiting behavior of a scaled CGF is welldefined, then its Legendre-Fenchel (LF) transform is the rate function of the LDP of the process under consideration. The LF transform of a real-valued function f (x) defined on R n is defined as where x, y = x T y is the inner product on R n . Let z ε be a sequence of random variables in R n , with probability measures P ε , and assume that its scaled CGF, defined as the limit, exists for each λ ∈ R n and is differentiable in λ. Then, the Gärtner-Ellis theorem states that the family of probability measures {P ε } satisfy an LDP, where the rate function I is the LF transform of G, I = G * . Intuitively, one can interpret λ as a Lagrange multiplier to condition on an outcome z. Crucially, though, if the probability measure P ε is super-exponential, or has even heavier tails, the expectation in equation (3) diverges and the CGF is no longer defined. Notably this does not mean that the corresponding rare events are special in any way, but merely that the duality between the parameter λ and rare event observable z is broken. As a consequence, no tilt exists to realize an outcome z, and standard rare events algorithms fail.
In what follows, we will show how a nonlinear tilt allows to modify the connection between tilt λ and outcome z, so that heavy tails can be probed regardless of the non-convexity of their rate function. We will concentrate specifically on the case of small noise sample-path large arXiv:2012.03360v1 [cond-mat.stat-mech] 6 Dec 2020 deviations for stochastic differential equations (SDEs), which will we introduced in section II. Here, we focus on the numerical computation of the instanton in section II A, and highlight the connection to a change of measure in path-space in II B. We demonstrate the problem in the heavy-tailed case by reviewing the convex analysis for the Gärtner-Ellis theorem in section III, and propose a solution modifying the instanton computation to yield finite outcomes for heavy tails and non-convex rate functions in section III A. To demonstrate the applicability of our approach, we show several examples of instantons for heavy-tailed distributions in section IV: Toy models with super-exponential (section IV A) and powerlaw (section IV B) tails, and a banana-shaped potential (section IV C), and finally high-amplitude events in fiberoptics described by the focusing nonlinear Schrödinger equation (section IV D). We summarize our findings in section V.

II. INSTANTONS AND FREIDLIN-WENTZELL THEORY
Consider a stochastic system, where X ε t ∈ R n is a family of random processes indexed by the noise strength ε. The deterministic drift b : R n → R n satisfies the Lipschitz condition, dW t is an n-dimensional Brownian increment, and the noise covariance χ = σσ T is assumed to be invertible for σ ∈ R n×n . Intuitively, equation (4) describes the temporal evolution of a system perturbed by stochasticity, where we later assume the fluctuations to be small, ε 1. This situation is ubiquitous in many application areas, where for example ε plays the role of the temperature in a chemical reaction, or the inverse number of particles in a thermodynamic system.
With vanishing noise, ε = 0, the solution x ∈ R n of the unperturbed (deterministic) systeṁ converges to one of its attractors for long times. For example, consider a point attractor, or asymptotically stable fixed points,x ∈ R n , with basin of attraction B, such that x(t) →x for t → ∞ for all initial conditions x 0 in B. Solutions of the stochastic system (4) converge to solutions of the deterministic system (5) in probability, [4]. This is an instance of the law of large numbers, stating that for small noise and large times we expect solutions of the stochastic system to end up near the attractors of the deterministic one.
Nevertheless, for any non-zero ε 1 there is a small but non-vanishing probability of finding the system far from the attractor. This can only happen if the noise conspires in just the right way to overcome the deterministic dynamics, and is consequently a rare event. Concretely, consider any domain D ⊂ R n attracted tox, i.e. D ⊂ B. We are interested in the chance of trajectories X ε t departing fromx and eventually leaving D. These trajectories belong to the set and we want to quantify the probability which is a question about the probability of large deviations. Under the stated conditions, there is a trajectory ϕ * ∈ A z such that the probability measure over A z ac- In other words, in the small noise limit we will almost surely find our sample trajectory close to ϕ * , such that max t0≤t≤t1 |X ε t − ϕ * t | ≤ δ, for an arbitrary small δ. In order to find this most likely trajectory ϕ * , Freidlin-Wentzell theory [4] states that ϕ * is actually the minimizer of large deviation rate function S(ϕ) associated with the stochastic system (4), given by where the integral exists, and S(ϕ) = ∞ otherwise. The norm f 2 χ = f, χ −1 f is induced by the noise covariance χ. With this rate function we can quantify the probability (7) of departing the domain D as where The problem of finding the rare event probability is now reduced to finding the minimizer ϕ * . In analogy to the principle of least action in classical mechanics or quantum mechanics, the rate function is often termed action, and the corresponding minimizer ϕ * is called instanton. The integrand of S can be understood as a Lagrangian, so that the maximum likelihood pathways leaving the attractors of (4) correspond to semi-classical trajectories of the field theory defined by L.

A. Instanton equations and large deviation Hamiltonian
It is helpful, both to increase understanding, and to simplify the numerical implementation, to rephrase the optimization problem (11) into the corresponding Hamiltonian formulation. To this end, we introduce the large deviation Hamiltonian H(ϕ, ϑ) as the Legendre transform of the Lagrangian L (ϕ,φ), which, for the Lagrangian (12), corresponds to Here ϑ = ∂L/∂φ is the conjugate momentum of ϕ [6]. Now, the minimizer ϕ * can also be expressed as the solution of Hamilton's equations, with boundary conditions, Equations (15) and (16) are often termed instanton equations.
The fact that we are looking only for trajectories that will eventually leave the attractor, ϕ * ∈ A z , implies that the optimization problem (11) is a constrained one, i.e. we are looking only for solutions of the instanton equations conditioned on the endpoint z. Practically, this constrained optimization problem can be transformed into an unconstrained one, by using a Lagrange multiplier λ ∈ R n to enforce the final constraint [7]. Note that the variation of this unconstrained action, results in the same instanton equations, (15), but with different boundary conditions We can solve the instanton equations (15) iteratively with these conditions, by solving the ϕ-equation forward in time, and using the result to solve the ϑ-equation backward in time, until convergence [8,9]. Note that this choice of temporal direction of integration is not only the one suggested by the boundary conditions, but is further the numerically stable choice of direction for the drifts b and ∇b T .
As we will see below, the mapping between Lagrange multipliers λ and final points z is nontrivial, and it is not clear a priori how to choose the correct λ to obtain a final configuration z = ϕ * (t 1 ). If we are interested in p ε (z) for a whole range of z, we can instead choose to simply solve the instanton equations for a range of λ to cover a range of z without specifically needing to know the duality mapping λ(z). Exactly this procedure is often used in the literature to work out probability distributions of stochastic systems, from Burgers [8,10] or Ginzburg-Landau [11] equations to the Kardar-Parisi-Zhang [12] and Kipnis-Marchioro-Presutti model [13] Crucially, however, the existence of a corresponding dual λ for a given final point z = ϕ * (t 1 ) is not necessarily guaranteed, as the next section clarifies. As a consequence, the above methodology might fail, in particular in situations with heavy tails.

B. Exponentially tilted measures
Interestingly, there is a probabilistic interpretation of the introduction of the Lagrange multiplier λ to the optimization problem (18) in the form of the exponentially tilted measure [14,15], as for example employed in importance sampling for Monte-Carlo estimators [1]. Intuitively, by the procedure of tilting, one replaces the original random process (4) by a modified one, under which the rare events under consideration become more likely, while correcting for this modification a posteriori when computing their probability. As a consequence, with tilting, the rare event probability, or expectations over its realizations, can be estimated more efficiently, and with possibly smaller variance.
To be more precise, we denote by p λ the measure exponentially tilted towards the outcome z, defined for our purposes as where Ep [.] is the expectation under the original measure p. In equation (20), the probability measure p λ of events resulting in z, i.e. trajectories in A z , have been awarded extra weight by the Radon-Nikodym derivative of p λ with respect to p, which is: Rearranging equation (20) gives: where is the scaled CGF, G : R n → R. This change of measure is optimal, in the sense of maximizing the tilted probability p λ , at the choice λ = λ z with This can be seen, given the Gärtner-Ellis theorem, where the last line is just the definition of the LDP, It is in this sense that the optimal tilting parameter of the end-point distribution corresponds to the Lagrange multiplier in the instanton equations constraining the endpoint to z.

III. CONVEX ANALYSIS AND THE GÄRTNER-ELLIS THEOREM
In order to use the described methodology to find the instanton for a rare outcome z, or equivalently make sense of the corresponding exponentially tilted measure p λ (z), we must demand that the mapping z → λ(z) is a bijection: For every outcome z there must be a unique tilt λ such that the instanton ϕ, solution of (15) with boundary conditions (19) have a unique solution with ϕ(t 1 ) = z. If that is the case then we can estimate the probability The precise properties of the duality mapping between tilting parameter λ and outcome z can be understood by the interplay between the Gärtner-Ellis theorem and convex analysis. We have, where the solution z (λ) of the form does only hold when the rate function is strictly convex. If instead the rate function is not strictly convex (i.e. has concave, and/or affine linear regions or is even just asymptotically linear), the LF transform is applied only to the region at which I (z) admits supporting hyperplanes. If there exists λ ∈ R n such that [16], then we say I admits a supporting hyperplane at z, where the slope of the supporting hyperplane is λ. In this sense, we can define non-convex regions to be the ones that do not admit any supporting hyperplane, so do not have any corresponding λ. Note that the absence of these hyperplanes can affect the LF transform I * (z) = G (λ) in two different ways, Case I: Having an asymptotically linear part of I (z) leads to a divergent LF transform G (λ).
Case II: Having a concave or affine linear part of I (z) leads to an existent but nondifferentiable LF transform G (λ).
Both cases will be discussed specifically in the applications in section IV.
Assuming for now there are supporting hyperplanes (i.e. existent λ) for all z ∈ R n , then equation (28) leads to, i.e. ∇I must be invertible for z(λ) to be so. Up to a choice of sign, this implies that ∇I is strictly monotonically increasing (SMI), which is equivalent to I being a strictly convex function [14]. Also note that if z (λ) is invertible, this implies that G (λ) is a differentiable function, since equation (27) gives, where equation (28) is used. What we have demonstrated above is nothing but the well-known fact that the LF transform of a convex, differentiable function G(λ) is strictly convex. This perspective, though, makes it clear that the existence of a tilting parameter (Lagrange multiplier) λ to enforce an outcome z depends on the finiteness and differentiability of the scaled CGF. In other words, both exponential tilting, and finding a Lagrange parameter to constrain the endpoint to z, depends on the rate function being strictly convex. Since in the low noise limit we have p(z) ∼ exp(−ε −1 I(z)), it is easy to construct cases where the rate function I(z) is not strictly convex. In fact, every situation where the tails of p(z) are fat, i.e. exponential (I(z) ∼ z), stretched exponential (I(z) ∼ z α , α < 1), or even algebraic (I(z) ∼ α log(z), α < 0) tails will break the above assumption. Examples of fat tailed distributions are ubiquitous in physical systems of relevance, such as the energy dissipation in fluid turbulence and the phenomenon of intermittency [17], wealth distributions in economies [18,19] or stock price changes in finance [20,21].
The main contribution of this paper is the realization that the introduction of a nonlinear map F : R n → R n allows us to loosen the restriction of the convexity of I(z). The idea is to define a nonlinear tilt through F via exp(ε −1 λ, F (z) ), such that the map λ → z(λ) is replaced by a new map λ → F • z(λ). We are free to choose any appropriate F . As we will see next, this allows us to suitably reparametrize the space of outcomes, so that the effective rate function I • F −1 is strictly convex.

A. Nonlinear tilt
In analogy to equation (22) and the description in sections II B and III, we can now define the nonlinearly tilted measure where the nonlinearly tilted CGF is given by (compare equation (27)) which is assumed to be finite and differentiable. Its gradient fulfills where the last equality is due to z (λ) being the solution of ∇I (z) = λ T ∇F (z) in equation (33). This proposed remapping can be chosen to overcome the above problem by creating a new function G F (λ), which plays the role of the CGF, while simultaneously being a bounded and differentiable function. At the same time, I • F −1 (y) can be understood as the effective rate function, since equation (33) can be written as, Obviously, the right choice of F depends on the nature of the tail scaling at hand. We will derive the necessary properties of F (.) next.

B. Properties of the reparametrization and the nonlinearly tilted instanton
In the following, we denote by y ∈ R n the reparametrized outcome, y = F (z). Our goal is to choose F such that F •z(λ) = y(λ) is a bijection. From above, it is clear that Following the same argument as in section III, y(λ) is bijective if • F is a diffeomorphism, and Assuming these conditions on F implies that the gradient of the reparametrized rate function I • F −1 (y), given by λ • F −1 (y), is an SMI function, implying that it is invertible. The desired bijective mapping then becomes λ → F (z (λ)) = ∇G F (λ) (compare equation (34)). For a non-convex I, intuitively, F must be chosen to suitably reparametrize the space of outcomes for the effective rate function to become strictly convex. For example for the n = 1 case with heavy tails, one might imagine a strong enough compression of the observable z such that a fat tail becomes non-fat.
To harness this nonlinear tilt in the computation of instantons for distributions with heavy tails, we need to modify the approach outlined in section II A as follows: The variation of the unconstrained action (18) now reads Consequently, the boundary conditions of the instanton equations are modified to which will yield an instanton trajectory ϕ * that reaches z, ϕ(t 1 ) = z, despite the fact that the rate function I(z) = S(ϕ * ) is not convex around z. Since F is continuous, the probability measure P ε • F −1 (y) in the limit ε → 0 is the same as P ε (z) for a continuous F , according to the contraction principle [3,4]. Note that this reparametrization through F is introduced solely to adequately define the tilted measure, or equivalently numerically compute the instanton without encountering divergences. Afterwards, the reparametrization can be reverted to obtain the probability distribution in the original coordinates z.
As additional remark, methods that compute the instanton by solving the global optimization problem, for example by solving the associated Euler-Lagrange equations instead of integrating the instanton equations [22,23], do not require the above treatment: The tilting parameter disappears in these cases as the boundary conditions are fixed in the field variable instead of the conjugate momentum. Therefore, in principle, these methods can be chosen in the non-convex case. The solution of the instanton equations, though, is generally preferred [24] due to numerically efficiency, and sometimes even required (such as when the noise covariance in (4) is not invertible).

IV. APPLICATIONS
We will now consider a number of examples that show how to compute tail probabilities in stochastic systems.
To demonstrate the wide applicability of our approach, we consider several cases that highlight different complications. We start with two toy models that feature stretched exponential (section IV A) or powerlaw (section IV B) tails. Then, in section IV C, we consider a two-dimensional system with a bent ("banana-shape") potential, where the non-convexity is not due to heavy tails, but due to the shape of the unimodal invariant probability density. Lastly, we demonstrate the practical applicability of our method by considering an example motivated from fiber optics in section IV D. Here we compute the probability of measuring extreme power spikes at the end of extended optical fibers, where the probability distribution of the input signal is known. Due to soliton formation, this distribution features heavy tails for long fiber lengths (L 10 m), so in order to compute probabilities via an instanton approach, our corrections are necessary.

A. Stretched exponential
Consider the stochastic gradient flow, The potential U : R n → R determines completely the stationary probability distribution function (PDF) with normalization constant Z. We further assume that U has a unique minimum, i.e. we are only considering unimodal distributions. For large times, t 1 − t 0 = T → ∞, the distribution of endpoints of X ε t1 = z will converge to ρ ∞ (z). From the perspective of large deviation theory (LDT), comparing (26) to (41), the rate function for the final point distribution is equivalent to the potential, I(z) = U (z), and lim ε→0 ε log p(z) = − U (z) .
Therefore, in order to approximate the tails of the stationary distribution, we can compute the instanton ϕ * ending at z and estimate ρ ∞ (z) ≈ exp(−ε −1 S(ϕ * )). We choose n = 1 and consider the non-convex potential, which corresponds to a stretched exponential stationary distribution: At the tails, the dominant exponent is α, and ρ ∞ (z) ≈ exp(−ε −1 |z| α ) for large z, as shown in figure 1. For this distribution, E[exp(ε −1 λz)] diverges for large λ as in case I, and hence numerical methods to find the instanton fail in the tail.
This can be seen in figure 2: We employ the numerical scheme by Chernykh-Stepanov [8,10] to compute the instanton starting at ϕ * (t 0 ) = 0 and ending at ϕ * (t 1 ) = z.  Stationary density with exponential tails (equation (43) for α = 1). Probing the tails with the traditional instanton method (light blue) leads to numerical divergence around the non-convex tail region. Reparametrizing the observable via F (z) = sign(z) log |z| convexifies the tail, so that the instanton (dark blue) correctly predicts the exact tail probabilities (solid black).
The iterative algorithm converges towards the minimizer of the action, and once converged, we can estimate the probability of reaching z by p(z) ≈ exp(−ε −1 S(ϕ * )). As expected, though, computing the instanton fails beyond the inflection points at z ≈ ±1.26, where the tails become stretched exponentials (light blue dots): No choice of λ leads to endpoints z of the instanton beyond these, as the linear tilt diverges and the CGF is undefined. Instead, we need to choose a non-linear tilt, such as for which even in the tails the reparametrized expectation E[exp(ε −1 λF (X ε t1 ))] < ∞ remains bounded. Consequently, the derivative of the CGF G F (λ) is a bijection, so that every value of λ has a corresponding z. This map is explicitly given by 45) (for z > 0, and negative for z < 0). With this choice, the instanton prediction for the stationary PDF is almost exact far into the heavy tails (dark blue dots vs black solid in figure 2). Here, we again employ the iterative instanton computation, but are solving the instanton equations with the boundary condition (39) instead. The underlying reason for convergence is that the reparametrization with F convexifies the rate function, i.e. creating supporting lines with slopes λ for all the domain of I • F −1 . As shown in figure 3, while the second derivative of the rate function becomes negative beyond the inflection points, the second derivative of the nonlinearly tilted rate function remains positive throughout.
For the numerics in this example, we chose α = 1, with N t = 10 3 timesteps, and a time interval of T = 6.

B. Powerlaw distribution
Even heavier tails are given by power law distributions, which are associated with a multitude of phenomena in wide areas of science, in part due to their connection to scale invariance, self-similarity, universality classes and criticality in phase transitions. Here, we construct a simple SDE in n = 1 dimensions which has a powerlaw invariant density. Consider  It can easily be shown that the invariant density for the process (47) is given by where Z is a normalization constant. For z 1 and ε = 1, this takes the limiting form (46), but is regularized for small z. Again, we are interested in computing tail probabilities for this toy model, by computing the instanton ϕ * realizing large values of z, which yields the respective probability by evaluating the corresponding action.
As in section IV A, the LDT rate function, given here by does not admit supporting lines in the tails, and consequently its LF transform is undefined (case I as well). This is reflected in the fact that the moment generating function diverges. We can convexify the rate function (49) by reparametrizing via which is an even more drastic tail compression than needed for the stretched exponential. Note that here we only convexify in the tails, |z| > 1, where the problem occurs, and do not attempt to find a global map. Indeed, for this choice of F , the reparametrized rate function I • F −1 (z) is convex in the tails, as shown in figure 4: Its second derivative remains positive in the tails, which is not true for the original rate function I(z). We can explicitly map λ to z via (for z > 1 and negative for z < −1). Numerically, as before, we solve the optimization problem posed by the instanton equations with tilted boundary conditions (39) to obtain instantons ϕ * for events with large z. The corresponding action, S(ϕ * ), yields the tail probability. This computation is shown in figure 5: The naive instanton computation (light blue) leads to numerically diverging results in the tail region, |z| > 1, which are captured accurately by the reparametrized instanton (dark blue). Parameters are β = 2, ε = 0.25, N t = 10 3 , and T = 10.

C. Banana potential
In higher dimensions, non-convexity can manifest in more subtle ways than in 1D. Consider for example the 2D system, where, This system is a gradient flow for the potential U (x) = x 2 1 + (x 2 − x 2 1 ) 2 , so that again we have that the rate function for the stationary distribution is equivalent to this potential, I(z) = U (z), i.e.
The system has a unique stable fixed point at the origin, which is the deepest point of a banana-shaped valley (the set of points (x 1 , x 2 ) | x 2 = x 2 1 ) of the potential, as can be seen in figure 6 (left). The rate function I (z) (55) does not admit supporting hyperplanes (29) at the region z = (a, b) | b > a 2 , leading to no tilt variables λ ∈ R 2 to reach an outcome z within that region. Unlike the previous examples, the challenge in this case is therefore not the far tails of the stationary density, but actually probing the core of the distribution. The nonconvexity of the rate function of the previous examples amounted to the divergence of its LF transform, the CGF (case I), while here the non-convexity leads to the nondifferentiability of the CGF (case II).
To fix this non-differentiability of the CGF G(λ), we propose a nonlinear reparametrization that satisfies the criteria of (III B). Consider This reparametrization "straightens the banana", i.e. it deforms the space of outcomes such that the rate function becomes strictly convex, as figure 6 (right) shows. Using this reparametrization produces a continuous and differentiable CGF of the observable F , resulting from the LF transform of I • F −1 (y), which allows Lagrange multipliers to reach any outcome z, in particular ones within the nonconvex region.
As numerical experiment, we choose to look at the marginal stationary distribution µ in z 1 direction for a fixed value of z 2 = 3 2 , i.e. µ(z) = ρ ∞ (z, 3 2 ). Since at any fixed value z 2 > 0 we cut through the non-convex region z 2 > z 2 1 , the marginal distribution µ(z) looks like a double-well potential. We stress, though, that the whole system indeed has only a single fixed point. We then solve  3 2 ) (as denoted by the blacked dashed line in figure 6 (left)). Instantons with linear tilt (light blue) fail to reach the region −1 < z < 1 without supporting hyperplanes of the rate function I (z) (55). Performing a reparametrization using the observable (56) produces an strictly convex effective rate function I • F −1 (x) that admits supporting hyperplanes everywhere. The corresponding instanton actions successfully capture the non-convex region (dark blue). the optimization problem posed by the instanton equations with linear tilt, and compare to the minimization problem with nonlinear tilt. As shown in figure 7, the linearly tilted instanton computation produces acceptable results in the tails of the probability density (light blue dots), it fails to converge within the non-convex region −1 < z < 1: Since in that region there are no supporting planes of the rate function (55), there is no λ ∈ R 2 corresponding to the slope of the supporting plane at that z, and consequently no tilt exists to produce the desired outcome z.
For the reparametrized observable (56), on the other hand, the effective rate function I • F −1 (x) is convexified and admits supporting planes at every z. Indeed, as demonstrated in figure 7, the reparametrized optimization problem leads instanton trajectories reaching outcomes (shown as dark blue dots) within the non-convex region −1 < z < 1.
As a final remark, convexifying the rate function by the above method, even though it guarantees the existence of a tilt for every outcome, might nevertheless lead to numerical convergence issues. For example, in regions where the original rate function was convex, the rescaled optimization problem might be harder to solve, or necessitate more iterations or smaller time-steps. Similarly, even in the convexified region, the problem might become ill-posed, for example for z 1 1 and z 2 > 2, where the observable is approximately linear.

D. Nonlinear Schrödinger equation
As a practical example for our proposed method, we consider the formation of extreme events in nonlinear wave equations [25][26][27][28]. In the field of nonlinear optics and photonics it has been established that heavy tailed statistics frequently occur [29]. Physical mechanisms such as soliton formation [30,31] and nonlinear amplification [32] are responsible for the emergence of extreme power spikes out of incoherent, Gaussian initial conditions, and have been subject to investigation by a multitude of rare event algorithms [33,34].
Here, we consider the one-dimensional propagation of an optical pulse along a fiber, described by the nonlinear Schrödinger equation (NLS) The input signal is considered random, with a Gaussian distribution of known energy spectrum. Specifically, we are mimicking an experimental setup such as [35] of a partially coherent light source, where the input signal is designed as a Gaussian shape in frequency space with covariance with spectral bandwidth 1/∆ν and truncation frequency ω N , so that the input signal is given by where ξ n are i.i.d. mean zero, unit variance complex Gaussian random variables. For this setup, we are interested in the probability of measuring large spikes in the optical power A(x, t) = |ψ(x, t)| 2 at the fiber end, x = L. Within the presented instanton formalism, this can be achieved by tilting the distribution of initial conditions towards a high-power outcome at the fiber end, and estimating the tail probability by its most likely ("instantonic") realization. The corresponding LDP is given by for a power spike of size z taken arbitrarily at the center of the temporal domain, t = T /2. Due to the Gaussianity of the initial conditions, the rate function I(z) simply is [34] I(z) = inf Here, ξ determines the source signal ψ 0 (t) through (60), while, λ(z) can be interpreted as a Lagrange multiplier enforcing the power constraint |ψ| 2 = z at the end of the fiber. Equation (62) is therefore simply saying that the rate function is given by the most likely random configuration ξ that determines a source signal with high power output. Note that, similar to the examples above, the tilt in (62) is linear, and we can therefore expect the expectation over light source signals to diverge for fiber lengths L long enough for solitons to emerge and for the tails of the power distribution to become fat. Since the probability of high power output signals at the fiber end is not known analytically, the only option we have to get comparison data is to perform Monte-Carlo (MC) simulations to sample the power distribution. To this end, we simulate the evolution of a wave packet along the fiber with a random input signals with energy spectrum (59) by numerically integrating equation (58). This equation is non-dimensionalized, with x, t and ψ normalized by characteristic parameters L 0 , T 0 and P 0 respectively, such that wherex,t andψ are the corresponding dimensional variables. These parameters L 0 , T 0 and P 0 also determine the dispersion β 2 and nonlinearity γ properties of the optical fiber via We chose these parameters according to the experimental setup in [35], where they are given as L 0 = 160.3 m, T 0 = 1.8778 ps, P 0 = 2.6 W .
Therefore, the optical fiber has dispersion parameter β 2 = 0.022 ps 2 /m and nonlinearity constant γ = 0.0024 (Wm) −1 . The spectral bandwidth 1/∆ν is taken to be (∆ν = 0.5 THz). We pick fiber lengths between 10 m and 60 m, periodic boundary conditions in time treated pseudo-spectrally, and integrate with a secondorder Runge-Kutta exponential time differencing method (ETDRK2) [36] in the spatial variable. The discretization is ∆x = 6.24 × 10 −3 , ∆t = 1.3 × 10 −2 , T = 106, and frequency cut-off N = 45. As expected, and shown in figure 8, the tails of the PDF of optical power become heavier with increasing fiber length L. Its samples number is 10 6 , where the property of A being statistically homogeneous in time is used to improve the statistics.
To compare these brute-force sampling estimates to the instanton prediction, we have to solve the optimization problem (62). This is done by defining the cost functional for a given λ and performing gradient descent, where the gradient is given by with Jacobian J(x, t) = dψ(x, t)/dξ. This gradient can be evaluated by simultaneously integrating the NLS equation (58) and the evolution equation of the Jacobian, (whereā is the complex conjugate of a ∈ C). The iterative gradient descent algorithm yields the optimal choice ξ * that will lead to the desired outcome of the final power exceeding the power threshold z. As can be seen in figure 8 (left), the corresponding prediction for the probability, exp(−I(z)) (from equation (61)) correctly describes the tail decay of high power events at the fiber end, but crucially only as long as the the rate function admits supporting lines, i.e. remains convex. Therefore, the instanton prediction is basically useless for optical fibers longer than L = 30 m. As a side note, the gradient computation could instead be performed in the adjoint formalism, leading to two coupled forward-backward equations similar in spirit to the instanton equations (15), but identically yielding meaningful results only in the convex region of the rate function. Now, applying the idea from above, we can instead nonlinearly tilt the probability distribution of input signals towards high power outcomes. For this, we choose instead the nonlinearly tilted rate function with F (z) = log log |z| , |z| > 1 .
For this tilt, the cost functional becomes E(ξ) = 1 2 |ξ| 2 − λF (ψ(L, T /2)) (72) and the gradient is dE/dξ = ξ − λJ(L, T /2)dF (ψ(L, T /2))/dψ , instead. The results of this are shown in figure 8 (right) for the four longest fiber lengths of 30 − 60 m in 10 m increments, where the tails are fattest. In the revised formalism (dark color), the nonlinearly tilted instanton prediction is able to reach far into the stretched tail and gives the right order of magnitude for the probability of power spikes obtained from sampling (light color). The end of the region of convergence for the naive instanton is shown for comparison (vertical markers). Note that due to the choice of reparametrization F in (71), the nonlinearly tilted instanton prediction is restricted to the region of normalized power |ψ| 2 > 1, but of course this is exactly the tail region that we care about.

V. CONCLUSION
Estimating the probability of tail events can efficiently be done via large deviation theory and instanton calculus, which transforms an inefficient sampling problem into a deterministic optimization problem. Unfortunately, for systems with heavy tails, or more generally non-convex rate functions, standard mechanisms of exponentially tilting the measure, or numerically solving the optimization problem, fail. The reason is the absence of a bijective map between Lagrange multiplier (tilting parameter) and desired outcome, caused by the breakdown of their Lagrange duality, or equivalently by the non-convexity of the rate function.
We put forward the idea of a nonlinear tilt that reparametrizes the output space, effectively convexifying the rate function of the observed probability distribution. We discuss the necessary conditions required for this reparametrization to yield a unique outcome variable and ensure a bijective mapping between tilt and outcome: It needs to be a diffeomorphism chosen such that its composition with the rate function is strictly convex. Note further that the reparametrization can be chosen locally, i.e. the conditions on the nonlinear observable need only apply in a subdomain of the events of interest.
Finding such nonlinear observable can be subtle, especially when the system is highly nonlinear, influencing the rate function landscape. However, drawing inspiration from toy problems with stretched exponential and algebraic tails, which can be treated analytically, yields candidate reparametrizations for physically relevant problems. We show the applicability to real-world problems by demonstrating how instantons determine the probability in extreme optical power events in a fiber optical cable, where solitons lead to a heavy-tailed power distribution at the fiber end.