Taming numerical imprecision by adapting the KL divergence to negative probabilities

The Kullback-Leibler (KL) divergence is frequently used in data science. For discrete distributions on large state spaces, approximations of probability vectors may result in a few small negative entries, rendering the KL divergence undefined. We address this problem by introducing a parameterized family of substitute divergence measures, the shifted KL (sKL) divergence measures. Our approach is generic and does not increase the computational overhead. We show that the sKL divergence shares important theoretical properties with the KL divergence and discuss how its shift parameters should be chosen. If Gaussian noise is added to a probability vector, we prove that the average sKL divergence converges to the KL divergence for small enough noise. We also show that our method solves the problem of negative entries in an application from computational oncology, the optimization of Mutual Hazard Networks for cancer progression using tensor-train approximations.


I. INTRODUCTION A. Motivation
A notion of distance between probability distributions is required in many fields, e.g., information and probability theory, approximate Bayesian computation [1,2], spectral (re-) construction [3][4][5], or compression using variational autoencoders [6].In mathematical terms, this notion can be phrased in terms of divergence measures.
One of the most commonly used divergence measures is the Kullback-Leibler (KL) divergence [7], which for two discrete probability vectors p = (p 1 , . . ., p n ) and q = (q 1 , . . ., q n ) has the form The KL divergence includes the logarithm, which is welldefined in the usual case of probability vectors with nonnegative entries.However, in practical applications, situations can arise that lead to negative entries, for which the logarithm is not defined.Let us give a few examples.In high-dimensional spaces, approximations are often needed to keep calculations tractable and to avoid runtime explosion due to the curse of dimensionality [8,9].Such approximations can lead to negative entries when the exact entry is close to zero, as is frequently the case for probability vectors.Another source of negative entries are rounding errors that occur, e.g., when the probability vector is obtained as the solution of a linear system of equations [2,10].Also, assumptions of a theory can lead to negative entries in the approximate probability distribution [4,11,12].If negative entries arise, one is faced with the choice of (a) giving up, (b) reformulating the theory, model, or approximation such that negative entries cannot occur, or (c) devising a workaround that leads to correct results in a suitable limit.In Section I B we review approaches from category (b) and (c) to address the problem of negative entries [4,[13][14][15][16][17][18][19].The methods we are aware of are typically tailored to the problem at hand or lead to reduced run-time performance.
Here, we suggest a method in category (c) that is generically applicable and does not suffer from decreased performance.Specifically, we propose the shifted Kullback-Leibler (sKL) divergence as a substitute divergence measure for the standard KL divergence in the case of approximate probability vectors with negative entries.It is parameterized by a vector of shift parameters.The sKL divergence is a modification of the KL divergence and retains many of the latter's useful properties while handling negative entries when they arise.To give theoretical support to our method, we consider a simple example in which i.i.d.Gaussian noise is added to the entries q i in Eq. ( 1) so that negative entries can occur.For this example we prove that the difference between the KL divergence and the expected sKL divergence under the noise is quadratic in the standard deviation of the noise, provided that the shift parameters are suitably chosen.Therefore the sKL divergence converges to the KL divergence in the limit of small i.i.d.Gaussian noise.We also show this convergence numerically.
In a concrete application, we show that the sKL divergence enables efficient learning of a Mutual Hazard Network (MHN), a model of cancer progression as an accumulation of certain genetic events [2,20,21].In an MHN, cancer progression is modeled as a continuoustime Markov chain whose transition rates are parame-ters of the model that are learned such that the model explains a given patient-data distribution.When considering ≳ 25 possible genetic events, exact model construction becomes impractical and requires efficient approximation techniques [9].We show that even though negative entries occur in the approximate probability vectors, meaningful models can still be constructed when the sKL divergence is employed.

B. Related work
Various approaches to handling or avoiding negative entries in approximate probability distributions exist in the literature.
In earlier work [22], we considered introducing a threshold ε > 0 and replacing entries of the probability vector that are smaller than this threshold by ε.This method fails to preserve probability mass and leads to a loss of gradient information when used in an optimization task.Additionally, the comparison function obtained in this way no longer meets the requirements of a statistical divergence [23], i.e., none of the required properties in Theorem 1 below are satisfied in this case.
For high-dimensional probability distributions, several nonnegative tensor decompositions have been proposed in order to break the curse of dimensionality while avoiding negative entries [13,15,16,18,19].Typically, this change of format leads to a significant increase in the run time of the algorithms involved, as other important properties for a time-efficient approximation have to be omitted.
In situations where negative entries occur due to assumptions of the theory, a common approach is to modify the divergence measure [4,14,17].Such modifications are typically problem-specific and therefore not applicable in general.
Our approach is in a similar spirit, but our modification of an established divergence measure is not problemspecific, as it does not depend on special properties of a particular application.Furthermore, since we do not need specific formats to preserve nonnegativity, optimization tasks in high dimensions can be completed efficiently.This paper is structured as follows.In Section II we define the sKL divergence, discuss its properties and give suggestions for how to choose the shift parameters.In Section III we show how the sKL divergence can be applied in optimization tasks on approximate probability vectors.In Section IV we summarize our findings and give an outlook on future research topics.In Appendices A and B we provide proofs of three theorems on the sKL divergence and mathematical details of the approximation we employ.

A. The sKL divergence
For two probability vectors p and q on a finite set of n elements, the Kullback-Leibler (KL) divergence of p from q is defined as [7] with the convention 0 log(0/x) = 0.In the context of statistical modeling, the vector p (with p i ≥ 0) is typically given by the data, while the vector q is a theoretical model.This model can be fitted to the data by minimizing the KL divergence of p from q. Approximations in the calculation of q can lead to negative entries q i < 0.1 In this case, the KL divergence is no longer defined and the optimization cannot be done.For this scenario, we propose the shifted Kullback-Leibler (sKL) divergence, a modification of the Kullback-Leibler divergence that allows for negative entries in q.For a parameter vector ε ∈ R n ≥0 , we define it as with the same convention as above.The sKL divergence is well defined for q i > −ε i .It is a modification of the KL divergence that reduces to the KL divergence for ε = 0. Figure 1 shows the behavior of both the KL and the sKL divergence for probability vectors on a set with two possible outcomes.We can see that the domain on which the sKL divergence is defined now also includes approximate probability vectors with small negative entries.In the following two subsections we show that the sKL divergence retains many important properties of the KL divergence and discuss how to best choose the parameters ε i .

B. Properties
The KL divergence belongs to the class of fdivergences [24].While the sKL divergence does not, it shares many important properties with the KL divergence: it is positive semidefinite, only zero for p = q, and locally a metric.Thus, it still satisfies the definition of a statistical divergence, see Theorem 1.
Theorem 1.Let p, q and ε be three vectors in R n whose entries satisfy p i , q i > −ε i and i p i = i q i .The sKL divergence of p from q, with parameter vector ε, is a statistical divergence, i.e., it satisfies the following properties [23], (1) D sKL (p∥q) ≥ 0 , The first assumption of the theorem can always be satisfied by a suitable choice of the shift parameters ε i , while the second assumption can always be satisfied by a rescaling of the q i .Additionally, we show in Theorem 2 that the sKL divergence is convex in the pair of its arguments, like the KL divergence [25].
This property is of particular importance as it is often needed for well-behaved optimization [26].The proofs of the two theorems are given in Appendix A.

C. Parameter choice
In the sKL divergence, we introduced a parameter vector ε = (ε 1 , . . ., ε n ).The properties of the sKL divergence discussed in the previous subsection (Theorems 1 and 2) hold for any choice of the shift parameters ε i .However, in applications, their values can have a large influence on the quality of the results, such as speed of convergence or accuracy of the learned model.In this section we therefore aim to provide a guide for choosing suitable values of the parameters ε i for the case that p is an exact probability vector and q is an approximate one.In the KL divergence, the terms in the sum only need to be evaluated for the indices i for which p i ̸ = 0.This property reduces the computational cost greatly if many entries of p are zero.This situation is encountered, for example, when optimizing an MHN [2].In order to preserve this advantage when working with the sKL divergence, one can simply choose ε i = 0 if p i = 0. To constrain possible values for ε i , we first note that if q i < 0, the sKL divergence is only well-defined if one chooses ε i > −q i .Second, we compute the gradient of the sKL divergence, and notice that it approaches −1 for large values of ε i .
Most optimization algorithms use gradient information to minimize a loss function [26].Therefore it is important to retain the gradient information, which implies that ε i should not be too large.
Using these guidelines, we provide two particular choices of ε.First, in Section II C 1, we discuss a natural, static choice of ε and examine its shortcomings.In Section II C 2, we then suggest a dynamic choice of ε and prove in Theorem 3 that the resulting sKL divergence is a good substitute for the KL divergence in the presence of small i.i.d.Gaussian noise.When possible, one should therefore prefer the dynamic choice of ε, as we will also demonstrate in Section III C.

Static choice
We first suggest a choice of ε that can be used with any optimization algorithm.This includes higher-order algorithms such as BFGS or conjugate-gradient methods, which make use of approximations of the Hessian matrix of the loss function [26].For higher-order methods, the objective function must not change during optimization.This requires us to choose a suitable ε a priori, which can be a difficult task.A simple approach is to evaluate q once before starting the optimization and to set where ε is fixed to a number slightly larger than max pj >0 (−q j ).If during optimization ε turns out to be too small, i.e., negative entries of q i + ε are encountered, the optimization has to be stopped early.The best possible result can then be found by tuning ε.This parameter choice allows us to choose a wide variety of optimizers.However, in addition to the requirement of tuning ε, the static choice leads to an unnecessarily large loss of gradient information.This is because gradient information is reduced for all i, even though q i < 0 is rarely encountered in practice.

Dynamic choice
We now suggest a second choice that can be used only if the objective function is allowed to change in every iteration of the optimizer.This does not pose a problem when using first-order optimization algorithms such as gradient descent [26] or Adam [27], as these do not use information about the Hessian matrix of the loss function.In this scenario, we can choose a new ε for every new q during the optimization process.Our proposed choice is where f : R ≥0 → R ≥0 is a nonnegative function.This choice of ε avoids a large loss in gradient information while extending the domain on which the sKL divergence is defined where necessary.A concrete choice of f , which we use in Section III, is given by f (x) = δ • x with a constant δ > 0, but many other choices are possible.

D. Negative entries due to Gaussian noise
To further motivate Eq. ( 4), let us consider a simple example in which negative entries of q i are caused by small Gaussian noise.In this case, we show that the difference between the KL divergence and the average of the sKL divergence over the Gaussian noise is quadratic in the standard deviation of the noise, see Theorem 3, the proof of which is provided in Appendix A. To keep the presentation simple, we restrict ourselves to nonnegative functions f that are finite sums of power laws, i.e., f (x) = k a k x b k with at least one a k ̸ = 0. Theorem 3. Let p and q be two probability vectors on a finite set of n elements, with q i > 0 whenever p i > 0. Let further x be a vector of i.i.d.Gaussian random variables with mean 0 and standard deviation σ.If the ε i are chosen according to Eq. (4) with the restriction on f stated above, the average of the sKL divergence of p from q + x is given by In fact, the restriction on f can be relaxed significantly so that a much larger class of functions is admissible, see Appendix A 3 for details.
In this example, we therefore see explicitly that the sKL divergence can be used as a substitute divergence measure for the KL divergence, provided that the condition σ 2 i p i /2q 2 i ≪ D KL (p∥q) is satisfied.To validate Theorem 3 numerically, we first consider a generic toy model where p, q ∈ R 10 are probability Relative difference between the KL divergence and the sKL divergence with the dynamic choice of ε given by Eq. ( 4) with f (x) = δ • x, for different noise levels σ.For each σ, the dashed line shows the prediction from Theorem 3 truncated at order σ 2 , the solid line shows the result obtained from numerical integration of Eq. (A1), and the points show the result obtained from simulation.
vectors whose entries are i.i.d.uniform random variables.As per the assumptions of Theorem 3, we consider i.i.d.Gaussian random variables added to q, and we choose the ε i according to Eq. ( 4) with the simple choice f (x) = δ•x.In order for the σ 2 correction to be small, we restrict ourselves to probability vectors p and q that satisfy for all σ considered.Figure 2 shows the convergence of the sKL divergence to the KL divergence in the limit of small σ of the noise.The plot shows averages of 200 pairs of p and q.We can see that in this case Theorem 3 gives a good prediction for the relative difference between the sKL divergence and the KL divergence.Interestingly, for large values of δ, the relative difference is significantly lower than predicted.Therefore, large values δ ≥ 10 1 should be preferred in this toy model, but this may be different for other applications, in particular when the negative entries are not due to Gaussian noise.Finally, we observe that the parameter δ in the function f has only little effect on the result when σ is sufficiently small.In many applications, the strength of the noise can be controlled by parameters that determine the numerical accuracy of the approximation.If the assumption of i.i.d.Gaussian noise is satisfied in a particular application, Theorem 3 is directly applicable.In other applications, the noise might follow a different distribution, for which one would have to do a similar analysis.
In the concrete application we consider in Section III, we have seen numerically that the assumption of i.i.d.Gaussian noise is not satisfied.Nevertheless, as we increase the numerical accuracy of our approximation, we observe that the sKL divergence with the choice of Eq. ( 4) converges to the KL divergence computed without approximations.Also, the model results obtained from op-timizing the sKL divergence with the choice of Eq. ( 4) converge to the model result obtained from the KL divergence without approximations.

III. APPLICATION A. Mutual Hazard Networks
As a real-world application, we consider the modeling of cancer progression using Mutual Hazard Networks (MHNs) [2].Cancer progresses by accumulating genetic events such as mutations or copy-number aberrations [28].As each event can be either absent or present, the state of an event is represented by 0 (absent) or 1 (present).We consider d such events, thus representing a tumor as a d-dimensional vector x ∈ {0, 1} d .An MHN aims to infer promoting and inhibiting influences between single events.The data used for the learning process are patient data of observed tumors.The data distribution p is thus a discrete probability distribution on the 2 d -dimensional state space S = x ∈ {0, 1} d of possible tumors, i.e., n = 2 d in the notation of Section II.
An MHN models the progression as a continuous-time Markov chain under the assumptions that at time t = 0 no tumor has active events, that events occur only one at a time, that events are not reversible, and that transition rates follow the Proportional Hazard Assumption [29].If we have two states x, x +i ∈ S that differ only by x i = 0 and x +i i = 1, the transition rate from state x to state x +i is modeled as 2 where e θii is the base rate of event i and e θij is the multiplicative influence event j has on event i.An MHN with d events can thus be described by a parameter matrix θ ∈ R d×d .Figure 3 shows all allowed transitions and their rates for d = 3.The transition-rate matrix can efficiently be written as a sum of d Kronecker products, with Starting from the initial distribution q ∅ = (1, 0, 0, . ..) ∈ R 2 d , the probability distribution at time t ≥ 0 is given by Markov-chain theory as q(t) = e tQ q ∅ , 2 All other off-diagonal transition rates are zero by assumption of the MHN.
e θ33 e θ22 e θ23 e θ11 e θ12 e θ13 FIG. 3. Allowed transitions, along with transition rates, for an MHN with three possible events and parameter matrix θ ∈ R 3×3 .
where e tQ denotes the matrix exponential.Since tumor age is not known in the data, MHNs assume that the age of tumors in p is an exponentially distributed random variable with mean 1.If we marginalize q(t) over t accordingly, we obtain where I denotes the identity.
A parameter matrix θ that best fits the data distribution p can now be obtained by minimizing a divergence measure of p from q θ .For small d (i.e., d ≲ 25), the KL divergence can be used since the calculation of q θ can be done without approximation.For larger d, this is no longer possible because of the exponential complexity (recall that the dimension of the state space is 2 d ), and the approach has to be modified [9].One possible modification is the use of low-rank tensor methods [30] in order to keep calculations tractable.In particular, we use the tensor-train (TT) format [31], which usually minimizes the approximation error through the Euclidean distance.Specifically, when approximating a tensor a by ã, a maximum Euclidean distance ∆ = ||a − ã|| 2 can be specified.Thus, small negative entries can occur in ã if the corresponding entry in a is smaller than ∆.In this case, the KL divergence is no longer defined.To be able to perform the optimization, we switch to the sKL divergence as our objective function.In Section III B, we explain the basics of the TT format.Section III C shows how we can find suitable θ matrices by use of the sKL divergence even when negative entries are encountered.

B. Tensor Trains
For a large number d of events, storing q (and p) as a 2 d -dimensional vector is computationally infeasible.This storage requirement can be alleviated through use of the TT format [31].A d-dimensional tensor a ∈ R n1×•••×n d with mode sizes n k ∈ N can be approximated in the TT format as for all i 1 , . . ., i d with tensor-train cores a (k) ∈ R r k−1 ×n k ×r k , where the r k are tensor-train ranks. 3In order to represent a scalar by the right-hand side, the condition r 0 = r d = 1 is required.The quality of approximation can be controlled through the TT ranks r k .In particular, it can be shown that choosing the TT ranks large enough gives an exact representation of a [31].
The TT format not only allows for efficient storage of high-dimensional tensors, but also supports many basic operations, e.g., addition, inner products, or operatorby-tensor products [31].Furthermore, there are efficient algorithms for solving linear equations [32,33].
By modifying the shape of the objects in Eq. ( 5) and changing the Kronecker products to tensor products, the transition-rate matrix Q ∈ R S×S can be written in the tensor-train format [34] (for details see Appendix B).This leads to a tensor train with all TT ranks r 1 , . . ., r d−1 equal to d.Thus, we can approximately solve Eq. ( 6) in the TT format using [32].Similar techniques can be used for the gradient calculation [22].
Details of the algorithms involved [32] lead to the conditions r k+1 ≤ n k+1 r k and r k−1 ≤ n k r k on the TT ranks of q θ .As a result, the first and last TT ranks are r 1 ≤ n 1 and r d−1 ≤ n d .Towards the middle of the tensor train, ranks increase until they level off at the specified maximum rank.In our simulations, we specify a maximum TT rank r q and choose the TT ranks r k ≤ r q to be as large as possible given the constraints on the r k .

C. Simulations
We test how well an MHN can learn a probability distribution p when optimizing the sKL divergence instead of the classical KL divergence.We use simulated data for d = 20 events.This relatively small value of d allows for exact calculations so that we can compare results obtained with and without the use of tensor trains.In the following, we describe how the data were generated, how the MHNs were learned, and how their quality was assessed.
Every dataset was generated from a ground-truth model described by θ GT ∈ R d×d .The diagonal entries of θ were drawn from a Gaussian distribution exp(−(x − µ) 2 /2σ 2 )/σ √ 2π with µ = −1 and σ = 1.A random set of 10% of the off-diagonal entries of θ was drawn from a Laplace distribution exp(−|x − µ|/b)/2b with µ = 0 and b = 1, and the remaining entries were set to 0. These choices were made to mimic MHNs obtained from real data we studied.The data distribution p was obtained from θ GT by drawing 1000 samples from its time-marginalized probability distribution, as defined in Eq. ( 6).
Given a dataset, MHNs were learned by optimizing the sKL divergence.We indirectly control the magnitude of the largest negative entries of q θ through our choice of the maximum possible TT rank r q of q θ .10 datasets were generated, and for all of them, MHNs were learned for specific choices of r q and ε.The results shown below are arithmetic means from these 10 runs.To avoid overfitting, an L1 penalty term of the form λ i̸ =j |θ ij | was added to the objective function.The factor λ was not made part of the optimization but set to a constant value of 10 −3 for all simulation runs to make comparison of different models easier.
A learned MHN's quality was assessed using the KL divergence of the ground truth model's timemarginalized probability distribution from the timemarginalized probability distribution of the learned MHN, see Eq. ( 6).This KL divergence was calculated without the use of the TT format to ensure that no negative entries can occur.

Static choice
First, we consider the static choice of ε given in Eq. ( 3).In this case, higher-order optimizers can be used for faster convergence to an optimum.However, for a fair comparison with the dynamic choice, we used gradient descent (a first-order optimizer) for all optimizations.Table I and Fig. 4 show the results for various combinations of r q and ε.In the last column ("exact"), optimization using the sKL divergence was also done when the TT format was not used, even though the KL divergence is well-defined and the introduction of a positive ε is not necessary in this case.The additional entries are written in grey to indicate this.
For increasing value of r q , the TT approximation improves, and accordingly the approximate results tend towards the exact result, although the convergence is quite slow.If q θ,i + ε < 0 occurred in any iteration of the optimization procedure, the optimization was stopped and the last θ matrix was returned.The colors indicate the percentage of runs where this happened.
The first row of Table I shows the results of optimization using the KL divergence.It can clearly be seen that, when the TT format is used, optimization using the sKL divergence leads to MHNs that describe the data more closely.The optimal ε value is between 10 −7 and 10 −6 across all cases where the TT format is used.The results also clearly show that, as explained in Section II C, if ε is chosen too small or too large, we obtain poor optimization results.From the colors we can see that early from the ground-truth model θGT for MHNs obtained by optimizing the sKL divergence with the static choice of ε given by Eq. ( 3).If q θ,i + ε < 0 occurred during optimization, the last available MHN was used to calculate the KL divergence from the ground truth.The cell color indicates the percentage of simulations where this occurred (a darker color corresponds to a higher percentage).In the column "exact", the TT format was not used during optimization, thus no negative entries occurred here.r q = 4 r q = 8 r q = 16 r q = 24 r q = 32 exact FIG. 4. Visualization of the data in Table I.The dashed line indicates the result when optimizing the KL divergence without the use of the TT format.
stopping due to negative entries of q θ,i +ε does not have a big influence on the optimization results.This is because early stopping often occurred late in the optimization procedure, i.e., after a good estimate for the optimum was already found.

Dynamic choice
Next, we similarly investigate the dynamic choice of ε given in Eq. choosing the function f as with δ > 0. In Theorem 3 we showed that the average of the sKL divergence converges to the KL divergence in the limit of small i.i.d.Gaussian noise.In our particular application, numerical data show that the assump- r q = 4 r q = 8 r q = 16 r q = 24 r q = 32 exact FIG. 5. Relative difference between the KL divergence and the sKL divergence with the dynamic choice of ε given by Eqs. ( 4) and (7) from the ground truth model θGT for MHNs obtained by optimizing the sKL divergence with the dynamic choice of ε given by Eqs. ( 4) and (7).The colors indicate the number of negative entries q θ,i in the final results, for indices i with pi > 0 (a darker color corresponds to a larger number of negative entries).In the column "exact", the TT format was not used during optimization, thus no negative entries occurred here.
tion of i.i.d.Gaussian noise is violated.Nevertheless, as r q increases and thus the quality of the approximation improves, we observe in Fig. 5 that the sKL divergence converges to the KL divergence, as already mentioned at the end of Section II C 2. This statement is true for all values of the parameter δ, although the details of the convergence may depend on δ.
We now discuss the results obtained from optimizing the sKL divergence, similar to Section III C 1. Since optimization using higher-order optimizers is not possible with the dynamic choice, simple gradient descent was used for optimization.The dynamic choice of ε ensures q θ,i + ε i > 0, so optimization was always done until a stopping criterion was satisfied.In Table II and Fig. 6 we show numerical results for various combinations of δ and r q .As r q is increased, we again observe a convergence towards the exact result, but now at a much faster rate than for the static choice.Table II also shows that the results are quite stable with respect to the parameter δ.For the exact calculation without the TT format, δ played no role, since it is only important when encountering negative entries in q θ .r q = 4 r q = 8 r q = 16 r q = 24 r q = 32 exact FIG. 6. Visualization of the data in Table II.
Comparing the static and dynamic choice of ε, it can be seen clearly that dynamically choosing ε generally leads to MHNs that are closer to the exact results at the same level of approximation.This is because usually only a few entries of q θ are negative.Therefore, dynamically choosing ε leads to an objective function that closely resembles the KL divergence, while the static choice introduces a shift for all entries even when the shift is not needed.

IV. SUMMARY AND OUTLOOK
We have introduced a new method to handle negative entries in approximate probability vectors.This method is very general and can be used in a wide variety of applications (see Section I B for examples).Moreover, it does not come with significant computational overhead.
We showed that the sKL divergence shares many desirable properties with the KL divergence.We discussed two possible choices of the shift parameters that occur in the sKL divergence.The static choice allows for the use of higher-order optimizers, but it requires tuning of the parameters and leads to a large loss of gradient information.In contrast, the dynamic choice restricts us to first-order optimizers, but it offers more freedom in the choice of the parameters and preserves most of the gradient information.For the dynamic choice we showed that, when negative entries occur due to i.i.d.Gaussian noise, the difference between the KL divergence and the average sKL divergence is quadratic in the strength of the noise and thus goes to zero for small noise.
In this work, we only considered the sKL divergence in the context of approximate discrete probability distributions.The investigation of possible use cases in other contexts is left for future work.
We applied our method to a real-world application, the modeling of cancer progression by Mutual Hazard Networks, where the use of tensor-train approximations can lead to negative entries.We showed that the sKL divergence and the corresponding model results converge to the KL divergence and the exact model results, respec-tively, when the parameter that controls the quality of the approximation is increased.We also showed that the dynamic choice of the shift parameters leads to a faster convergence to the exact results than the static choice, as expected from the theoretical considerations in Section II.
When using the sKL divergence as an objective function, first-order optimizers are desirable because they allow for more freedom when choosing the parameters of the sKL divergence.So far, we only used a standard gradient-descent optimizer.In future work, we will investigate the effect of different first-order optimizers, including stochastic and momentum-based optimizers.
Regarding MHNs, it was shown in [22] that the computational complexity of model construction can be reduced from exponential to cubic in the number of events using low-rank tensor formats.The remaining problem of negative entries in the approximate probability vectors is solved by the current work, which thus enables the construction of large Mutual Hazard Networks with ≫ 30 events.Expected applications will include as many as 800 events.Furthermore, modifications of the MHN have been conceived which allow for more realistic modeling of tumor progression, but which are currently limited to a very small number of events.We will apply the techniques described in this work to these large and extended MHNs in future work.use a Lagrange multiplier λ to find the extremal points of By taking derivatives w.r.t.q i and λ, we find that the only local extremum of L DsKL (q, λ) is at q = p and λ = −1.
For fixed i, we now use the log sum inequality [35] j a j log j a j j b j ≤ j a j log a j b j for a j , b j > 0 to complete the proof.

Proof of Theorem 3
The probability distribution of i.i.d.Gaussian noise x with mean 0 and standard deviation σ is The average of the sKL divergence over the Gaussian noise is given by We therefore consider the integral where c = 1/ √ 2πσ 2 is the normalization factor of the Gaussian.To expand this integral in σ 2 , it is convenient to perform a saddle-point analysis (also known as method of steepest descent or stationary-phase approximation) [36].This is a useful method to deal with integrals of the form dx e −N g(x) h(x) , where N is a large parameter and g and h are functions of x.The saddle-point method yields a systematic expansion of such integrals in powers of 1/N .In our case, the large parameter is N = 1/σ 2 , and g(x) = x 2 /2.The remaining part of the integrand is identified with h(x).
We first split the integral I into two contributions I 1 and I 2 corresponding to the integration intervals (−∞, −q] and [−q, ∞), respectively.For q + x ≥ 0, Eq. ( 4) gives ε = 0 so that I 2 takes the form The saddle point, i.e., the location of the minimum of g(x), is at x = 0, which is within the integration interval.
A standard saddle-point analysis up to next-to-leading order then yields In the remaining region, where q + x < 0, Eq. ( 4) gives Substituting x = −q − z, we obtain Before we can expand this integral in σ 2 , we first need to make sure that it exists.This depends on the choice of the (nonnegative) function f .In particular, I 1 exists if f is a sum of power laws, as assumed in Theorem 3. In fact, I 1 only diverges for rather exotic choices of f that go to zero too rapidly.For example, it diverges at the lower end for f (z) = exp(−1/z) and at the upper end for f (z) = exp(− exp(z 4 )).In the following we restrict ourselves to functions f for which I 1 exists.
To extract the small-σ behavior of I 1 , we again perform a saddle-point analysis.We still have N = 1/σ 2 , but now g(z) = (q + z) 2 /2.In this case, we encounter two complications that require modifications of the saddlepoint method.The first complication arises because the saddle point obtained from minimizing g(z) is at z = −q, which is outside the integration interval.Hence the integral is not dominated by the region near a saddle point but by the region near the minimum of g(z) within the integration interval, which is at the lower bound z = 0, where g ′ (0) = q > 0. The standard result for this case is obtained by expanding g(z) about zero, which yields ∞ 0 dz e −N g(z) h(z) = e −N g(0) h(0) N g ′ (0) (1 + O(1/N )) (A2) if h(0) is well-defined.Applied to our case we find which is well-defined if f (0) > 0. By assumption, f is nonnegative, but if f (0) = 0 a second complication arises since log f (0) is not defined.In this case, let us assume that the leading behavior near z = 0 is f (z) = az b with a, b > 0. If we split the logarithm in the integrand of I 1 , the integral involving log(p + z + f (z)) is of the form of Eq. (A2) with h(0) = p log p.The integral involving log f (z) is not covered by Eq. (A2), but it is still dominated by the region near z = 0.For this part, we need the integral (valid for Re(α) > −1) where Γ and Ψ denote the gamma and digamma function, respectively, and the ≈ sign indicates that we have only kept the leading behavior for large N (corresponding to small σ below).Applying this to our case (with α = 0, 1, and b) and collecting terms, we now find the leading behavior for small σ, where γ = −Ψ( 1) is Euler's constant.Incidentally, our choice of f in Section III C 2 corresponds to a = δ and b = 1.We conclude that for the functions f we admit, the integral I 1 is suppressed for small σ by a factor of exp(−q 2 /2σ

2 )i p i log p i q i + p i 2q 2 i σ 2 + 2 i p i 2q 2 i+
and thus goes to zero faster than any power of σ. (This statement remains true if the leading behavior of f near zero is f (z) = exp(−a/z b ) with a > 0 and b < 1.) Adding the two integrals we obtain D sKL (p∥q + x) x = O(σ 4 ) = D KL (p∥q) + σ O(σ 4 ) , which completes the proof.
, for different TT ranks rq.
TABLE II.Average KL divergence (without approximation)