THE EFFECT OF INTERSPIKE INTERVAL STATISTICS ON THE INFORMATION GAIN UNDER THE RATE CODING HYPOTHESIS

The question, how much information can be theoretically gained from variable neuronal firing rate with respect to constant average firing rate is investigated. We employ the statistical concept of information based on the Kullback-Leibler divergence, and assume rate-modulated renewal processes as a model of spike trains. We show that if the firing rate variation is sufficiently small and slow (with respect to the mean interspike interval), the information gain can be expressed by the Fisher information. Furthermore, under certain assumptions, the smallest possible information gain is provided by gammadistributed interspike intervals. The methodology is illustrated and discussed on several different statistical models of neuronal activity.


1.
Introduction. Since the classical works of [2,53,43], the challenges of understanding the principles of neuronal coding have attracted an increasing number of scientists from different fields. It is generally accepted that neurons communicate using series of action potentials (spike trains) via chemical and electrical synapses. Currently, there are two main hypotheses that describe the representation of information in neuronal signal. In the first, denoted as the rate (or frequency) coding hypothesis, information is represented by the neuronal firing rate. In the second hypothesis, denoted as the temporal coding, features of the spiking activity beyond the firing rate are employed [54]. In this paper we are concerned only with the rate coding point of view.
A fundamental mathematical framework for the theoretical approach to the problem of information processing in neuronal systems is provided by information theory [52], e.g., in the works by [10,9,53] and by the statistical estimation theory [32,40], e.g., in [24,22,23]. Along the latter approach, Koyama (2013) recently introduced the Kullback-Leibler (KL) divergence for rate-modulated renewal processes to investigate how much information on time-varying firing rates is carried by spike trains. It was shown that the KL divergence determines a lower bound for detectability of rate fluctuations with a Bayesian rate estimator [37]. It was also found that the information, as well as the lower bound, could significantly depend on the dispersion 64 SHINSUKE KOYAMA AND LUBOMIR KOSTAL properties of neuronal firing, the effect of which, however, has not been investigated systematically.
In this paper, we employ the approximation of KL divergence in terms of Fisher information, which essentially relates the information gain and the dispersion of neuronal firing in a single formula. Consequently, we investigate the effect of the interspike interval (ISI) distribution dispersion on the overall encoding performance. In particular, we show that among all scale-family ISI distributions that share the same coefficient of variation, the gamma distribution attains the minimum information gain. We also illustrate typical behavior of the Fisher information by using generalized inverse Gaussian and lognormal distributions.

2.
Methods. In this manuscript we employ the general concept of information arising in statistics and introduced by [39], following the works of [52], [20] and [50]. In particular, assume that H 1 is the hypothesis that the random variable X follows probability distribution f 1 (x), and let H 2 denote the hypothesis X ∼ f 2 (x). The information contained (on average) in the observation X = x for discrimination in favor of H 1 against H 2 is defined as the Kullback-Leibler (KL) divergence D(f 1 ||f 2 ) (see [39] for more details) The integral defining D(f 1 ||f 2 ) always exists (although it may be infinite), and "0 = 0 ln 0" as follows by taking the limits. The units of information are either "bits" (for the base-2 logarithm) or "nats" for the natural logarithm. The definition of information in Eq. (1) is apparently distinct from the Shannon's measures of selfinformation and mutual information, although there is no contradiction as can be demonstrated by a particular choice of H 1 , H 2 ; the mutual information, for example, is obtained by taking H 1 and H 2 to be joint density and the product of the marginal densities, respectively [39, pp. 7-8]. See also [18] for further details. Another measure of "information", employed especially in the theory of statistical estimation of continuously varying parameters [45], is the Fisher information. Let f (x; θ) be a parametric family of probability densities, then is denoted as the Fisher information about parameter θ contained in a single observation of r.v. X, since J(θ|X) determines how well the value of θ can be estimated from observing X (roughly stated, see e.g., [56] for details). In particular, for any unbiased estimatorθ(X) of parameter θ holds V ar(θ(X)) ≥ 1/J(θ|X) provided that f (x; θ) satisfies certain technical conditions [45]. The Fisher information has been used for measuring the encoding accuracy in the context of neural coding [8,51,57]. Eqns. (1) and (2) differ obviously, nonetheless, many fundamental relationships between these information measures have been found over the years, thus showing the depth of correspondence between information theory, theory of point estimation and statistics in general, see e.g., [5,10,33,39,45,49] among others. An asymptotic relation between Fisher information and KL divergence is also employed in this paper.
2.1. KL divergence for rate-modulated spike trains. We consider a ratemodulated renewal process as a model of spike trains. Let N (t) be the number of spikes that have already occurred at time t. A point process is generally defined by the conditional intensity function: where H(t) represent the history of spikes up to time t. Using this, the probability density of a sequence of spikes {t i } := {t 1 , . . . , t n } in an interval [0, T ] is expressed as [16,31] Here, the exponential factor describes the probability of no spikes between the spike times. (Intuitively, this arises from the product of 1 − r(t; H(t))∆t as ∆t goes to 0 and the number of terms in the product goes to infinity.) Consider a renewal process whose ISI density is f (x) with unit mean. The conditional intensity, or hazard function, of this process is given by where s * is the spike time preceding s. A renewal process with arbitrary mean firing rate can be generated from Eq. (5) by rescaling the time axis. First, consider that the mean firing rate is given by a constant µ. The conditional intensity function is, then, obtained by simply rescaling the time t = s/µ as Substituting Eq. (6) into Eq. (4), the probability density of a spike train {t i } in the interval [0, T ] is obtained as where p 1 (t 1 ; µ) is the probability density of the first spike occurring at t 1 , and P ((t n , T ]; µ) is the probability of no spikes being observed on (t n , T ]. The second line of Eq. (7) is derived in Appendix A. This transformation can be generalized to time-dependent firing rate [3,4,15,38,44,47]. Consider next that the firing rate λ(t) is given as a function of t. By defining the transformation Λ(t) := t 0 λ(u)du and rescaling the time t = Λ −1 (s) (Figure 1), the conditional intensity function is obtained from Eq. (5) as

SHINSUKE KOYAMA AND LUBOMIR KOSTAL
In the same way as Eq. (7), the probability density of {t i } is obtained as This rate-modulated renewal process is a generalization of both the inhomogeneous Poisson process (if f (x) is the exponential density) and the renewal process (if λ(t) is constant).
rescaled time (unitless) Figure 1. The time-rescaling transformation. A renewal spike train {s i }, whose ISI distribution is given by f (s i − s i−1 ), is transformed to {t i } via t i = Λ −1 (s i ). Accordingly, the transformed spike train {t i } has the instantaneous firing rate λ(t). Figure 2 depicts four probability densities f (x) with C V = 1 (left) and sample spike trains derived from the rate-modulated renewal processes (9) with the sinusoidal rate process (17) (right). These models will be used for examples in the following section.
Note that the unit is "nats/spike". Eq. (10) is generally difficult to be analyzed because it is a functional of λ(t) and f (x), the effects of which are not separated from each other in the formula, and contains a high-dimensional integration. As shown in the section 3.1, it can be approximated to more tractable form if temporal variation of the rate is sufficiently slow and small. For the case of N identically and independent trials, or population of N neurons, which may be more practical situation in estimating the firing rate, the KL divergence is multiplied by the factor N due to the additivity of information gain. Hence, it is enough to consider the case of single spike trains.

2.2.
Fisher information for scale family of probability densities. We consider the Fisher information for a scale family of probability densities p(t; λ) that is generated from a probability density f (x) as where λ > 0 is a scale constant. p(t; λ) has the same shape as f (x) but a different scale. Here, f (x) is a distribution with unit mean, and λ is interpreted as the mean firing rate. Inserting Eq. (11) into Eq. (2), it is found that the Fisher information about the scale parameter has the scaling property: where J(1|T ) is dimensionless and is uniquely determined by the shape of the density f (x). Therefore, J(1|X) can be interpreted as a kind of "dispersion" measure of random variable X ∼ f (x). In the following, we use a notation I[f ] := J(1|T ) to indicate that it is a functional of f (x). I[f ] is expressed as where the second line is obtained by integration by parts.

3.
Results. We first show that the KL divergence (10) can be approximated by the Fisher information (13) for slow and small rate variation. From this point of view, all the properties of the Fisher information play important role. In particular, we show that the minimum of the Fisher information I[f ], given the coefficient of variation, is achieved by the gamma distribution. Typical behavior of the Fisher information is illustrated by using generalized inverse Gaussian and lognormal distributions.
3.1. Relation between the KL and Fisher information.
be the mean of and variance of λ(t). We also let τ be the characteristic time scale of λ(t) 1 . In the limit of T → ∞ and under the conditions of τ µ 1 (slow variation) and σ/µ 1 (small variation), the KL divergence (10) is approximated as The details are given in Appendix B.
It must be noticed that the rhs of (14) is analytically more tractable than Eq. (10), because i) I[f ] is defined for single ISIs while D[p(·; {λ})||p(·; µ)] is defined for whole spike trains, from which we can avoid performing the high-dimensional integration, and ii) I[f ] is separated from the effect of rate variation.
By using this approximation, we can study the effect of dispersion of firing, which is described by I[f ], on the information gain, separably from the effect of the rate variation. For this purpose, it is necessary to verify that the leading term in the rhs of Eq. (14) dominates the KL divergence under reasonable range of parameter values. This is done by computing the KL divergence (10) numerically, and comparing it with the approximation (14). The KL divergence is computed by simulating a large number of spikes {t i }, and averaging the log likelihood ratio, ln p({t i }; {λ(t)}) − ln p({t i }; µ), over the sample spikes: The spike trains are simulated by generating spikes {s i } from the renewal process with ISI density f (x), and then by applying the time-rescaling transformation, In the simulation, we use the gamma density function with unit mean for f (x): where is the gamma function, and a sinusoidally varying rate: Figure 3 depicts the KL divergence, and its approximation (14) as a function of σ/µ. Note that the difference between these two quantities corresponds to the error term O( σ τ µ 2 ). For instance, this error term is relatively small even if the rate fluctuation 1 For a stochastic process λ(t) whose correlation function is φ(u) = lim T →∞ For a periodic process such as a sinusoidal function (17) used as an example below, τ is given by the period.
is relatively large (σ/µ . = 1) for τ µ = 3, with which 3 spikes on average appear in a cycle of the sinusoidal firing rate. it is therefore confirmed that the expression of the KL divergence (14) with I[f ] provides a good approximation.  Figure 3. The information gain due to sinusoidally varying firing rate and gamma distribution of interspike intervals with shape parameter κ = 2. The value of σ/µ is the ratio of the amplitude to the mean of the driving sinusoid (the "fluctuation rate"). The number τ µ determines the average number of spikes per sinusoidal period, here shown for τ µ = 1 (dash-dotted line), τ µ = 3 (dashed line) and τ µ = 5 (dotted line). The solid line is the approximation using the Fisher information. Note that the approximation holds well even for relatively high fluctuation rates (σ/µ . = 1).

The density function that achieves the minimum Fisher information.
Given the mean and variance, or the coefficient of variation C V , the minimum of the Fisher information is achieved by the gamma distribution. This has been known in the literature [30]. In this paper, we provide another proof. The proof is done by showing that the Fisher information I[f ] is bounded as and then, showing that only the gamma distribution achieves the minimum of the Fisher information, C −2 V . The details are given in Appendix C. 3.3. Examples. We use two specific models, generalized inverse Gaussian (GIG) family and lognormal distribution to illustrate behavior of the Fisher information (13).
The GIG family may be used as a statistical descriptor of ISIs for several reasons [28]. First, for certain values of the parameters of this family, its members are the first passage time distributions of certain diffusion processes to a constant boundary. It also contains gamma and inverse Gaussian distributions as sub-families, and thus unifies commonly used distributions. Gamma distribution is one of the most frequent statistical descriptors of ISIs [17,41,46]. Note that for C V = 1 gamma distribution becomes exponential, resulting in the neuronal firing described by the Poisson process [55]. The inverse Gaussian distribution [11] is often used to describe neural activity [19,55] and fitted to experimentally observed ISIs [41,46]. This distribution results from the Wiener process with positive drift (the depolarization has a linear trend to the threshold) as the stochastic perfect integrate-and-fire neuronal model [55].
The lognormal distribution of ISIs, with some exceptions [6], is rarely presented as a result of a neuronal model. However, it represents quite a common descriptor in ISI data, see e.g. [46,55] and references therein. Furthermore, a mixture of two lognormal distributions has been used recently [7] as an statistical ISI descriptor.
3.3.1. Generalized inverse Gaussian distribution. The GIG distribution has a density function: where K a (w) is the modified Bessel function of the second kind with index a ∈ (−∞, ∞) [1], and η ≥ 0 and w ≥ 0 represent a scale and concentration parameters, respectively. This becomes the inverse Gaussian distribution (for a = −1/2) and gamma distribution (16) (for a = κ > 0, w/(2η) = κ and w → 0) as special cases. See [29] for statistical properties of the GIG distribution. The mean and variance of X are, respectively, given by and from which the square of the coefficient of variation is obtained as

I[f ] of the GIG is calculated as
Using Eq. (70), I[f ] is obtained as In the following, we analyze the behavior of Eqs. (22) and (24) in asymptotic cases.
1. Limit of w → 0 for a > 0. The gamma distribution is obtained in this limit. Using the asymptotic formula: and which is consistent with the result in the section 3.2, that is, I[f ] of the gamma distribution corresponds to the minimum value 1/C 2 V . 2. Limit of w → 0 for a < 0. Using the asymptotic formula for this limit: and where γ is the Euler constant, C 2 V and I[f ] are obtained as and Particularly, I[f ] for a < −2 is expressed as It is worth noting that the GIG for a < 0, wη = 2β and w → 0 becomes the reciprocal gamma distribution: 3. Limit of w → ∞. Using the asymptotic formula in this limit: we obtain and Thus, the Fisher information diverges as C 2 V → 0.

Upper and lower bounds of I[f ]
. As shown in section 3.2, the gamma distribution gives the lower bound of I[f ] (among all ISI densities), which is obtained in the limit w → 0 for a > 0. On the other hand, the GIG becomes the reciprocal gamma distribution in w → 0 for a < 0, whose C 2 V and I[f ] are obtained as Eqs. (30) and (31). Taking into account that I[f ] and C −2 V monotonically increase as the concentration parameter w is increased, and the mapping from (w, a) to (C −2 V , I[f ]) is one-to-one and smooth, it can be shown that the reciprocal gamma distribution (32) gives the upper bound of I[f ] among the GIG family. 5. Case of a = −1/2. The GIG becomes the inverse Gaussian distribution for this case, the density function of which is given by Using the formula: and K −a (w) = K a (w), or calculating directly from Eq. (37), we obtain C 2 V = 1/w and 3.3.2. Lognormal distribution. Next, we examine the lognormal distribution. The density function has a form: f (x; µ, σ 2 ) = 1 The mean and variance are given by E(X) = exp(µ + σ 2 2 ) and V ar(X) = (e σ 2 − 1) exp(2µ + σ 2 ), respectively. Using Eq. (13), I[f ] is obtained as .   , for the generalized inverse Gaussian family (gray region) and lognormal distribution ("logn") of interspike intervals, as a function of CV . The "gm", "r-gm" and "invg" represent the gamma, reciprocal gamma and inverse Gaussian distributions, respectively. The value of I[f ] plays a key role in the approximate expression for the information gain due to variable firing rate.

Summary of the Fisher information.
4. Discussion. In this paper, we studied how much information can be gained from variable neuronal firing rate with respect to constant firing rate. For this purpose, we employed the KL divergence and its approximation in terms of the Fisher information. It was shown that the KL divergence, defined for rate-modulated spike trains, can be reduced to the Fisher information for single ISIs in the limit of slow (τ µ 1) and small (σ/µ 1) rate variation. The numerical study quantitatively verified that the Fisher information approximates the KL divergence reasonably well (Figure 3).
We stress at this point, that the overall methodology presented in this paper is not restricted to the scale-family probability distributions. The approximate relation between KL divergence and Fisher information holds for an arbitrary parameterization [39, p.26]. Here, however, we restrict ourselves to the scale parameterization due to its relationship to the rate coding scheme (see e.g., [27,42]) and due to tractable properties of the Fisher information [26,30]. In particular, we justify that among all scale-family ISI distributions that share the same coefficient of variation, the gamma distribution reaches the minimum information gain.
The KL divergence employed in this paper is different from the Shannon mutual information. The mutual information between the firing rate λ(t) and spike train {t i } can be defined by introducing the probability distribution of λ(t). Asymptotic relations between Fisher information and mutual information has also been well investigated, analogously to the relation of KL divergence with Fisher information [5,10,33,34,39,45,49]. In our case, it is easily speculated that under the same condition as the KL divergence (i.e., slow and small rate variation), the Fisher information I[f ] will appear in the approximate mutual information.
As the definition, I[f ] quantifies the change in the shape of the probability density under infinitesimally small changes in the scale. Therefore, "smooth" density shapes attain lower I[f ] than, e.g., multimodal ones. Similarly, [35] employed a dispersion measure based on the Fisher information for the location-family class, and analyzed its statistical properties. It may be interesting to investigate the relation among these measures.
As shown in Figure 4, I[f ] generally has a monotonic relationship with C −2 V , but reflects additional statistical properties beyond the second moment of ISIs. I[f ] can take different values among various probability distributions that share the same value of C −2 V , and vice versa. I[f ] is equal to C −2 V , i.e., the minimum value, only for the gamma distribution. An interesting property of I[f ] is that it is directly connected with the information-theoretic measure (the KL divergence) on rate coding, while C V is not. Therefore, characterizing the dispersion of ISI with I[f ] could give some new information if the rate reflected the information processing in neuronal systems (e.g., see [14,25,48]), and statistical properties of spike trains were significantly deviated from the gamma statistics. It would be interesting to examine if these were the case for biological spike trains.
In order to estimate the Fisher information I[f ] from experimentally observed spike trains, It would be preferable to perform nonparametric inference. Recently, [36] introduced a Fisher information estimator for the location family, based on the maximum penalized likelihood estimation of the probability density function [21]. Nonparametric inference of Fisher information is generally not straightforward because it contains the derivative of density, which is sensitive to estimates of the density. It remains for a future work to develop a reliable method for estimating I[f ] and examine how much information can be gained from real spike trains.
is the probability density of the first spike occurring at t 1 2 , and is the probability of no spikes being observed on (t n , T ]. Consider the following derivative: from which we obtain Using this, the conditional intensity function (6) is written as Substituting this into Eq. (42), we obtain Eq. (7).

SHINSUKE KOYAMA AND LUBOMIR KOSTAL
Inserting Eq. (51) into Eq. (10), and taking it into account that p 1 (t 1 |{λ(t)}) and P ((t n , T ]|{λ(t)}) are negligible in the limit of T → ∞ if the mean firing rate is nonzero µ > 0, the KL divergence is rewritten as where is the KL divergence between the two ISI densities. Thus, the KL divergence of the spike trains is reduced to the KL divergence of the single ISIs.
Since the mean of λ(t) is given by µ, λ i may be expressed by λ i = µ + δ i with δ i ∼ O(σ). Using the scaling property of D s (λ||µ) (i.e., D s (λ||µ) = D s (cλ||cµ), c > 0), we obtain where ξ i = −δ i /µ + (δ i /µ) 2 − · · · . The first and second terms in the above equation vanish, and the coefficient of the third term is computed as Thus, the KL divergence of the single ISIs is expanded with respect to δ i /µ as Substituting Eq. (57) into Eq. (53), the KL divergence is obtained as If we further assume that the rate fluctuation is ergodic with a limiting density p(λ), the summation can be replaced as where δ = λ − µ. Taking into account δ i = δ + O(σ/(τ µ)) from Eq. (50) (60) Appendix C. Proof of minimum Fisher information. Consider the scale family of densities p(t; λ) given by Eq. (11). The mean is given by E(T ) = 1/λ if the mean of f (x) is unity. Then, applying the Cramér-Rao inequality [45] leads to from which we obtain where C V = V ar(T )/E(T ) is the coefficient of variation of p(t; λ) as well as f (x). Thus, given the coefficient variation C V = 1/ √ κ, the density function f (x) that achieves the minimum fisher information satisfies which holds if f (x) satisfies The solution of the differential equation (64) is found to be f (x) = exp[(κ−1) log x− c 1 x + c 2 ], where c 1 = κ and c 2 = κ log κ − log Γ(κ) are the constants of integration that are determined from the normalization condition ∞ 0 f (x)dx = 1 and the mean of f (x). Therefore, it is shown that the gamma density function (16) attains the minimum of I[f ].
Let f (x) and g(x) be probability densities having fixed mean and the variance. Then, it is easily proven that, for 0 ≤ θ ≤ 1, θf (x) + (1 − θ)g(x) also has the same mean and variance. Therefore, the set of distributions with fixed mean and variance is convex. Since I[f ] is a strictly convex functional [12], only the gamma density function achieves the minimum Fisher information.
Using Eq. (20) and E(1/X) is obtained as Another expression of E(1/X) is obtained by taking the derivative of log f with respect to η, Using Eq. (20), E(1/X) is obtained as From Eqs. (67) and (69), E(1/X) is also expressed as