A Maximum Entropy Test for Evaluating Higher-Order Correlations in Spike Counts

Evaluating the importance of higher-order correlations of neural spike counts has been notoriously hard. A large number of samples are typically required in order to estimate higher-order correlations and resulting information theoretic quantities. In typical electrophysiology data sets with many experimental conditions, however, the number of samples in each condition is rather small. Here we describe a method that allows to quantify evidence for higher-order correlations in exactly these cases. We construct a family of reference distributions: maximum entropy distributions, which are constrained only by marginals and by linear correlations as quantified by the Pearson correlation coefficient. We devise a Monte Carlo goodness-of-fit test, which tests - for a given divergence measure of interest - whether the experimental data lead to the rejection of the null hypothesis that it was generated by one of the reference distributions. Applying our test to artificial data shows that the effects of higher-order correlations on these divergence measures can be detected even when the number of samples is small. Subsequently, we apply our method to spike count data which were recorded with multielectrode arrays from the primary visual cortex of anesthetized cat during an adaptation experiment. Using mutual information as a divergence measure we find that there are spike count bin sizes at which the maximum entropy hypothesis can be rejected for a substantial number of neuronal pairs. These results demonstrate that higher-order correlations can matter when estimating information theoretic quantities in V1. They also show that our test is able to detect their presence in typical in-vivo data sets, where the number of samples is too small to estimate higher-order correlations directly.

Text S1 1 Optimization of the Nuisance Parameters The nuisance parameters (rates of the marginals and correlation coefficients) of the distributions are unknown. We can just obtain estimates of these parameters based on the available data set. However, our goal is to construct a test which is applicable even when the number of samples is very small. In this case, the estimates of the nuisance parameters are generally unreliable. Instead of relying on specific estimates of the nuisance parameters, we maximize the p-value over all possible nuisance parameters.
The cost function of the optimization is given by the negative p-value that is obtained for a specific set of nuisance parameters of the distributions.
In order to compute the entropy and the mutual information of a particular distribution we restricted the support of the spike counts to the set {0, 1, . . . , b}, where b was selected such that the residual mass of the distribution was less than 0.1% of the total mass: The sets of possible nuisance parameters are continuous. There are many algorithms for optimizing a nonlinear function over continuous data sets. Grid search [1] is one option for optimizing the nuisance parameters globally. However, when the mutual information is used for the divergence measure three nuisance parameters have to be considered for every conditional probability distribution P (X 1 , X 2 |θ), i.e. for every stimulus θ of the stimulus set. This made it unfeasible to maximize the p-value using grid search when the mutual information is used as the divergence measure, since the number of parameter combinations is too big. Therefore, we applied simulated annealing [2] to maximize the p-value in all of the analyzes of the main article. For the initial values of the simulated annealing optimization we chose the sample means for the Poisson rate parameters and the sample correlation coefficients for the correlations of the maximum entropy distributions.
We used the default simulannealbnd implementation from the MATLAB R2011b Global Optimization Toolbox. The initial temperature T init was set to 100. The temperature was updated according to were generated with uniform random directions and steps of length T (k). If the new objective function value was less than the old, the new point was always accepted. Otherwise, the new point was accepted with probability 1 where ∆ is the difference in objective function values. Rates were bounded by zero (lower bound) and two times the maximum sample rate (upper bound) and correlation coefficients were bounded by -1 (lower bound) and 1 (upper bound). Points outside of the boundary region were never accepted. The algorithm terminated after the average change in value of the objective function in 10 iterations was less than 10 −6 .
Maximization problems are harder to solve if many local maxima are present, because simulated annealing can converge to a local maximum instead of the global maximum. This is likely to happen if the global maximum is peaked with small support. In order to make sure that simulated annealing is appropriate for this maximization task we plotted p-values as a function of the nuisance parameters for the entropy difference as the divergence measure and for input data sampled from a maximum entropy distribution showing an acceptance ( Figure S1A, B) and from a higher-order mixture distribution showing a rejection ( Figure S1C, D). The p-values are symmetric with respect to the rate parameters of the Poisson marginals. In both cases, the plateau areas are localized in a relatively small region of the parameter space with the empirical parameter estimates being close to the maxima.
Note that the true distribution does not necessarily have the highest p-value, because a low value reflects strong divergence while the highest value does not necessarily reflect the greatest similarity. The landscape also depends on the data sample that the p-values are based on. For very low and high rates the distributions are very dissimilar in terms of the entropy difference. Figure 2C of the main article shows the rejection rates of the maximum entropy test applied to the data from the mixture model P M1/2 . A similar test application is shown in Figure S2. The only difference are the rates of the Poisson marginals: the rates are set to λ = 5 (corresponding to 50 Hz for 100 ms bins). The results are resembling those of Figure 2C and thus demonstrate that the maximum entropy test successfully detects higher-order correlations for different rates.   Figure 2A). (C) Same for P M2 (ρ = 0.2, cf. Figure 2B)). Simulated annealing was applied to maximize the p-value. Number N MC of Monte Carlo samples was 1000. process [3]. The interspike intervals were transformed by

Poisson Goodness-of-fit Tests
where λ i is the rate of the process in the time interval of the interspike interval and the τ k are interspike intervals, corrected for discrete time by an analytical procedure of discrete time rescaling [4]. We applied Spike times were discretized at 1 ms, because the spike trains were sampled with a rate of 1 kHz. Just like in the application of the maximum entropy test, we divided the spike trains into non-overlapping time intervals locked to the start of a grating presentation at varying latencies (later referred to as "time bins"). The firing rate of each bin was estimated over all trials. We then collected all interspike intervals that were confined to a single bin from each trial separately and concatenated these interspike intervals.
This was done separately for the orientation of the grating stimulus, the neuron, the time bin and the two experimental conditions. For small bin sizes, interspike intervals are almost never confined to a single bin, because it is unlikely to observe more than one spike within a very small bin. We therefore used bin  Note that the number of samples for the Kolmogorov-Smirnov tests increases with increasing bin size, Figure S4. Results of the maximum entropy test for a subset of data recorded from area V1 of anesthetized cat that is not rejected by the Kolmogorov-Smirnov tests. Orientation, time bin and neuron pair combinations with at least one marginal Poisson rejection for 400 ms bins (cf. Figure S3B) were excluded from the analysis. (A) Fraction of neuron pairs rejected by the Monte Carlo maximum entropy test with the entropy difference as the divergence measure (α = 5%) and for different bin sizes. (B) Change in rejection rates compared to the full data set including data that were rejected by the Kolmogorov-Smirnov tests (cf. Figure 6A).
whereas the number of spike count samples for the maximum entropy tests do not depend on the bin size.
To investigate whether the maximum entropy rejections (cf. Figure 6) can be linked to rejections of the Poisson process hypothesis, we excluded maximum entropy rejections that were also rejected by the Kolmogorov-Smirnov tests. For an orientation and time bin, a neuron pair was excluded if a Kolmogorov-Smirnov test rejected at least one of the neurons for 400 ms bins. The rejection rates of the maximum entropy test for this subset are shown in Figure S4A. The rejection rates for the different bin sizes are similar to those for the complete data set that are shown in Figure 6A. A closer examination of the change in rejection rates that is caused by the excluded data ( Figure S4B) reveals that the shift in all rejection rates is less than 5%. This result is in line with the results of the two spike count tests. We conclude that the rejections of the maximum entropy hypothesis for the entropy difference as the divergence measure cannot be explained by a violation of the Poisson process hypothesis. The change in rejection rates for the mutual information as the divergence measure was not investigated, because we would have to exclude a time bin and neuron pair whenever there was a rejection for any of the orientations.
We emphasize that the Kolmogorov-Smirnov test was applied to test a hypothesis that is a sufficient condition but not a necessary condition for the spike count Poisson hypothesis. A discrete Poisson distribution of the spike counts does not necessarily imply an underlying Poisson spike generating process.
Spike counts can follow a discrete Poisson distribution even if the Kolmogorov-Smirnov test rejects the 8 Poisson process hypothesis entirely.

Alternative Marginal Distributions
There are other discrete marginal distributions that can be applied if the spike counts at hand are not Poisson distributed.
One example is the negative binomial distribution, which allows to model spike count distributions that have variance greater than mean. Here, Γ is the gamma function, λ is the mean spike count, and υ is a positive parameter which controls the degree of overdispersion. The smaller the value of υ, the greater is the Fano factor, and as υ approaches infinity, the negative binomial distribution converges to the Poisson distribution.
Another example is the binomial distribution, which allows to model spike count distributions that have a variance λ(1 − λ/n) that is smaller than the mean spike count λ. As n approaches infinity, the binomial distribution converges to the Poisson distribution.
The most appropriate distribution can be selected based on the Fano factor (defined as variance over mean). If the Fano is greater than 1 then the negative binomial distribution is promising. If it is smaller Mixture models can be applied to construct even more complex marginal distributions in case of all three marginal distributions being rejected [5].
In principle, marginal distributions with any number of parameters can be applied in the test. Note, however, that the required time of the simulated annealing optimization increases with the number of parameters. On current desktop computers with reasonable computation time, the number of parameters is therefore limited to less than 10 parameters per marginal distribution. In any case, empirical marginals are not feasible, because they have an unbounded number of parameters.

Alternative Divergence Measures
In this study, we applied the entropy difference and the mutual information difference as divergence measures. The proposed maximum entropy test, however, is more general and makes it possible to apply other divergence measures D as well. In the following, we list a couple of alternative divergence measures.
A very commonly applied divergence measure in statistical testing is Pearson's X 2 statistic where P (1) is the empirical distribution and P (2) is the reference distribution (also called null distribution). This divergence measure is primarily motivated by its asymptotic characteristics. The statistic is asymptotically χ 2 -distributed when the number of samples approaches infinity [6]. In the proposed Monte Carlo test the asymptotic distribution is of no concern, because the test is specifically designed to be applicable when the number of samples is small. Nevertheless, the statistic quantifies the distribution difference on a point-by-point basis. The X 2 statistic is part of the power divergence family of goodness-of-fit statistics [7] which provides many choices for divergence measures.
The Kullback-Leibler divergence (KL divergence) is a natural choice for quantifying divergence between distributions: KL divergence can be interpreted as the expected extra message length per data point that must be communicated if a code that is optimal for a given distribution P (2) instead of a true distribution P (1) is used [8]. The measure is also related to the mutual information by I(X; Y ) = D KL (P (X, Y ), P (X)P (Y )).
The divergence measure quantifies the divergence between two distributions. Any function that quantifies such a divergence can be applied in the test. The test then assesses the necessity of higher-order correlations in terms of the divergence measure. The choice of this measure should therefore depend on the task at hand. If the task is to evaluate the necessity of higher-order correlations in terms of the mutual information, then the divergence measure should also be based on the mutual information.

Modeling Higher-order Correlations
Suppose that the right marginal distributions were selected and that the maximum entropy hypothesis is rejected by the proposed test. In that case, a model that includes higher-order correlations is required.
Generalized linear models (GLMs) have become very popular in the neuroscience literature [9][10][11][12] and include higher-order dependencies. Here, we will briefly describe the basic framework of GLMs and relate it to the maximum entropy models which we applied in the proposed test.
In the GLM framework for spike trains, spiking statistics are based on conditional intensity functions.
For two coupled neurons Y and Z, the following conditional intensity functions are defined [12]: where k y , k z , h yy , h zz , h yz and h zy are filters, · denotes the dot product and h · y [t−τ,t) = t−∆t i=t−τ h i y i denotes the filtered activity of Y in the time interval [t − τ, t). The distributions P (y t |λ yt ) and P (z t |λ zt ) of the activities conditioned on the intensity functions can, for instance, be Poisson or Bernoulli distributed [10]. Conditional independence of these distributions is assumed. Thus, the full likelihood function of the model is given by P (Y, Z|X, k y , k z , h yy , h zz , h yz , h zy ) = t P (y t |λ yt )P (z t |λ zt ). k y , h yy and h yz are referred to as stimulus filter, history filter and coupling filter of neuron Y , respectively. Typically, these filters are parametrized by basis functions like, for instance, cosine bumps [11].
The coupling filters introduce dependencies between the neurons. Even more complex dependencies can be modeled by hidden neurons [12]. Note, however, that changing the dependencies also changes the marginal distributions of the neurons. There are efficient fitting procedures for the parameters of the model [11]. Moreover, GLMs can model non-stationary data by their history dependent terms. For a more detailed description of GLMs we refer the reader to previous studies [9][10][11][12].
In contrast to GLMs, the second-order maximum entropy models that we apply in the test do not have history dependence. Instead, separate models are generated for each point in time. The models therefore do not make any assumptions about non-stationarity which could be confounded with effects of higher-order dependencies. History effects within a single spike count bin can be modeled by the marginal distributions. If there are autocorrelations present then the marginal spike count distributions will typically deviate from discrete Poisson distributions (cf. Section "Impact of Autocorrelations" of the main article). We describe alternative marginal distributions in Section "Alternative Marginal Distributions" for modeling such data.

Subpopulation Structure of Recorded Neurons
The proposed maximum entropy test provides a method to check whether significant divergences between the second-order model and data are present, but it does not provide quantitative differences. To get an idea of the magnitude of these differences we applied two estimators. (1) An estimator that is based on the second-order model. We used the maximum likelihood second-order maximum entropy model and calculated the mutual information based on this distribution. (2) A state-of-the-art bias correction estimator working directly on the data. Here we applied the shuffle estimator [13] together with the best upper bound (BUB) estimator [14]. The BUB estimator does not rely on the asymptotic sampling regime and therefore might work in the undersampled regime. In practice, however, the number of samples in each condition should be larger than the number of relevant responses in order to obtain an unbiased estimate of the mutual information [15]. For small Poisson rates where the domain can be restricted to 15 spike counts the number of relevant responses is 225. Thus, at least 225 samples per stimulus are required for an unbiased estimate, whereas only 42 samples per stimulus are present in our data set.
Nevertheless, the relation between the bias correction estimator and the second-order maximum entropy estimator may illustrate the order of the impact of higher-order correlations.
The maximum entropy test examines the sufficiency of the second-order model in terms of the mutual Figure S5. Pairwise mutual information estimates. The mutual information was averaged over all pairs and subsequent bins. (A) Difference between the mutual information estimated by the maximum entropy distribution and the (higher) mutual information estimated by the BUB shuffle estimator for the same pairs. (B) Histogram over the information rate estimated by the maximum entropy distribution and the information rate estimated by the mutual information BUB shuffle estimator. The information rates are normalized to the respective maximum information rate of the estimator. Control and adaptation conditions are mixed.
information difference. Therefore, we expected the differences of the mutual information estimations to reflect the rejection rates of the maximum entropy test. The differences between estimator 1 and estimator 2 are shown in Figure S5A. For all but the 200 ms and 400 ms bins the differences between the estimates are smaller for the control condition than for the adaptation condition. Differences between control and adapted condition are not significant though. The total differences between the mutual information estimates are on the order of 0.3 to 4 bit. This amounts to almost two times the mutual information of the second-order model and suggests a strong impact of higher-order correlations on the mutual information.
We tested all pairs of the recorded neurons but there could be subpopulations of neurons with strong higher-order correlations. To investigate whether this is the case we examined the empirical distributions of the information rates. Histograms over the information rates of the maximum entropy estimator and of the BUB shuffle estimator are shown in Figure S5B. We normalized the information rates to their respective maximum to make the distributions easily comparable. The distribution of the BUB shuffle estimator has a remarkable bimodal form which suggests subpopulations with (1) low and (2) high pairwise mutual information. We applied Hartigan's dip test of unimodality [16] with 500 bootstrap samples in order to verify the multimodal form of the latter distribution. The hypothesis of unimodality can be rejected for the distribution of the BUB shuffle estimator (p < 1%), whereas it cannot be rejected for the distribution of the maximum entropy estimator (p = 77.6%). The recorded neurons are presented Figure S6. The graph of the neurons that were recorded from anesthetized cat V1 during an adaptation experiment. The nodes represent the neurons. The numbers in the nodes are the indices of the neurons. Gray values visualize preferred orientations of the cells. Edges are shown in red if the information rate of the pair as estimated by the BUB shuffle estimator exceeds 1.7 bit/s. as a graph in Figure S6. The edges of group 2 are shown in red. All neurons in this group (1,2,4,6,7 and 8) are connected by red lines and thus form a subpopulation with high pairwise mutual information.
This separation coincides with a thresholding of the overall firing rate of the individual neurons at 10 Hz ( Figure 7A of the main article).