Adaptive modeling and inference of higher-order coordination in neuronal assemblies: A dynamic greedy estimation approach

Central in the study of population codes, coordinated ensemble spiking activity is widely observable in neural recordings with hypothesized roles in robust stimulus representation, interareal communication, and learning and memory formation. Model-free measures of synchrony characterize coherent pairwise activity but not higher-order interactions, a limitation transcended by statistical models of ensemble spiking activity. However, existing model-based analyses often impose assumptions about the relevance of higher-order interactions and require repeated trials to characterize dynamics in the correlational structure of ensemble activity. To address these shortcomings, we propose an adaptive greedy filtering algorithm based on a discretized mark point-process model of ensemble spiking and a corresponding statistical inference framework to identify significant higher-order coordination. In the course of developing a precise statistical test, we show that confidence intervals can be constructed for greedily estimated parameters. We demonstrate the utility of our proposed methods on simulated neuronal assemblies. Applied to multi-electrode recordings from human and rat cortical assemblies, our proposed methods provide new insights into the dynamics underlying localized population activity during transitions between brain states.


A Problem Formulation
This section provides detailed formulations of the MkPP model and the adaptive estimation problem to complement their overviews in the main text.

A.1 Discretized marked point process likelihood model
Because multivariate point processes, as defined in literature [2], do not permit simultaneous events at arbitrarily small time scales, point process models of ensemble spiking treat neurons as conditionally independent elements of a multivariate process.To avoid this assumption, Solo introduced a theoretical model for simultaneous spiking events based on MkPPs that explicitly model each possible ensemble spiking event [3].Due to the convenience of its disjoint decomposition, MkPP representations have been used by Kass et al. in [4] and Ba et al. in [5] to model simultaneous spiking in neuronal assemblies.
We utilize a discrete-time MkPP representation of ensemble neuronal activity [5].For an assembly of C neurons, the C-variate spiking process, binned with small bin size ∆, at time bin index t is denoted by n t := [n t , . . ., n , where each component is the spiking process of one neuron.Conventional discrete point process models treat the components as conditionally independent Bernoulli observations.Given our interest in simultaneous spikes, we instead treat n t as multivariate Bernoulli observations.The spiking process n t is mapped to a C * -variate process n * t := [n * t (1) , n * t (2) , . . ., n * t (C * ) ] , which are the binned observations of a marked point process whose marks count the number of exactly one of C * := 2 C − 1 disjoint non-zero spiking events; we refer to n * t as the marked Bernoulli process, distinguishing it from the multivariate Bernoulli process n t .We define the mark space K := {1, . . ., C * } [2].At each time t j such that n tj = 0, the sole non-zero element of n * tj indicates the mark.We also define the binned ground process n (g) t that takes value 1 at each such t j and is zero otherwise [2]; the ground process indicates the occurrence of any spiking event and is represented by n The marked process representation is not uniquely defined, but can be done so in a convenient fashion: treating the components of n t as the bits of a C-bit binary number, the mark indexed by the decimal equivalent of a particular realization of n t will corresponded to that realization.By the disjointness of the marked representation, the spiking process of the c th neuron can be recovered as the sum of all marked processes whose index, in binary, takes value 1 at the c th bit.The conditional intensity functions (CIFs) of n t and n * t are approximated by the probabilities of observing an event at time bin t given ensemble spiking history.That is, for c = 1, . . ., C and m = 1, . . ., C * .We can relate λ (m) ∆.The marked process permits a generative description of simultaneous spiking events: ensemble spiking events are characterized by the ground process, occurring with probability λ * t (g) ∆; the event is then assigned to the m th mark (i.e. the m th simultaneous spiking outcome) with conditional probability λ * t (m) ∆ λ * t (g) ∆ .Thus, at time t the likelihood of ensemble event n * t is given by: S1 Appendix 2/28 The likelihood in Eq. ( 2) is used to form a multinomial generalized linear model (mGLM) with multinomial logistic link function of which we consider two versions.
The first, more general version utilizes the ensemble history as covariates in the mGLM.Letting the covariate vector x t be the ensemble history up to some fixed lag at time t (augmented by a constant element of 1), the model is parameterized by ω t := ω (1)   t , ω (2) t , . . ., ω , where the parameters for the m th mark ω (m) t := µ (m) t , θ (m) t consists of an ensemble history-modulation vector θ (m) t and the baseline firing parameter, µ (m) t .The CIF of the m th mark is hence defined as: equivalently, the log-odds of the m th mark occurring versus no spiking event occurring .
The second version of the mGLM makes the simplifying assumption that there is no history dependence.The resulting model depends only on contemporaneous spiking, permitting compact parameterization by the baseline firing parameters t , . . ., µ ] .In the history-independent model, the log-odds of the m th mark occurring versus no spiking event occurring are: The log-likelihood can be expressed in a fashion resembling the maximum entropy model [6,7].For the history-independent case, the log-likelihood of while the history-dependent formulation admits a similar expression by simply replacing

A.2 Adaptive estimation of marked point process models
First, recall that since parameters of the MkPP mGLM are allowed to change in time, we use adaptive algorithms to capture the dynamics of the history-dependent and history-independent models; and that for tractabilitiy, we employ a thresholding rule similar to [8] thus considering only "reliable interactions", i.e. the subset of the mark space K = {m ∈ K : t n * t (m) > N thr } for some pre-defined constant N thr > 0. For generality and clarity in notation, the following derivations are in terms of the full mark space K.
The parameters of the history-dependent discretized MkPP model are adaptively estimated by solving a sequence of maximum likelihood problems.We assume that the parameters ω t admit piece-wise constant dynamics and are constant over consecutive windows of length W ; additionally, observations are assumed conditionally independent across time bins.The ensemble history up to lag p defines the covariates as The set of history covariate vectors at the i th window are denoted by X i = [x 1+i(W −1) , . . ., x iW ] .Note that since the mapping from n * t to n t is injective, the influence of past spiking activity can be equivalently captured by defining history covariates in terms of either; however, using n t reduces the dimensionality of ω t and quantifies the influence of past spiking activity directly rather than through categorical variables.Let n * (m) i = [n denote the sequence of outcomes of the m th mark in the i th window.The log-likelihood of the i th window is thus given by Motivated by the RLS objective function [9], a forgetting factor mechanism is utilized to combine the log-likelihoods up to the k th window, capturing the dynamics in each mark's rates.For a forgetting factor 0 ≤ β < 1, the adaptively-weighted log-likelihood at window k is thus defined as: Parameter estimation is hence performed by solving the sequence of maximum likelihood problems: To efficiently solve the sequence of problems in Eq. ( 8) in an online fashion, we use the Adaptive Orthogonal Matching Pursuit (AdOMP) [10], an adaptive version of the Orthogonal Matching Pursuit (OMP) [11] [12].The AdOMP and OMP both iteratively identify the parameter support set (the non-zero components) of fixed size (a hyperparameter for which we cross-validate) to encourage sparsity, thus capturing the inherent sparsity of network interactions based on past ensemble activity [13][14][15].However, AdOMP permits the support set to change between windows.Moreover, greedy estimation over a sparse subset of parameters mitigates the intractability of the estimation problem for large neuronal assemblies where regularization-based constraints still require optimization over all parameters.The AdOMP relies on efficient evaluation of the gradient ∇ ω β k (ω k ) at the l th iterate ω(l),k , to determine the next addition to the parameter support set and to solve the new maximization problem via gradient descent.Hence, its recursive computation is crucial for the algorithm to operate in an online fashion.To this end, we utilize a recursive update rule to compute the gradient at the k th window, generalizing the adaptive filtering techniques employed in [16] for Bernoulli observations to a multivariate setting.A complete derivation of the recursive update rule and the utilization of the gradient in AdOMP is provided in Section B.1.
The sequence of maximum likelihood problems that must be solved to obtain the history-independent model takes a similar form as in Eqs. ( 6) -( 8) under the same assumption of piece-wise constant dynamics of µ t .Defining n * i := 1 W iW j=(i−1)W +1 n * j , the equivalent of the i th window log-likelihood in Eq. ( 6) is given by . Hence, the adaptively-weighted log-likelihood at the k th window is: The sequence of maximum likelihood estimates μk = arg max S1 Appendix 4/28 are obtained by gradient descent.The gradient of the history-independent log-likelihood in Eq. ( 9), while involving a weighted summation of the terms n * i for which a simple recursion is derived, can be computed directly.The procedure for computing the maximum-likelihood estimates of the history-independent model is detailed in Section B.2.

A.3 Statistical inference of higher-order coordination
Here, we expound on the formulation of a nested hypothesis test for r th -order coordination of ensemble spiking activity presented in the main text.Recall that the significance of r-wise simultaneous spiking for r ≥ 2 is tested by considering the two alternatives: H 0 : r th -order simultaneous spikes occur as frequently as they would between independent units, given ensemble spiking history H 1 : r th -order simultaneous spikes occur at a significantly different rate than they would between independent units, given ensemble spiking history.(11) We estimate a reduced model that assumes r th -order interactions are chance occurrences by constraining the base rate parameters for each r th -order mark.For the m th mark, let u We decompose the base rate parameter as where µ (m) 0,k is the base rate under the null hypothesis and γ (m) k is analogous to the additional multiplicative factor in [4] that captures potential exogenous effects after conditioning on ensemble spiking history.
The reduced model is estimated by solving the sequence of maximum likelihood problems ω(R) k ) with the base rate parameters of r th -order events constrained to their values under the null hypothesis.Let where m c is the c th least significant bit of m in binary.Then, to obtain the reduced model, we fix µ (m) k to µ (m) 0,k for each m ∈ K r and optimize the remaining parameters.To explicitly obtain the constraints, first recall that u (m) t is the log-odds of n * (m) t = 1 versus n (g) t = 0 given ensemble spiking history.Under the assumption that the neurons are independent, the probabilities of each event are given, respectively, by Evaluating the log-ratio of these two probabilities at the full model estimate ωk , we obtain that S1 Appendix 5/28 Attributing the difference between u (m) t and u (m) 0,t only to exogenous factors, we estimate the exogenous factors at the k th window as γ(m) Thus, for the reduced model, the value of µ (m) k for m ∈ K r .The hypotheses are then quantitatively stated as: To control the possible abrupt variations of γ(m) k across windows, we apply Kalman forward/backward smoothing to the exogenous factor and use the smoothed values, γ (m)  k , in lieu of γ(m)  k .The procedure is detailed in Section C.

B Estimating MkPP Models
This section states the Adaptive Orthogonal Matching Pursuit (AdOMP) algorithm proposed in [10] and provides the derivations of recursive update rules for the log-likelihood and its gradient, which are utilized to efficiently implement AdOMP and computate the adaptive de-biased deviance difference (Section B.1).The section also restates the recursive algorithm for maximum likelihood estimation of the history-independent model (Section B.2), proposed in [17].
B.1 The History-Dependent Model: Recursive Gradient Computation and AdOMP We use AdOMP [10] to greedily estimate the parameters of the history-dependent MkPP model to capture the inherent sparsity of network interactions [13][14][15] and for tractability.The algorithm uses the gradient of the log-likelihood at each iteration of the matching pursuit to determine the next addition to the parameter support set and to solve the new maximization problem with a gradient descent algorithm.We derive a recursive update to compute the gradient at the k th window.The gradient of the objective subsumes the gradients with respect to each ω (m) .
That is, The m th mark's CIF over the i th window is denoted by and the common term (1 − β) is dropped for convenience, as it does not affect the optimal solution.Observe that the gradient, evaluated at the l th iteration ω(l),k , can equivalently be written as suggesting a recursive update.However, this definition requires the value of the gradient (and consequentially, the CIFs) from the previous window evaluated at ω(l),k , which is unavailable in an online setting.However, the values λ * i ( ωi )∆ have been evaluated at previous windows; thus, each λ * i ( ω(l),k )∆ is approximated by its first-order Taylor approximation around ωi .In the following, we consider only the m th mark, noting that each component of the gradient may be obtained by applying the procedure to each mark in parallel.
First, the aforementioned Taylor approximation is computed for the CIFs of the m th mark at window i th .Defining the diagonal matrix with diagonal terms given by the specified point-wise products, we find the first-order Taylor approximation to be S1 Appendix 7/28 Note that evaluating the approximation of the CIF for the m th requires the CIFs of the other marks through Λ * i (m,j) .Making the simplifying assumption that , m = j, the cross-terms are rendered negligible and the parallelized CIF approximation is given by Substituting this approximation, the gradient evaluated at ω(l),k becomes the parallelized, fully recursive gradient update rule for the m th mark is thus The AdOMP algorithm, and the utility of the recursive gradient update, is detailed in Algorithm A. Let the initial support set of size r 0 be denoted by S (0) , and the maximum support size by s * .The maximum support size can be obtained by cross-validation.We assume that S (0) is non-empty here, where a reasonable initialization is made; it is then necessary to solve the preliminary problem in Step 3.Alternatively, we may initially set S (0) = ∅ and ω(0),k = 0.
Additionally, the log-likelihoods of the full and reduced models can be computed in an online fashion.We use the Taylor approximation of i ( ωk ) around ωi to obtain the following recursion: where the variable a k is defined as and b k , B k have been previously defined.The bias term ∇ β k ( ωk ) can also be computed recursively.The recursive computations of the log-likelihood and bias terms enable the adaptive de-biased deviance difference to be computed efficiently.

B.2 The History-Independent Model: Maximum Likelihood Estimation
The maximum likelihood estimates of the history-independent discretized MkPP model's parameters are obtained via by gradient descent; Algorithm B, below, details S1 Appendix 8/28 Algorithm A AdOMP : Adaptive Orthogonal Matching Pursuit end for 8: the implementation proposed in [17] and utilized here.Noting that the gradient at window k involves the weighted summation of n * i , a simple recursion can be used to compute it efficiently at each window.The log-likelihoods of the full and reduced history-independent models are similarly straightforward to compute using the same recursive variables.The parameters I max and κ are the maximum number of gradient-descent iterations and step size, respectively.In step 4, the term exp( μk ) is understood to be an element-wise operation.

Algorithm B ML Estimation of History-Independent Model
end for 8: end for C Forward-Backward Smoothing of Exogenous Factors In our proposed test for r th -order coordination, the null hypothesis that r th -order spiking occurs at the same rate as between r independent neurons is quantified by estimating a reduced model for which µ (m) as stated in the main text.However, despite parameters being piece-wise constant, mark CIFs in the history-dependent model (hence u (m) t and u (m) 0,t ) can vary abruptly between adjacent time-bins due ensemble spiking history covariates; in turn, this propagates to the estimated exogenous factors.Hence, we apply Kalman forward/backward smoothing [18] for stability in the estimated exogenous factors over time.
The procedure is summarized in Algorithm C for the estimated exogenous factors of the m th mark, {γ (m)   k } K k=1 .The forward-backward smoothing algorithm was applied independently for each m ∈ K r .The model is stated in Eq. ( 29), below: Dropping the mark index from the notation for clarity, we assume that {γ k } K k=1 are the noisy observations of a linear Gaussian model with state transition coefficient ρ ∈ (0, 1].Noise samples, v k and w k , are independent and normally distributed, with respective variances σ 2 v and σ 2 w . w,(L) for k = 1 to K do 4: w,(l−1) 6: 8: end for 9: for k = K to 1 do 10: end for 13: Steps 3-8 describe the forward filter, while backward smoothing is described in Steps 9-12.Additionally, an Expectation-Maximization algorithm is used to update the values S1 Appendix 10/28 of the noise variances σ 2 w and σ 2 v over L iterations.In simulations, we observed that in L = 10 iterations, noise variances converged to stable solutions after being initialized to σ 2 w,(0) = σ 2 v,(0) = 10 × 1 2 Var({γ k } k=1:K ); this initialization scheme and choice of L was also used in real data applications.The backward-smoothed estimated exogenous factors, {γ k } K k=1 , were utilized to compute null values of the baseline rate parameters.
S1 Appendix 11/28 D Asymptotic Normality of the De-Sparsified AdOMP Estimate Precisely characterizing the limiting behavior of the adaptive de-biased deviance, our test statistic, requires the ability to construct confidence intervals around parameter estimates.While procedures are well-established for unrestricted or unregularized linear regression models, there is a notable gap in existing literature on greedily-estimated high-dimensional sparse models.However, a set of elegant results [1,19,20] in relation to 1 -regularized maximum-likelihood estimation have established techniques to de-sparsify parameter estimates and construct confidence intervals.In this section, we extend the de-sparsification technique of [1] to the greedy high-dimensional setting (Section D.1) and prove the de-sparsified AdOMP estimate also asymptotically behaves like the maximum likelihood estimate (Section D.2).

D.1 De-Sparsifying the AdOMP Estimate
We show that the AdOMP parameter estimates can be de-sparsified as in [1] by inspecting the Karush-Kuhn-Tucker (KKT) conditions of the optimization problem.First, note from Algorithm A that ωk = ω(s * ),k , solves the maximization problem ω(s * ),k = arg max The support constraint is equivalent to requiring that ω . Collectively, these linear equality constraints can be expressed as A , respectively) are straightforward to derive.Of particular relevance is the stationarity condition: Substituting for β k (ω k ) with its quadratic approximation around ω(s * ),k in the stationarity condition and rearranging terms, the de-sparsified parameters are defined as in the same fashion as van de Geer et al. in [1].

D.2 Asymptotic Normality
Next, we establish the asymptotic normality of the de-sparsified AdOMP estimate.While a similar result was established in [1], the independence of each realization of the covariates and observations was assumed; additionally, several technical conditions were tailored for 1 -regularized maximum likelihood estimation.Hence, the result is formally stated based on conditions revised for our setting, enumerated below.
(C5) The covariates satisfy X Θk ∞ = O P (c 1 ), where the T × d matrix X is obtained by tiling X horizontally C * times.
(C6) The covariates satisfy X( ωk Theorem 1.Consider the maximization of the total data log-likelihood β k (ω k ) at the k th window, where the true parameter ω k ∈ R d is (s, ξ)-compressible with ξ < 1 2 .Let ω 0 k be the maximum likelihood estimate and ωk be the AdOMP estimate after O(slog(s)) iterations.Under conditions (C1)-(C6), the de-sparsified AdOMP estimate ŵk satisfies where, as k the inverse of the Fisher information matrix.
The complete proof is provided in Section F.2, but is summarized here.Theorem 1 is established by decomposing the difference ŵk − ω 0 k into two components, one of which is negligible, using conditions (C1)-(C6).The rate of decay of negligible terms is explicitly provided, but summarily treated as o P (1).Then, modified versions of the Central Limit Theorem and Law of Large Numbers are used to show the non-negligible component is asymptotically normal.Based on Theorem 1, confidence intervals for the AdOMP estimates can be constructed by adapting the recursive procedure of [16] to our setting.

S1 Appendix
13/28 E Limiting Behavior of the Adaptive De-biased Deviance Difference In this section, we characterize the limiting distributions of the adaptive de-biased deviance difference under both the null and alternative hypotheses using the results in Section D. We first address the main result for testing r th -order coordination using history-dependent models in Section E.1.Recalling that the history-independent models are special cases of the history-dependent models analyzed in our main result, we specifically address the former in a corollary result (Section E.2).Finally, in a second corollary result, we extend the main result to tests of arbitrary nested hypotheses (Section E.3).

E.1 Testing for r th -Order Coordination with History-Dependent Models
First, the limiting distributions of the adaptive de-biased deviance difference for history-dependent MkPP models are characterized under both the null hypothesis of chance r th -order spiking and the alternative hypothesis of coordinated r th -order spiking in Theorem 2, below.
k respectively be the full and reduced greedily-estimated parameters of the history-dependent model at window k, where ω(R) k assumes conditionally independent r th -order simultaneous spiking.Then, as β → 1, i) if r th -order coordination matches independent r th -order interactions given ensemble spiking history, then ), i.e. chi-square, and ii) if r th -order coordination differs from independent r th -order interactions given ensemble spiking history, and difference between the estimated base rate parameters of r th -order interactions and their restricted values scales at least as k ), i.e. non-central chi-square, where ν (r) k is the non-centrality parameter at window k and the degrees of freedom M (r) := |K r | is the difference in the cardinalities of the full and reduced support sets.
The complete proof is presented in Section F.3.It is based on Davidson and Lever's treatment in [22] and is closely related to similar analyses for inferring adaptive Granger causality [23].

E.2 Special Case: History-Independent Models
The statistical inference procedure described in the main text for the history-dependent MkPP model can be specialized for history-independent analysis, as has been previously demonstrated [17].In this case, the two hypotheses, H 0 : r th -order simultaneous spikes occur as frequently as they would between independent units H 1 : r th -order simultaneous spikes occur at a significantly different rate than they would between independent units (33) are quantified in the same manner as the history-dependent case, excepting the history covariates.The distinction between the history-independent and the history-dependent test is that the former seeks only to determine if observed rates of simultaneous spiking S1 Appendix 14/28 events are facilitated or suppressed, while the latter seeks to determine if observed rates of simultaneous spiking events are attributable to unobserved or neglected processes.
Estimates of the history-independent model parameters are unbiased, as no sparsity constraints are imposed.Hence, the adaptive deviance difference, D , is used as the test statistic.Here, we characterize the asymptotic distributions of the adaptive deviance difference under the null and alternative hypotheses in a corollary to Theorem 2. k respectively be the full and reduced maximum-likelihood estimates of the history-independent model parameters at window k, where μ(R) k assumes conditionally independent r th -order simultaneous spiking.Then, as β → 1, i) if r th -order synchrony matches independent r th -order interactions, the adaptive ii) if r th -order synchrony differs from independent r th -order interactions, and the difference between the estimated base rate parameters of r th -order interactions and their restricted values scales at least as O( 1−β 1+β ), the adaptive deviance difference where ν (r) k is the non-centrality parameter at time k that depends only on the true parameters, and M (r) = |K r | is the difference in the cardinalities of the full and reduced support sets.
The proof of Corollary 2.1 is presented in Section E.2, with emphasis on the points of departure from the proof Theorem 2.

E.3 General Case: Testing Nested Hypotheses with History-Dependent Models
The statistical tests proposed in this work have been developed specifically to determine if r th -order coordinated spiking occurs by chance in ensemble activity.However, our proposed methods may be adjusted to test other nested hypotheses about ensemble activity, i.e. tests of the form: where ω(R) k is estimated while restricting an arbitrary subset of size M to fixed values.For example, it might be of interest to inspect r th -order interactions amongst subsets of the neuronal assembly -such as all events involving a particular neuron -rather than amongst the entire assembly.Alternatively, one could in principle characterize the significance of simultaneous spiking events individually rather than as groups.
Corollary 2.2, below, states the limiting distributions of the adaptive de-biased difference for arbitrary nested hypotheses which follow almost immediately from Theorem 2, as expounded in Section F.5. ii) under the alternative hypothesis, assuming the difference between the restricted parameters and their full model estimates scales at least as where ν k is the non-centrality parameter at window k and the degrees of freedom M is the number of the restricted parameters.
An important consequence of Corollary 2.2 is its connection to adaptive Granger causality analysis for point process models of neuronal spiking.The discretized point process GLM utilized in [23] is a special case of the history-dependent MkPP mGLM in which C * = 1.Hence, Corollary 2.2 establishes that sparse Granger-causal networks can be adaptively inferred using greedy estimation as an alternative to 1 -regularized estimation, and thus more tractably applied to larger neuronal assemblies.S1 Appendix 16/28

F Proofs
This section contains proofs of the theoretical results presented in Sections D-E.Preliminary results -namely, the limiting behavior of the Hessian matrix, ∇ 2 β k (ω 0 k ), and gradient of the exponentially-weighted total data log-likelihood, ∇ β k (ω 0 k ) -are established in Section F.1.The proof of Theorem 1 is provided in Section F.2, followed by the proof of Theorem 2 in Section F.3.Proofs of the corollaries to Theorem 2 are presented next; Corollary 2.1 is addressed in Section F.4 and Corollary 2.2 in Section F.5.

F.1 Preliminary Results
The limiting behaviors of ∇ 2 β k (ω 0 k ) and ∇ β k (ω 0 k ) are characterized in Propositions 1 and 2, respectively.Limits are evaluated over a monotonically increasing sequence {β l } ∞ l=1 that converges to 1; for notational convenience, the shorthand β → 1 has been used throughout.For simplicity and without loss of generality, we analyze the case that the windows over which the model is piece-wise constant are of size W = 1 and indexed Proof.First, note the following decomposition of the Hessian: ) and the sequence c β := 1 1−β that tends to ∞ as β → 1, a version of the LLN for φ-mixing random variables, [24], is invoked to find that The limiting behavior of the Hessian is thus given by From the Fisher information equality, we have that Next, we establish the asymptotic normality of the score function.
Proof.The result is established by invoking a version of the CLT for martingales [24,25].First, recall that the gradient with respect to the parameters of the m th mark, and that the covariates X t are the ensemble spiking history {n (c) j } C,t−1 c=1,j=t−p .Let F t be the σ-field generated by {n is the sum of martingale differences with respect to the filtration {F t } k t=1 .Employing the Cramer-Wold device, it will suffice to characterize the limiting distribution of ) for an arbitrary unit vector z.Let so that we may write we define (39) Thus, we must establish that Z β k /s β d − → N (0, 1) as β → 1.The version of CLT for martingales in [25] stipulates this result if the following two conditions hold: where 1{•} denotes the indicator function.We first address condition (i).Substituting for s β and Y t 2 , the condition is rewritten as Letting ρ = β 2 , and using the definition of V t , the condition may be equivalently expressed as S1 Appendix 18/28 where we have used the Fisher information equality; this condition is implied by monotone convergence and the result of Proposition 1.

F.2 Proof of Theorem 1
Proof.To begin, we analyze the de-sparsified estimates ŵk = ωk − Θk ∇ β k ( ωk ) element-wise, attending particularly to the latter term of the right hand side.For the Note that based on the expansion in Eq. ( 43), the difference between the desparsified and maximimum likelihood estimates (for the j th component) where ∆ with e j a standard basis vector.In the following, we derive uniform upper bounds on both |∆ 2 | for all j = 1, . . ., d.We reintroduce two notations from conditions (C1) and (C5) to proceed.Recall that the collection of covariate matrices over all windows is denoted as X = X 1 , . . ., X k ; letting Xi denote the matrix X i tiled horizontally C * times, we also define the T × d matrix X = X 1 , . . ., X k .Thus, for i = 1, . . ., k, we defined z i := Xi ω k , and the log-likelihood function for the i th window can be written as For the i th window, note that for some zi such that zi − ẑi 2 ≤ ẑi − z 0 i 2 .Using condition (C1), we have that Additionally, note that, by the chain rule for derivatives, Using these two relations, a uniform upper bound for {∆ 1 } d j=1 is derived as follows.The residual is expanded as First the i th term in the summation, neglecting β k−i , is addressed: , where the first inequality is a consequence of Hölder's inequality, the second is due to the equivalence of norms, and the third follows by invoking condition (C1), as in Eq. (48).Using the fact that β < 1, it follows that Consequently, the term 2 | is upper-bounded as follows: Invoking conditions (C5) and (C6), and hence, ∆ = o P (1).Of course, this conclusion follows immediately from Eq. (52) if ∇ 2 β k is invertible, as per the first part of condition (C3).Thus, we have shown that ŵk or equivalently that Σk → Σ k ; invoking condition (C3), it follows that Θk → Σ −1 k .Recalling that this is a sequential limit in β, if Σk is invertible for each β, Θk → Σ −1 k by the continuous mapping theorem.Alternatively, if Θk is not invertible, the statement follows directly by second part of condition (C3).Proposition 2 establishes the asymptotic normality of ∇ β k (ω 0 k ); in conjunction with the aforementioned limiting behavior of Θk and Slutsky's theorem, we thus conclude that 1+β 1−β ŵk − ω 0 k is asymptotically normal with zero-mean and covariance matrix Σ

F.3 Proof of Theorem 2
Proof.We begin with the adaptive de-biased deviance statistic for the estimated parameters ωk and the true parameters ω k : where To write the deviance in a more convenient form, we first note the quadratic expansion of β k (ω k ): Substituting into Eq.( 54) and expressing the bias correction term B k explicitly, the deviance is equivalent to where ωk intermediates ωk and ω k .By rearranging terms and recalling that the de-sparsified estimate ŵk = ωk − ∇ 2 β k ( ωk ) , the deviance can be compactly expressed as: We now explicitly define the null and sequence of local alternative hypotheses in order to adapt Davidson and Lever's treatment [22].Recall first that limits in β are S1 Appendix 21/28 understood to be sequential limits; i.e., they are evaluated for the sequence {β l } ∞ l=1 where β l → 1 as l → ∞.Then, for window k, we test the null hypothesis H 0,k : ω 0 k = (ω 0,k , ω 1,k ) against the sequence of local alternative hypotheses , where ω β l 1,k = ω 1,k + 1−β l 1+β l δ k ; the limiting true parameter vector under the alternative is denoted by ω * k .The partition of the d−dimensional parameter vector corresponds to the free parameters under the null hypothesis (ω 0,k ) and the restricted sub-vector (ω 1,k ).Thus, the statistical test seeks to detect local perturbations of order O 1−β 1+β from the null hypothesis.For the statistical inference of r th -order coordination, ω 1,k corresponds to the base rate parameters of r th -order events, the number of which is denoted by M (r) ; however, since a similar partition can be made for any nested hypothesis test, the result shown here generalizes.
We now establish the limiting behavior of the de-biased deviance difference under the null hypothesis.By Proposition 1, Thus, recalling the definition of the the de-sparsified estimate ŵk and invoking Slutsky's Theorem, 1+β  1−β ( ŵk − ω k ) → N (0, I −1 k ) and consequently, the deviance [D k,β ( ωk , ω k )|H 0,k ] → χ 2 (d), asymptotically following a χ 2 distribution with d degrees of freedom.Finally, invoking classical results for likelihood ratio tests of nested models [26,27], we conclude that under the null hypothesis, the adaptive de-biased deviance difference converges in distribution to a χ 2 random variable with M (r) degrees of freedom: It remains to establish the limiting behavior of the adaptive de-biased deviance difference under the sequence of alternative hypotheses.Though Proposition 1 establishes that ∇ 2 β k (ω * k ) → −I * k , the asymptotic normality of 1+β 1−β ( ŵk − ω * k ) does not follow as readily.To establish asymptotic normality, we first derive an alternative expression for the difference ( ŵk − ω * k ) using the following two Taylor expansions.The gradient of the log-likelihood evaluated at ωk and ω β k may equivalently be written as: respectively.The remainder terms , which can also be shown based on the conditions of Theorem 1.Thus, we can rewrite In contrast to the null case, the limiting normal distribution has non-zero mean; thus, we employ Davidson and Lever's approach [22] to establish the limiting behavior of the adaptive de-biased deviance difference.
To proceed, we first decompose the Fisher information matrix in accordance with the partition of the parameter vector introduced earlier.The parameters ω * k were partitioned into ω * 0,k , corresponding to the parameters free under the null hypothesis, and ω * 1,k , the parameters restricted under the null hypothesis; I * k is similarly decomposed as: , where I * 0,1,k = I * 1,0,k .Utilizing the quadratic form of the de-biased deviance in Eq. (57) and invoking Proposition 1, the adaptive de-biased deviance difference is expressed as: In the following steps, an equivalent expression for the reduced model deviance is derived in terms of ŵ and that , and so the second term of Eq. ( 61) is equivalent to Note that the partial gradient is a subvector of the gradient, i.e., where I * 0,•,k = I * 0,0,k , I * 0,1,k .Equation (62) can then be expressed as where Thus, the adaptive de-biased deviance difference is equal to Note that and so, defining Ĩ * , the deviance difference simplifies to follows from earlier arguments.
Using this fact, we conclude that, under the sequence of local alternatives H β 1,k , the adaptive de-biased deviance difference converges to a non-central χ 2 random variable with M (r) degrees of freedom and non-centrality parameter ν (r) F.4 Proof of Corollary 2.1 Proof.Maximum-likelihood estimation of the parameters eliminates the need to de-bias the deviance, since the gradient evaluated at the maximum-likelihood estimate (and consequently, the bias terms) is exactly zero.Hence, the test statistic reduces to the adaptive deviance difference: As in the course of proving Theorem 2, we begin with the deviance between the estimated and true parameters, the deviance can be expressed as To proceed, we first argue that asymptotic results analogous to Propositions 1 and 2 hold.To determine the limiting behavior of the Hessian, note that it can be written as and is not a function of the simultaneous spiking observations.Hence, The normality of the gradient, i.e., that 1+β 1−β ∇ β k ( μk ) → N (0, I k ), is a consequence of Proposition 2. Note that, trivially, E [∇ t ( μk )|F t−1 ] = 0, where the filtration {F t } k t=0 , as previously defined, corresponds to past spiking observations.Hence, the martingale CLT of Proposition 2 is applicable and the required conditions can be shown to hold.Alternatively, because ∇ β k ( μk ) is the sum of independent but not identically distributed random variables, the Lindeberg-Feller CLT may be invoked to the same effect.
The limiting distribution of the adaptive deviance difference can now be established.First, we address the null hypothesis H 0,k : µ 0 k = (µ 0,k , µ 1,k ), where the parameters are partitioned into the free (µ 0,k ) and restricted (µ 1,k ) subsets as before.Through a series of arguments similar to those presented in proving Theorem 2, the limiting behaviors of ∇ 2 β k (µ k ) and 1+β 1−β ∇ β k ( μk ) imply that D k,β ( μk , µ k ) → χ 2 (d), where d is the dimensionality of the parameter vector; it likewise follows that the adaptive deviance difference is asymptotically χ 2 -distributed with M (r) degrees of freedom where, as before, M (r) is the dimensionality of the subvector µ 1,k .Next, we address the limiting distribution of the adaptive deviance difference under the sequence (in β) of local alternatives H β 1,k : µ β k = µ * 0,k , µ 1,k + 1−β 1+β δ , where β converges to unity and µ * k is the limiting true parameter vector.The deviances considered in this case are between the estimated and true limiting parameter, D k,β ( μk , µ * k ).The characterization of the limiting behavior of the Hessian is true under the sequence of alternatives, but the asymptotic normality of 1+β 1−β ( μk − µ * k ) must be established.To this end, we derive an equivalent expression based on the following Taylor expansions of the gradient at μk and µ β k around µ * k : By rearranging the terms of these two equations, and by noting that ∇ β k ( μk ) = 0, the following equivalence is derived:  .That follows from previous arguments.We thus conclude that, in the special case of the contemporaneous-event model, the adaptive deviance difference converges to a non-central χ 2 -distributed random variable with M (r) degrees of freedom and non-centrality parameter ν Proof.Recall from the proof of Theorem 2 that the partition of the parameter vector, ω 0 k = (ω 0,k , ω 1,k ), corresponds to the free parameters under the null hypothesis (ω 0,k ) and the restricted sub-vector (ω 1,k ), which are also free when estimating the full model for the alternative hypothesis.For the statistical inference of r th -order coordination, ω 1,k corresponded to the base rate parameters of r th -order events; however, for the arbitrary nested hypotheses in Corollary 2.2, ω 1,k corresponds to the arbitrary sub-vector that was restricted under the null hypothesis to estimate the reduced model.Denoting the limiting χ 2 distributions' degrees of freedom by M and the non-centrality parameter by ν (k) , the result is established, mutatis mutandis, by the same proof as of Theorem 2. S1 Appendix 26/28 , and obtain the CIF of the ground process λ * t (s * ) k ω k = 0, where the diagonal matrix A (s * ) k i,i = 0 for i ∈ S (s * ) k and 1 otherwise.The KKT conditions on the primal and dual parameters (ω k and ν (s * ) k
full and reduced greedily-estimated parameters of the history-dependent model at window k, where an arbitrary subset of the reduced parameters are restricted.Then, as β → 1, i) under the null hypothesis, D k,β ω(F ) k , ω(R) k d − → χ 2 (M ), and S1 Appendix 15/28

1+β 1 −− 1 .
β ( μk − µ * k ) → N δk , (I * k )Recalling the partition of the parameter vector into subsets of free and restricted parameters under the null hypothesis and the corresponding decomposition of the Fisher information matrix I * k , the deviance difference between the full and reduced we focus on the latter term.The first of the Taylor expansions in Eq. (73) implies that∇ β k (µ * k ) = I * k (µ k − μ * k ) + o P (1), and so the reduce model deviance can be written as1 + β 1 − β ∇ µ 0,k β k (µ * k ) I −1 0,0,k ∇ µ 0,k β k (µ * k ) + o P (1).(76)Following a series of arguments analogous to those presented in the proof of Theorem 2, the deviance difference can be shown to be equivalent to D