Influence of Uncertainty and Surprise on Human Corticospinal Excitability during Preparation for Action

Summary Actions are guided by prior sensory information [1–10], which is inherently uncertain. However, how the motor system is sculpted by trial-by-trial content of current sensory information remains largely unexplored. Previous work suggests that conditional probabilities, learned under a particular context, can be used preemptively to influence the output of the motor system [11–14]. To test this we used transcranial magnetic stimulation (TMS) to read out corticospinal excitability (CSE) during preparation for action in an instructed delay task [15, 16]. We systematically varied the uncertainty about an impending action by changing the validity of the instructive visual cue. We used two information-theoretic quantities to predict changes in CSE, prior to action, on a trial-by-trial basis: entropy (average uncertainty) and surprise (the stimulus-bound information conveyed by a visual cue) [17–19]. Our data show that during preparation for action, human CSE varies according to the entropy and surprise conveyed by visual events guiding action. CSE increases on trials with low entropy about the impending action and low surprise conveyed by an event. Commensurate effects were observed in reaction times. We suggest that motor output is biased according to contextual probabilities that are represented dynamically in the brain.


Model Specification and Estimation
To evaluate p k , where k = 1,.,K, we assume the subjects learn p k as follows: The joint likelihood of X = x 1 ; .; x N g f is where N k = P i dðx i = kÞ is the number of instances of the k-th trial-type. The conjugate prior of a multinomial is a Dirichlet distribution, PðpjaÞwDða 1 ; .; a K Þ = 1 ZðaÞ Y K k = 1 p ak 2 1 k : ( This has hyperparameters, a = ½a 1 ; .; a K T and normalization constant ZðaÞ. Given the likelihood and prior (Equations 1 and 2), we can compute the posterior distribution, which is also Dirichlet (because the prior is conjugate): Pð pj X; aÞwDðN k + a k Þ: ( We assume the initial prior has hyperparameters, a k = 1, i.e., is uniform. This assumes that at the beginning of a block participants start with the prior that all events are equally likely.
Equation 3 gives the conditional density of the multinomial parameters we require. The expectation of event k given N observations is given by the predictive distribution, These values were computed sequentially throughout a block and entered into equations 3 and 4 from the main text. This modeling rests on ideal observer assumptions; several studies show that human observers can compute the predictability of sensory events and perform accordingly, close to the performance of an ideal observer [S7-S12]. Note also that in this analysis, we discard any information about trial-type but used the information theoretic estimates of contextual uncertainty that were updated dynamically, based on what the subject actually observed (see Figure 1A and the Experimental Procedures for details). This approach thus encodes online learning of the order within a sequence, assuming that each event is sampled from a discrete probability distribution.
Our model assumes that subjects model contingencies as being stationary and unchanging within an experimental block. Within the context of the present task, this assumption matches the known generative distribution. The contribution of this paper is to relax the assumption that a subject's response to an event type does not change with experience as is the case in a categorical model. For this, we used a stationary model to learn the probability of event types sequentially within a block, assuming the block structure of the true generative distribution is known. A model that learns transitions between different blocks could be implemented by using a parameterized function to weight past observations, which could take a number of forms including its duration and rate of decay into the past. This will be the subject of future work. However, we illustrate the issue by including two additional models that depend on the maximal window length of past observations from which predictions are based. These were the extreme scenarios of minimal and (near) maximal forgetting. Maximal forgetting would be when the prior is constant for every trial (i.e., it is not updated). If this is chosen to be uniform then every event will be equally likely, i.e. have the same ''surprise'' throughout the experiment. This meant we compared three models in which the maximal number of past observations were one block, all blocks (i.e., all past observations from previous blocks), and the four most recent trials. Examples of regressors, assuming these different window lengths into the past, are shown in Figure S2.

General Linear Model of Data
To test the hypothesis that surprise and entropy could explain behavioral and physiological responses, we used empirical Bayes to estimate a regression model. There were T trials per subject and S subjects. Data from all subjects was concatenated in a vector, Y, of length T3S. This was fitted using a three-level hierarchical model Parameters weights w 1 ; w 2g f scale each column of the design matrices Z 1 ; Z 2 g f and hyperparameters l 1 ; l 2 ; l 3 g f control the precision (inverse variance) of noise at each level, e 1 ; e 2 ; e 3 g f ; these correspond to within-subject error, between-subject error and shrinkage priors on the group-parameters, w 2 . The first level design matrix, Z 1 was block diagonal, with dimensions TS3PS, with P regressors per subject. These correspond to subject-specific explanatory variables (i.e., surprise, entropy, error trials, and a constant term). The second design matrix, Z 2 = 1 S 5I P , models between-subject differences in the parameter weights, where 1 S was a column of ones of length S and I P a P3P identity matrix. Posterior densities over model parameters and hyperparameters were optimized by using standard techniques [S13]. The posterior density represents the degree of belief in the parameter values given data, e.g., reaction time or CSE.
The model evidence p(yjM i ) for the i th model, M i , was approximated by the marginal likelihood, computed after optimizing the model [S13]. Note that model parameters are integrated out of this expression and therefore include a model complexity term [S13]. This evidence is used to compare competing models defined in terms of the explanatory variables in Z 1 . Note that the evidence does not depend on the parameters or their number and, therefore, accounts properly for model complexity when used for Bayesian model comparison. In brief the log evidence comprises two terms; the accuracy of a model or the expected log-likelihood and the complexity, which is a function of the number of and uncertainty about the free parameters. Bayesian model comparison therefore furnishes the most accurate and parsimonious model (see Penny et al., 2004, for details [S14]).
Models were compared using the ratio of the log marginal likelihoods, F i and F j , of models i and j. The difference between these two numbers relates to the model evidence ratio as follows This means that a difference of +3 corresponds approximately to 20:1 odds, i.e., exp(3) z20, in favor of model i over j, whereas 23 corresponds to 20:1 odds in favor of model j. In the present case, positive values reflect stronger evidence in favor of the model containing entropy (Ĥ ) and surprise (î), whereas negative values would indicate stronger evidence for the alternative models tested.   Information theoretic quantities computed from models assuming no forgetting and near maximal forgetting. Top, entropy and surprise from a block sequence generated from a 55%-45% valid-invalid CS block in which only the four most recent previous trials are used. Bottom, entropy and surprise when previous trials from all previous blocks are used. The ensuing time series were used as predictors for modeling corticospinal excitability (CSE) and reactions times (RT) across the entire series of trials of each participant.