Learning predictive statistics from temporal sequences: Dynamics and strategies

Human behavior is guided by our expectations about the future. Often, we make predictions by monitoring how event sequences unfold, even though such sequences may appear incomprehensible. Event structures in the natural environment typically vary in complexity, from simple repetition to complex probabilistic combinations. How do we learn these structures? Here we investigate the dynamics of structure learning by tracking human responses to temporal sequences that change in structure unbeknownst to the participants. Participants were asked to predict the upcoming item following a probabilistic sequence of symbols. Using a Markov process, we created a family of sequences, from simple frequency statistics (e.g., some symbols are more probable than others) to context-based statistics (e.g., symbol probability is contingent on preceding symbols). We demonstrate the dynamics with which individuals adapt to changes in the environment's statistics—that is, they extract the behaviorally relevant structures to make predictions about upcoming events. Further, we show that this structure learning relates to individual decision strategy; faster learning of complex structures relates to selection of the most probable outcome in a given context (maximizing) rather than matching of the exact sequence statistics. Our findings provide evidence for alternate routes to learning of behaviorally relevant statistics that facilitate our ability to predict future events in variable environments.

First, we describe the context length modeling, which considers participant's responses as a weighted combination of multiple Markov processes. This modelling approach enables us to track participants' learning in a unified framework. When participants are first exposed to the sequences, we reason that their responses will be driven by random guesses, which corresponds to a special setting of the zero-order (memory-less) Markov model. We reason that participants' responses will be later refined after having observed that some symbols are presented more frequently than others in a given context. At trial t, our model tracks the context-length learning using three components corresponding to no memory (zero-order Markov model M 0 t using empty context ∅ of zero length), shallow memory (first-order Markov model M 1 t using context C 1 t of length one) and deeper memory (variable length Markov model of maximum order two M 2 t using context C 2 t of length of one or two): where w k t is the mixture coefficient at trial t of the component model M k t and represents the probability of using M k t for prediction at trial t. The mixture coefficients w k t can be thought of as expressing the "strength" at trial t of individual components M k t in the overall mixture model. By tracking the mixture coefficients over trials, we dynamically assess whether participants' responses can be accounted for by the different Markov structures. Note that in the previous sections, we use i to index consecutive symbols in a Markov sequence or, equivalently, stimuli in a continuous stimulus stream. Here, t is used to index consecutive trials which consist of 8-14 stimuli each.
Given the history of last seen symbols, model (1) gives the following probability to a target symbol s: The predictive contingency distribution allows us to quantify how the participants' responses compare to the context-conditional probabilities of the generating Markov model.
To track a participant's responses over time, mixture coefficients as well as the mixture components themselves are updated after each response. We calculate whether the participant's response is more likely to be driven by a particular mixture component (e.g. Markov model of order one), and update the weight for the components accordingly. This updating is implemented in Bayesian terms: given the participant's response at trial t, each mixture coefficient w k t−1 is updated proportionally to both the current coefficient value ('strength' of M k t−1 ) and the likelihood of M k t−1 given the participant's response. Considering w k t−1 and w k t as the prior and posterior for the model order k, Bayesian update of w k t reads: Similarly, by applying the Bayes rule, the context-conditional predictive distribution is obtained: Here we introduce a noise model δ representing uncertainty in the participant's prediction. If we assumed that the participant is certain about their prediction, δ would be a noise-less model (delta-function), As the participants learn during the sequence presentation, we expect that modeling of the participant's responses changes smoothly over time. This is represented by a partial adaptation of the modeling parameters towards the ideal Bayesian updates (3) and (4): and where 0 < η p < 1 and 0 < η w < 1. Here we introduced tuning parameters η p , η w (i.e. the adaptation rate) to calibrate the response-tracking modeling and ensure smooth learning curves. The case of η p = η w = 1 corresponds to ideal Bayesian updating. From the learning dynamics point of view, for η p , η w < 1, the change of M t over t is effectively smoothed. In our study, we assume the adaptation rates attain the same value, η p = η w = η, and set η to 0.04. We tested whether the results of our modeling change with different adaptation parameters η. The principal results remain unchanged with a high tolerance to the tuning parameter (η in the range from 0.03 to 0.1), validating the robustness of our modeling approach.
We hypothesize three likely sequence structures (i.e. level-0, level-1 and level-2) and track whether these structures account for the participants' responses. It should be noted that we introduced a special structure for level-0 data, where participants' responses are driven by target probability, rather than context-target contingencies. Unlike the general three-component modeling described above, for level-0 we only employ two mixture components related to zeroorder memory: a fixed random guess mixture component (all 4 symbols have equal probability of 25%) and a flexible memory-free process where the probabilities of the 4 symbols are treated as free variables and can be updated based on participants' responses.
To initialize the response-tracking modeling, we imposed a prior reflecting a possible memory structure used by the participants at the start of training. We initialized the mixture coefficients controlling the degree of a participant's initial preference for one mixture component over other(s) as follows: • At level-0 we monitor whether the participants switch from fixed random guess to an appropriate level-0 model. As the participants' performance is initially guided by random guesses, we set the initial coefficient of random guess and flexible level-0 model to w and 1 − w, respectively. We give a clear preference to random guesses by setting w to 0.8.
• At level-1 we monitor how the participants switch from a level-0 to a level-1 model. To reflect the participants' initial preference for a simpler model, we set the initial coefficient of level-0 and level-1 model to w = 0.8 and 1 − w = 0.2, respectively.
• For level-2 preceded by training at level-1 (Group 0 and 1), we set the initial coefficients as follows: As the participants had learnt level-1 sequences, we set the coefficient w of the level-1 model to w = 0.8. The remaining small weight 1 − w = 0.2 is distributed among the level 0 and 2 in the proportion 2 3 and 1 3 , respectively, reflecting the intuition that the more complex level-2 model is a-priori less likely. Hence the initial coefficients for level-0, level-1, and level-2 models are set to 0.13, 0.8 and 0.07, respectively. When participants perform the task at level-2 without preceding training at level-1 (Group 2), the coefficients are set to w = 0.8, 2 3 · (1 − w), and 1 3 · (1 − w), i.e. 0.8, 0.13, 0.07, respectively. i.e., the level 0 model is assumed to be naturally preferred.

Context-length analysis
When using the dynamic mixture modeling approach (1)-(6) to the response data from a single participant, we obtain time series of mixture coefficients {w k t } and the context-target contingencies {P k t }, k = 0, 1, 2, indexed by the trial index t. The initial settings described above may produce a small systematic variation on the mixture coefficients w k t . Therefore, to deal with uncertainty about the participants' initial preferences, we vary the initial mixture coefficients by setting w to a range of 5 values: 0.7, 0.75, 0.8, 0.85, 0.9. For every data level, participant and mixture component k, we then average the resulting 5 trajectories of w k t into a single mixture coefficient curve w k t . We followed the same method to produce the averaged context-target contingency curves {P k t }.
For each participant and data level k 0 , we are interested in the dynamics across all training blocks for that level k 0 (model order used to generate the data). We denote the series {w k 0 t } by { w t }. The w t curves thus represent the dynamics of participants' learning of the correct memory structure when producing predictions for the target stimuli. As the w t curves have a sigmoid structure in most cases, we represent them through a parametrized sigmoid function where W 0 is the initial mixture coefficient, δ W is the difference between the initial and final mixture coefficients, ξ is the learning rate and τ 0 is the transition point. For further analysis, we characterize the w t curves through the parameter vector Θ = (W 0 , δ W , ξ, τ 0 ).

Simulation-based validation
We formulated a response-tracking analysis to derive the participants' learning curves. We use these curves to identify prototypical profiles of fast and slower learners. To validate this approach, we designed a synthetic learner for which we controlled (i) the speed of learning and (ii) memory-order transition time. Our logic was to control the parameters of the synthetic learner, and then test whether we can recover information about these parameters using response-tracking modeling.
The synthetic learner was presented with sequences viewed by human participants, and was tasked with predicting the upcoming symbol. The learner purposefully had a different architecture from the Markov models used to generate the stimulus sequences. In particular, the learner extracted n-grams; i.e., blocks of consecutive symbols of length L = 1 (unigram), 2 (bigram) or 3 (trigram). During training, it learnt the n-gram probability distribution: where U denotes n-grams, t s is the stimulus index, U, is the observed n-gram at time t s , γ ∈ (0, 1) is a learning rate, and δ(·, ·) is the delta function 1 .
For all levels, the initial n-gram probabilities were uniformly distributed. At any given time the synthetic learner learnt a distribution of n-grams of a fixed length L, however the n-gram length changed over the course of training. For level-0 sequences, learning was initially driven by random guessing (all length-1 n-grams were equally probable); as training proceeded (after N 0 blocks of level-0 sequences) the synthetic learner started to learn n-gram distribution of length L = 1. For level-1 sequences, learning was initially driven by length L = 1 n-grams; as training proceeded (after N 1 blocks of level-1 sequences) the synthetic learner started to learn n-gram distribution of length L = 2. For level-2 sequences, learning was initially driven by length L = 2 n-grams; as training proceeded (after N 2 blocks of level-2 sequences) the synthetic learner started to learn n-gram distribution of length L = 3.
These n-gram distributions were used to generate predicted symbols as follows: If the model had no memory (L = 1; i.e. context length = 0), the synthetic learner drew a random sample from the n-gram distribution. If the context is first or second order (L > 1), after presenting L − 1 symbols (context C) before the target, there were four possible n-grams Us of length L, depending on the target symbol prediction s ∈ {A, B, C, D}. The n-gram C containing the target prediction s was then chosen in a probabilistic manner: That is, the target was predicted to be symbol s with probability P (s|C).
We applied response-tracking modeling to the responses of the synthetic learner. In particular, we fixed the memory-order transition times N 0 , N 1 and N 2 to 1, 1 and 3 blocks, respectively, and varied the learning rate. We then applied the response-tracking modeling to the synthetic learner's responses and obtained the mixture coefficient curves w t for the correct memory order. To investigate the relation between learning rate and the mixture coefficient curves w t , we computed the mean curve w t (γ) averaged across all stimulus sequences of the same level (i.e. memory order). Fig S3 shows that for all levels (level-0, level-1, and level-2), the slopes of the mean curves w t (γ) increase monotonically with training. Moreover, similar to our participants, some synthetic learners extract the correct context length for a given level faster than others (given an appropriate change in the synthetic learner's learning rate). These simulations validate our response-tracking modeling, by demonstrating that using this modeling approach we were able to recover memory-order (i.e. context length) from synthetic data.

Predictive contingency
In addition to learning the context length, the participants need to learn the context-target contingencies. To test to how participants learn context-target contingencies, we used predictive contingency modeling (2). For each level, we compared: (i) the true underlying Markov model used to generate sequences (i.e. level-0, level-1 and level-2) and (ii) an alternative, less complex model (i.e. a random guessing model for level-0, an approximate marginalized level-0 model for level-1 and an approximate marginalized level-1 model for level-2).
To quantify the difference at trial t between a baseline model Q and the predictive contingency model P t obtained by tracking the participants' responses with the mixture model (2), we used the expected Kullback-Leiber (KL) divergence from Q to P t : The overall difference between Q and P t is thus computed as a weighted average of the contextspecific differences (KL-divergences) over all contexts in the baseline model. Here, Q(s|C) is the context-conditional probability of the next symbol following context C in Q, P t (s|C) is the participant's predictive contingency for that context at trial t, and P (C) is the theoretical distribution of contexts for Q obtained from the data generating model. The smaller the divergence between the predictive contingency model and a baseline model, the greater their similarity (in information theoretic terms). Recall that for level-0 stimulus sequences, empty context ∅ (of zero length) is the only context in the baseline model. This implies P (∅) = 1 and Q(s|∅) = Q(s), which leads to KL t = KL Q(s)||P t (s|∅) .
We then computed the difference between the expected KL divergence from the less complex model to the predictive contingency model and the expected KL divergence from the true underlying Markov model to the predictive contingency model. We refer to this quantity as ∆KL, measuring which of the two baseline models is closer (in the information theoretic sense) to the participant's predictive contingency model. Positive values of ∆KL indicate that participant responses approximated the true Markov model used to generate sequences, while negative values indicate responses were based on a less complex model.

Extracting temporal structure using an entropy-based approach
Our response-tracking modeling approach is based on applying different order Markov models to the participants' responses. A potential concern with this approach is that it may overconstrain our understanding of the learning process. As an alternative less constrained approach, we adapted and applied a model for temporal sequence learning [1] based on acquiring a library of symbol sequences (i.e., 'engrams'). The original Jensen et al model acquired a library of engrams using a pruning process based on the entropy of the next-symbol distributions, conditional on the context provided by the preceding symbols. In particular, their algorithm monitored the stream of symbols and computed the entropy of engrams up to seven symbols long (i.e. context order 7). If an engram had low entropy (i.e., the last symbol was predicted almost perfectly by the preceding symbols) it was stored in the dictionary, while contexts with high entropies (i.e., unpredicted by the previous symbols) were pruned out. In this way, the algorithm implemented by Jensen et al was able to use its acquired dictionary to predict upcoming symbols without formal understanding of the generative process that produced the sequence.
Directly applying this method to the sequences in our study, however, was unsuccessful. The Jensen et al model was designed and tested using sequences that were more deterministic than those used in our study. Having minimal stochasticity provides a ready means of pruning the acquired library of contingencies; in our case this fails because our probabilistic sequences were more random, meaning that a simple entropy rule did not provide a ready basis for pruning.
We therefore modified the entropy-based approach to fit the context of our study that uses probabilistic sequences and unsupervised learning. In doing this, we were limited in the context length we could consider. Our human participants learned while making relatively few responses (i.e., we have a sparse dataset), meaning that an attempt to extract context lengths of >2 would lead to insufficient model regularization. As the Jensen et al work did not test human learning, this problem was obviated: their simulations used 1000 item sequences repeated 100 times.
We considered unigrams and bigrams preceding the target as contexts. For all contexts C in the library, we tracked the next-symbol entropies H(C) across training blocks. Using these entropy time series, we derived an entropy-based measure for each context. For each immediate context (e.g. level-1 context A), there exists a number of higher-order contexts (i.e. level-2 contexts AA, BA, CA, and DA). To assess whether a level-1 context A is appropriate, we computed the following measure:

∆H(A) = H(A) -( H(AA) + H(BA) + H(CA) + H(DA)) / 4. If ∆H(A)
is positive, the next-order symbol distribution of context A is less informative than the average information level given by its extended level-2 contexts AA, BA, CA, and DA. Hence higherorder contexts are more appropriate. If ∆H(A) is negative, context A is sufficiently informative and should be preferred to higher-order contexts.
Implementing this entropy-based model for level-2 data (Fig S4; Experiment 1, Group 0) indicates that the participants used the correct contexts for the prediction tasks (i.e. the first-order contexts A and D; second-order contexts AB, XB, BC, and YC). This reproduces the results produced by our response-tracking modeling using a less-constrained approach. However, as the entropy-based model requires a larger number of trials to extract learning strategies from participant responses, its use in tracking learning dynamics is limited. In particular, we could only apply the entropy-based method in the relatively saturated learning regime after the 10 th block of participants' responses; that is, there were insufficient participant responses to reliably estimate entropy in the first 10 blocks.

Functional clustering analysis
Inspecting Fig 3a suggested prototypical differences in mixture coefficient curves w t across participants. To identify prototypical patterns, we used k-means clustering. We first initialized the clustering algorithm by generating n cluster prototypes using randomly selected n learning curves and fitting a sigmoid function to each cluster. To partition the w t curves, we defined the Euclidean distance D ij between a w t -curve w j t and a sigmoid prototype W i t : where T j is the length of w j t , W i t stands for the value of sigmoid function specified by parameter Θ i and evaluated at t. The cluster index for w j t is then i * j = arg min i∈{1,...,n} D ij .
Using these initialized cluster prototypes, the clustering algorithm reassigned individual learning curves into one of the clusters, and then updated the prototypes by re-estimating their parameters Θ i . The parameter Θ i is obtained by minimizing j∈J i D ij , where J i = {j|i * j = i}. Note that this assumes that the differences between w t -curves and the sigmoid prototype are Gaussian distributed with constant variance. This is equivalent to assuming that the noise model on the prototypes is homoeostatic and Gaussian. To monitor the clustering quality, the Sum of Squared Errors (SSE) is used to measure how well the w t -curves within a cluster are represented by the prototype We searched for prototypical sigmoidal functions that give rise to the smallest average SSE. As the clustering process is stochastic, we repeated initialization ten times and selected the solution with minimal SSE. We also repeated the process of prototype re-estimation fifty times to ensure convergence. To determine the number of clusters needed to capture the data, we evaluated how SSE decreases with increasing n. Specifically, the rate of SSE reduction should drop sharply when the number of clusters exceeds the optimal number of clusters n opt . Therefore, n opt can be identified as the 'knee' position in the n-SSE curve [2]. We contrasted the SSE against the number of clusters and assessed the statistical significance using a Bayesian Information Criterion (BIC) [3] that penalised large number of clusters: Note that 4n is the number of free parameters in n sigmoid prototypes, while n i=1 |J i | j=1 T j is the number of observations in the data. Such regularisation favors a smaller number of clusters, subject to sufficiently small SSE, giving n opt = arg min n BIC(n). Using this approach, we found a knee at n = 2 in the SSE curves for level-0, level-1, and level-2 (Fig S2), indicating that two clusters best described the data. against number of clusters n. The optimal number of clusters is identified as the minimum position in the n-BIC curve. In particular, we calculated the overall SSE for different number of clusters n by computing the 'optimal' n (nopt), and plotting the curve of SSE against n. A knee at nopt indicates that clustering accuracy (as measured by SSE) initially increases quickly (with n less than nopt), while it becomes insignificant when the cluster number exceeds nopt. Results from this analysis were confirmed using an additional model selection method, Bayesian Information Criterion (BIC) to determine the optima number of clusters. Statistical significance of nopt clusters is quantified by the difference between BIC(n = nopt +1) and BIC(n = nopt), denoted as ∆BIC. ∆BIC > 6 indicates statistical significance [4]. Thus, nopt = 2 was statistically significant across levels: level-0: ∆BIC=17.1; level-1: ∆BIC=26.9; level-2: ∆BIC=8.5.       : learning 1st-vs. 2nd-order context for level-2 sequences. Probability of using level-1 (dashed lines) and level-2 (solid lines) models for level-2 sequences that are determined by 1st-vs. 2nd-order contexts. Mean data are shown for fast learners (red) vs. slower learners (blue) in Group 0, Group 1 and Group 2, for 1st-and 2nd-order contexts. For level-2, we separated 1st-from 2nd-order trials based on the context order preceding the target in each trial. We then derived a context length model for each set of trials and computed the probability of using level-1 vs. level-2 models for each trial. This analysis showed that for 1st-order trials, there was no difference between individuals with different learning profiles (as characterized by functional clustering analyses) in Groups 0 and 1. In contrast, the responses of Group 2 participants could be explained by a 1st-order model; that is, participants used 1st-order context to predict stimuli that were preceded 2nd-order contexts during the initial phase of learning. Interestingly, fast learners started using 2nd-order contexts earlier than slower learners who continued to rely on 1st-order contexts. Further, this analysis showed that learning 1st-order contexts intermixed with 2nd-order contexts (as in the variable context length model used for level-2) is more difficult than learning 1st-order contexts presented alone (as in Group 1, level-1). That is, participants extracted the correct 1st-order context length earlier in the training when they were exposed only to level-1 sequences (Group 1, level-1, Fig S5a) than when they were presented with variable context length sequences (Group 2, level-2, Fig S6). This suggests that learning variable context length sequences is more difficult not only because of the complexity of higher-order structures but also because participants have to extract more than one context length.  Performance index is shown across training blocks and levels. Fitted data is shown for participants who improved during training (black circles, level-0, n = 8; level-1, n = 6; level-2, n = 3). There were a small number of participants who did not improve during training (grey circles, level-0, n = 1; level-1, n = 1; level-2, n = 1). The random guessing baseline is indicated by dashed lines.