Neural mechanisms underlying the temporal organization of naturalistic animal behavior

Naturalistic animal behavior exhibits a strikingly complex organization in the temporal domain, with variability arising from at least three sources: hierarchical, contextual, and stochastic. What neural mechanisms and computational principles underlie such intricate temporal features? In this review, we provide a critical assessment of the existing behavioral and neurophysiological evidence for these sources of temporal variability in naturalistic behavior. Recent research converges on an emergent mechanistic theory of temporal variability based on attractor neural networks and metastable dynamics, arising via coordinated interactions between mesoscopic neural circuits. We highlight the crucial role played by structural heterogeneities as well as noise from mesoscopic feedback loops in regulating flexible behavior. We assess the shortcomings and missing links in the current theoretical and experimental literature and propose new directions of investigation to fill these gaps.


Introduction
Naturalistic animal behavior exhibits a striking amount of variability in the temporal domain ( Figure 1A). An individual animal's behavioral variability can be decomposed across at least three axes: hierarchical, contextual, and stochastic. The first source of variability originates from the vast hierarchy of timescales underlying self-initiated, spontaneous behavior ranging from milliseconds to minutes in animals (and to years for humans). On the sub-second timescale, animals perform fast movements varying from tens to hundreds of milliseconds. In rodents, these movements include whisking, sniffing, and moving their limbs. On the slower timescales of seconds, animals concatenate these fast movements into behavioral sequences of self-initiated actions, such as exploratory sequences (moving around an object while sniffing, whisking, and wagging their noses) or locomotion sequences (coordinating limb and head movements to reach a landmark). These sequences follow specific syntax rules (Berridge et al., 1987) and can last several seconds. On the timescales of minutes or longer, mice may repeat the 'walk and explore' behavioral sequence multiple times, when engaged in some specific activity, such as foraging, persisting toward their goal for long periods. In these simple examples, a freely moving mouse exhibits behavior whose temporal organization varies over several orders of magnitudes simultaneously, ranging from the sub-second scale (actions), to several seconds (behavioral sequences), to minutes (goals to attain). A leading theory to explain the temporal organization of naturalistic behavior is that behavioral action sequences arrange in a hierarchical structure (Tinbergen, 1951;Dawkins, 1976;Simon, 1962), where actions are nested into behavioral sequences that are then grouped into activities. This hierarchy of timescales is ubiquitously observed across species during naturalistic behavior.
The second source of temporal variability stems from the different contexts in which a specific behavior can be performed. For example, reaction times to a delivered stimulus can be consistently faster when an animals is expecting it compared to slower reaction times when the stimulus is unexpected (Jaramillo and Zador, 2011). Contextual sources can be either internally generated (modulations of brain state, arousal, expectation) or externally driven (pharmacological, optogenetic or genetic manipulations; or changes in task difficulty or environment).
After controlling for all known contextual effects, an individual's behavior still exhibits a large amount of residual temporal variability across repetitions of the same behavioral unit. This residual variability across repetitions can be quantified as a property of the distribution of movement durations, which is typically very skewed with an exponential tail (Wiltschko et al., 2015;Findley et al., 2021;Recanatesi et al., 2022), suggesting a stochastic origin. We define as stochastic variability the fraction of variability in the expression of a particular behavioral unit, which cannot be explained by any readily measurable variables.
These three aspects of temporal variability (hierarchical, contextual, stochastic) can all be observed in the naturalistic behavior of a single individual during their lifetime. Other sources of behavioral variability have also been investigated. Variability in movement execution and body posture (Dhawale et al., 2017) may originate from the motor periphery, such as noise in force production within muscles (van Beers et al., 2004), or from the deterministic chaotic dynamics arising from neural activity (Costa

A B
Left turn Figure 1. Neural mechanisms underlying the temporal organization of naturalistic animal behavior. (A) Three sources of temporal variability in naturalistic behavior: hierarchical (from fast movements, to behavioral action sequences, to slow activities and long-term goals), contextual (reaction times are faster when stimuli are expected), and stochastic (the distribution of 'turn right' action in freely moving rats is right-skewed). (B) Neural mechanisms underlying each source of temporal variability: hierarchical variability may arise from recurrent networks with a heterogeneous distribution of neural cluster sizes; contextual modulations from neuronal gain modulation; stochastic variability from metastable attractor dynamics where transitions between attractors are driven by low-dimensional noise, leading to right-skewed distributions of attractor dwell times. Panel (A) adapted from Figure 2 of Jaramillo and Zador, 2011. Panel (A, bottom) reproduced from Figure 6 of Findley et al., 2021Findley et al., . et al., 2021 or the biomechanical structure of an animal's body (Loveless et al., 2019). One additional source is individual phenotypic variability across different animals, an important aspect of behavior in ethological and evolutionary light (Honegger and de Bivort, 2018). Although we will briefly discuss some features of variability in movement execution and phenotypic individuality, the main focus of this review will be on the three main axes of variability described above, which pertain to the behavior of a single animal. From this overview, we conclude that the temporal structure of naturalistic, self-initiated behavior can be decomposed along at least three axes of temporal variability ( Figure 1A): a large hierarchy of simultaneous timescales at which behavior unfolds, ranging from milliseconds to minutes; contextual modulations affecting the expression of behavior at each level of this hierarchy; and stochasticity in the variability of the same behavioral unit across repetition in the same context. These three axes may or may not interact depending on the scenarios and the species. Here, we will review recent theoretical and computational results establishing the foundations of a mechanistic theory that explains how these three sources of temporal variability can arise from biologically plausible computational motifs. More specifically, we will address the following questions: 1. Neural representation: How are self-initiated actions represented in the brain and how are they concatenated into behavioral sequences? 2. Stochasticity: What neural mechanisms underlie the temporal variability observed in behavioral units across multiple repetitions within the same context? 3. Context: How do contextual modulations affect the temporal variability in behavior, enabling flexibility in action timing and behavioral sequence structure? 4. Hierarchy: How do neural circuits generate the vast hierarchy of timescales from milliseconds to minutes, hallmark of naturalistic behavior?
The mechanistic approach we will review is based on the theory of metastable attractors ( Figure 1B), which is emerging as a unifying principle expounding many different aspects of the dynamics and function of neural circuits . We will first establish a precise correspondence between behavioral units and neural attractors at the level of self-initiated actions (the lowest level of the temporal hierarchy). Then we will show how the emergence of behavioral sequences originates from sequences of metastable attractors. Our starting point is the observation that transitions between metastable attractors can be driven by the neural variability internally generated within a local recurrent circuit. This mechanism can naturally explain the action timing variability of stochastic origin. We will examine biologically plausible mesoscopic circuits that can learn to flexibly execute complex behavioral sequences. We will then review the neural mechanisms underlying contextual modulations of behavioral variability. We will show that the average transition time between metastable attractors can be regulated by changes in single-cell gain. Gain modulation is a principled neural mechanism mediating the effects of context, which can be induced by either internal or external perturbations and supported by different neuromodulatory and cortico-cortical pathways; or by external pharmacological or experimental interventions. Finally, we will review how a large hierarchy of timescales can naturally and robustly emerge from heterogeneities in a circuit's structural connectivity motifs, such as neural clusters with heterogeneous sizes. Although most of the review is focused on the behavior of Table 1. Definitions of behavioral units (see Appendix 1 for how to extract these features from videos).

Action
The simplest building blocks of behavior are at the lowest level of the hierarchy. This definition depends on the level of granularity (spatiotemporal resolution) required and the scientific questions to be investigated (see Appendix 1). A widely used definition of action is a short stereotypical trajectory in posture space (synonyms include movemes, syllables [Anderson and Perona, 2014;Brown and de Bivort, 2018]), operationally defined as a discrete latent trajectory of an autoregressive state-space model fit to pose-tracking time-series data ( Figure 2; Wiltschko et al., 2015;Findley et al., 2021). An alternative definition is in terms of short spectrotemporal representations from a time-frequency analysis of videos (Berman et al., 2016;Marshall et al., 2021). Examples include poking in or poking out of a nose port, waiting at a port, pressing a lever.
individual animals, we discuss how recent results on multianimal interactions and social behavior may challenge existing theories of naturalistic behavior and brain function. Appendices 1-3 provide guides to computational methods for behavioral video analyses and modeling; theoretical and experimental aspects of attractors dynamics; and biologically plausible models of metastable attractors.

The stochastic nature of naturalistic behavior
The study of naturalistic behavior based on animal videos has recently undergone a revolution due to the spectacular accuracy and efficiency of computational methods for animal pose tracking (Stephens et al., 2008;Berman et al., 2014;Hong et al., 2015;Mathis et al., 2018;Pereira et al., 2019;Nourizonoz et al., 2020;Segalin et al., 2020); see Appendix 1 and Table 1 for details). These new methods work across species and conditions, ushering a new era for computational neuroethology (Anderson and Perona, 2014;Brown and de Bivort, 2018;Datta et al., 2019). They have led to uncovering a quantitative classification of self-initiated behavior revealing a stunning amount of variability both in its lexical features (which actions to choose, in which order) and in its temporal dimension (when to act) (Berman et al., 2016;Wiltschko et al., 2015;Markowitz et al., 2018;Marshall et al., 2021;Recanatesi et al., 2022). Different ways to characterize the behavioral repertoire on short timescales have been developed, depending on the underlying assumptions of whether the basic units of behavior are discrete or continuous (Appendix 1). One can define stereotypical behaviors or postures by clustering probability density maps of spectrotemporal features extracted from behavioral videos, as in the example of fruit flies (Berman et al., 2014;Berman et al., 2016;(Figure 2A) or rats (Marshall et al., 2021; Figure 7B). Alternatively, one can capture actions or postures as discrete  Figure 2. The complex spatiotemporal structure in naturalistic behavior. (A) Identification of behavioral syllables in the fruit fly. From left to right: raw images of Drosophila melanogaster (1) are segmented from background, rescaled, and aligned (2), then decomposed via principal component analysis (PCA) into a low-dimensional time series (3). A Morlet wavelet transform yields a spectrogram for each postural mode (4), mapped into a twodimensional plane via t-distributed stochastic neighbor embedding (t-SNE) (5). A watershed transform identifies individual peaks from one another (6). states of a state-space model based on Markovian dynamics, each state represented as a latent state autoregressive trajectory accounting for stochastic movement variability ( Figure 2). In the case of discretization of the behavioral repertoire, the definition of the smallest building block of behavior depends on the level of granularity that best suits the scientific question being investigated. Within this discrete framework, state-space models were applied successfully to Caenorhabditis elegans (Linderman et al., 2019), Drosophila (Tao et al., 2019), zebrafish , and rodents (Wiltschko et al., 2015;Markowitz et al., 2018). At the timescale of actions and postures, transition times between consecutive actions are well described by a Poisson process (Killeen and Fetterman, 1988), characterized by a right-skewed distribution of inter-action intervals ( Figure 3). In this and the next section, we will focus on short timescales (up to a few seconds) where state-space models can provide parsimonious accounts of behavior. However, these models fail to account for longer timescale and non-Markovian structure in behavior, and in later sections we will move to a more data-driven approach to investigate these aspects of behavior (see Appendix 1).

Self-initiated actions and ensemble activity patterns in premotor areas
We begin the investigation of naturalistic behavior starting at the lowest level of the hierarchy and examining the neural mechanisms underlying self-initiated actions. How does the brain encode the intention to perform an action? How are consecutive actions concatenated into a behavioral sequence? Unrestrained naturalistic behavior, as it has been recently characterized in the foundational studies mentioned above, revealed a vast repertoire of dozens to hundreds of actions (Berman et al., 2014;   In the Marshmallow experiment, a freely moving rat poked into a wait port, where tone 1 signaled the trial start. The rat could poke out at any time (after tone 2, played at random times, in patient trials; or before tone 2, in impatient trials) and move to the reward port to receive a water reward (large and small for patient and impatient trials, respectively  Wiltschko et al., 2015;Markowitz et al., 2018;Marques et al., 2018;Costa et al., 2019;Schwarz et al., 2015) (although the repertoire depends on the coarse-graining scale of the behavioral analysis), leading to a combinatorial explosion in the number of possible action sequences. To tame the curse of dimensionality, typical of unconstrained naturalistic behavior, a promising approach is to control for the lexical variability in behavioral sequences and design a naturalistic task where an animal performs a single behavioral sequence of self-initiated actions, where each action retains its unrestrained range of temporal variability. Murakami et al., 2014, Murakami et al., 2017, and Recanatesi et al., 2022 adopted this strategy to train freely moving rats to perform a self-initiated task (the rodent version of the 'Marshmallow task'; Mischel and Ebbesen, 1970, Figure 3A), where a specific set of actions had to be performed in a fixed order to obtain a reward. The many repetitions of the same behavioral sequence yielded a large sample size to elucidate the source of temporal variability across trials. Action timing retained the temporal variability hallmark of naturalistic behavior, characterized by skewed distributions of action durations ( Figure 3A). What is the neural mechanism generating the large variability in action timing? A large number of studies implicated the secondary motor cortex (M2) in rodents as part of a distributed network involved in motor planning in head-fixed mice (Li et al., 2016;Barthas and Kwan, 2017) and controlling the timing of self-initiated actions in freely moving rats (Murakami et al., 2014;Murakami et al., 2017). Ensemble activity recorded in M2 during the Marshmallow task unfolded via sequences of multineuron-firing patterns, each one lasting hundreds of milliseconds to a few seconds; within each pattern, neurons fired at an approximately constant rate ( Figure 3B). Such long dwell times, much longer than typical single-neuron time constants, suggest that the observed metastable patterns may be an emergent property of the collective circuit dynamics within M2 and reciprocally connected brain regions. Crucially, both neural and behavioral sequences were highly reliable yet temporally variable, and the distribution durations of action and neural pattern durations were characterized by a right-skewed distribution. This temporal heterogeneity suggests that a stochastic mechanism, such as that found in noise-driven transitions between metastable states, could contribute to driving transitions between consecutive patterns within a sequence (see below). A dictionary between actions and neural pattern could be established, revealing that the onset of specific patterns reliably preceded upcoming self-initiated actions ( Figure 3A and B;e.g., the onset of the red pattern reliably precedes the poke out movement). The dictionary trained on the rewarded sequence generalized to epochs where the animal performed erratic nonrewarded behavior, where pattern onset predicted upcoming actions as well. The use of state-space models with underlying Markovian dynamics as generative models, capturing both naturalistic behavior (Wiltschko et al., 2015;Batty et al., 2019;Johnson et al., 2020;Parker et al., 2021;Findley et al., 2021) and the underlying neural pattern sequences (Maboudi et al., 2018;Linderman et al., 2019;Recanatesi et al., 2022), is a powerful tool to bridge the first two levels of the temporal hierarchy in naturalistic behavior: a link from actions to behavioral sequences. This generative framework further revealed fundamental aspects of neural coding in M2 ensembles, such as their distributed representations, dense coding, and single-cell multistability (Recanatesi et al., 2022;Mazzucato et al., 2015).

Action timing variability from metastable attractors
Neural patterns in M2 may represent attractors of the underlying recurrent circuit dynamics. Attractors are a concept in the theory of dynamical systems, defined as the set of states a system evolves to starting from a large set of initial conditions. The central feature of an attractor is that, if the system is perturbed slightly away from an attractor, it tends to return to it. Within the context of neural activity, an attractor is a persisting pattern of population activity where neurons fire at an approximately constant rate for an extended period of time (see Appendix 1 for models of attractor dynamics). Foundational work in head-fixed mice showed that licking preparatory activity in M2 during a delayed response task is encoded by choice-specific discrete attractors (Inagaki et al., 2019). Attractors can be stable (Amit and Brunel, 1997), as observed in monkey IT cortex in working memory tasks (Fuster and Jervey, 1981;Miyashita and Chang, 1988). Attractors can also be metastable (see Appendix 3), when they typically last for hundreds of milliseconds and noise fluctuations spontaneously trigger transitions to a different attractor (Deco and Hugues, 2012;Litwin-Kumar and Doiron, 2012).
Metastable attractors can be concatenated into sequences, which can either be random, as observed during ongoing periods in sensory cortex (Mazzucato et al., 2015), or highly reliable, Mazzucato encoding the evoked response to specific sensory stimuli (Jones et al., 2007;Miller and Katz, 2010;Mazzucato et al., 2015), or underlying freely moving behavior (Maboudi et al., 2018;Recanatesi et al., 2022). In particular, M2 ensemble activity in the Marshmallow experiment was consistent with the activity generated by a specific sequence of metastable attractors ( Figure 3). The main hypothesis underlying this model is that the onset of an attractor drives the initiation of a specific action as determined by the action/pattern dictionary ( Figure 3A and B) and the dwell time in a given attractor sets the inter-action-interval. The dynamics of the relevant motor output and the details of variability in movement execution are generated downstream of this attractor circuit (see Figure 5 and 'Discussion' section). The main features observed in the M2 ensemble dynamics during the Marshmallow task (i.e., long-lived patterns with a right-skewed dwell-time distribution, concatenated into highly reliable pattern sequences) can be explained by a two-area mesoscopic network where a large recurrent circuit (representing M2) is reciprocally connected to a small circuit lacking recurrent couplings (a subcortical area likely representing the thalamus [Guo et al., 2018;Guo et al., 2017;Inagaki et al., 2022], see Figure 5C). In this biologically plausible model, metastable attractors are encoded in the M2 recurrent couplings, and transitions between consecutive attractors are driven by low-dimensional noise fluctuations arising in the feedback projections from the subcortical nucleus back to M2. As a consequence of the stochastic origin of the transitions, the distribution of dwell times for each attractor is right-skewed, closely matching the empirical data. This model's prediction was confirmed in the data, where the ensemble fluctuations around each pattern were found to be low-dimensional and oriented in the direction of the next pattern in the sequence ( Figure 3). This model presents a new interpretation for low-dimensional (differential) correlations: although their presence in sensory cortex may be detrimental to sensory encoding (Moreno-Bote et al., 2014), their presence in motor circuits seems to be essential for motor generation during naturalistic behavior (Recanatesi et al., 2022).

Open issues
The hypothesis that preparatory activity for upcoming actions is encoded in discrete attractors in M2 has been convincingly demonstrated using causal manipulations in head-fixed preparations in mice (see Appendix 1), although a causal test of this hypothesis in freely moving animals is currently lacking. A shortcoming of the metastable attractor model of action timing in Recanatesi et al., 2022 is the unidentified subcortical structure where the low-dimensional variability is originating. Thalamus and basal ganglia are both likely candidates as part of a large reciprocally connected mesoscopic circuit underlying action selection and execution (see Figure 5B), and more work is needed to precisely identify the origin of the low-dimensional variability. The metastable attractor model assumes the existence of discrete units of behavior at the level of actions, although large variability in movement execution originating downstream of cortical areas may blur the distinction between the discrete behavioral units. The extent to which behavior can be interpreted as a sequence of discrete behavioral units or, rather, a superposition of continuously varying poses (see Appendix 1 for an in-depth discussion of this issue) is currently open for debate.
At higher levels of the behavioral hierarchy, repetitions of the same behavioral sequence (such as a jump attempt, an olfactory search trial, or a waiting trial, Figures 2 and 3) exhibit large temporal variability as well, characterized by right-skewed distributions (Lottem et al., 2018). It remains to be examined whether temporal variability in sequence duration may originate from a hierarchical model where sequences themselves are encoded in slow-switching metastable attractors in a higher cortical area or a distributed mesoscopic circuit (see Figure 5C and Figure 9).

A theory of metastable dynamics in biologically plausible models
A variety of neural circuit models have been proposed to generate metastable dynamics (see Appendix 3). However, a full quantitative understanding of the metastable regime is currently lacking. Such theory is within reach in the case of recurrent networks of continuous rate units. In circuits where metastable dynamics arises from low-dimensional correlated variability (Recanatesi et al., 2022), dynamic mean field methods could be deployed to predict the statistics of switch times from underlying biological parameters. In biologically plausible models based on spiking circuits, it is not known how to quantitatively predict switching times from underlying network parameters. Phenomenological birth-death processes fit to spiking network simulations can give a qualitative understanding of the on-off cluster dynamics (Huang and Doiron, 2017;Shi et al., 2022), and it would be interesting to derive these models from first principles. Mean field methods for leaky-integrate-and-fire networks can give a qualitative prediction of the effects of external perturbations on metastable dynamics, explaining how changes in an animal's internal state can affect circuit dynamics Wyrick and Mazzucato, 2021). These qualitative approaches should be extended to fully quantitative ones. A promising method, deployed in random neural networks, is based on the universal colored-noise approximation to the Fokker-Planck equation, where switch times between metastable states can be predicted from microscopic network parameters such as neural cluster size (Stern et al., 2021). Finally, a crucial direction for future investigation is to improve the biological plausibility of metastable attractor models to incorporate different inhibitory cell types. Progress along this line will open the way to quantitative experimental tests of the metastable attractor hypothesis using powerful optogenetic tools.

Stochastic individuality
One important source of behavioral variability is phenotypic variability across different individuals with identical genetic profile, an important aspect of behavior in ethological and evolutionary light (Honegger and de Bivort, 2018). Stochastic individuality is defined as the part of the phenotypic variability in nonheritable effects that cannot be predicted from measurable variables such as learning or other developmental conditions -such as behavioral differences in identical twins reared in the same environment. Signatures of stochastic individuality have been found in rodents (Kawai et al., 2015;Marshall et al., 2021) and flies (Tao et al., 2019). Existing theoretical models have not examined which neural mechanisms may underlie this individuality. This variability may arise from differences in developmental wiring of brain circuits related to axonal growth. The metastable attractor model in Recanatesi et al., 2022 may naturally accommodate some stochastic individuality. The location of each attractor in firing rate space is drawn from a random distribution, so that across-animal variability may stem from different random realization of the attractor landscape with the same underlying hyperparameters (i.e., mean and covariance of the neural patterns).

Variability in movements and body posture
A large source of variability in behavior originates from variability in movement execution (Dhawale et al., 2017). In a delayed response task in overtrained primates, preparatory neural activity in premotor areas could only account for half of the trial-to-trial variability in movement execution (Churchland et al., 2006). This fraction of motor variability originating from preparatory neural activity can be potentially explained by the mesoscopic attractor model in Figure 3D. The remaining fraction may originate from the motor periphery, such as noise in force production within muscles (van Beers et al., 2004), which is not accounted for by our attractor models. In humans, trial-to-trial motor variability can be interpreted as a means to update control policies and motor output within a reinforcement learning paradigm (Wu et al., 2014). It is challenging to dissect this finer movement variability from the state-space model approach to behavioral classification as it aims at capturing discrete stereotypical movement features. Variability in postural behavior can also stem from purely deterministic dynamics, which have been modeled using state-space reconstruction based on delay embedding of multivariate time series (Costa et al., 2019;Costa et al., 2021;Costa et al., 2021). These methods are fundamentally different from state-space models: while the latter assume an underlying probabilistic origin of variability, the former assume that variability arises from deterministic chaos. In order to capture the finer scale of movements and body posture in C. elegans (Schwarz et al., 2015;Szigeti et al., 2015), a data-driven state-space reconstruction revealed that locomotion behavior exhibits signature of chaos and can be explained in terms of unstable periodic orbits . A completely different approach posits that variability in movements emerges from an animal's body mechanics in the absence of any neural control. Strikingly, in a mechanical model of the Drosophila larvae body, the animal's complex exploratory behaviors were shown to emerge solely based on the features of the body biomechanics in the absence of any neural dynamics (Loveless et al., 2019). In particular, behavioral variability stems from an anomalous diffusion process arising from the deterministic chaotic dynamics of the body.

Contextual modulation of temporal variability
The second source of temporal variability in naturalistic behavior arises from contextual modulations, which can be internally or externally driven. When internally driven, they may arise from changes in brain or behavioral state such as arousal, expectation, or task engagement. When externally driven, they may arise from changes in environmental variables, from the experimenter's imposed task conditions, or from manipulations such as pharmacological, optogenetic, or genetic ones. Contextual modulations can affect several qualitatively different aspects of behavioral units at each level of the hierarchy (actions, sequences, activities): average duration; usage frequency; and transition probabilities between units. Moreover, context may also change the motor execution of a behavioral unit, for example, by improving the vigor of a certain movement upon learning or motivation. For each type of modulation, we will give several examples and review computational mechanisms that may explain them.

Action timing
The distribution of self-initiated action durations typically exhibits large variability, whose characteristic timescale can be extracted from their average duration ( Figure 3). The average timing of an action is strongly modulated by contextual factors, both internally and externally driven. Examples of internally generated contextual factors include expectation and history effects. When events occur at predictable instants, anticipation improves performance such as reaction times. This classic effect of expectation was documented in an auditory two-alternative choice task (Jaramillo and Zador, 2011), where freely moving rats were rewarded for correctly discriminating the carrier frequency of a frequencymodulated target sound immersed in pure-tone distractors ( Figure 4A). The target could occur early or late within each sound presentation, and temporal expectations on target timing were modulated by changing the ratio of trials with early or late targets within each block. When manipulating the expectations about sound timing, valid expectations accelerated reaction times and improved detection accuracy, showing enhanced perception. The auditory cortex is necessary to perform this task, and firing rates in auditory cortex populations are modulated by temporal expectations. Contextual changes in brain state may also be induced by varying levels of neuromodulators. In the self-initiated waiting task of Figure 3A, optogenetic activation of serotonergic neurons in the dorsal raphe nucleus selectively prolonged the waiting period, leading to a more patient and less impulsive behavior, but did not affect the timing of other self-initiated actions (Fonseca et al., 2015). Internally generated contextual modulations include history effects, which can affect the timing of self-initiated actions. In an operant conditioning task, where freely moving mice learned to press a lever for a minimum duration to earn a reward (Fan et al., 2012), the distribution of action timing showed dependence on the outcome of the preceding trial. After a rewarded trial, mice exhibited longer latency to initiate the next trial, but shorter press durations; after a failure, the opposite behavior occurred, with shorter latency to engage and longer press durations. Trial history effects are complex and actionspecific and depend on several other factors, including prior movements, and wane with increasing inter-trial intervals. Inactivation experiments showed that these effects rely on frontal areas such as medial prefrontal and secondary motor cortices (Murakami et al., 2014;Murakami et al., 2017;Schreiner et al., 2021).
Although the contextual modulations considered so far occur on a fast timescale of a few trials, they may also be the consequence of associative learning. In a lever press task, mice learned to adjust the average duration of the lever press to different criteria in three different conditions where lever presses were always rewarded regardless of duration or only rewarded if longer than 800 or 1600 ms. Within each of the three conditions, the distribution of action timing exhibited large temporal variability, yet the average duration was starkly different between the three criteria as the mice learned the different criteria (Fan et al., 2012;Schreiner et al., 2021). Reaction times to sensory stimuli and self-initiated waiting behavior, in the form of long lever presses or nose-pokes, have emerged as a fruitful approach to test hypotheses on contextual modulations and decision-making in naturalistic scenarios.

Controlling action timing via neuronal gain modulation
The paramount role of contextual modulations in regulating action timing during naturalistic scenarios has been well documented. However, the neural mechanisms underlying these effects remain elusive. Results from head-fixed preparations revealed some possible explanations, which have the potential to generalize to the freely moving case. Within the paradigm of metastable attractors (see Appendix 3), the speed at which cortical activity encodes incoming stimuli can be flexibly controlled in a statedependent manner by transiently changing the baseline level of afferent input currents to a local cortical circuit. These baseline changes may be driven by top-down projections from higher cortical areas or by neuromodulators. In a recurrent circuit exhibiting attractor dynamics, changes in baseline levels modulate the average transition times between metastable attractors Wyrick and Mazzucato, 2021). In these models, attractors are represented by potential wells in the network's energy landscape, and the height of the barrier separating two nearby wells determines the probability of transition between the two corresponding attractors (lower barriers are easier to cross and lead to faster transitions, Figure 4B). Changes in input baseline that decrease (increase) the barrier height lead to faster (slower) transitions to the coding attractor, in turn modulating reaction times.
Although it is not possible to measure potential wells directly in the brain, using mean field theory one can show that the height of these potential wells is directly proportional to the neuronal gain as measured by single-cell transfer functions (for an explanation, see Appendix 3). In particular, a decrease (increase) in pyramidal cells gain can lead to faster (slower) average action timing. This biologically plausible computational mechanism was proposed to explain the acceleration of sensory coding observed when gustatory stimuli are expected compared to when they are delivered as a surprise (Samuelsen et al., 2012;Mazzucato et al., 2019); and the faster encoding of visual stimuli observed in V1 populations during locomotion periods compared to when the mouse sits still . Transition rates between attractors may also be modulated by varying the amplitude and color of the fluctuations in the synaptic inputs (see Appendix 3), while keeping the barrier heights fixed. Changes in E/I background synaptic inputs were also shown to induce gain modulation and changes in integration time (Chance et al., 2002).
How can a neural circuit learn to flexibly adjust its responses to stimuli or the timing of self-initiated actions? Theoretical work has established gain modulation as a general mechanism to flexibly control network activity in recurrent network models of motor cortex (Stroud et al., 2018). Individual modulation of each neuron's gain can allow a recurrent network to learn a variety of target outputs through reward-based training and combine learned modulatory gain patterns to generate new movements. After learning, cortical circuits can control the speed of an intended movement through gain modulation and affect the shape or the speed of a movement. Although the model in Stroud et al., 2018 could not account for the across-repetition temporal variability in action timing, it is tempting to speculate that a generalization of this learning framework to incorporate the metastable attractor models of Mazzucato et al., 2019 could allow a recurrent circuit to learn flexible gain modulation via biologically plausible synaptic plasticity mechanisms (Litwin-Kumar and Doiron, 2014). This hypothetical model could potentially explain the contextual effects of learning on action timing observed in Fan et al., 2012;Schreiner et al., 2021, and the acceleration of reaction times in the presence of auditory expectations ( Figure 4A; Jaramillo and Zador, 2011). Although this class of models has not been directly tested in freely moving assays, we believe that pursuing this promising direction could lead to important insights.

Behavioral sequences
Contextual modulations of temporal variability may affect other aspects of natural behavior beyond average action timing, such as the frequency of occurrence of an action ('state usage') or the transition probabilities between actions within a behavioral sequence ( Figure 5A). These effects can be uncovered by analyzing freely moving behavioral videos with state-space models, such as the autoregressive hidden Markov model (HMM) (Wiltschko et al., 2015;Figures 2B and 5A and Appendix 1). Differences in state usage and transition probabilities between conditions may shed light on the computational strategies animals deploy to solve complex ethological tasks, such as those involving sensorimotor integration. For example, in a distance estimation task where mice were trained to jump across a variable gap, a comparison of monocular and binocular mice revealed the different visually guided strategies mice may use to perform a successful jump. Mice performed more vertical head movements under monocular conditions compared to control ( Figure 5A, states 2 and 3 occur more frequently in the monocular condition), revealing a reliance on motion parallax cues (Parker et al., 2021).
During ongoing periods, in the absence of a task, animal behavior features a large variety of actions and behavioral sequences ( Figure 2). Experimentally controlled manipulations can lead to strong changes in ongoing behavior reflected both in changes of state usage and of transition probabilities between actions, resulting in different repertoires of behavioral sequences. Examples of manipulations include exposing mice to innately aversive odors and other changes in their surrounding environment; optogenetic activation of corticostriatal pathways (Wiltschko et al., 2015); and pharmacological treatment (Wiltschko et al., 2020). In the latter study, a classification analysis predicted with high accuracy which drug and specific dose was administered to the mice from a large panel of compounds at multiple doses. Comparison of state usage and transition rates can also reveal subtle phenotypical changes in the structure of ongoing behavior in genetically modified mice compared to wildtype ones. This phenotypic fingerprinting has led to insights into the behavior of mouse models of autism spectrum disorder (Wiltschko et al., 2015;Wiltschko et al., 2020;Klibaite et al., 2021).

Computational mechanisms underlying flexible behavioral sequences
Recent studies have begun to shed light on the rules that may control how animals learn and execute behavioral sequences. These studies revealed various types of contextual modulations such as changes in the occurrence of single actions or in the transition probabilities between pairs of actions and proposed potential mechanisms underlying these effects. Biologically plausible models of mesoscopic neural circuits can generate complex sequential activity (Logiaco et al., 2021;Murray and Escola, 2017). In a recent model of sequence generation (Logiaco and Escola, 2020;Logiaco et al., 2021), an extensive library of behavioral motifs and their flexible rearrangement into arbitrary sequences relied on the interaction between motor cortex, basal ganglia, and thalamus. In this model ( Figure 5B), the basal ganglia sequentially disinhibit motif-specific thalamic units, which in turn trigger motif preparation and execution via a thalamocortical loop with the primary motor cortex (M1). Afferent inputs to M1 set the initial conditions for motif execution. This model represents a biologically plausible neural implementation of a switching linear dynamical system (LDS) (Linderman et al., 2017;Nassar et al., 2018), a class of generative models whose statistical structure can capture the spatiotemporal variability in naturalistic behavior (see Appendix 1).
How do animals learn context-dependent behavioral sequences? Within the framework of corticostriatal circuits, sequential activity patterns can be learned in an all-inhibitory circuit representing the striatum (Murray and Escola, 2017). Learning in this model is based on biologically plausible synaptic plasticity rules, consistent with the decoupling of learning and execution suggested by lesion studies showing that cortical circuits are necessary for learning, but that subcortical circuits are sufficient for expression of learned behaviors (Kawai et al., 2015). This model can achieve contextual control over temporal rescaling of the sequence speed and facilitate flexible expression of distinct sequences via selective activation and concatenation of specific subsequences. Subsequent work  uncovered a new computational mechanism underlying how motor cortex, thalamus, and the striatum coordinate their activity and plasticity to learn complex behavior on multiple timescales (Murray and Escola, 2020). The combination of fast cortical learning and slow subcortical learning may give rise to a covert learning process through which control of behavior is gradually transferred from cortical to subcortical circuits, while protecting learned behaviors that are practiced repeatedly against overwriting by future learning.

Open issues
It remains to clarify the exact extent to which the stochastic and contextual variability are independent sources. In principle, one could hypothesize that detailed knowledge and control of all experimental variables and behavioral state and their history might explain part of the trial-to-trial variability as contextual variability conditioned on these variables. For example, in the Marshmallow task about 10% of the temporal variability in waiting times could be attributed to a trial history effect, which relied on an intact medial prefrontal cortex and was abolished with its inactivation (Murakami et al., 2017). Moreover, in a subset of sessions this fraction of variability could be predicted by the activity of transient neurons before trial onset (Murakami et al., 2014;Murakami et al., 2017). The remaining 90% of the unexplained across-repetition variability was attributed to a stochastic mechanism, likely originating from metastable dynamics (Recanatesi et al., 2022). In general, it is an open question to investigate whether what we think of as noise driving trial-to-trial variability could just be another name for a contextual variable that we have not yet quantified. Alternatively, stochastic variability could be genuinely different from contextual variability and originate from noise inherent in neural spiking or other activity-dependent mechanisms.

A mesoscopic circuit for flexible action sequences
The computational mechanisms discussed so far can separately account for some specific features of contextual modulations, but none of the existing models can account for all of them. Metastable attractor models, where an attractor/action dictionary can be established, explain how the large variability in action timing may arise from correlated neural noise driving transitions between attractors ( Figure 3C). These models can explain how contextual modulations of action timing may arise from neuronal gain modulation ( Figure 4B). However, it is not known whether such models can explain the flexible rearrangement of actions within a behavioral sequence ( Figure 5A). Conversely, models of flexible sequence execution and learning (Logiaco et al., 2021;Murray and Escola, 2017;Stroud et al., 2018) can explain the latter effect, but do not incorporate temporal variability in action timing across repetition. We would like to propose a hypothetic circuit model that combines the cortex-basal ganglia-thalamic model of Figure 5B together with the metastable attractor model of preparatory activity of Figure 3C and D to provide a tentative unified model of temporally variable yet flexible behavioral sequences. In the original model of Figure 5B, the movement motif A in M1 is executed following a specific initial condition x A , representing an external input to M1. In the larger mesoscopic circuit of Figure 5C, we can interpret x A as the preparatory activity for the motif A, dynamically set by an M2 attractor, which encodes the upcoming motif, and mediated by an M2-to-M1 projection.
In the model of Figure 5B, the motif A in M1 is selected by the activation of a thalamic nucleus t A , which is gated in by basal ganglia-to-thalamus projections. In the larger mesoscopic circuit of Figure 5C, the thalamic nuclei are also part of a feedback loop with M2 (already present in the model of Figure 3D) responsible for sustaining the metastable attractors encoding the preparatory activity for motif A . Presynaptic noise in the thalamus-to-M2 projections implement the variability in action timing via noise-driven transitions between M2 attractors. Another prominent source of behavioral variability, the one in movement execution, can potentially be incorporated into this model. Specifically, fluctuations in neural preparatory activity (which account for up to half of the total variability in movement execution; Churchland et al., 2006) can be naturally incorporated as trial-totrial variability in the initial conditions x A,B,C for motifs in M1, originating from firing rate variability in M2 attractors upstream and in the M2-to-M1 projections. This model represents a direct circuit implementation of the state-space models of behavior based on Markovian dynamics: discrete states representing actions/motifs correspond to discrete M2 attractors, whereas the continuous latent trajectories underlying movement execution correspond to low-dimensional trajectories of M1 populations driving muscle movements. Moreover, this model can provide a natural and parsimonious neural implementation for the contextual modulations of behavioral sequences ( Figure 5A). Two different neural mechanisms can drive this behavioral flexibility: either a variation in the depth of the M2 attractor potential wells mediated by gain modulations ( Figure 4B) or a change in the noise amplitude originating at thalamus-to-M2 projections. These contextual modulations induce changes in transition times between consecutive actions/attractors. As a consequence, the temporal dynamics are only Markovian when conditioned on each context and the across-context dynamics exhibits non-Markovian features such as context-specific transition rates. This class of models can serve as a useful testing ground for generating mechanistic hypotheses and guide future experimental design.

Flexible sequences in birdsongs
A promising model system to investigate the neural mechanisms of flexible motor sequences is songbirds. Canary songs consist of repeated syllables called phrases, and the ordering of these phrases follows long-range rules in which the choice of what to sing depends on the song structure many seconds prior (Markowitz et al., 2013). These long-range contextual modulations are correlated to the activity of neurons in the high vocal center (HVC, a premotor area) (Cohen et al., 2020). Seminal work has established that motor variability can be actively generated and controlled by the brain for the purpose of learning (Olveczky et al., 2005;Kao et al., 2005). Variability in song production is not simply due to intrinsic noise in motor pathways but is introduced into robust nucleus of the arcopalium (RA, a motor cortex analog) by a dedicated upstream area lateral magnocellular nucleus of the anterior nidopallium (LMAN, analog to a premotor area), which is required for song learning. Theoretical modeling further revealed that this motor variability could rely on topographically organized projections from LMAN to RA for amplifying correlated neural fluctuations (Darshan et al., 2017). Strikingly, this mechanism provides universal predictions for the statistics of babbling shared by songbirds and human infants. Timing variability may allow animals to explore the temporal aspects of a given sequence of behavior independently of the choices of actions. Animals could learn proper timing of an action sequence by a search in timing space independent of action selection and vice versa. Future work should explore the advantages of temporal variability in driving learning of precise timing. Contextual modulations of behavior may occur on longer timescales as well, including those induced by circadian rhythms, hormones, and other changes in internal states; and they may span the whole lifetime of an individual, such as homeostasis and development. Social behavior provides another source of complex contextual modulations. We briefly address these different aspects of contextual modulations below.

Variability on longer timescales
Internal states, such as hunger (Corrales-Carvajal et al., 2016), and circadian rhythms (Patke et al., 2020) induce daily modulations of several aspects of behavior. In the fruit fly, different clusters of clock neurons are implicated in regulating rhythmic behaviors, including wake-sleep cycles, locomotion, feeding, mating, courtship, and metabolism. Activation of circadian clock neurons in different phases of the cycle drives expression of specific behavioral sequences, and targeted manipulations of particular clusters of clock neurons are sufficient to recapitulate those sequences artificially (Dissel et al., 2014;Yao and Shafer, 2014). In mammals, the circadian pacemaker is located in the hypothalamus, interacting with a complex network of neuronal peripheral signals downstream of it (Hastings et al., 2018;Schibler et al., 2015). Other sources of contextual variability include variations in the levels of hormones and neuromodulators, which modulate behavioral sequences on longer timescales . Although hormones and neuromodulators may not directly drive expression of specific behaviors, they typically prime an animal to elicit hormone-specific responses to particular stimuli in the appropriate behavioral context, as observed prominently during social behavior such as aggression and mating (Schibler et al., 2015). Statistical models based on Markovian dynamics, such as state-space models, whose building blocks are sub-second movements syllables, are challenged by contextual modulations occurring on very long timescales. Introducing by hand a hierarchical structure in these models might mitigate this issue (Tao et al., 2019). However, approaches based on discrete states, such as HMMs, are not well suited to capture long-term modulation of continuous nature as in the case of circadian rhythms. Fitting a discrete model to slow continuous dynamics would lead to the proliferation of a large number of discrete states tiling the continuous trajectory (Alba et al., 2020;Recanatesi et al., 2022). Data-driven models may provide a more promising way Mazzucato to reveal slow behavioral modulations, although they have not been tested in this realm yet (Marshall et al., 2021). Alternative methods based on continuous latents, such as Gaussian processes (Yu et al., 2008), might also provide useful alternative approaches. In trial-based experiments, tensor component analysis was shown to capture slow drifts in latent variables as well (Williams et al., 2018). Although the behavioral effects of these long-term sources of contextual modulations have not been included in current models of neural circuits, the theoretical framework based on attractor networks in Figure 5C could be augmented to account for them. An afferent higher-order cortical area, recurrently connected to M2, may toggle between different behavioral sequences and control long-term variations in their expression, for example, via gain modulation. A natural candidate for this controller area is the medial prefrontal cortex, which is necessary to express long-term biases in the waiting task (Figure 3; Murakami et al., 2017). In this augmented model, a top-down modulatory input to the higher-order controller, representing afferent inputs from circadian clock neurons or neuromodulation, may modulate the expression of different behavioral sequences implementing these long-term modulations.

Aging
Other sources of contextual modulation of behavior may act on even longer timescales spanning the whole lifetime of an individual (Churgin et al., 2017), such as homeostasis and development.
The interplay between these two mechanisms may explain how the motor output of some neural circuits maintains remarkable stability in the face of the large variability in neural activity observed across a population or in the same individual during development (Prinz et al., 2004;Bucher et al., 2005). Across the entire life span, different brain areas develop, mature, and decline at different moments and to different degrees (Sowell et al., 2004;Yeatman et al., 2014). Connectivity between areas develops at variable rates as well (Yeatman et al., 2014). It would be interesting to investigate whether these different mechanisms, unfolding over the life span of an individual, can be accounted for within the framework of multiarea attractor networks presented above.

Social behavior and contextual modulations
Although most of the experiments discussed in this review entail the behavior of individual animals, contextual modulations of behavior are prominently observed in naturalistic assays comprising the interaction between pairs of animals, such as hunting and social behaviors. In the prey-capture paradigm, a mouse pursues, captures, and consumes live insect prey (Hoy et al., 2016;Michaiel et al., 2020). Prey-capture behavior was found to strongly depend on context. Experimental control of the surrounding environment revealed that mice rely on vision for efficient prey-capture. In the dark, the hunting behavior is severely impaired: only at close range to the insect is the mouse able to navigate via auditory cues. Another remarkable example of context-dependent social behavior was demonstrated during male-female fruit fly mating behavior (Calhoun et al., 2019). During courtship bouts, male flies modulate their songs using specific feedback cues from their female partner such as their relative position and orientation. A simple way to model the relationship between sensory cues and the choice of a specific song in terms of linear 'filters' (Coen et al., 2014), where a common assumption is that the sensorimotor map is fixed. Relaxing this assumption and allowing for momentto-moment transition between more than one sensorimotor map via a hidden Markov model with generalized linear model emissions (GLM-HMM), Calhoun et al., 2019 uncovered latent states underlying the mating behavior corresponding to different sensorimotor strategies, each strategy featuring a specific mapping from feedback cues to song modes. Combining this insight with targeted optogenetic manipulation revealed that neurons previously thought to be command neurons for song production are instead responsible for controlling the switch between different internal states, thus regulating the courtship strategies. Finally, a tenet of naturalistic behavior is vocal communication, which combines aspects of sensory processing and motor generation in the realm of complex social interactions. A particularly exciting model system is the marmoset, where new techniques to record neural activity in freely moving animals during social behavior and vocalization (Nummela et al., 2017) together with newly developed optogenetic tools (MacDougall et al., 2016) and multi-animal pose tracking algorithms (Pereira et al., 2020b;Lauer et al., 2021) hold the promise to push the field into entirely new domains (Eliades and Miller, 2017).

Hierarchical structure
The temporal organization of naturalistic behavior exhibits a clear hierarchical structure (Tinbergen, 1951;Dawkins, 1976;Simon, 1962), where actions are nested into behavioral sequences that are then grouped into activities. Higher levels in the hierarchy emerge at longer timescales: actions/ movements occur on a sub-second scale, behavioral sequences span at most a few seconds and activities last for longer periods of several seconds to minutes. The crucial aspect of this behavioral hierarchy is its complexity: an animal's behavior unfolds along all timescales simultaneously. What is the organization of this vast spatiotemporal hierarchy? What are the neural mechanisms supporting and generating this nested temporal structure?

A case study: C. elegans locomotor behavior
Remarkable progress in both behavioral and neurophysiological aspects of the hierarchy has been made in the worm C. elegans ( Figure 6). The worm locomotor behavior exhibits a clear hierarchical organization with three distinct timescales: sub-second body-bends and head-casts, second-long crawling undulations, and slow reverse-forward cycles (Gomez-Marin et al., 2016). At the neuronal level, this hierarchy is generated by nested neuronal dynamics where upper-level motor programs are supported by slow activity spread across many neurons, while lower-level behaviors are represented by fast local dynamics in small multifunctional populations (Kaplan et al., 2020). Persistent activity driving higher-level behaviors gates faster activity driving lower-level behaviors, such that, at lower levels, neurons show dynamics spanning multiple timescales simultaneously (Hallinen et al., 2021). Specific lower-level behaviors may only be accessed via switches at upper levels, generating a nonoverlapping, tree-like hierarchy, in which no lower-level state is connected to multiple upper-level states.
At the top of this motor hierarchy, we find a much longer-lasting organization of states in terms of exploration, exploitation, and quiescence. In contrast to the strict, tree-like structure observed in the motor hierarchy, lower-level motor features are shared across these states, albeit with different  Figure 6. Behavioral and neural hierarchies in C. elegans. Top: behavioral hierarchy in C. elegans. A 0.05Hz cycle drives switches between forward and reverse crawling states, with intermediate level 0.5Hz crawling undulations, and lower level 1Hz head-casts. Bottom: slow dynamics across whole-brain circuits reflect upper-hierarchy motor activity; fast dynamics in motor circuits drive lower-hierarchy movements. Slower dynamics tightly constrain the state and function of faster ones. Adapted from Figure 8 of Kaplan et al., 2020. frequency of occurrence. Whereas the motor hierarchy is directly generated by neuronal activity, this state-level hierarchy may rely on neuromodulation (Ben Arous et al., 2009;Flavell et al., 2013)

Behavioral hierarchies in flies and rodents
How much of this tight correspondence between behavioral and neural hierarchies generalizes from worms to insects and mammals? Are action sequences organized as a chain? Alternatively, is there a hierarchical structure where individual actions, intermediate subsequences, and overall sequences can be flexibly combined? A chain-like organization would require a single controller concatenating actions, but could be prone to error or disruption. A hierarchical structure could be error-tolerant and flexible at the cost of requiring controllers at different levels of the hierarchy (Geddes et al., 2018). We will first examine compelling evidence from the fruit fly and then discuss emerging results in rodent experiments.
By analyzing videos of ground-based fruit fly during long sessions of spontaneous behavior ( Figure 2B), Berman et al., 2014 andBerman et al., 2016 mapped the entire behavioral space of   freely moving flies, identifying a hundred stereotyped, frequently reoccurring actions, interspersed with bouts of nonstereotyped behaviors. Recurring behavioral categories emerged as peaks in the behavioral space probability landscape, labeled as walking, running, head grooming, and wing grooming ( Figure 7A). In order to uncover the organization of behavior at different timescales, the authors estimated which behavioral representations (movements, sequences, or activities) could optimally predict the fly's future behavior on different temporal horizons (from 50 ms to minutes), applying the information bottleneck method (Tishby et al., 2000). This predictive algorithm revealed multiple timescales in the fly behavior, organized into a hierarchical structure reflecting the underlying behavioral programs and internal states. The near future could be optimally predicted by segmenting behavior according to actions at the fastest level of the hierarchy. The optimal representation of behavior that could optimally predict the distant future up to minutes away was based on slower, coarser groups of actions grouped into activities. These longer timescales manifest as nested blocks in the transition probability matrix, implying that a strongly non-Markovian structure emerges on longer timescales (Alba et al., 2020;Figure 7A). A hallmark of the fly behavior emerging from this analysis is that the different branches of this hierarchy are segregated into a tree-like, nonoverlapping structure: for example, actions occurring during grooming do not occur during locomotion. This multiscale representation was leveraged to dissect the descending motor pathways in the fly. Optogenetic activation of single neurons during spontaneous behavior revealed that most of the descending neurons drove stereotyped behaviors that where shared by multiple neurons and often depended on the behavioral state prior to activation (Cande et al., 2018). An alternative statistical approach based on a hierarchical HMM revealed that, although all flies use the same set of low-level locomotor features, individual flies vary considerably in the high-level temporal structure of locomotion, and in how this usage is modulated by different odors (Tao et al., 2019). This behavioral idiosyncrasy of individual-toindividual phenotypic variability has been traced back to specific genes (Ayroles et al., 2014) regulating neural activity (Buchanan et al., 2015) in the central complex of the fly brain. This series of studies suggest that the fly behavior is organized hierarchically in a tree-like structure from actions to sequences to activities. This hierarchy seems to lack flexibility, such that behavioral units at lower levels are segregated into different hierarchical branches and cannot be rearranged at higher levels. Although a more flexible structure seems to emerge during social interactions (see above and Calhoun et al., 2019).
What is the structure of the behavioral hierarchy in mammals? We will start by considering the first two levels of the hierarchy, namely, how actions concatenate into sequences. A recent experimental tour de force demonstrated the existence of a hierarchical structure in the learning and execution of heterogeneous sequences in a freely moving operant task (Geddes et al., 2018). Mice learned to perform a 'penguin dance' consisting of a sequence of two or three consecutive left lever presses (LL or LLL) followed by a sequence of two or more right lever presses (RR or RRR) to obtain a reward. Mice acquired the sequence hierarchically rather than sequentially, a learning scheme that is inconsistent with the classic reinforcement learning paradigm, which predicts sequence learning occurs in the reverse order of execution (Sutton and Barto, 2018). Using closed-loop optogenetic stimulation, the authors revealed the differential roles played by striatal direct and indirect pathways in controlling, respectively, the expression of a single action (either L or R), or a fast switch from one subsequence to the next (from the LL block to RR, and from RR to the reward approach).
What is the structure of the rodent behavioral hierarchy at slower timescales? A multiscale analysis of ongoing behavior was performed using rat body piercing, which allowed for tracking of the three-dimensional pose with high accuracy ( Figure 7B; Marshall et al., 2021). A watershed algorithm applied to behavioral probability density maps revealed the emergence of hierarchical behavioral categories at different temporal scales. Examining the behavioral transition matrix at different timescales, signature of non-Markovian dynamics peaked at 10-100 s timescales. On 15 s timescales, pattern sequences emerged featuring sequentially ordered actions, such as grooming sequences of the face followed by the body. On minute-long timescales, states of varying arousal or task engagement emerged, lacking stereotyped sequential ordering. Although the results of Marshall et al., 2021 seem to suggest the presence of a flexible, overlapping structure in the rat hierarchy, consistent with Geddes et al., 2018 and different from the tree-like hierarchy in flies (Berman et al., 2016), the analysis of Marshall et al., 2021 lacked the information bottleneck step, which was crucial to capture the fly behavior on long timescales. Future work may shed light on the nature of behavioral hierarchies in mammals using the information bottleneck.
Manipulation experiments could help clarify the structure of the rodent behavioral hierarchy and whether it allows for flexible behavioral sequences. Earlier studies of contextual modulations of rodent freely moving behavior by pharmacological or genetic manipulations were mostly confined to comparison of usage statistics of single actions (Wiltschko et al., 2015). However, more recent studies (Marshall et al., 2021;Klibaite et al., 2021), which mapped the rodent behavioral hierarchy, allowed a multiscale assessment of contextual effects. In a rat model of the fragile X syndrome, while mutant and wildtype rats had similar locomotor behavior, the former showed abnormally long grooming epochs, characterized by different behavioral sequences and states compared to wildtypes ( Figure 7C). This study highlights the advantages of multiscale comparative taxonomy of naturalistic behavior to investigate behavioral manifestations of complex conditions such as autism spectrum disorder and classify the behavioral effects of different drugs. Another benefit of this multiscale approach is to allow the investigation of the individuality of rodent behavior, revealing that, although all animals drew from a common set of temporal patterns, differences between individual rats were more pronounced at longer timescales compared to short ones (Marshall et al., 2021). It would be interesting to further explore the features of individuality across species as a way of testing ecological ideas such as 'bet-hedging' (Honegger and de Bivort, 2018

Multiple timescales of neural activity
What are the neural mechanisms generating the hierarchy of timescales observed during naturalistic behavior? Is there evidence that neural activity is simultaneously varying over multiple timescales? Although no studies directly addressed these questions yet, a number of experimental and theoretical approaches have provided evidence for multiple timescales of neural activity. Some evidence for temporal heterogeneity in neural activity was reported in restrained animals during stereotyped behavioral assays. A heterogeneous distribution of timescales of neural activity was found in the brainstem of the restrained larval zebrafish by measuring the decay time constant of persistent firing across a population of neurons comprising the oculomotor velocity-to-position neural integrator ( Figure 8A; Miri et al., 2011). The decay times varied over a vast range 0.5-50 s across cells in individual larvae. This heterogeneous distribution of timescales was later confirmed in the primate oculomotor brainstem (Joshua et al., 2013), suggesting that timescale heterogeneity is a common feature of oculomotor integrators conserved across phylogeny. Single-neuron activity may also encode a long memory of task-related variables. In head-fixed monkeys performing a competitive game task, temporal traces of the reward delivered in previous trials were encoded in single-neuron spiking activity in frontal areas over a wide range of timescales, obeying a power law distribution up to 10 consecutive trials (Bernacchia et al., 2011).
The timescale of intrinsic fluctuations in spiking activity can be also estimated from single-neuron spike autocorrelation functions ( Figure 8B). In awake head-fixed primates, during periods of ongoing activity, the distribution of intrinsic autocorrelation timescales within the same cortical circuit was found to be right-skewed and approximately lognormal ( Figure 8B), in the range from 10 ms to 1 s (Cavanagh et al., 2016). Moreover, comparison of the population-averaged autocorrelations during ongoing periods revealed a hierarchical structure across cortical areas, varying from 50 to 350 ms along the occipital-to-frontal axis (Murray et al., 2014; Figure 8C).
While all these results were obtained in restrained animals, it is an open question whether neural activity during naturalistic behavior exhibits temporal hierarchies similar to those observed in behavior. Current evidence from freely moving rodents engaged in waiting tasks (Murakami et al., 2017;Schreiner et al., 2021;Recanatesi et al., 2022) revealed the presence of multiple timescales in single-neuron activity from secondary motor and prefrontal cortices. These timescales range from the sub-second scale (single attractors), to a few seconds (attractor sequences), to tens of seconds or minutes (trial history dependence). It is tantalizing to speculate that even longer timescales may be present for neural activity to be able to generate the vast hierarchy of timescales observed in naturalistic behavior (Figure 7). Evidence from associative learning tasks in rodents found that population activity in hippocampal CA1 and CA3 encodes multiplexed information about several aspects of the task occurring on multiple trials such as context, place, value, and objects (McKenzie et al., 2014). Although an explicit analysis of temporal correlations was not carried out in this study, these results suggest that a hierarchy of timescales may be present in the hippocampus and emerge during learning.

Computational mechanisms underlying heterogeneous distributions of timescales
What are the computational mechanisms underlying this hierarchical temporal structure featuring a wide distribution of timescales? Single-cell biophysical properties such as differences in membrane time constant across cell types or synaptic time constants across receptors (e.g., faster AMPA and GABA vs. slower NMDA receptors) may generate multiple timescales in sub-second range, from a few milliseconds to a few hundred milliseconds (Gjorgjieva et al., 2016). Although this class of singlecell mechanisms might explain the temporal hierarchy observed in the posterior-anterior cortical axis (Murray et al., 2014) by relying on systematic differences in cell-type-specific features, they are unlikely to explain the much wider temporal hierarchy observed across neurons within the same area ( Figure 8A and B), or the even larger hierarchy observed in behavior (Figure 7).
Theoretical studies highlighted the central role played by recurrent synaptic couplings within a local circuit for generating long timescales emerging from the recurrent dynamics. Recurrent networks with random recurrent couplings can generate time-varying neural activity whose timescale may be tuned to very slow values at a critical point (Toyoizumi and Abbott, 2011;Magnasco et al., 2009). Long timescales may also emerge in random neural networks with an excess of symmetric couplings (Martí et al., 2018). In these examples, all cells share the same timescale. Although this common timescale can be tuned to arbitrarily long values, these circuits are incapable of giving rise to temporal heterogeneity either at the single cell or at the population level.
Two biologically plausible ingredients were shown to be sufficient to generate long timescales with temporal heterogeneity in their distributions: recurrent couplings realizing local functional neural clusters and heterogeneity in synaptic couplings. The first requirement of functional assemblies is a connectivity motif ubiquitously observed in biological circuits, namely, the fact that the strength of recurrent couplings decays with spatial distance between pairs of neurons, leading to the emergence of local functional assemblies of strongly coupled neurons (Song et al., 2005;Perin et al., 2011;Lee et al., 2016;Kiani et al., 2015;Miller et al., 2014). In recurrent linear network models, functional assemblies can generate neural activity exhibiting slow relaxation times following stimulation. In order to generate a heterogeneity in the distribution of relaxation times, functional assemblies can be coupled by heterogeneous long-range connections, arranged along a spatial feedforward gradient ( Figure 8D; Miri et al., 2011;Joshua et al., 2013;Chaudhuri et al., 2014). Linear networks featuring both local functional assemblies and spatial heterogeneity in long-range couplings were able to reproduce the heterogeneous decay times found in the brainstem oculomotor integrator circuit in zebrafish and primates (Miri et al., 2011;Joshua et al., 2013) and in the reward integration times in the macaque cortex (Bernacchia et al., 2011). It remains an open question whether these integrator models could be generalized to explain hierarchical activity in motor generation, perhaps stacking multiple layers of them. A shortcoming of these models is the fact that they require finetuning of synaptic couplings and do not generate attractor dynamics, which recent experimental evidence suggests is the basic building block of preparatory motor activity (Inagaki et al., 2019;Recanatesi et al., 2022). Other network models generating multiple timescales of activity fluctuations were proposed based on self-tuned criticality with anti-Hebbian plasticity (Magnasco et al., 2009) or multiple block-structured connectivity (Aljadeff et al., 2015).
An alternative robust and biologically plausible way to generate a vast hierarchy of timescales was proposed based on the ingredients of recurrent functional clusters and heterogeneity in synaptic couplings ( Figure 8E; Stern et al., 2021), two common features observed across cortical circuits (Marshel et al., 2019;Kiani et al., 2015;Miller et al., 2014;Song et al., 2005;Perin et al., 2011;Lee et al., 2016), supported by theoretical evidence (Litwin-Kumar and Doiron, 2012;Wyrick and Mazzucato, 2021). In this model, excitatory and inhibitory neurons are arranged in clusters of heterogeneous sizes, generating metastable activity whose typical timescale is measured by the on-off cluster switching time (see Appendix 3). In this model, a cluster's timescale is proportional to its size and larger clusters exhibit longer timescales, yielding a heterogeneous distribution of timescales in the range observed in cortex ( Figure 8A-C). An appealing feature of this model is that it could be generalized to other domains that exhibit fluctuations simultaneously varying over a large range of timescales, such as spin glasses (Bouchaud, 1992), metabolic networks of Escherichia coli (Almaas et al., 2004;Emmerling et al., 2002), and yeast cultures (Roussel and Lloyd, 2007;Aon et al., 2008). It remains an open question whether the relationship between a neural cluster's size and its intrinsic timescale is realized in cortical circuits and whether it can explain the origin of the hierarchical variability in naturalistic behavior.

Future directions: Neural mechanisms generating temporal hierarchies
Future studies should address how the hierarchy of timescales found in behavior may emerge from computational mechanisms. There are several missing links along this path. First, although temporally heterogeneous neural activity was found in several brain areas and species under different experimental conditions, no study to date investigated the presence of temporal hierarchies in neural activity during naturalistic behavior. All the necessary tools are available: recent advances in neurotechnology demonstrate the feasibility of chronic recordings of large neural populations during freely moving behavior (Juavinett et al., 2019;van Daal et al., 2021); simultaneously performing behavioral classification analyses from pose tracking software (Mathis et al., 2018;Marshall et al., 2021;Pereira et al., 2019). This first piece of the puzzle is thus within reach.
On the theory side, although recurrent dynamics can generate heterogeneous distributions of timescales ( Figure 8E), this model needs to be extended to a fully realistic framework for explaining nested temporal hierarchies in naturalistic behavior such as the proposed mesoscopic circuit in Figure 5C. Here, we propose a simple roadmap to bridge this gap, building on some of the theoretical ideas we reviewed above. This theory is based on a hierarchical structure that, for lack of a better term, we will denote 'Attractors all the way down' (Figure 9). We start from the observation that selfinitiated actions within a behavioral sequence are represented as metastable attractors in secondary motor cortex (Inagaki et al., 2019;Recanatesi et al., 2022;Figure 3). We then extend this observation to a general principle positing that behavioral units at each level of the temporal hierarchy (not only actions, but also sequences, activities) are represented by metastable attractors. The stochasticity in behavioral unit duration can be achieved by generating transitions between attractors at each level via low-dimensional variability, arising from mesoscopic feedback loops involving cortex and subcortical areas (Recanatesi et al., 2022; Figure 3D). The average transition time between behavioral units at each level will then depend on the barrier height separating the corresponding attractors (Cao et al., 2016;Mazzucato et al., 2019;Wyrick and Mazzucato, 2021). As one moves up the hierarchy from actions (fast transitions = lower barriers) to sequences and activities (slow transitions = higher barriers), the potential wells separating the corresponding attractors become deeper and the basins of attraction wider. This increase in timescales can be achieved in a biological manner by assuming that increasingly larger neural populations encode for slower behavioral features (Stern et al., 2021; see Figure 8E), which can be realized in a biologically plausible way as a hierarchy of clusters within clusters (Schaub et al., 2015). Contextual modulations in average duration of a behavioral unit at each level of the hierarchy can be implemented by top-down changes of the barrier heights between attractors via gain modulation ( Figure 4C). The new theoretical ingredient required for this theory to work is the presence of a nested structure in the attractor landscape, whereas the basin of attraction of lower-level attractors (actions) is contained within the basins of higher level attractors (sequences) all the way up to the largest basins representing long-term activities. The architecture of this model provides some immediate predictions for neural activity. First, neurons exhibit conjunctive selectivity to multiple variables at different levels of the hierarchy, which has been observed, for example, in rodent hippocampus during freely moving tasks (McKenzie et al., 2014) and in primate prefrontal cortex during complex decision-making tasks (Rigotti et al., 2013). Second, single-cell activity is modulated by many attractors, in agreement with the multistable activity observed in sensory (Mazzucato et al., 2015) and frontal areas (Recanatesi et al., 2022) where neuronal representations of attractors are dense rather than sparse. Third, fluctuations in the activity of single neurons should vary over multiple timescales simultaneously, encoding information about multiple levels of the hierarchy; this is the structure that was observed in C. elegans where neurons representing movements exhibit both fast and for slow fluctuations correlated to multiple levels of the behavioral hierarchy (Kaplan et al., 2020;Figure 6). Which network architectures can implement a hierarchical attractor landscape? The behavioral hierarchy discovered in flies (Berman et al., 2016) can be approximated as a tree: behavioral units at a lower level of the hierarchy only occur during a specific unit at a higher level of the hierarchy (Figure 7). For example, movements occurring during the fruit fly locomotion activities never occur during idle activities. This structure is typical of phylogenetic classification in taxonomy, where the distance from the root to the leaves is the same for every leaf and the tree is called ultrametric (Rammal et al., 1986). It is not known whether behavioral trees are ultrametric; in order to clarify this structure, a notion of distance in behavioral space will have to be introduced and examined. A classic model of hierarchical attractors, which could potentially capture the tree-based hierarchy, is an extension of the Hopfield network where the stored pattern is hierarchically arranged into a tree (Parga and Virasoro, 1986;Figure 9). This class of models was originally proposed to explain the effect of word semantic categorization during memory encoding, storage, and retrieval. A contamination between the domains of naturalistic behavior and natural language processing has been proposed early on (Dawkins, 1976) and recently applied to describe C. elegans (Gomez-Marin et al., 2016;Gupta and Gomez-Marin, 2019) and larval zebrafish behavior (Reddy et al., 2022). It is still an open question to determine the precise structure of the behavioral hierarchy in mammals. The results of Marshall et al., 2021 seem to suggest that the hierarchy in rats may not be tree-like as the one observed in flies (Berman et al., 2016). However, the analysis of the rodent behavior in Marshall et al., 2021 lacked the information bottleneck step, which was crucial to infer the structure of the fly hierarchy at long timescale. On general grounds, it would be interesting to investigate whether the mammalian hierarchical motor system allows for flexible and adaptable behaviors, by utilizing the same motor elements across different sequences in a hierarchical manner, as represented in the mesoscopic circuit models of Murray and Escola, 2017;Logiaco and Escola, 2020;Logiaco et al., 2021 (see Figure 5) and suggested by experimental evidence in Geddes et al., 2018. This flexible hierarchy could be realized by overlapping branches and would be different from the tree-like structure found in the fly.
Are the computational mechanisms generating the tree-like hierarchy sufficient to generate flexible hierarchies? Recent theoretical work highlighted the importance of a neural circuit architecture that segregates the motor preparation and execution in different areas. In the cortico-thalamic-basal ganglia model of Figure 5B, flexible sequences can be generated by rearranging existing motifs as well as by learning new motifs without interfering with previous ones (Logiaco et al., 2021). Moreover, flexible sequences can be learned using biologically plausible Hebbian plasticity in the striatum (Murray and Escola, 2017). These theoretical models provide an exciting blueprint for further investigation into learning and expression of flexible behavioral hierarchies.
What are potential neural implementations of this network architecture? In the case of mammals, at the lowest level of the hierarchy, neural populations in the rodent secondary motor cortex (M2) generate metastable attractors representing upcoming actions (Inagaki et al., 2019;Recanatesi et al., 2022). Other subcortical areas, such as the basal ganglia or the thalamus, are likely involved in the generation of complex behavioral sequences (Murray and Escola, 2017;Logiaco et al., 2021;Recanatesi et al., 2022). How are higher-order behavioral units, such as sequences and activities, encoded? One possibility is that they could also be encoded in M2 attractors as recent evidence suggests M2 populations encode for a wide range of behavioral variables (Cazettes et al., 2021). We hypothesize an architecture featuring nested assemblies where larger populations, whose activity varies over progressively slower timescales (Stern et al., 2021;Schaub et al., 2015), hierarchically encode for slower features of behavior (from actions to sequences to activities). Experimental evidence shows that single neurons in M2 are selective to multiple actions (Recanatesi et al., 2022), suggesting that mixed selectivity (Rigotti et al., 2013) could play an important role in generating the high-dimensional hierarchical attractor landscape necessary to capture the complexity of naturalistic behavior. An alternative architecture might involve a distributed cortical circuit where the neural representations of behavioral units at different levels of the hierarchy are encoded in multiple frontal cortical areas, such as action sequences in M2, and activities in prefrontal cortex, which is known to control behavior on longer timescales such as trial history effects (Murakami et al., 2014;Murakami et al., 2017;Schreiner et al., 2021). In both scenarios, we hypothesize that while behavioral units are encoded as cortical attractors, transitions between attractor rely on feedback loops involving cortico-subcortical circuits (Murray and Escola, 2017;Logiaco et al., 2021;Recanatesi et al., 2022).

Behavioral energy landscape
The structure of naturalistic behavior emerging on long timescales in the fruit fly is a hierarchical tree-like organization ( Figure 7A), which is consistent with an underlying neural circuit architecture based on hierarchical basins of attractions (Figure 9). Can we derive a representation of behavior as a complex energy landscape directly from the behavioral data itself? A promising approach is given by the probability density map approach (Berman et al., 2016). One could define the log-probability of the density map as an energy potential, where the transition rates between behavioral features yield a probability flux along this potential landscape. It is tantalizing to speculate that using the combination of energy gradient and probability flux one could derive a data-driven nonlinear dynamical system describing behavior (Wang, 2015). Other approaches to infer a potential energy landscape directly from data have been successfully applied to spike trains (Genkin and Engel, 2020;Duncker et al., 2019), although in a regime where the energy landscape has only a few minima. Alternatively, in chemical kinetics or fluid flow, methods based on the transfer operator formalism have been applied to find effective free energy landscapes and metastable states from experimental data or simulations (Bowman et al., 2013), and have been recently applied to animal behavior .

Neural mechanisms of flexible behavioral hierarchies
The model of nested basins of attractions proposed in Figure 9 can explain a tree-like behavioral hierarchy, where behavioral units at lower levels (e.g., actions) are not shared by different units at a higher level (e.g., sequences), as observed in fruit flies (Berman et al., 2016). Although the tree-like hierarchical organization emerged from assays where individual animals were monitored in isolation, recent studies of social behavior seem to challenge this structure. A large variety of qualitatively new behaviors arise from social interactions, including fighting, mating, and others. While some of these behaviors involve specialized behavioral units, others involve simultaneous execution of multiple behavioral units leading to multitasking. For example, during courtship, a male fly can simultaneously approach a female fly (locomotion) and sing, two behaviors that would be mutually exclusive in the absence of a female fly. As a consequence, the simple tree-like hierarchical organization of behavior observed in isolated individuals ( Figure 8) might break down during social interactions and lead to a flexible organization where actions are shared between multiple sequences and activities.

Appendix 1 From behavioral videos to behavioral features
Several video analysis tools are available to extract behavioral features from videos, ranging from standard computer vision algorithms to deep learning-based methods. These algorithms are fast and reliable and allow to track animal movements at different levels of spatiotemporal resolution, typically yielding three different kinds of time-series outputs. Deep learning algorithms with humanannotated video frames yield a set of coordinates in ambient space representing tracked points on an animal's body (Mathis et al., 2018;Segalin et al., 2020;Pereira et al., 2019). Unsupervised algorithms yield a set of low-dimensional variables obtained from dimensionally reducing the image pixel space via principal component analysis or other nonlinear compression methods (Wiltschko et al., 2015;Batty et al., 2019). Spectral decomposition of the videos obtained from a time-frequency analysis yield probability density maps (Berman et al., 2014;Szigeti et al., 2015). Translating these low-dimensional observables into a meaningful representation of behavior for subsequent analyses requires further assumptions on the nature of feature representations, leading to difficult choices that should be carefully evaluated case by case.

Discrete vs. continuous features
The first choice regards the definition of what constitutes a fundamental unit of behavior for a particular model system and whether this unit is discrete or continuous. Is a 'slow locomotion' bout distinct from a 'fast locomotion' bout? Are they just different manifestation within the variability range of the same 'locomotion' motif? This choice is related to the timescale at which behavior is analyzed to test a particular hypothesis and is based on the level of perceived stereotypy present in the observed behavior (Brown and de Bivort, 2018;Berman, 2018;Pereira et al., 2020a). At the sub-second timescale, movements can be represented as continuous low-dimensional trajectories of body parts. A continuous representation of behavior assumes that behavioral time series can be represented as a superposition of continuous behavioral motifs or 'eigenshapes' and has been fruitful in quantifying the behavior of the worm C. elegans and the Drosophila larva (Stephens et al., 2008;Szigeti et al., 2015) (although discrete representations have been applied to the same systems; Brown et al., 2013;Vogelstein et al., 2014;Schwarz et al., 2015). Continuous trajectories of movements can be interpreted and investigated as complex dynamical systems as well . In the zebrafish swimming, clear peaks in distribution of kinematic parameters suggest categorical distinctions (Marques et al., 2018), although for some movements the kinematics vary more continuously, allowing the fish to perform a smooth range of swim types. The approach where units of behavior are discrete and can be separated in time leads to representations based on ethograms, namely, graphs where nodes are discrete actions and edges transition rates between them (Bateson and Hinde, 1976). Ethograms implicitly assume that variability across repetitions of the same action ('stereotypy') is smaller than the variability between different actions. In order to bridge the continuous nature of movements with its inherent variability to a discrete interpretation in terms of actions, two successful approaches have been used. When using state-space models (Figures 2A and 5B), the fundamental units of behavior can be captured by stereotyped, lowdimensional, autoregressive trajectories or LDS. The behavioral time series unfolds as a sequence of piecewise LDS trajectories, where each trajectory represents a discrete action, yet it accommodates for large variability in its continuous expression. This approach has been successfully applied both in rodents (with autoregressive HMMs; Wiltschko et al., 2015) and worms (with switching LDS; Linderman et al., 2019). Two of its advantages are the fact that it provides a generative model that can be used to simulate synthetic behavior, and the fully Bayesian implementation of inference. One limitation is that the number of behavioral units must be specified as a hyperparameter, raising the thorny issue of model selection. In freely moving rodents, up to a dozen states were sufficient to capture animal behavior when performing a task (Findley et al., 2021;Parker et al., 2021), but many more states (up to a hundred) were found during spontaneous behavior (Wiltschko et al., 2015;Figure 2A). Sample size and the bias-variance tradeoff are important issues to consider in this class of models. An alternative data-driven approach to segmentation of behavior avoids having to choose the number of units and their dynamical features a priori, but rather lets the structure emerge from the data at different timescales. Using density-based clustering to obtain a distribution of time points in feature space, one can identify peaks in the density map via watershed transforms ( Figure 2B). This approach has been successfully used in the adult Drosophila (Berman et al., 2014) and rodents (Marshall et al., 2021). The advantage of the watershed transform is that the segmentation into discrete units depends on the level of coarse-graining or the timescale at which the behavior is analyzed, allowing greater flexibility compared to state-space models. Both approaches combine continuous representations of movements with discrete segmentation into behavioral units. A different data-driven approach used locally linear models to extract behavioral states from posture dynamics (Costa et al., 2019;Costa et al., 2021). Here, behavioral states emerge at different coarse-graining levels through a hierarchical clustering approach combining short-time movements into slower interpretable behaviors, thus surpassing some of the limitations of state-space models, since the number of states is not imposed as a hyperparameter but emerges from the data.

Non-Markovianity and long timescales
Discrete behavioral representations based on state-space models (such as HMM or SLDS) are convenient and effective for capturing naturalistic behavior with a limited number of actions, represented by discrete hidden states. These models are based on the Markovian assumption that the state of the system at time t only depends on the state at time t − 1 . A large number of studies used these models to capture a single sub-second timescale related to fast movements emerging from the data, as in the examples of Figures 2B, 3B and 5A; while a fast and a slow timescale were found in an olfactory search task, the sub-second duration of single movements and the secondlong duration of the investigation/approach sequences (Findley et al., 2021). In principle, a Markov model with N states can capture up to N − 1 different timescales, represented by the nontrivial eigenvalues of the transition probability matrix. In these models, transition probabilities between states over longer intervals are generated by powers of the elementary transition probability matrix. However, naturalistic behavior typically exhibits strong deviations from this Markovian structure (Tinbergen, 1951;Dawkins, 1976;Simon, 1962;Marshall et al., 2021), whereby at long intervals much slower timescales emerge, compared to the ones predicted by the Markovian dynamics. A simple workaround is to introduce an ad hoc hierarchical structure in the state-space model (capturing timescale of a few seconds; Tao et al., 2019), although the hierarchical architecture has to be imposed a priori and may reflect human prejudices. An alternative data-driven approach is provided by the information bottleneck, whereby a non-Markovian structure at long timescales can be inferred directly from the data. Using this approach in the fruit fly, a hierarchical representation of actions naturally emerged when aiming to optimally predict the future behavioral state of the fly using the information bottleneck (Berman et al., 2016). An efficient way to model the slow non-Markovian structure is by compressing the behavioral repertoire to just two states and using maximum entropy models to capture the interactions between states. This statistical physics approach revealed that interactions between states at time τ decay as slow as 1/τ 2 . If one wanted to fit an HMM to data with such slowly decaying interactions, an infinite number of discrete states would be required (Alba et al., 2020). On the other hand, it would be interesting to fit Markov models (such as HMM or SLDS) to behavior for classifying the fast movement motifs and then apply the information bottleneck for compressing them and revealing the underlying slow timescales. This non-Markovian structure in behavior poses an interesting challenge to existing frameworks based on discrete states and calls for new theoretical models to explain the dynamical evolution of naturalistic behavior over long timescales.

Appendix 2 Attractors dynamics in neural circuits
Attractors are a concept in the theory of dynamical systems. An attractor is defined as the set of states a system evolves to starting from a large set of initial conditions, defined as its basin of attraction. The main feature of an attractor is that if the system is perturbed slightly, it tends to return to it. In the context of recurrent neural networks (RNN, Appendix 2- figure 1A), attractors are configurations of the network activity that exhibit these features and they can be either timeindependent (point attractors, Appendix 2- figure 1A), time-varying but periodic (limit cycles), or time-varying but nonperiodic (chaotic attractors). The classic example of an attractor neural network exhibiting point attractors is the Hopfield network (Hopfield, 1982;Amit et al., 1985), which was introduced as a model of auto-associative memory. The Hopfield network represents an RNN with N units x i (for i = 1, . . . , N ) evolving according to the dynamics τẋ i = −x i + ∑ N j=1 w ij ϕ(x j ) , where τ and ϕ(x) are single-cell time constant and transfer function, the latter typically chosen with a saturating nonlinearity in neuroscience models (Hopfield, 1984). In a Hopfield network, the synaptic couplings . After initializing the network activity at a random value x 0 i (empty circle in Appendix 2- figure 1A), the network temporal dynamics quickly converge to the attractor M µ , which is closest to x 0 (black circle in Appendix 2- figure 1A). The set of initial conditions that converge to a particular attractor defines its basin of attraction, represented as potential wells in an energy landscape (in Appendix 2- figure 1A, the basin of the attractor of M 2 is represented by the values of x such that a < x < b). If we interpret this model as an auto-associative memory circuit, then each attractor M µ represents a stored memory; when presenting the network with a hint ( x 0 ) that is close to the memory M 2 , the network will quickly retrieve the full memory. A crucial feature of attractor dynamics is that when we slightly perturb the network activity away from the attractor (blue lightning in Appendix 2- figure 1A), then if the new configuration (empty blue circle) still lies within the basin of attraction of M 2 , the network will quickly converge back to the attractor.
It is possible to generalize these Hopfield-like stable attractors to biologically plausible networks of spiking neurons with cell-type-specific connectivity (Amit and Brunel, 1997). These models explain the features of neural activity observed in several areas of the association cortex (inferotemporal cortex [Fuster and Jervey, 1981;Miyashita and Chang, 1988]; prefrontal cortex [Funahashi et al., 1989;Wilson et al., 1993]; posterior parietal cortex [Koch and Fuster, 1989]). In these experiments featuring a delay-response task, a stimulus is presented and then withdrawn, followed by a delay period at the end of which the monkey performs a stimulus-specific saccade to obtain a reward. Neural activity in association areas exhibits stimulus-selective persistent activity during the delay period following the removal of the stimulus, which can be interpreted as originating from attractors in the underlying neural circuit. Such self-sustained persistent activity preserves an active memory of a visual stimulus after it is removed. Long-lasting persistent activity was also reported in the fruit fly following optogenetic stimulation and connectome reconstruction, suggesting that this activity may be supported by strong recurrent couplings between persistently active neurons (Deutsch et al., 2020). Recent advances with optogenetics in rodents further confirmed that delay persistent activity is stable against perturbations, a crucial feature of attractor dynamics. In a delay-response task (Inagaki et al., 2019;Finkelstein et al., 2021;Appendix 2- figure 1B), neural activity recorded in the anterolateral motor cortex (a part of the secondary motor cortex) showed persistent choiceselective activity during the delay period (dashed lines in Appendix 2- figure 1B). In trials where bilateral optogenetic perturbations transiently silenced the local circuit at the beginning of the delay epoch (full lines in Appendix 2- figure 1B), stimulus-selective population activity quickly recovered to the unperturbed values (full lines in Appendix 2- figure 1B), suggesting that delay activity in this area can be interpreted as originating from attractors. In an attractor neural network, the activity of three representative units starting from initial values x 0 rapidly converges to the closest attractor ( M 2 ). A small perturbation displaces the activity away from the attractor (blue circle) and the dynamics quickly returns to the attractor. Bottom: attractors are represented as potential wells in an energy landscape. Initial conditions or perturbations within the basin of attraction of the attractor M 2 (defined by a < x < b) quickly converge back to the attractor. (B) In a head-fixed delay-response task (top), one of two tones was presented during the sample epoch and the mouse reported its decision after the delay epoch by directional licking (randomized delay duration). The anterolateral motor cortex was photoinhibited bilaterally during the first 0.6s of the delay epoch (cyan bar) and quickly returned to its unperturbed level. Mean spike rate of lick-right preferring neurons (blue and red thick lines: unperturbed lick-right and -left trials; dashed lines: perturbed trials). Panel (B) adapted from Figures 1, 6, and S6 of Inagaki et al., 2019. can replace the Kramers theory (Hänggi et al., 1990;Huang and Doiron, 2017;Litwin-Kumar and Doiron, 2012) with the universal colored noise approximation (UCNA) to the Fokker-Planck equation, which provides a good approximation of the escape process in this regime (Stern et al., 2021). Here, both the strength of the noise amplitude and the barrier height are proportional to a product of the noise color and the recurrent coupling strength. This theory shows that the overall effect of colored noise is to slow down the average transition rate. Although UCNA has been applied to networks of rate units, a full account of the spiking network dynamics is still an open problem. In summary, in models with metastable attractors, we can find a separation of timescales between the fast fluctuations around a particular attractor and the slow fluctuations involving switching between different attractors. Moreover, network simulations suggest that the details of the network architecture might play an important role in generating metastable dynamics over a large range of parameters. Whereas in networks with only excitatory but not inhibitory clusters the metastable regime might require fine-tuning (Litwin-Kumar and Doiron, 2012;Mazzucato et al., 2019), an architecture with coupled excitatory and inhibitory clusters allows metastable dynamics to persist over a wide range of synaptic coupling strength and cluster sizes (Rostami et al., 2020;Wyrick and Mazzucato, 2021). This E/I clustered architecture can lead to the emergence of a heterogeneous distribution of timescales within the same circuit, as explained in Figure 8E (Stern et al., 2021).

Top-down modulation of network metastable dynamics
A central challenge in experimentally testing computational models of metastable attractors is that the reconstruction of the attractor landscape and the height of its potential wells relies on the knowledge of the network's structural connectivity including its coupling strength and connection probability, which is out of reach of current neurotechnology. However, one can use mean field theory methods to obtain an alternative formulation of the attractor landscape only involving quantities directly accessible to experimental observation (Mascaro and Amit, 1999;Mattia et al., 2013;Mazzucato et al., 2019;Wyrick and Mazzucato, 2021). The double potential well representing the two attractors can be directly mapped to the effective transfer function of a neural population representing a single cluster (Appendix 3- figure 1C). Using this correspondence, one can map changes in the barrier height ∆ separating metastable attractors to changes in the slope (or 'gain') of the intrinsic transfer function estimated during ongoing periods. This map provides a direct relationship between changes in cluster activation timescale and single-cell gain modulation, which can be inferred from either intracellular or extracellular recordings (Lim et al., 2015;Wyrick and Mazzucato, 2021). These computational tools can be used to test mechanistic hypothesis on the effect that circuit perturbations and manipulations exert on a network metastable dynamics Wyrick and Mazzucato, 2021). In particular, perturbations inducing steeper gain increase well depths and barrier heights, and thus increase the cluster timescale, and vice versa. This approach assumes that fluctuations in synaptic inputs are constant and additive, such that we can interpret the changes in first passage times as coming from changes in the transfer function. Transition rates between attractors may also be modulated by varying the levels of amplitude and color of the synaptic inputs, which can arise either from recurrent dynamics or from external projections. An increase in the noise amplitude can naturally arise in balanced E/I networks by simultaneously increasing both the E and I activity or E and I external inputs, leading to noiseinduced gain modulation and a faster integration time in conductance-based models (Chance et al., 2002). Variations in multiplicative synaptic noise within thalamocortical loops (e.g., in the model of Recanatesi et al., 2022) may also modulate average transition times between metastable attractors, although this mechanism has not been explored yet. Variations in the noise color were also shown to affect the transition rate between potential wells (Stern et al., 2021), where a slower noise timescale led to slower transition times compared to white noise.
An exciting new direction aims to infer the dynamical system underlying the observed spiking activity directly from the data. Assuming that high-dimensional neural activity is generated by a lowdimensional set of latent variables evolving according to Langevin dynamics, one can directly infer the effective energy potential underlying their dynamics from the data (Genkin and Engel, 2020;Duncker et al., 2019). A potential limitation of these new methods may be their reliance on lowdimensional latent manifolds, whereas the structure of neural patterns in motor areas during freely moving behavior seems to be high dimensional (Recanatesi et al., 2022).