On the dimensionality of behavior

Significance How do we characterize animal behavior? Psychophysics started with human behavior in the laboratory, and focused on simple contexts, such as the decision among just a few alternative actions in response to sensory inputs. In contrast, ethology focused on animal behavior in the natural environment, emphasizing that evolution selects potentially complex behaviors that are useful in specific contexts. New experimental methods now make it possible to monitor animal and human behaviors in vastly greater detail. This “physics of behavior” holds the promise of combining the psychophysicist’s quantitative approach with the ethologist’s appreciation of natural context. One question surrounding this growing body of data concerns the dimensionality of behavior. Here I try to give this concept a precise definition.


I. INTRODUCTION
Even large and complex animals have relatively small numbers of muscles or joints.In some sense the complexity of behavior is limited by this number of independent degrees of freedom.Efforts to tame the complexity of brains and behavior have led to interest in a stronger notion, namely that the limited set of output degrees of freedom implies that the dimensionality of behavior is limited, and that correspondingly we should expect the dynamics of the neural networks that drive behavior also to be low-dimensional spaces.
There are many examples where we have direct evidence that motor behaviors are described by lowdimensional models, in organisms from the worm C elegans to humans and nonhuman primates [1][2][3][4][5][6][7].The argument that low dimensionality of behavior implies low dimensionality of neural activity has been made most explicitly for the case of C elegans [8,9], but the search for low-dimensional manifolds in neural activity is more widespread [10,11].Note that in the mammalian brain, with ∼ 10 5 neurons in one cortical column, dimensionality could be reduced dramatically yet still be very large.
There are many reasons to be suspicious of the argument that a small number of behavioral degrees of freedom implies low dimensionality of behavior which in turn implies low dimensionality of neural activity.There are, for example, ∼ 100 muscles involved in human speech.Does this mean that our linguistic behavior is ∼ 100 dimensional?Should we be searching for 100-dimensional dynamics in the patterns of neural activity that govern language production?Should we be worried that there are ∼ 80 muscles in the fruit fly thorax [12,13], which would mean that the potential dimensionalities of fly be-havior and human language are not so different?In fact even C elegans has 95 body wall muscles [14]; the claim that the dynamics of worm behavior is low-dimensional rests on observations of the behavior itself, not on limits set by the anatomy.
Perhaps the comparison of flies and human language highlights the need for a more precise definition of the "dimensionality of behavior."This is made more urgent by the explosive growth of methods for more quantitative measurements of behavior [5,[15][16][17][18][19][20][21].If these data can be reduced to low-dimensional descriptions, then we have achieved an enormous simplification, with practical consequences for further analysis.We might also have theoretical predictions about the dimensionality of behavior (either low or high), and then measuring dimensionality would provide decisive tests of these theories.
As a preface to the discussion, it should be remembered that the current wave of quantitative approaches to the analysis of behavior integrates several very different intellectual traditions, including classical ethology, physiology, theoretical physics, engineering, and computer science.Different ideas and methods will seem obvious or opaque to these very different communities.In the interest of clarity I take the risk of saying some things that will be well known to some audiences, and focus on what I hope are simple versions of general ideas.

II. TWO EXAMPLES
To work toward a more precise definition, let's start with the case in which the behavior we observe is just a single function of time x(t).I will assume that this is a completely autonomous behavior, and that across the time windows we consider the statistical structure of the behavior is stationary; both of these are common but possibly unrealizable idealizations.As is familiar from now classical literature on dynamical systems [22,23], it is possible to tease out of this single time series a higher dimensional description of the underlying dynamics, so that the apparent dimensionality of the data is not a bound on the dimensionality of the dynamics.
On the other hand, suppose that a complete description of the observed behavior were given by where the η(t) is white noise, I think most people would agree that if this is a complete description of the dynamics, then the system really is one dimensional.It is important that the noise source is white; non-white noise sources, which themselves are correlated over time, are equivalent to having hidden degrees of freedom that carry these correlations.
The observable consequences of the dynamics in Eqs (1, 2) are well known: x(t) will be a Gaussian stochastic process, with the two-point correlation function We recall that for a Gaussian process, once we specify the two-point function there is nothing else to say about the system.Importantly, we can turn this around: if the observed behavior is a Gaussian stochastic process, and the correlations decay exponentially as in Eq (3), then Eqs (1, 2) are a complete description of the dynamics.Suppose the real dynamics involve not only the observable x(t) but also an internal variable y(t), where the driving noises are white and independent, Notice that since y is hidden, the units of this variable are arbitrary, which allows us to have the strength of the noise driving each variable be the same without loss of generality, while the choice to give each variable the same correlation time is just for illustration, as is the symmetry of the dynamical matrix.Looking at these equations, it seems easy to agree that the system is two dimensional.Again, x(t) again is Gaussian, but the correlation function has two exponential decays, We see that a one dimensional system generates behavior with a correlation function that has one exponential decay, while a two dimensional system generates a correlation function with two exponential decays.We would like to turn this around, and say that if we observe certain structure in the behavioral correlations, then we can infer the underlying dimensionality.

III. GAUSSIAN PROCESSES MORE GENERALLY
Trying to analyze the structure of correlations by constructing explicit dynamical equations, as in Eqs (1) or (4), may not be the best approach.In particular, if there are hidden dimensions, then there is no preferred coordinate system in the space of unmeasured variables, and hence no unique form for the dynamical equations.Let us instead focus on the probability distribution of trajectories x(t).For Gaussian processes this has the form where the integrals run over the interval of our observations, which should be long.The kernel K(τ ) is inverse to the correlation function, We can divide the full trajectory x(t) into the past, with t ≤ 0, and the future, with t > 0. Schematically, where K pf couples the past and future.More explicitly, then everything that we can predict about future behavior given knowledge of past behavior is captured by D features, Equation ( 13) is telling us that the features {F n } provide "sufficient statistics" for making predictions.We recall that in a dynamical system with D variables, predicting the future (t > 0) requires specifying D initial conditions (at t = 0).In this precise sense, the number of variables that we need to achieve maximum predictive power is the dimensionality of the dynamical system.To complete the argument, we need to show that K pf has finite rank when correlations decay as a finite combination of exponentials; see Appendix A.
In the case of Gaussian stochastic processes we thus arrive at a recipe for defining the dimensionality of the underlying dynamics.We estimate the correlation function, take its inverse to find the kernel, and isolate the part of this kernel which couples past and future.If this past-future kernel is of finite rank, then we can identify this rank with the dimensionality of the system.
This discussion still refers only to Gaussian processes, but we see that the search for low-dimensional descrip-tions could fail, qualitatively.It is possible that the pastfuture kernel K pf is not of finite rank; more generally if we analyze signals in a window of size T then the rank can grow with T .This happens, for example, if behavioral correlations decay as a power of time, Under these conditions the system is infinite dimensional.

IV. BEYOND GAUSSIANS
What emerges from the analysis of Gaussian stochastic processes is that dimensionality can be measured through the problem of prediction.Let us see how we can make this more general, beyond the Gaussian case.
Let us break a very long observation of time series into many examples of a time window −T < t < T .Within each window, the trajectory x(t < 0) defines the past x past , and x(t > 0) defines the future x fut .Across a large ensemble of these windows we can define the joint probability distribution P T (x past , x fut ).To characterize the possibility of making predictions we can measure the mutual information between past and future, or the "predictive information" [24], The predictive information can have very different qualitative behaviors as T becomes large [24]. 1 For a time series that can be captured by a finite state Markov process, or more generally described by a finite correlation time, then I pred (T ) is finite as T → ∞.On the other hand, for Gaussian processes with correlation functions that decay as a power, as in Eq (16), the predictive information diverges logarithmically, I pred (T → ∞) ∝ log T .
In the example of a dynamical system with D variables, as in Eq (15), all the predictive power available will be 1 If we observe a continuous variable x(t) in continuous time, then smoothness across t = 0 generates a formal divergence in the mutual information between past and future.Modern analyses of behavior typically begin with video data, with time in discrete frames, evading this problem.Alternatively, if measurements on x(t) include a small amount of white noise, then the predictive information becomes finite even without discrete time steps.Thanks to A Frishman for emphasizing the need for care here.
realized if we can specify D numbers, which are the initial conditions for integrating the differential equations.Thus we consider mappings of the past into d features, For any choice of features we can compute how much predictive information has been captured, and then we can maximize over the mapping, resulting in which is the maximum predictive information we can capture with d features in windows of duration T .
If the system truly is D dimensional, then D features of the past are sufficient to capture all of the available predictive information.This means that a plot of I pred (T ; d) vs d will saturate.To be precise we are interested in what happens at large T , so we can define lim If we find that f (d ≥ D) = 1, then we can conclude that the behavior has dimensionality D.
In words, the dimensionality of behavior is the minimum number of features of the past needed to make maximally informative predictions about the future, and to be precise the past should be taken to be of long duration.While very general, it should be admitted that this definition is much more complex than the analysis of correlation functions that works in the Gaussian case.To use this definition we have to search over all possible mappings M d of long past trajectories into ddimensional feature spaces, and we have to estimate the mutual information between these d variables and some representation of the future trajectory.Both of these steps are challenging.

V. DISCRETE STATES
In many cases it is natural to describe animal behavior as moving through a sequence of discrete states.We do this, for example, when we transcribe human speech to text, and when we describe a bacterium as running or tumbling [25].This identification of discrete states is not just an arbitrary quantization of continuous motor outputs, nor should it be a qualitative judgement by human observers.Discrete states should correspond to distinguishable clusters, or resolvable peaks in the distribution over the natural continuous variables, and the dynamics should consist of movements in the neighborhood of one peak that are punctuated by relatively rapid jumps to another peak (e.g., Ref [16]).A "mechanism" for such discreteness is the existence of multiple dynamical attractors, with jumps driven by noise (e.g., Refs [5,6]).
When behavioral states are discrete, how do we define dimensionality?Once again it is useful to think about the simplest case, where there are just two behavioral states-perhaps "doing something" and "doing nothing"-and time is marked by discrete ticks of a clock.We can represent the two states at each time t by an Ising variable σ t = ±1.If the sequence of behavioral states were Markovian, then σ t depends only on σ t−1 , and because σ 2 = 1 the only possible stationary probability distribution for the sequences which is the one-dimensional Ising model with nearest neighbor interactions.Importantly, if we measure the correlations of the fluctuations in behavioral state around its mean, we find that these correlations decay exponentially, where we can express τ c in terms of h and J [26].This reminds us of the exponential decays in the continuous case with Gaussian fluctuations, where they provide a signature of low dimensionality.Suppose that we have only two states, but observe correlations that do not decay as a single exponential.Then the probability distribution P ({σ t }) must have terms that describe explicit dependences of σ t on σ t ′ with t − t ′ > 1.This can be true only if there are some hidden states or variables that carry memory across the temporal gap t − t ′ .A sensible definition for the dimensionality of behavior then refers to these internal variables.
Imagine that we observe the mean of the behavioral variable, σ , and the correlation function C(t−t ′ ).What can we say about the probability distribution P ({σ t })?There are infinitely many models that are consistent with measurements of just the (two-point) correlations, but there is one that stands out as having minimal structure required to match these observations [27].Said another way, there is a unique model that predicts the observed correlations but otherwise generates behavioral sequences that are as random as possible.This minimally structured model is the one that has the largest possible entropy, and it has the form where the parameter h must be adjusted so that the model predicts the observed mean behavior σ , and the function J(t − t ′ ) must be adjusted so that the model predicts the observed correlation function C(t − t ′ ).
Maximum entropy models have a long history, and a deep connection to statistical mechanics [27].As applied to temporal sequences, the maximum entropy models sometimes are referred to as maximum caliber [28].The Boltzmann distribution, which describes a system in thermal equilibrium, can be derived as the maximum entropy distribution over microscopic states of the the system that is consistent with its mean energy, and this sometimes leads to the impression that maximum entropy models only describe equilibrium systems, but this isn't correct.In this discussion we are explicitly using the maximum entropy idea to describe distributions of sequences or trajectories, not distributions over states as with the Boltzmann distribution.In the case of just two states, if we only constrain the two-point correlations C(t − t ′ ) we cannot distinguish the arrow of time, but as soon as we have more states, or constrain higher-order correlations, the maximum entropy model can break time-reversal invariance or detailed balance.For biological systems there has been interest in the use of maximum entropy methods to describe amino acid sequence variation in protein families [29][30][31], patterns of electrical activity in populations of neurons [32][33][34][35][36], velocity fluctuations in flocks of birds [37,38], and more.There have been more limited attempts to use these ideas in describing temporal sequences, either in neural populations [39] or flocks [40][41][42].
The maximum entropy model in Eq (24) can be rewritten exactly as a model in which the behavioral state at time t depends only on some internal variable x(t): and the distribution of the internal variable is where K(t) is the matrix inverse of the function J(t), Notice that since J(0) multiplies σ t σ t = 1, its value can't change the observable statistics of behavior, so we have some freedom in writing the model this way. 2  Starting from discrete binary states we thus are led back to an underlying continuous variable, and we can carry over our definitions of dimensionality.Although x(t) is not Gaussian, the only coupling of past and future is through a kernel K(t), just as in the Gaussian case, as we see by comparing Eqs ( 8) and ( 28).This kernel is not the inverse of the observed behavioral correlations, but of the effective interactions between states at different times, J(τ ).But, importantly, we are considering quantities that are determined by the correlation function and hence the problem is conceptually similar to the Gaussian case: we analyze the correlations to derive a kernel, and the dimensionality of behavior is the rank of this kernel.The maximum entropy model plays a useful role because it is the least structured model consistent with the observed correlations.
If x(t) is one-dimensional, then the interactions J(t) ∼ J 0 e −|t|/τ .The correlations C(t) are predicted to decay exponentially at large |t|, but by the time this limit is reached the correlations may be so weak that this is hard to measure convincingly.At the more accessible intermediate times the behavior of C(t) can be complicated even 2 Models where the observed degrees of freedom depend on hidden or latent variables, but not directly on one another, are sometimes set in opposition to statistical physics models, where it is more natural to think about direct interactions.But this example shows that these pictures can be mathematically equivalent; see also the Supplementary material of Ref [43].
though the underlying dynamics are one-dimensional.At the opposite extreme, if x(t) has effectively infinite dimensionality, then we can have J(t) ∼ J 0 |t| −α , as in Eq (16).Ising models with such power-law interactions are the subject of a large literature in statistical physics; the richest behaviors are at α = 2, where results presaged major developments in the renormalization group and topological phase transitions [44][45][46][47].It would be fascinating if these models emerged as effective descriptions of strongly non-Markovian sequences in animal behavior.

VI. PERSPECTIVES
The explosion of quantitative data on animal behavior is exciting in part because these essentially macroscopic behaviors-rather than their microscopic mechanismsare what first strike us as being interesting about living systems.Behaviors have been selected by evolution for their utility, and as we observe them it is difficult not to think of them as purposeful or intelligent.Understanding the phenomena of life means explaining how these behaviors arise, ultimately from interactions among molecules that obey the same laws of physics as in inanimate (and unintelligent) matter.But what is it, exactly, that we are trying to explain?
If we want to explain why we look like our parents, a qualitative answer is that we carry copies of their DNA.But if we want to understand the reliability with which traits are passed from generation to generation,3 then talking about DNA structure is not enough-the energy differences between correct and incorrect base pairing are not sufficient to explain the reliability of molecular copying if the reactions are allowed to come to thermal equilibrium, and this problem arises not just in DNA replication but in every step of molecular information transmission.Cells achieve reliability by holding these reactions away from equilibrium, allowing for proofreading or error-correction [49,50].In the absence of proofreading, the majority of proteins would contain at least one incorrect amino acid, and roughly 10% of our genes would be different from those carried by either parent; these error rates are orders of magnitude larger than observed.These quantitative differences are so large that life without proofreading would be qualitatively different. 4he example of proofreading highlights the importance of starting with a quantitative characterization of the phenomena we are trying to explain.In an era of highly mechanistic biology, this emphasis on phenomenological description may seem odd.But quantitative phenomenology has been foundational, certainly in physics and also in the mainstream of biology.Mendel's genetics was a phenomenological description of the patterns of inheritance, and the realization that genes are arranged linearly along chromosomes came from a more refined quantitative analysis of these same patterns [51].The work of Hodgkin and Huxley led to our modern understanding of electrical activity in terms of ion channel dynamics, but explicitly eschewed mechanistic claims in favor of phenomenology [52].The idea that transmission across a synapse depends on transmitter molecules packaged into vesicles emerged from the quantitative analysis of voltage fluctuations at the neuromuscular junction [53].
Even when we are searching for microscopic mechanisms, it is not anachronistic to explore macroscopic descriptions.Time and again, the scientific community has leaned on phenomenology to imagine the underlying mechanisms, often taking literally the individual terms in a mathematical description as representing the actual microscopic elements for which we should be searching, whether these are genes, ion channels, synaptic vesicles, or quarks [54][55][56].What is anachronistic, in the literal sense of the word, is to believe that microscopic mechanisms were discovered by direct microscopic observations without guidance from phenomenology on a larger scale.
The idea that quantitative phenomenology would provide a foundation for understanding brains and minds took hold very early in the modern era.In the second half of the nineteenth century, many people were trying to turn observations on seeing and hearing into quantitative experiments, creating a subject that would come to be called psychophysics [65].By ∼1910, these experiments were sufficiently well developed that Lorentz could look at data on the "minimum visible" and suggest that the retina is capable of counting single photons [66], and Rayleigh could identify the conflict between our ability to localize low-frequency sounds and the conventional wisdom that we are "phase deaf" [67].Both of these essentially theoretical observations, grounded in quantitative descriptions of human behavior, would drive experimental efforts that unfolded over more than fifty years.Also ∼1910, von Frisch was doing psychophysics experiments to demonstrate bees could, in fact, discriminate among the beautiful colors of the flowers that they pollinate [68].But he took these experiments in a very different direction, focusing not on the discrete choices made by individual bees, but on how these individuals communicated their sensory experiences to other residents of the hive.This work led to the discovery of the "dance language" of bees.While von Frisch often used simplified stimuli, and counted whether bees arrived at a destination or not, it was crucial that intermediate behaviorsthe dance-were unconstrained and fully natural.
What grew out of the work by von Frisch and others was the field of ethology [69], which emphasizes the richness of behavior in its natural context, the context in which it was selected for by evolution.Because ethologists wrestle with complex behaviors, they often resort to verbal description.In contrast, psychophysicists focus on situations in which subjects are constrained to a small number of discrete alternative behaviors, so it is natural to give a quantitative description by estimating the probabilities of different choices under various conditions.
The emergence of a quantitative language for the analysis of psychophysical experiments was aided by the focus on constrained behaviors, but was not an automatic consequence of this focus.For photon counting in vision, the underlying physics suggests how the probability of seeing vs. not seeing will depend on light intensity [70], but the observation that human observers behave as predicted points to profound facts about the underlying mechanisms [71].During World War II, attempts to formalize the problems of radar operators and communication with pilots led to a more general view of the choices among discrete alternative behaviors as being discriminations among signals in a background of noise [72].In the 1950s and 60s this view was exported to experimental psychology and developed further into the modern form of signal detection theory [65].Much of this now seems like an exercise in probability and statistics, something obviously correct, but the early literature records considerable skepticism about whether this (or perhaps any) mathematization of human behavior would succeed.
Much has been learned through both the ethological and the psychophysical approaches, yet it is easy for advocates of the two approaches to talk past one another.Still, it does not seem unfair to note that the traditional ethological approach is missing the drive for quantification, while the traditional psychophysical approach achieves quantitative sophistication by excluding much of what impresses us about behavior.The challenge is not to find what each tradition is lacking, but to find a way of combining the best from both, and this brings us back to the questions asked at the outset.
A quantitative characterization of naturalistic behaviors requires that we attach comparable numbers to very different kinds of time series.Dimensionality is a candidate for this sort of characterization.When we do psychophysics, we characterize behaviors with numbers that are meaningfully comparable across situations and across species.To give but one example, we can discuss the accumulation of evidence for decisions that humans and non-human primates make based on visual inputs, but we can use the same mathematical language to discuss decisions made by rodent based on auditory inputs [73].Perhaps the dimensionality of behavior will provide part of the needed unifying mathematical language for more natural behaviors.