Understanding frequency distributions of path-dependent processes with non-multinomial maximum entropy approaches

Path-dependent stochastic processes are often non-ergodic and observables can no longer be computed within the ensemble picture. The resulting mathematical difficulties pose severe limits to the analytical understanding of path-dependent processes. Their statistics is typically non-multinomial in the sense that the multiplicities of the occurrence of states is not a multinomial factor. The maximum entropy principle is tightly related to multinomial processes, non-interacting systems, and to the ensemble picture; it loses its meaning for path-dependent processes. Here we show that an equivalent to the ensemble picture exists for path-dependent processes, such that the non-multinomial statistics of the underlying dynamical process, by construction, is captured correctly in a functional that plays the role of a relative entropy. We demonstrate this for self-reinforcing Pólya urn processes, which explicitly generalize multinomial statistics. We demonstrate the adequacy of this constructive approach towards non-multinomial entropies by computing frequency and rank distributions of Pólya urn processes. We show how microscopic update rules of a path-dependent process allow us to explicitly construct a non-multinomial entropy functional, that, when maximized, predicts the time-dependent distribution function.


Introduction
'It seems questionable whether the Boltzmann principle alone, meaning without a complete [...] mechanical description or some other complementary description of the process, can be given any meaning'. Einstein's famous critical comment on the completeness of Boltzmann entropy [1], is still thought provoking. For ergodic systems, e.g. [2], over a well defined set of states, this critique has turned out to be of minor relevance. Here we demonstrate how Einstein's observation becomes relevant again when dealing with non-ergodic, pathdependent systems or processes, i.e. processes where ensemble and time averages cease to yield identical results and the ensemble descriptions of a processes fails to describe the dynamics of a particular process (e.g. compare [3]).
Moreover, for path dependent systems we have to specify what we mean with 'entropy', since no unique generalization of entropy from equilibrium to non-equilibrium systems exists. However, Boltzmann's principle is grounded in the idea that in large systems the most likely samples we may draw from a process, i.e. the so called maximum-configuration, also characterize the typical samples, while it becomes very unlikely to draw atypical samples. In fact we will demonstrate the possibility to directly construct 'entropic functionals' from the microscopic properties determining the dynamics of a large class of non-ergodic processes using maximumconfiguration framework. In this approach we identify relative entropy (up to a multiplicative constant) with the logarithm of the probability to observe a particular macro state (which typically is represented by a histogram over a set of observables states), compare e.g. [4]. By construction, maximization of the resulting entropy functionals leads to adequate predictions of statistical properties of non-ergodic processes, in maximum configuration.
For ergodic processes it is possible to replace time-averages of observables by their ensemble-averages, which leads to a tremendous simplification of computations. In particular, this is true for systems composed of independent particles or for Bernoulli processes, i.e. processes where samples are drawn independently, and the states of the independent components or observations collectively follow a multinomial statistics. The multinomial statistics of such a system with W observable states = ¼ i W 1, , is captured by a functional that coincides with Shannon entropy [5], is the empirical relative frequency distribution of observing states i in an experiment of drawing from the process for N times, i.e. p=k/N is the normalized histogram of the experiment where state i has been drawn k i times. Clearly, å = k N i i . In this context H(p) can be understood as the logarithm of the multinomial factor, i.e. -å= ( ) (e.g. compare [6]). Maximization of Shannon entropy under constraints therefore is a way of finding the most likely relative frequency distribution function (normalized histogram of sampled events) one will observe when measuring a system, provided that it follows a multinomial statistics. Constraints represent knowledge about the system. Bernoulli processes with multinomial statistics are characterized by the prior probabilities, In general, the set of parameters characterizing a process, we denote by θ. In the multinomial case q º q.
Denoting the probability to measure a specific histogram by q ( | ) P k N , , the most likely histogramk, that maximizes q ( | ) P k N , , is the optimal predictor or the so-called maximum configuration. For a multinomial distribution function, which is (up to a sign) called the relative entropy or Kullback-Leibler divergence [7]. The term H(p) coincides with Shannon entropy, the term that depends on q is called cross-entropy and is a linear functional in p. By re-parametrizing be = - In statistical physics, the constants e i typically correspond to energies and β to the so called inverse temperature of a system. Maximization of this functional with respect to p yields the most likely empirical distribution function; this is sometimes called the maximum entropy principle. Clearly, systems composed of independent components follow a multinomial statistics. Note that a multinomial statistics is also a direct consequence of working with ensembles of statistically independent systems. In this case the multinomial distribution function reflects the ensemble property and is not necessarily a property of the system itself. Therefore H(p) only has physical relevance for systems that consist of sufficiently independent elements. For path-dependent processes, where ensemble-and time-averages typically yield different results, H(p) remains the entropy of the ensemble picture, but ceases to be the 'physical' entropy that captures the time evolution of a path-dependent process. Obviously, assuming that the entropy functional H, which is consistent with an underlying multinomial statistics, in general also is adequate for characterizing pathdependent processes that are inherently non-multinomial (break multinomial symmetry), is nonsensical.
Surprisingly, the possibility that non-multinomial max-ent functionals can be constructed for pathdependent processes seems to have caught only little attention. In [4] it was noticed that a particular class of non-Markovian random walks with strongly correlated increments can be constructed, where the multiplicity of event sequences is no longer given by the multinomial factor, and the max-ent entropy functional of the process class exactly violates the composition axiom of Khinchin [8]. The general method of constructing a relative entropy principle for a particular process class does not inherently depend on the validity of particular information theoretic axioms, which opens a way for a general treatment of path-dependent, and nonequilibrium processes. We demonstrate this by constructing the max-ent entropy of multi-state Pólya urn processes [9,10].
In multi-state Pólya processes, once a ball of a given color is drawn from an urn, it is replaced by a number of δ balls with the same color-see figure 1. They represent self-reinforcing, path-dependent processes that display the the rich get richer and the winner takes all phenomenon. Pólya urns are related to the beta-binomial distribution, Dirichlet processes, the Chinese restaurant problem, and models of population genetics. Their mathematical properties were studied in [11,12], extensions and generalizations of the concept are found in [13,14], applications to limit theorems in [15][16][17]. Pólya urns have been used in a wide range of practical applications including response-adaptive clinical trials [18], tissue growth models [19], institutional development [20], computer data structures [21], resistance to reform in EU politics [22], aging of alleles and Ewens's sampling formula [23,24], image segmentation and labeling [25], and the emergence of novelties in evolutionary scenarios [26,27]. A notion of Pólya-divergence was recently defined in [28] in the context of Sanov's theorem [29]. This work characterizes Pólya urns in a regime of weak reinforcement. More precisely the Pólya divergence is derived for situations where the ratio between N, the number of samples drawn from the Pólya urn, and the number A 0 of balls initially contained in the urn, are asymptotically fixed by the parameter . So even if the number δ of balls added to the urn at each trial is large, the number of balls initially contained in the urn is much larger. In this regime of weak reinforcement Pólya urns behave similarly to Bernoulli processes. Our constructive approach allows us to access strong reinforcement parameters g > 0 and the transition of Pólya urn dynamics from Bernoulli-process like behavior to a winner-takes-all type of dynamics can be studied.

Non-multinomial max-ent functionals
The general aim is to construct a max-ent functional for a path-dependent process, which allows us to infer the maximum configuration, i.e. the most likely sample we may draw from a process of interest. From a given class of processes X we select a particular process q ( ) X , specified by a set of parameters, θ. Running the processes q ( ) X for N consecutive iterations produces a sequence of observed states q where each x n takes a value from W possible states. As before, we assume the existence of a most likely histogramk, that maximizes To construct a max-ent functional for X, one has to conveniently rescale represents (up to a sign) a functional providing us with a notion of relative entropy (information divergence) for the process-class X. If this process-class X is the class of Bernoulli-processes, such that ( | ) P k q N , is the multinomial distribution, then asymptotically In the following we compute y q ( | ) p N , for Pólya urn processes.

Max-ent functional for Pólya urns
In urn models observable states i are represented by the colors balls contained in the urn can have. The likelihood of drawing a ball of color i is determined by the number of balls contained in the urn. Initially the urn contains a i balls of color = ¼ i W 1, , . The initial prior probability to draw a ball of color i is given by is the total number of balls initially in the urn. Balls are drawn sequentially from the urn. Whenever a balls of the same color. Then the next ball is drawn and the process is repeated for N iterations. Here d = 2. This reinforcement process creates a history-dependent dynamics. The configurations obtained after successive iterations have non-multinomial structure.
ball of color i is drawn, it is put back into the urn and another δ balls of the same color are added. This defines the multi-state Pólya process [9]. A particular Pólya process is fully characterized by the parameters, . Drawing without replacement is the hypergeometric process, drawing with replacement (d = 0), is the multinomial process.
If  , and the probability to draw a ball of color i in the + ( ) N 1 th step is which depends on the history of the process in terms of the histogram k.
the empty sequence, the probability of sampling sequence x can be computed  The probability of observing a particular histogram k after N trials becomes is almost of multinomial form, it is a multinomial factor times a term depending on θ. One might conclude that the max-ent functional for Pólya processes is Shannon entropy in combination with a generalized cross-entropy term that depends on θ. However, this turns out to be wrong, since contributions from the generalized powers d exp , which is valid for sufficiently small d = y a i , i.e. for sufficiently large δ. With the notation g d º A 0 we obtain  q g where k=pN. Following the construction discussed above, we identify , which no longer scales explicitly with N, but f = ( ) N 1 (c = 0), so that y = Y. Inserting equation  Convenient choices for λ are the following. l = 1, represent y log as in equation (9). Alternatively, one may , which is a convenient choice if one considers a uniform initial distribution, = q W 1 i , of balls in the urn. The finite size Pólya entropy equation (10), yields a well defined entropy even if some states i have vanishing probability p i =0.
To simplify the following analysis we consider the limit  ¥ N of this functional, where the notion of 'information divergence' for Pólya processes, essentially reduces to å å å up to terms of order N 1 and terms that do not explicitly depend on p i or q i . In this limit the asymptotic Pólya 'entropy' is given by, We observe that one cannot derive ( ) H p Polya from the multiplicity of the system, which gets canceled by counter terms, as we have seen above. In addition, we note that the q dependent terms, å q p log i i i , in equation (12) play the role of the Pólya 'cross-entropy', which is no longer linear in p.
Maximizing y q ( | ) p with respect to p on å = p 1 i , either leads to the solution , or, if this can not be satisfied, to boundary solutions p i =0. ζ is a normalization constant. There exist three scenarios: (14) is the max-ent solution for all i (no boundary-solutions). The limit g  0 provides the correct multinomial limit  p q i i .
min , ψ gets maximal for those i with g > q i and follows solution equation (14); those i where Since , equation (14) becomes negative but also unstable and is replaced by a boundary solution: cases (B) and (C). The Pólya max-ent not only allows us to predict p i from the initial prior probabilities q i , it also identifies γ as the crucial parameter that distinguishes between the three regimes of Pólya urn dynamics 5 . For sufficiently large but finite N, the analysis above is more involved but solvable.
Assuming uniformly distributed priors, = q W 1 i for all i, the max-ent result equation (14) correctly predicts uniformly distributed = p W 1 i , while observed distributions p may strongly deviate from this prediction. This result reflects the fact that despite the Pólya urn process being inherently instable (e.g. winner takes all) with little chance of predicting who in particular will win, i.e. which color of balls will dominate the others, repeating the experiment many times every color of balls has the same chance to win (or biased according to the priors q). This discrepancy between ensemble average and time average makes it impossible to predict who in particular will win or loose in the course of time. However, using detailed information about the process one can predict how winners win. In particular one can (i) predict the onset of instability, i.e. the emergence of colors i that will effectively never be drawn, at g = ( ) q min crit (compare figure 2), and (ii) construct a maximum entropy 5 Note that a Pólya urn U 1 that initially contains A 0 balls and has evolved for N steps with g d = A 0 , can be regarded as another Pólya urn, U 2 , in its initial state, containing functional for predicting the time dependent frequency distribution of a process, i.e. the number of times one observes states i for n times. As a consequence, one also can derive the rank distributions of the process, i.e. the frequency of observing balls of some color after ranking those frequencies according to their magnitude.

Rank and frequency distributions of Pólya urns
With the presented max-ent approach we now compute frequency distribution functions. Given the histogram Results for the frequency distributions for a i =1, W=100, and d = 2 are shown in figure 3, together with a numerical simulation for the same process. The inset shows the rank distribution. The Pólya max-ent predicts frequency and rank distribution extremely well.
The above results were all derived under the assumption that g > 0 is sufficiently large. By numerical simulation we find that the solution equation (23) also works remarkably well for very small values of γ, if the value of γ in equation (23) is appropriately renormalised, g g  0 . In particular for g = 0 (multinomial process) we sample the Poisson distribution function, equation (20). The Pólya max-ent solution recovers the Poisson distribution extremely well if g g In this sense the Pólya max-ent remains adequate in the limit of small γ.

Discussion
Pólya urns offer a transparent way to study self-reinforcing systems with explicit path-dependence. They behave similarly to Bernoulli processes if the reinforcement is weak, i.e. if the number of balls initially contained in the urn is large in comparison to the number of balls added to the urn at each trial. This weak reinforcement regime has been studied in [28].
If reinforcement gets stronger, Pólya urns start to behave differently and the Pólya divergence derived in [28] no longer applies. Based on the microscopic rules of the process, we constructively derive the generalized information divergence or relative entropy y -, for strongly reinforcing Pólya urns. The functional ψ acts as the corresponding non-multinomial max-ent functional. This provides us with an alternative to the ensemble approach for path-dependent processes that enables us to predict the statistics of the process. The maximization of the functional leads to an equivalent of the classical maximum configuration approach, which by definition predicts the most likely distribution function. In this sense maximum configuration predictions are optimal, and can be used to understand even details of the statistics of path-dependent processes, such as their frequency and rank distributions. It is interesting to note that the functional playing the role of the entropy in the Pólya processes violates at least two of the four classic information theoretic (Shannon-Khinchin) axioms which determine Shannon entropy [8]. Even more, for the finite size Pólya entropy, three of the four axioms are violated. This indicates that the classes of generalized entropy functionals that are useful for a max-ent approach may be even larger than expected [30,31]. One might speculate that in this sense the classic information theoretic axioms are too rigorous, when it comes to characterizing information flow and phase space structure in non-stationary, pathdependent, processes. The observation that each particular class of non-multinomial processes requires a matching max-ent functional that can in principle be constructed from the generative rules of a process, opens the applicability of max-ent approaches for a wide range of complex systems in a meaningful way. The generalized max-ent approach in this sense responds to naturally Einsteins comment on Boltzmann's principle.
Finally we note the implications for statistical inference with data from non-multinomial sources, which implicitly involves the estimation of the parameters θ that determine the process that generates the data. In a max-ent approach this is done by fitting classes of curves to the data, that are consistent with the max-ent approach. For doing this, the nature of the process, i.e. its class, needs to be known. For path-dependent processes, which are non-multinomial by nature, entropy will no longer be Shannon entropy H, and the information divergence will no longer be the Kullback-Leibler divergence.