Infinitely large, randomly wired sensors cannot predict their input unless they are close to deterministic

Building predictive sensors is of paramount importance in science. Can we make a randomly wired sensor “good enough” at predicting its input simply by making it larger? We show that infinitely large, randomly wired sensors are nonspecific for their input, and therefore nonpredictive of future input, unless they are close to deterministic. Nearly deterministic, randomly wired sensors can capture ∼ 10% of the predictive information of their inputs for “typical” environments.


Introduction
Prediction is thought to be fundamental to organism functioning, e.g. see Ref. [1] and references therein. By better predicting the world around them, organisms can better choose actions that will maximize their reaped reward. On the other hand, prediction of the future of a time series from its past drives much of science, e.g. the realization that one could predict future particle positions from some parameters (such as particle masses and charges) and current particle positions and velocities.
So how should one build predictive time series models? One common trick is to feed the data we wish to predict into a recurrent network, the state of which can contain enough memory of the past in order to be predictive of the future. Sometimes, these networks are trained so that the parameters defining the network's dynamics yield optimal predictions of the input time series, e.g. as in Ref. [2] and references therein. However, much utility has been gained from using randomly connected networks (or reservoirs) [3][4][5][6], where one merely trains the readout of the network. Such networks can have nearly maximal predictive power if the networks are large enough [7]. Such a finding might also imply that perhaps, evolution need not work so hard to build predictive networks (or sensors) in organisms; rather, randomly wiring large biological sensors could lead to sufficiently good predictive performance.
Here, we identify a key sensor property without which the sensor has little predictive power: determinism. Determinism means that the present sensor state and the present input state uniquely determine the future sensor state. To be clear, the sensors studied in Refs. [3,4,6] are deterministic as long as there is no added noise, and so the effect of nondeterminism on a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 predictive capabilities of recurrent networks/reservoirs is somewhat unstudied. Interestingly, we find that the detrimental effects of nondeterminism are compounded rather than mitigated by large sensor size when the sensor has recurrent connections.
We find numerical and analytical evidence that nondeterminism greatly limits the ability of a (recurrent) randomly wired sensor to be predictive of its inputs, due to the weak law of large numbers. For some nondeterministic, randomly wired sensors, there is a finite optimal sensor size at which predictive power is maximized. This optimal sensor size seems to balance the larger predictive capacity of larger sensors against a trend towards nonspecificity for input demanded by the weak law of large numbers.
When the sensors connections are nearly deterministic, then larger sensors are on average more predictive. Large, deterministic, randomly wired sensors can capture * 10% of the total predictive information possible. This is comparable to the * 20% of predictive information captured by sensors whose weights have been (locally) optimized via the BFGS algorithm.

Setup
First, we discuss the model of the environment, which is given by the output of a unifilar Hidden Markov model. Then, we discuss the model of the sensor, specifying notation for the dynamics of a conditionally Markovian discrete-valued sensor. Finally, we describe metrics for sensor performance: memory; and predictive information captured.
Throughout what follows, we characterize time series and the relations between them via entropy and mutual information. The entropy of a random variable X with realizations x and probability distribution Pr(X = x) is given by while the mutual or shared information between a random variable X and a random variable Y with realizations x and y, respectively, and joint probability distribution Pr(X = x, Y = y) is given by One can think of entropy as a measure of the uncertainty of a random variable. Maximally uncertain random variables have a uniform distribution over values, and this maximizes entropy; minimally uncertain random variables are singly supported, and this minimizes entropy. One can think of mutual information as a measure of the nonlinear dependency of two random variables. From the identity , mutual information is the reduction in uncertainty about a random variable X that comes from knowing a potentially related random variable Y. Operational meanings of entropy and mutual information come from Shannon's source coding and noisy channel coding theorems [8,9].

Model of environment
To test the predictive capabilities of sensors, we wish to construct a non-Markovian environment that is at least somewhat predictable. The non-Markovianity of the environment will force a predictive sensor to remember longer and longer pasts, while the predictability of the input will guarantee that remembering the "right" things about environmental pasts leads to measurable predictive gains. Ideally, it would also be difficult to infer the right predictive features, so that the environments provide a challenge for sensors that desire to be predictive of their input. Environments generated by random minimal unifilar Hidden Markov models (or Ms [10]) as specified below turn out to satisfy all these requirements. We consider randomly generated binary-alphabet environments, that is, environments generated by a randomly-drawn unifilar Hidden Markov model. Here, S t is the random variable for the hidden state at time t, while X t is the random variable for the observed symbol at time t. An edge-emitting discrete-time Hidden Markov model is specified by a set of internal "hidden" states σ 2 S, a set of possible observables x 2 X , and a labeled transition dynamic PrðS tþ1 ¼ s 0 ; X t ¼ xjS t ¼ sÞ. For the purposes of this paper, X ¼ f0; 1g. When S is countable, we can represent the labeled transition dynamic by a set of labeled transition matrices T (x) with elements A unifilar Hidden Markov model is one such that PrðS tþ1 jX t ¼ x; S t ¼ sÞ is singly supported, i.e. one for which the current state and symbol emitted uniquely specify the next state. This is a version of determinism in that the next state is uniquely determined, but the next symbol is not uniquely determined.
To randomly generate environments, we randomly generate labeled transition matrices T (x) as follows. In each hidden state σ, we choose P s 0 T ð0Þ s 0 ;s (the probability of emitting a 0) uniformly at random from the unit interval; this specifies, for binary-alphabet processes, P s 0 T ð1Þ s 0 ;s (the probability of emitting a 1). We then randomly choose the state to which one transitions after seeing a 0 and the state to which one transitions after seeing a 1. The fact that there is only one state to which one transitions after seeing a 0 or a 1 implies unifilarity of the resulting Hidden Markov model. The states of a minimal unifilar Hidden Markov model are called causal states [10]. We wish to understand how much memory is required to predict the output of these Hidden Markov models as well as possible and their predictability. Let X t stand for the past environmental inputs, . . ., X t−2 , X t−1 , X t . The memory required is characterized by statistical complexity C m ¼ H½S t [10], while the predictability is characterized by the total correlation rate r m ¼ I½ X t ; X tþ1 [11] also known as a particular value of the predictive information [12,13]. As the underlying model grows larger, ρ μ seems to tend in probability to * 0.2 nats. These unifilar Hidden Markov models generate infinite-order Markov processes-that is, a process for which the next symbol depends to some extent on all previous symbols. However, a process that is technically infinite-order Markov can still be approximately Markovian [14]. A sensor which perfectly stores the present observed symbol and nothing else (a Markov model) would capture a predictive information of I[X t ;X t+1 ]. A typical value of I[X t ;X t+1 ] for these environments is 0.002 nats when the environment is of size |S| = 30, and so a typical value of I[X t ;X t+1 ]/ρ μ for these environments is 0.01. In other words, these environments tend to be strongly non-Markovian. And finally, the statistical complexity C μ of these environments tends to be between 2 − 3 nats, indicating that these environments provide a predictive challenge for sensors.

Model of sensor
Fix a realization of the environment, x 1 , x 2 , . . .. Let R t be the random variable denoting the sensor's state at time t. We consider conditionally Markovian sensors with a finite number of states and with state space s whose probability evolves according to which we represent in matrix-vector notation as Given its relation to the conditional probability distribution, M (x t ) is a transition matrix whose columns must sum to 1. See Fig 1, in which the arrows between sensor states indicate transition probabilities and in which the arrow from input to sensor states indicates a dependence of transition probabilities on input. The present sensor state depends on both the previous environmental symbol and the previous sensor state, and the recursion leads to the present sensor state depending on arbitrarily long environmental pasts. By construction, the sensor only understands the future of the environment through the past of the environment. In other words, the Markov relation R t ! X t ! X tþ1 holds [9], meaning that

Sensor metrics: Memory and predictive information
We define memory I mem as which is an achievable minimal coding cost of pasts-that is, the minimal amount of space needed to write down information about said pasts-to retain information about the future [15]. In terms of the steady-state probability distribution over input causal states and sensor states p ss (r, σ), we have where p ss (r) = ∑ σ p ss (r, σ) and p ss (σ) = ∑ r p ss (r, σ). The memory I mem is 0 only when p ss (r, σ) = p ss (r)p ss (σ)-that is, when the sensor is nonspecific for the predictive features of its input. Note, from a standard information theory identity and so one's memory is upper-bounded by the statistical complexity, which is calculable from p ss (σ). We define instantaneous predictive information (which we call predictive information for brevity) I pred as This predictive information corresponds (under some assumptions) to the increase in expected log growth rate of an asexually reproducing population attainable with a given memory, and is always an upper bound on the increase in expected log growth rate attainable with a given memory [16]. In terms of the steady-state distribution p ss (r, x) over sensor states and future inputs, we have Earlier, causality gave us R t ! X t ! X tþ1 ; and as the hidden states of a unifilar Hidden Markov model are minimal sufficient statistics of prediction [10], we also have X t ! S t ! X tþ1 . Together, this gives R t ! S t ! X tþ1 , and the Data Processing Inequality [9] therefore reveals That is, the predictive information captured is always less than one's memory. Another application of the Data Processing Inequality reveals and so the total correlation rate is an achievable upper bound on the predictive information.
To achieve this upper bound, one needs the sensor states to uniquely determine all causal states σ. We also define, for reasons that become apparent later on, This is a measure of the deviation between the steady state distribution over sensor states and input causal states, p ss (r, σ), from the distribution over sensor states and input causal states were the sensor to be nonspecific for its input and were each sensor state to be equally likely. If the sensor is nonspecific for its input but if the distribution over sensor states is nonuniform, then I 0 mem will be nonzero. As discussed in Section B in S1 File, As a result, if we can argue that I 0 mem tends to 0, we will also have argued that I mem and thus I pred tend to 0.

Methods
We wish to calculate the aforementioned sensor metrics. One could simulate sequences of σ t , r t , and x t+1 by using T (x) , M (x) to randomly choose future sensor and environmental states, and then count the number of occurrences of a particular combination of σ t , r t , x t+1 . In the case of a large sensor state space, one would ideally use an entropy estimator such as the NSB entropy estimator [17] to calculate the predictive information, so as to avoid prohibitively long simulations.
However, we pursue a different approach that leads to an easier calculation of sensor metrics and to plausibility arguments that underscore the generality of our results. To calculate I mem , we wish to find p(r t , σ t ). To calculate instantaneous predictive information I pred , we wish to find p(r t , x t+1 ). We can get the latter probability distribution, p(r t , x t+1 ), from the former probability distribution, p(r t , σ t ), by exploiting the Markov chain R t ! S t ! X tþ1 : To find p(r t , σ t ), we can set up a Chapman-Kolmogorov equation: Eq 20 defines a vector with |R||S| elements and a corresponding transition matrix. The normalized eigenvector of eigenvalue 1 corresponds to the desired p ss (r, σ), and then one can calculate the instantaneous predictive information directly from p(r t , x t+1 ) via This serves as an alternative to calculating p ss (r, σ), and thus I mem and I pred , through simulation.

Results
There are two parameters that one can play with when designing our randomly wired sensors: sensor size, given by the number of sensor states N; and the method by which connections between sensor states are randomly generated. Using a sensor of size N corresponds to clustering pasts into N clusters: an input past x leads to sensor state r with probability pðrj x Þ. When the sensor is not deterministic, these clusters are soft clusters-that is, pasts are assigned probabilistically to clusters. When sensor connections are closer to deterministic, these clusters "harden". From Section A in S1 File, we then expect the memory and predictive information captured by sensors to increase (but saturate) with increasing sensor size and increasing determinism.
In other words, specification of a randomly wired sensor corresponds to a random soft clustering of pasts into predictive features, though the mapping from the sensor specification to the clustering of pasts is nontrivial. Despite the presence of such a mapping, specification of a sensor has one important advantage over specification of a probabilistic clustering of pasts: it is a finite description of a potentially complicated clustering. More concretely, it is more economical to specify the optimal predictor (the M [10]) by input-dependent state transitions, M (x) , instead of the conditional probability distribution of hidden states given input histories, pðrj x Þ.
First, we consider fully nondeterministic, randomly wired sensors for which the columns of M (x) are independent and identically distributed (i.i.d) draws from a Dirichlet distribution with concentration parameterã ¼ a1. We start with a strange numerical fact. As such sensors grow in size, when α is sufficiently large, both memory and predictive information tend to decrease. See Fig 2. In fact, as N grows, p(r|σ) appears to tend to 1 N -that is, the sensor appears to become nonspecific for its input. In other words, the capacity of a sensor to remember its input increases as the sensor increases in size, but this capacity is not at all used by fully connected, randomly wired sensors.
We give a plausibility argument for this nonspecificity, which comes down to a statement about the eigenvector of eigenvalue 1 of the transition matrix between sets of causal states and sensor states, (σ,r), with elements P x t M ðx t Þ r tþ1 ;r t T ðx t Þ s tþ1 ;s t as described in Methods. This argument is generalized and expanded upon in Section B in S1 File. There is a unique eigenvector of eigenvalue 1 for these transition matrices by the Perron-Frobenius theorem. We will argue that this eigenvector is given in the large N limit by pðs; rÞ ¼ 1 N p ss ðsÞ where p ss (σ) = eig 1 (∑ x T (x) ) σ . If pðs; rÞ ¼ 1 N p ss ðsÞ, then To understand whether or not this is plausible, we focus on simplifying the right-hand side of this equation. Recall that a Dirichlet distribution in which the concentration parameter vector takes the formã ¼ a1 can be generated by drawing realizations of identical and independently distributed (i.i.d.) Gamma random variables and normalizing them by their sum. Hence, we Large, fully nondeterministic, randomly wired sensors are nonspecific for their inputs. As described in the main text, the concentration parameter α controls the distribution from which transition probabilities in the sensor are drawn. We show hI mem i (blue) and hI pred i (red) for three values of α-1, 3, and 10-as a function of αN. Both information quantities decrease at varying rates with N and α: roughly 0.1/αN for hI mem i, and roughly 0.01/(αN) 2 for hI pred i. Corresponding lines in black dashes are drawn to guide the eye. The environment has total correlation rate ρ μ = 0.198 nats and statistical complexity C μ = 2.205 nats, so the total memory and predictive information captured by the sensor is maximally three orders of magnitude smaller than the total possible memory and predictive information captured. https://doi.org/10.1371/journal.pone.0202333.g002 Infinitely large, randomly wired sensors cannot predict their input unless deterministic where Y ðx t Þ r 0 ;r t have been drawn i.i.d. from a distribution with probability density function 1 GðaÞ y aÀ 1 e À y . Then, the sum is the ratio of two roughly independent Gamma distributions, both with shape parameter Nα and scale 1. The ratio of two such Gamma distributions is highly peaked at 1 for large Nα. Thus P r t M ðx t Þ r tþ1 ;r t tends to 1, which then implies that as desired, using P s t T ðx t Þ s tþ1 ;s t p ss ðs t Þ ¼ p ss ðs tþ1 Þ. In other words, when Nα ) 1, pðrjsÞ ¼ 1 N appears to be a reasonable guess for the steady state distribution p ss (r|σ). This, in turn, suggests that I mem tends to 0 with probability 1, which would give I pred ! 0 from 0 I pred I mem . An additional argument given in Section B in S1 File suggests that I 0 mem is O 1 Na À Á , though we emphasize that this is a plausibility argument rather than a sketch of a proof.
In other words, infinitely large fully-connected randomly-wired sensors are nonspecific for their inputs, no matter the input. This is true even when concentration parameterã defining the sensor stochasticity is input-dependent and state-dependent, as an extension of the plausibility argument given above holds as detailed in Section B in S1 File. For numerical evidence, see Fig 3. More generally, we find that the memory and predictive information captured by random sensors is governed by two trends. Both trends can be described in terms of their effect on the clusters pðrj x Þ that describe how specifically one can determine an input past x from a sensor state r and vice versa. According to the first trend, information captured tends to increase with the number of clusters, as detailed by a null model in Section A in S1 File. According to the second trend, the exponential explosion in possible paths between sensor states given any particular input past x yields increasing nonspecificity. For small enough α, as shown in Fig 4, the behavior of memory and predictive information captured by randomly wired sensors appears to be a balance of these two trends. However, the latter trend for fully nondeterministic sensors always wins, and so infinitely large, fully nondeterministic, randomly wired sensors are completely nonspecific for their inputs even though the sensor dynamics are inputdependent.
How might a large system with unavoidable randomness in its connections avoid the seemingly inevitable march towards nonspecificity for its inputs? After all, large sensors have a large capacity to predict, in principle, harnessed in Refs. [3,4]. Is there no way in which randomness in wiring can be constrained so that this capacity can be harvested?
A clue is provided by consideration of the minimal optimal predictive sensor of the inputs [10]-the M, of size N ¼ jRj ¼ jSj. This optimal sensor is constructed so that each r corresponds to a different causal state σ, with transitions M ðxÞ r;r 0 ¼ PrðS tþ1 ¼ r 0 jX t ¼ x; S t ¼ rÞ. (When PrðX t ¼ xjS t ¼ rÞ is zero, a particular input word is forbidden, and any M ðxÞ r;r 0 can be chosen.) Note that for this optimally predictive sensor, given a particular input x, a particular sensor state r can only transition to one other sensor state r 0 . Minimal optimal predictive sensors, therefore, have a great deal of structure: they are deterministic. Many transitions between sensor states are forbidden.
Indeed, sensors optimized so as to maximize predictive information using the L-BFGS algorithm are also nearly deterministic. These (locally) optimal sensors tend to make Markov models of the input, capturing 0.04 nats or 20% of the total predictive information capturable with jRj ¼ 10 states.
Perhaps unsurprisingly, then, large randomly wired sensors turn out to be specific for their inputs when the sensor is very close to deterministic. Then, P r t M ðx t Þ r tþ1 ;r t will not be highly concentrated about 1, and the plausibility arguments given above for nonspecificity will fail. To illustrate this, we consider the case of a sensor in which M ðxÞ r;r 0 is, for each initial sensor state r 0 , nonzero only for k randomly drawn sensor states r; and in which the probability distribution over the k nonzero M ðxÞ r;r 0 is Dirichlet with concentration parameter α. We see from Fig 5 that the greater the determinism (i.e. the smaller the k), the higher the average predictive Large, nondeterministic, randomly wired sensors are nonspecific for their inputs even when the stochasticity is input-and state-dependent. As described in the main text, the concentration parameterã describes the random wiring of the sensors. Here, α(x, r) is drawn uniformly at random in the interval [0, 10] for each x. Both average memory and average predictive information decrease with sensor size N. The environment again has |S| = 30 but with ρ μ = 0.19 nats and C μ = 3.1 nats. The 50% confidence intervals in memory and predictive information decrease with increasing sensor size N, implying that the larger sensors are increasingly likely to be nonspecific for their input. https://doi.org/10.1371/journal.pone.0202333.g003 Infinitely large, randomly wired sensors cannot predict their input unless deterministic information; and that, no matter the k, the predictive information captured actually increases with the size of the sensor. Large, nearly deterministic, randomly wired sensors have variable values of predictive information, unlike the large, fully nondeterministic, randomly wired sensors as discussed in Section C in S1 File. The average predictive information captured appears to saturate with increasing sensor size, and as such, the quality of the sensor grows with N but is fundamentally limited by k. The same trends hold for average memory hI mem i, shown in Fig 6. Note, however, that if k grows with N such that lim N!1 k is infinite, then the plausibility arguments given before will apply and the infinitely large, randomly wired sensor will be nonspecific for its input. Hence, the sparsity of sensor connections-the number of connections divided by the number of possible connections-must tend to 0 as the sensor size increases if an infinitely large randomly-wired sensor is to be at all specific for its input. Infinitely large, randomly wired sensors cannot predict their input unless deterministic

Conclusion
Evolution and scientists have a difficult task: that of creating predictive sensors of time series input that may contain long-range correlations. One can approach this problem by designing learning rules that adjust sensor connection weights so as to optimize sensor predictive capabilities. Nondeterministic sensors can result. Alternatively, one can approach this problem by randomly wiring a sensor and optimizing the sensor readout. We have shown that success of the latter approach requires determinism or something close to it, in that given a present sensor state and present input, one should be able to transition to only a few sensor states in the next time step.
One way to conceptualize the sensor's predictive computation is to view each sensor state as a cluster of pasts-that is, given a particular input past x , there is a probability pðrj x Þ of Nearly deterministic, randomly wired sensors capture more predictive information than nondeterministic randomly wired sensors. Nearly deterministic sensors are randomly generated as described in the main text with α = 3, k as given in the legend, and N as given by the x-axis. 100 sensors are generated at each possible sensor size N ¼ jRj, and their predictive informations I pred are averaged to give hI pred i. 50% confidence intervals in hI pred i are given by the error bars. The environment has |S| = 30, ρ μ = 0.23 nats, and C μ = 2.3 nats, and so these randomly-wired sensors are still only capturing * 10% of the total predictive information possible. https://doi.org/10.1371/journal.pone.0202333.g005 Infinitely large, randomly wired sensors cannot predict their input unless deterministic ending up in sensor state r determined in some nonlinear fashion by the dynamics of the input and the dynamics of the sensor. When a sensor is nondeterministic, these clusters are soft clusters, meaning that a number of sensor states r are possible given a particular input past x . More sensor states correspond to more clusters, which as shown in Section A in S1 File, tends to increase memory and predictive information captured. However, an increase in the number of sensor states provides one with exponentially more pathways from one sensor state to the next for a particular input past, which on average have equivalent total weights for each input past. These two competing trends can lead to a nonmonotonic dependence of memory and predictive information on sensor size. At large enough sensor sizes, the latter trend always dominates. Hence, infinitely large, fully nondeterministic, randomly wired sensors are increasingly nonspecific for their input. To break the trend towards nonspecificity, one needs to restrict the number of possible paths between sensor states with determinism. Fig 6. Nearly deterministic, randomly wired sensors have more memory than nondeterministic randomly wired sensors. Nearly deterministic sensors are randomly generated as described in the main text with α = 3, k as given in the legend, and N as given by the x-axis. 100 sensors are generated at each possible sensor size N ¼ jRj, and their predictive informations I mem are averaged to give hI mem i. 50% confidence intervals in hI mem i are given by the error bars. The environment has |S| = 30, ρ μ = 0.23 nats, and C μ = 2.3 nats, and so these randomly-wired sensors are only capturing * 25% of the total memory possible. https://doi.org/10.1371/journal.pone.0202333.g006 Infinitely large, randomly wired sensors cannot predict their input unless deterministic One may well ask what we have learned about optimal sensors from the model presented here. After all, it is well-known in signal estimation problems that sensor stochasticity is detrimental. The interesting twist in our story is that when discussing recurrent sensors, the detrimental effects of sensor stochasticity are compounded rather than mitigated by increasing sensor size. Feedforward sensors correspond to the null model studied in Section A in S1 File, while recurrent sensors correspond to the eigenvector analysis of the main text.
In essence, we have asked what kinds of random sensor ensembles yield greater predictive power. This is certainly not the first time that such a question has been asked, e.g. Refs. [7,18,19]. Previously studied reservoirs were all deterministic-or rather, the variable that evolved in a conditionally Markovian manner (e.g., membrane voltages as opposed to neural activities) in the reservoir computing applications evolved deterministically. As such, our report on the extremely detrimental effects of nondeterminism on recurrent sensors is new, if perhaps unsurprising. When sensors are deterministic, one can study the effect of the spectral radius of the weight matrix used to evolve the sensor state, the sparsity of aforementioned weight matrix, the function applied to the weight matrix multiplied by the sensor state, and sensor size (as studied here), among other things.
True determinism in physical systems is impossible [20,21]. Sometimes, noise can be beneficial to the functioning of biological systems [16,20,22,23], but our work here suggests that this noise must be tightly controlled when one wants to remember or predict input. For instance, in a chemical reaction network, too many different possible reactions for a given environmental forcing (nondeterminism) will lead to a nonspecific response, e.g. sparsity in random reaction networks was key to the results of Ref. [24]. Our results suggest that given a particular inherent sensor stochasticity, there is an optimal finite sensor size at which functionality (prediction) is maximized. The size of sensors in biological systems, then, might not always be governed by resource constraints [25][26][27] but instead governed by degradation of functionality due to unavoidable noise.
Supporting information S1 File. Section A: Random soft clusters in the information bottleneck method. Section B: Plausibility argument for nonspecificity of large randomly-wired sensors. Section C: Fluctuations in memory and prediction. (PDF) On the x-axis is jRj, or N, and on the y-axis is the interquartile range (IQR) of I pred . The environment has ρ μ = 0.147 nats and C μ = 2.36 nats, but these results seemed to hold qualitatively regardless of particular environment. (TIFF)