Network Inference with Hidden Units

We derive learning rules for finding the connections between units in stochastic dynamical networks from the recorded history of a ``visible'' subset of the units. We consider two models. In both of them, the visible units are binary and stochastic. In one model the ``hidden'' units are continuous-valued, with sigmoidal activation functions, and in the other they are binary and stochastic like the visible ones. We derive exact learning rules for both cases. For the stochastic case, performing the exact calculation requires, in general, repeated summations over an number of configurations that grows exponentially with the size of the system and the data length, which is not feasible for large systems. We derive a mean field theory, based on a factorized ansatz for the distribution of hidden-unit states, which offers an attractive alternative for large systems. We present the results of some numerical calculations that illustrate key features of the two models and, for the stochastic case, the exact and approximate calculations.


Introduction
Recent interest in network identification problems has been motivated by the advent of multi-electrode neural recordings and other large-scale biological data [1,2,3,4].Current inference methods, however, do not take into account the effects of units in the networks that are not recorded, though they are almost always present.This problem can be serious: For example, in cortical neural data, almost all recorded cells are excitatory, though inhibitory cells are essential in the network dynamics.In this paper we extend previous methodology to include "hidden units", presenting algorithms for inferring the strengths of connections to, from and among them.
There is a long history of work of problems of this sort.Perhaps the best know is that on "Boltzmann machines" [5].These are symmetrically coupled networks of stochastic binary units.Their states are updated, one randomly chosen unit at a time, with the probability of being in a particular one of its two possible states given by a logistic sigmoid function of the net input from other units.Because of the symmetric coupling matrix, their dynamics satisfies detailed balance, so their equilibrium distributions are of Gibbs-Boltzmann form Z −1 exp(−E), where E is a quadratic form.This fact that simplifies their analysis considerably.The problem has also been studied in networks where the unit outputs are continuous sigmoidal functions of their inputs, for both continuous-time (asynchronous-update) and discrete-time (simultaneous-update) dynamics, extending the back-propagation algorithm used earlier for layered networks.
Applying either of these kinds of models to multineuron spike data is problematic.Real biological networks do not have symmetric connections, invalidating the first kind, while the nature of synaptic transmission and neuronal spiking calls for a stochastic binary representation, ruling out the second.In this paper we treat models in which the recorded neurons are stochastic and binary, and there is no symmetry requirement on the connections in the network.They obey a discrete-time kinetic Ising (Glauber) dynamics [6], and a value +1 represents an action potential.We study two kinds of models, in which the hidden units are deterministic or stochastic, respectively.We employ, for convenience, a discrete-time dynamics [7], though it should be straightforward to extend the treatment to continuous-time models.

Continuous, deterministic hidden units
We examine first the deterministic case, taking the output of a hidden unit to be a sigmoidal function of its input.Though it is a big simplification of a real spiking-neuron network, this kind of model can be practical for analyzing neural data.One cannot hope to model the detailed dynamics of all the unrecorded neurons in the network of interest, because they vastly outnumber the recorded ones.What one can hope to do, at least as a first approximation, is to describe the effect of unrecorded populations of neurons, for example, of inhibitory neurons when only excitatory neurons have been recorded.The values of the hidden units in our model here represent the firing rates of those populations.While this representation throws out many details, it enables one to capture some essential features of the dynamics, even using only one or a few hidden units.
We draw here on work in learning in analog neural networks a couple decades ago, under the names "back-propagation in time" and "recurrent back-propagation" [8,9,10,11].Our treatment differs from that work in having stochastic visible units and a likelihood-based objective function.

Model
We denote the states of the visible units by s i (t), where i labels the unit and t the time bin.They can take the values ±1.(We assume the recorded spikes have been sorted into time bins small enough that there is no more than one spike per bin.)We denote the hidden unit values by µ a (t), 1 ≤ µ a (t) ≤ 1.To make our equations a little more transparent, we use indices i, j, • • • for visible units and a, b, • • • for hidden ones.Our model is defined by the stochastic evolution rule with All s i (t + 1) are assumed independent, conditional on {s j (t)}, {µ b (t)}.The model is pictured in Fig. 1.We do not write constant bias terms in H or B here; they can be included by adding input units which are always +1.We will denote the number of visible units by N v and the number of hidden ones by N h .

Objective function and learning rules
We assume we are given the data {s i (t)} and that we know the number of hidden units.The task is to learn the connections {J ij }, {K ia }, {L aj }, and {M ab }, and our objective function is the log likelihood of the observed visible history: We consider the simplest form of gradient-based learning, where the parameters are adjusted proportional to the derivative of the log likelihood with respect to them.For {J ij } and {K ia }, this is straightforward: with ǫ k (t + 1) = s i (t) − tanh H i (t), the observed error on unit i at t + 1 under the model with the current parameters, given its state at t.This is standard error × input learning, as in networks without hidden units [8,3].
For the connections that lead to hidden units, the derivatives of H i (t) with respect to {L aj } and {M ab } are through its dependence on the µ b (t); as in Furthermore, the derivatives of the µ j (t) have terms proportional to derivatives of all the µs at the previous time step: These equations can be iterated starting from the initial condition ∂µ c (0)/∂L al = 0.The solution can be written relatively compactly: X ab (t) = (1 − µ 2 a (t))δ ab (11) and we make the convention that the product over r is equal to 1 when q = 0.The learning rule for L al can then be written as Exactly the same procedure for the derivative with respect to M ab gives (13) which differs from (12) only in the last factor.This all has a nice graphical interpretation.The effective error is the sum over all paths starting at future visible units (time t + 2 + q) and propagating back through the hidden units at intermediate times until it reaches the receiving unit a at time t + 1.For each such path, we pick up a factor ǫ i (t + q + 2) at the visible error source, a factor of a K ib for backpropagating from the source unit to a hidden unit b, factors of elements of M for the hiddento-hidden connections on the path, and factors of X cc = 1 − µ 2 c at every hidden unit c that it passes through.This is just the standard prescription for back-propagation of errors in layered networks.Fig. 2 shows a typical path for q = 2.

Numerical results
In this calculations reported in this paper we restrict ourselves to networks with no hidden-to-hidden connections (M = 0).This simplifies the learning algorithm considerably: There are no backpropagation paths longer than two steps.Fig. 3 shows an example of learning for a network with 18 visible and 2 hidden units, based on 10000 time steps of data.The top left panel shows how the cost function (the negative log-likelihood of the data) falls smoothly to a minimum.The top right panel shows the evolution of the errors in the couplings J ij , K ib and L aj under learning.The apparent poor performance can be understood by comparing the middle panels, which show the coupling matrix elements of the model that generated the data (left) and the inferred couplings (right), respectively.It is apparent that the input connection strengths L 2j to the second hidden unit (unit 20 in these Figure 2: Back-propagation of errors from the future through the hidden units.The example path here starts at a visible unit i where the output error ǫ i (t + 4) is measured.It is then propagated back in time, first to a hidden unit at time t + 3, then through another hidden unit at t + 2 and finally to the one at t + 1 which is the receiving unit on the connection being evaluated.It gives a change in that connection strength equal to the product of ǫ i (t + 4), all the connection strengths on the path, and factors of 1 − µ 2 b (t) for each hidden unit on the path.The total connection strength change is a sum over all such paths from all visible units in the future.plots) are negatives of each other in the two panels.The same is true of the outgoing connections K i2 from that unit, though it is hard to see in these graphs.These two inversions have no effect on the visible units, so the "true" model and the one with the flipped signs of L 2j and K i2 are equivalent: there is no way we can know from the visible data alone which one was the true model.The bottom left panel shows how, if we bias the initial random values of the couplings to have the right sign, results close to the true model are obtained.The equivalence of the two inferred models is apparent from the fact (bottom right panel) that the final values of the cost function are exactly the same.
In this example it was easy to see the relation between the inferred and true connections.However, in general there is a 2 N h × N h !-fold degeneracy (the signs of the connections to and from every hidden unit could be flipped, and the labels on the hidden units can be permuted arbitrarily.)Thus, for large N h , most likely one will infer one of the models equivalent to the true one, but not the true one itself.

Stochastic hidden units
The case where all units in the model, including the hidden ones, are stochastic is more difficult, but it is the more interesting one from a theoretical point of view.Denoting the hidden units by σ a (t), the dynamics are now given by We restrict our treatment to networks with weak dense random connections, The likelihood of the history of the full system is and the likelihood of the visible history is The distribution of the σ, conditional on the observed data, is This has the form of a Gibbs distribution and (Z s also depends on all the model parameters {J ij , K ib , L aj , M ab }, but to save some space we do not write that explicitly.)To show the nature of the interactions in the energy E s [σ], we write it out explicitly: The first term is just a constant (independent of the σs), the next two are like external fields acting on the σ a (t) from the visible data s i (t ± 1) one time step in the future and past, respectively, and the fourth term represents interactions between σs at successive time steps.The final two terms are interactions among all the σs at one time (but these terms do not couple σs at different times).Their non-polynomial form leads to important features in this problem that are not present in Boltzmann machines.

Exact learning algorithm
Just as for Boltzmann machines, we can derive an exact learning algorithm for the model parameters by gradient ascent on Z s , the log likelihood of the visible history.It can be written The averages • • • σ|s are over all hidden histories σ(t), weighted by the probability } that they produce the known visible history s.In each learning rule, the first term comes from differentiating the terms in the first two lines of ( 22) and the second from differentiating one of the log 2 cosh terms.
When there are no hidden-to-hidden connections M ab , P [σ|s] becomes a product of independent terms, one for each t.The averages over P [σ|s] in (23-26) then involve sums over 2 N h terms, where N h is the number of hidden units.For small networks, they can be computed exactly in a reasonable time.

Numerical results
In Fig. 4 we show, for a model with N h = N v = 10 (and, again, no hiddento-hidden connections), how the log-likelihood of the data converges to its asymptotic value as the number of steps in the data set is increased.All the couplings in this example were i.i.d. and normal with variance 0.1.In addition to the cost function −L evaluated on the training data, we also plot it evaluated on an independently-generated test data set.We also plot the values of the Akaike and Bayesian information criteria, based on the training cost function.The Akaike information criterion penalizes the estimated log likelihood (i.e., increases the cost) by the number of parameters N, and the Bayesian information criterion penalizes it by N log N.  [12]).Cyan: Bayesian information criterion (BIC [13]).Purple: T → ∞ limiting value.
For networks larger than ∼ 10, one has to resort to Monte Carlo to estimate the averages.When there are hidden-to-hidden connections, the number of states to sum over becomes 2 N h T , where T is the number of time steps in the data.In this case, exact calculations are never possible, even for just one hidden unit, and even Monte Carlo becomes impractical for moderate numbers of hidden units.

Mean field theory for stochastic hidden units
An attractive approximate alternative is mean field theory.It can be formulated variationally [14]: One seeks the best approximation to P [σ|s] that factorizes over the different σ a (t).Each such factor is parametrized by a single number: the probability that σ a (t) = +1.Equivalently (and conventionally), one can use the "magnetization", denoted µ a (t), which is the difference between the probabilities to be +1 and −1.The entire factorizable distribution is then parametrized by the set of magnetizations {µ a (t)}.The learning proceeds in an EM fashion [15,16], iterating the two steps: (1) For given coupling parameters, find the µ a (t) that maximize the factorized log Z s , and (2), for these µ a (t), improve the estimates of the coupling parameters as in rules (23-26) but with the averages computed under the factorized approximate P [σ|s].

Derivation of mean-field theory
Under the factorizability assumption, the likelihood of the visible data {s i (t)}, given σ a (t) = µ a (t), is where 28) is the entropy: the average log of the probability of magnetizations µ a (t).In (27) we indicate explicitly that A depends on the parameters {J ij , K ib , L aj , M ab } (through E s ).Thus, the EM learning procedure involves repeatedly maximizing over µ for fixed parameters (the "E-step") and taking uphill steps on A (equivalently, downhill steps on E s ) in parameter space for fixed µ (the "M-step").
The prescription for obtaining the average energy by the replacement σ a (t) → µ a (t) in E s is based on the independence of different σ a (t) under the factorized distribution.For example, if E s contains a term like σ a (t)σ b (t), then σ a (t)σ b (t) = σ a (t) σ b (t) = µ a (t)µ b (t).Thus, one might think that we get the E s [µ] to use in (27) by simply substituting µ for σ in (22).Then maximizing A[µ] would lead to the equations for the µ a (t).This equation has a nice interpretation: The first two terms are just the inputs from visible and hidden units, respectively, at the previous time step, and the last two terms are just the back-propagated errors from visible and hidden units one time step later.However appealing this equation looks, it is wrong.One has to be careful in the log 2 cosh terms in E s [σ].Expanding it in powers of the K ib , we get a second order term proportional to ab K ia K ib σ a σ b .The double sum includes terms with a = b, and for these terms we should make the replacement σ 2 a = 1, not σ 2 a = µ 2 a .(This situation does not arise for the usual Ising energy − i<j J ij s i s j , since the i = j term is explicitly excluded from the sum.)The same problem comes up in all higher-order terms in the expansion whenever there are repeated indices in the sums over hidden unit indices.This problem was noticed already by Saul et al [17], who tried to deal with it by introducing an extra set of variational parameters.Here, we make a treatment for a particular ensemble of models that is exact (within the factorization approximation) in the limit of large N h .In these models, all the couplings are zero-mean independent random numbers with variances proportional to 1/N.This makes the net inputs H i (t) and B a (t) Gaussian (for large N), with variances of order unity.
Writing the nth order term in the expansion of log 2 cosh H (we drop the visible unit index i temporarily here, for simplicity) as consider first the terms in every term of (30) where two of the indices are equal.There are n(n − 1)/2 such pairs, so the correction to this subset of the nth order terms is (31) because naive substitution of µ a for σ a would have given µ 2 a instead of 1.But what multiplies the sum on a here is just half the n − 2nd term in the expansion of the second derivative of log 2 cosh H, i.e., 1 − tanh 2 H.So we can sum all such terms over n, yielding a correction Thus, at this level of approximation, we should use an energy (now with the substitution σ → µ in the first term).This looks like the TAP term in the free energy for the usual Ising model, but with the opposite sign.
The same argument applies to the log 2 cosh B term in E s , which should be replaced by These corrections will lead to new terms in the MF equations for µ a (t) and in the learning rule for the K ia and M ab .Note also that, for the models we are considering, these correction terms are of order 1 (per visible or hidden unit, respectively) since they contain sums of N h terms and each term is of order 1/N h .We can also sum up terms with 2 pairs of indices equal, 3 pairs of indices, equal, etc.Consider first the terms where two pairs of indices are equal.In the nth order term α n (30), there are n!/[4!(n − 4)!] ways of picking the 4 indices and 3 ways to pair them.The correction is The sum over n of these terms is just Like (32), ( 35) is of order 1.
Extending this argument to the general term with j/2 pairs of coincident indices, in the nth order term, there are n!/[j!(n − j)!] ways to pick our the j indices, and the number of ways to pair them is (j−1)!! ≡ (j−1)(j−3) Thus, we get a correction Again, all these terms are all of order 1.
On the other hand, terms we have not considered, with 3 or more indices equal, are negligible in the mean-field limit N h → ∞. (Consider terms with p equal indices.They involve the sum a K p a , which is of order N 1−p/2 h and therefore negligible for p > 2 as N h → ∞. Now we can sum all the E j over j, exploiting the fact that (j − 1)!! is the jth moment of a zero-mean univariate normal distribution.The result of all these manipulations is simply the replacement where Dx means (2π) −1/2 e −x 2 /2 dx and Thus, the effect of all these corrections can be described in terms of an effective Gaussian noise.The same arguments apply to the log 2 cosh B term, with the final result that the effective energy can be written, exactly in the limit N h → ∞, as We note that this form could have been motivated heuristically: In ( 15) and ( 16), the σ b (t) are fluctuating variables of variance 1 − µ 2 b (t).Since K ib and M ab are assumed to be independent random variables, H i (t) and B a (t) are normally distributed with variances ∆ 2 i (t) and Γ 2 a (t) given by ( 39) and (41), respectively.

Learning algorithm
The resulting equations for the E-step are then They differ from the naive equations (29) in that the tanh terms in the second and fourth lines are averaged over the Gaussian noises and in the presence of the new terms on the third and fifth lines.The latter have the form of cavity field corrections [18]: The effect of µ a itself on the expected s i (t + 1) and µ b (t + 1) should not be counted in calculating the tanh H terms in the second and fourth lines.
For the M-step, the learning rules for J ij and L aj are differing from those we would find in the naive mean field theory only in the averaging of the tanh's over the Gaussian noises.The rules for K ib and M ab , have extra terms that come from the dependence of ∆ i (t) and Γ a (t) on K ia and M ab in (39) and (41), respectively.For small K ia and M ab (i.e., at the level of the corrections (33) and (34), the E-step equations reduce to and the learning rules are A few final remarks are in order.The reader might notice that the lowestorder corrections in (33), (34), and (47) resemble Thouless-Anderson-Palmer (TAP) corrections in spin glasses [19].However, there the TAP equations come from the first corrections to the factorized-distribution approximation, whereas ours here come from evaluating the average energy within that approximation.We expect that for our model here, as for spin glasses, to get an exact theory for large N h , TAP corrections analogous to theirs should also be included.We do not try to do that here, working entirely within the factorized-distribution ansatz.In problems like ours for networks without hidden units, this is sometimes called "naive mean field theory" [3].

Numerical results
We have carried out mean-field inference computations for some models with no hidden-to-hidden connections (M ab = 0), using the lowest-order meanfield equations (47-51).Fig. 5 shows how the mean square errors of J ij , K ib and L aj depend on the data set length T for two networks with 80 visible units.The left-hand panel shows the case where the number of hidden units N h = 80, and the right-had panel shows the case where N h = 20.For the smaller N h , all three mean square errors fall off like 1/T , as we would expect to find if we could do this calculation exactly.However, for the larger N h , while the errors on the visible-to-visible couplings also fall off with T in this way, the errors on the couplings to and from the hidden units are larger and fall off much more slowly.
We can get a little insight into this behavior by doing the mean-field calculations for small N h , where it is also possible to do the exact calculations, as described in Sect.2.3.Fig. 6 shows the results of both kinds of calculations for N v = N h = 5 and 8.In these cases we can see that for small T the meanfield and exact calculations nearly coincide.The T -dependence is in this region is qualitatively like that for the mean-field results at N v = N h = 80.However, at larger T , the mean-field errors all fall less rapidly.At the same time, the exact calculation gives errors on the Js which continue to fall off like 1/T , and those on the Ks and Ls also start to fall more rapidly at the largest T s studied.This behavior is consistent with the expectation that, as for models with no hidden units, all exact-method errors should fall off asymptotically like 1/T , while the mean-field errors should approach limits ∝ 1/N h [3].However, apparently one has to go to very large data sets (roughly T > 10 3 N h ) to see this.

Discussion
We have derived learning rules for two kinds of stochastic binary networks with hidden units.These networks differ from Boltzmann machines in that (1) the units in them are updated synchronously rather than asynchronously, and (2) the connection strengths are allowed to be asymmetric.Because of these differences, the usual kind of Gibbs equilibrium does not hold, and a new kind of treatment is required.
The first kind of network has deterministic, continuous-valued hidden units.The learning rules for it are very similar to those in the back-propagationin-time approach for recurrent networks where all the units are deterministic and continuous-valued.
Units in the second kind of network are binary and stochastic, like the visible units.Here the learning problem is harder, but we have showed that one can always put it into the form of an equilibrium statistical mechanical problem with a non-polynomial energy function.The learning rules involve averages over the Gibbs distribution for this problem.For small systems and in the absence of hidden-to-hidden couplings, the problem can be solved exactly numerically, but otherwise one must resort to Monte Carlo methods or other approximations.We explored in detail one such approximation: mean field theory.A careful analysis revealed that the naive way one might write this theory was wrong, but we were able to construct a version of mean field theory that was exact for weak, dense connectivity in the limit of a large number of hidden units (the analog of the Sherrington-Kirkpatrick model of spin glasses [20]).
We also performed some numerical calculations to illustrate and to begin to explore some of the features of the different kinds of networks and learning rules.A general feature is that when the number of hidden units is large (i.e., comparable to the number of visible units), the errors in determining the couplings to and from the hidden units are much larger than those on the couplings among the visible units.This is true for both kinds of networks and for both exact learning algorithms and mean field theory.This should not be surprising, since the information about the connections to and from hidden units is only available indirectly, through the statistics of the visible units.On the other hand, it is noteworthy that even a rather poor estimation of the connections to and from the hidden units does not spoil the good estimation of the couplings among the visible ones.
Another point worth mentioning is that for small data lengths mean field theory is as good as doing the full exact calculation, which would take prohibitively long for N h much bigger than 10 or so.For large N h the errors on connections to and from hidden units can be rather large and fall off very slowly with T , but the results on small systems seem to show that doing the exact calculation instead of mean field theory (even if this were feasible), would not help except at very large T .
We have only scratched the surface of this problem in our numerical calculations.It would be useful to know, for example, what the asymptotic errors on the Ks and Ls are for the mean-field algorithm in the limit of large data sets and at what T the approach to these values begins, as functions of N h and N v .We leave this and other questions to future work.The theory presented here provides a foundation for those investigations and, we hope, will point the way toward other questions that will be interesting to study.

Figure 1 :
Figure 1: Schematic picture of the model.(Color online) White squares represent visible units s i ; blue ones, hidden units µ a (or σ a when they are stochastic).Visible-visible connections J ij are black, hidden-to-visible ones K ib are red, visible-to-hidden ones L aj are blue, and hidden to hidden ones M ab are green.Rows represent time steps.

Figure 3 :
Figure 3: Learning example: network of 18 visible and 2 hidden units (no hidden-hidden connections).Top left: iterative minimization of the cost function −L.Top right: rms errors on J ij , K ib , and L aj as functions of the number of iterations of the learning algorithm when it is started at small random values of the couplings.Middle panels: true (left) and inferred coupling strengths.The hidden units are number 19 and number 20.Bottom panels: rms errors (left) and cost function (right) when the initial parameter values have the correct signs.