Asymptotic Capacity of a Random Channel

We consider discrete memoryless channels with input alphabet size $n$ and output alphabet size $m$, where $m=$ceil$(\gamma n)$ for some constant $\gamma>0$. The channel transition matrix consists of entries that, before being normalised, are independent and identically distributed nonnegative random variables $V$ and such that $E[(V \log V)^2]<\infty$. We prove that in the limit as $n\to \infty$ the capacity of such a channel converges to $Ent(V) / E[V]$ almost surely and in $L^2$, where $Ent(V):= E[V\log V]-E[V] E[\log V]$ denotes the entropy of $V$. We further show that the capacity of these random channels converges to this asymptotic value exponentially in $n$. Finally, we present an application in the context of Bayesian optimal experiment design.


INTRODUCTION
A discrete memoryless channel (DMC) comprises a finite input alphabet X = {1, 2, . . . , n}, a finite output alphabet Y = {1, 2, . . . , m}, and a conditional probability mass function expressing the probability of observing the output symbol y given the input symbol x, denoted by W x,y . Any DMC can be represented by a stochastic matrix W = (W x,y ) x∈X ,y∈Y ∈ [0, 1] n×m , whose rows are normalized, i.e., y∈Y W x,y = 1 for all x ∈ X . In his seminal 1948 paper [1], Shannon proved that the channel capacity of a DMC W is given by where I(p, W) := x∈X p(x)D(W x,· ||(pW)(·)) denotes the mutual information and ∆ n := {x ∈ R n | n i=1 x i = 1, x i ≥ 0 for all i} the n-simplex. W x,y = P[Y = y|X = x] describes the channel law, (pW)(·) is the probability distribution of the channel output induced by p and W which is given by (pW)(y) := x∈X p(x)W x,y for y ∈ Y and D(·||·) denotes the relative entropy that is defined as D(W x,· ||(pW)(·)) := y∈Y W x,y log Wx,y (pW)(y) . In this paper, we are interested in a particular class of DMCs which are characterized by the property that each entry of their channel matrix is an i.i.d. random variable before the rows are normalized. Two different scenarios are considered; first we assume that each entry of the channel transition matrix is a nonnegative i.i.d. random variable V before being normalized and that m = ⌈γn⌉ for some constant γ > 0. Using duality of convex optimization, we prove in Theorem 2.4 that as n → ∞ the capacity of such a DMC converges to µ 2 µ 1 − log µ 1 almost surely and in L 2 , where µ 1 := E[V ] > 0 and µ 2 := E[V log V ]. Second, we consider a more general setup under slightly different model assumptions, where each entry V x,y of the channel transition matrix, before being normalized, is independent and distributed on the nonnegative real line such that for all x ∈ X and for all y ∈ Y we have µ 1,n := 1 m y∈Y E[V x,y ] = 1 n x∈X E[V x,y ] and µ 2,n := 1 m y∈Y E[V x,y log V x,y ]. In Theorem 3.2 we show that the capacity of such a random DMC converges exponentially in n to its asymptotic value lim n→∞ µ 2,n µ 1,n − log µ 1,n in probability.
In the literature there exists a variety of extensively studied channel models that are described by random constructions, where one observes that in the limit as the blocklength tends to infinity the capacity converges to a deterministic value. This is sometimes viewed as a manifestation of of diversity [2,3]. A common model studied in [2,3] is of the form y = Hx + n, where x is an n-dimensional input vector and y represents an m-dimensional output vector. H is modeled as a random matrix (the simplest example is the one where H has i.i.d. entries) and n denotes additive noise. To the best of our knowledge the random channel model that is considered in this article has never been addressed directly in the literature.
Notation.-The logarithm with basis 2 is denoted by log(·) and the natural logarithm by ln(·). We consider DMCs with an input alphabet X = {1, 2, . . . , n} =: [n] and an output alphabet Y = {1, 2, . . . , m} =: [m]. The channel law is summarized in a stochastic matrix W ∈ M n,m , where W i,j := P[Y = j|X = i] and M n,m denotes the set of all stochastic n × m matrices. The input and output probability mass functions are denoted by the vectors p ∈ ∆ n and q ∈ ∆ m , where we define the standard n-simplex as ∆ n := {x ∈ R n | n i=1 x i = 1, x i ≥ 0 for all i}. For a probability mass function p ∈ ∆ n we denote the Shannon entropy by H(p) := − n i=1 p i log p i . It is convenient to introduce an additional variable for the conditional entropy of Y given X as r ∈ R n , where r i := − m j=1 W i,j log W i,j . We denote the maximum (resp. minimum) between a and b by a ∨ b (resp. a ∧ b) and by ⌈·⌉ the ceiling function. Given a nonempty set A ⊂ R, its Borel σ-algebra is denoted by B(A). The uniform distribution with support A is denoted by U (A) and the exponential distribution with rate parameter λ > 0 by E(λ). The Dirichlet distribution on the n-simplex with concentration parameter α ∈ R n ≥0 is denoted by Dir(α 1 , . . . , α n ) and the lognormal distribution with rate parameters z ∈ R and σ > 0 by ln N (z, σ). The Dirac delta distribution is denoted by δ(·). By convention when refering to sets or functions, measurable means Borelmeasurable. Let U be a nonnegative real-valued integrable random variable. The entropy of U is defined as Ent(U ) : Structure.-In Section 2 the asymptotic capacity of random DMCs having the form explained above is determined. Section 3 proves the exponential rate of convergence for the capacity of such random DMCs. Section 4 contains a numerical simulation of a random DMC constructed using a uniform distribution for the channel entries, before being normalized. As a second numerical example, we simulate the capacity of a DMC whose rows are uniformly distributed over the nsimplex. An application of the asymptotic capacity in terms of optimal design of experiments is presented in Section 5.

ASYMPTOTIC CAPACITY
Consider a probability space (Ω, A, P) and let (V x,y ) x∈[n],y∈[m] be a sequence of i.i.d. nonnegative random variables on Ω. We define the channel transition matrix 1 W (V,n) := (W x,y = 1 for all x ∈ [n]. We impose the following assumption on the random variables V x,y .
Note that Assumption 2.1 implies that E V 2 x,y < ∞. We first show that the the capacity C(W (V,n) ) of such a (random) DMC as well as the optimal input distribution are random variables. Lemma 2.2 (Measurability). For a channel constructed as explained above the mapping C : M n,m → R ≥0 given by C(W (V,n) ) = max p∈∆n I p, W (V,n) is measurable. Furthermore, the (setvalued) mapping p ⋆ : M n,m ⇒ ∆ n , p ⋆ (W (V,n) ) = arg max p∈∆n I p, W (V,n) , describing the optimal input distribution, is measurable.
Proof. Note that we have Since the mapping p → I p, W (V,n) is concave and continuous for almost any W (V,n) , I is a normal integrand [4,Proposition 14.39]. Then, as shown in [4,Example 14.32], I(p, W (V,n) ) +δ ∆n (p) is a normal integrand and as such the measurability of the mappings W (V,n) → C(W (V,n) ) and W (V,n) → p ⋆ (W (V,n) ) follows by [4,Theorem 14.37]; see [4,Definition 14.1] for a definition of measurability of a set-valued mapping.
y ∈ M n,m constructing the channel clearly is measurable and therefore, invoking Lemma 2.2, the channel capacity C(W (V,n) ) is a function from Ω to R ≥0 that is (A, B(R ≥0 ))-measurable and hence a random variable. The following assumption provides a relation between the input and output alphabet size that is required for the main theorem.
There is a positive constant γ ∈ R >0 such that the output alphabet size is given by m = ⌈γn⌉.
Remark 2.7 (Properties of the asymptotic capacity). The asymptotic capacity described in Theorem 2.4 (i) is nonnegative by Jensen's inequality, since R ≥0 ∋ x → x log x ∈ R is a convex function.
(iv) admits the homogeneity property lim n→∞ C(W (αV,n) ) = lim n→∞ C(W (V,n) ) for any α > 0. This follows by Remark 2.6, as where the second equality uses [6, Remark 3.3.1] Example 2.9 (Exponential distribution). Consider a DMC as defined above using an exponential distribution with rate parameter λ > 0. Then for n → ∞ its capacity converges to 1−κ ln 2 almost surely and in L 2 , where κ denotes Euler's constant. This follows directly from Theorem 2.4, since for V x,y ∼ E(λ) we have µ 1 = E[V x,y ] = 1 λ and µ 2 = E[V x,y log V x,y ] = 1−κ−ln λ λ ln 2 . The fact that the asymptotic capacity is constant (i.e., independent of λ) is a direct consequence of the homogeneity property in Remark 2.7, since αV x,y ∼ E( λ α ) for any α > 0. Example 2.10 (Uniform distribution on the n-simplex). Consider a DMC that is described by an n × n channel transition matrix, whose rows W (V,n) x,· are independent random variables on the nsimplex. More precisely, let the rows W (V,n) x,· be i.i.d. random variables according to the symmetric Dirichlet distribution Dir(λ, . . . , λ) with concentration parameter λ = 1, that is equivalent to the uniform distribution over the n-simplex. It is known [7, Theorem. 4.1, p. 594] that for n exponentially distributed i.i.d. random variables V x,1 , . . . , V x,n ∼ E(η) for η > 0, the multivariate random variable W (V,n) x,· := V x,· / y∈[n] V x,y admits a uniform distribution over the n-simplex. Hence, by Example 2.9 the capacity of a channel W (V,n) with i.i.d. uniformly distributed rows converges to 1−κ ln 2 almost surely and in L 2 as n → ∞, where κ denotes Euler's constant. Example 2.11 (Lognormal distribution). Consider a DMC (with n = m) as defined above using a lognormal distribution ln N (z, σ) with rate parameters z ∈ R and σ > 0. Then for n → ∞ its capacity converges to σ 2 2 ln 2 almost surely and in L 2 . This follows directly from Theorem 2.4, . We note that αV x,y ∼ ln N (z + ln α, σ) for positive α, which by the homogeneity property (cf. Remark 2.7) implies that the asymptotic capacity cannot depend on z.
Three more examples considering the gamma, chi-squared and beta distribution can be found in Appendix A. Before we provide a rigorous proof of Theorem 2.4 in the next section let us sketch an informal motivation, that might provide some intuition about the proof.
Let us assume that the i.i.d. random variables V x,y take values in a finite set [k], for some k ∈ N. Statistically as the input and output alphabet get larger (i.e., n, m ≫ k), the channel matrix W (V,n) resembles a weakly symmetric channel (i.e., every row is a permutation of every other row and all the column sums are equal). It is known [8, Theorem 7.2.1], that the capacity of a weakly symmetric channel W (V,n) is given by log m − H(W (V,n) x,. ) for x ∈ [n] and that the uniform input distribution is capacity achieving. In Section 2 A, to prove Theorem 2.4, we derive an analytical upper and lower bound for the capacity and show that in the limit n → ∞ they coincide at the value predicted by Theorem 2.4. The upper bound is shown to be log m − max x∈[n] H(W (V,n) x,. ) and the lower bound I p, W (V,n) , wherep is the uniform distribution on [n].
A. Proof of Theorem 2.4 To keep the notation simple we denote the channel transition matrix W (V,n) by W. We reformulate the problem (1) by introducing an additional decision variable q ∈ ∆ m representing the output distribution of the channel, together with the coupling constraint W ⊤ p = q. Whereas the Lagrange dual problem to (1) can only be implicitly expressed through the solution of a system of linear equations (as reported in [9,10]), introducing the new decision variable q allows us to derive an explicit and simple Lagrange dual problem. It can be shown (see e.g. [11,Lemma 1]) that the optimization problem (1) is equivalent to (primal program): where where G, F : R m → R are given by Note that since the coupling constraint W ⊤ p = q in the primal program (2) is affine, the set of optimal solutions to the dual program (3) is nonempty [12, Proposition 5.3.1] and as such the optimum is attained. As shown in [11, Section 2], G and F have analytical solutions given as Lemma 2.12. Strong duality holds between (2) and (3).
Proof. The proof follows by a standard strong duality result of convex optimization, see [12, Proposition 5.3.1].
Weak duality of convex programming implies that the dual always is an upper bound to the primal problem, i.e., for every p ∈ ∆ n and for every λ ∈ R m , C UB (W). By following the proof of Lemma 2.2, one can show that the mapping M n ∋ W → C (λ) UB (W) ∈ R ≥0 is measurable for any λ ∈ R m and as such C   Proof. According to (4) we have By definition of our channel and Lemma B.1 we know that for every i ∈ [n] as n → ∞, m j=1 W i,j log W i,j + log m converges to µ 2 µ 1 − log µ 1 almost surely and in L 2 , which proves the assertion.  Proof. The mutual information for a uniform input distribution, i.e., p i = 1 n for all i ∈ [n] can be written as Consider the upper bound for somex ∈ [n], where ε n := ⌈γn⌉ − γn ∈ [0, 1) for all n. According to Lemma B.1, the right hand side of (8) converges to µ 2 µ 1 − log µ 1 − log γ almost surely and in L 2 for n → ∞. We can also bound the same term from below as for somex ∈ [n], where ε n := ⌈γn⌉ − γn ∈ [0, 1) for all n. According to Lemma B.1, the right hand side of (9) converges to µ 2 µ 1 − log µ 1 − log γ almost surely and in L 2 as n → ∞. Thus for n → ∞, (6) converges to µ 2 µ 1 − log µ 1 in L 2 which proves the assertion.
Lemmas 2.13 and 2.14 complete the proof of Theorem 2.4 as C

CONVERGENCE RATE
This section addresses how fast the capacity of a channel with the form introduced in Section 2 converges to the asymptotic value predicted by Theorem 2.4. In addition, we consider a different model for the channel construction compared to Section 2. Let (V x,y ) x∈[n],y∈[m] be a sequence of independent nonnegative random variables such that the following assumption holds, where we use the notation (x) q Assumption 3.1. There exist positive numbers K and T such that for all n, m ∈ N .
The main difference between the random channel model considered in Section 2 and the one in this section is that here we assume that the random variables (V x,y ) x∈[n],y∈ [m] are independent and such that Assumption 3.1 holds, whereas in Section 2 we assume that the random variables (V x,y ) x∈[n],y∈[m] are independent and identically distributed and satisfy Assumption 2.1. Clearly Assumption 3.1 is stronger than Assumption 2.1, which allows us to state a rate of convergence.  Remark 3.3 (Exponential convergence). Note that µ 1,n is strictly larger than zero for any n ∈ N by Assumption 3.1(iv). Moreover, Assumption 3.1(i) implies that there exists a constant S such that µ 2,n ≤ S for any n ∈ N. Therefore, the parameters α t and β t in Theorem 3.2 can be bounded from below independently of n. Assumption 3.1(iv) further ensures that the parameter a can be bounded from below independently of n and as such the parameter L is bounded from above and below independently of n. Hence, Theorem 3.2 clearly implies exponential convergence in n.
Proof. Follows directly from Theorem 3.2.
Since the exponential concentration provided in Theorem 3.2 is summable, a direct application of the Borel-Cantelli Lemma [ The structure of the proof is such that we prove separately convergence rates for the lower and upper bounds of Section 2 (Propositions 3.9 and 3.13) respectively. The claim follows since the capacity is forced to be between the upper and lower bounds, hence converges at the worst among the two rates. To prove Proposition 3.9 we need a few preparatory lemmas. . Let X 1 , . . . , X n be independent real-valued random variables. Assume that there exist positive numbers K and T such that Lemma 3.6. Let X and Y be two random variables, η 1 and η 2 be two real constants such that
We next derive a few preparatory lemmas that are used to prove Proposition 3.13.
Lemma 3.10. Using the notation introduced above, for all y ∈ [m], for f (·, ·) and β t as defined in (10) and the theorem. The same argument can be obtained to bound P U y,UB − n m ≥ t which then proves the assertion. Lemma 3.11. Using the notation introduced above, for every y ∈ [n], Proof. Follows directly from Lemmas 3.8 and 3.10.
Lemma 3.12. Using the notation introduced above where the final inequality uses similar steps as done in the derivation of (21) together with Lemma 3.6 and Lemma 3.12.
Theorem 3.2 follows directly from Propositions 3.9 and 3.13 as by definition C UB (W) almost surely.

SIMULATION RESULTS
In this section we compute the capacity of the DMCs introduced in Section 2 for finite alphabet sizes. For the computation we use a recently introduced method [11] which allows us to efficiently compute close upper and lower bounds to the capacity. Example 4.1 (Exponential distribution). We consider a channel that is given by the stochastic matrix W = (W x,y ) x,y∈ [n] with W x,y = V x,y / y∈[n] V x,y , where V x,y are i.i.d. E(η) random variables with η = 1 10 for all x, y ∈ [n]. As explained in Example 2.10 with this channel construction the rows W x,· admit a uniform distribution on the n-simplex for all x ∈ [n]. Figure 1 depicts the capacity of W for variable alphabet sizes. We perform five independent experiments for each value of n. On can observe that as n → ∞ the capacity approaches the asymptotic limit as determined in Example 2.9. In addition one can see that the variance between the capacity of the two independently chosen channels is deceasing for increasing alphabet sizes.

APPLICATION IN BAYESIAN EXPERIMENT DESIGN
The main objective of optimal experiment design is, based on prior knowledge, to select a most informative experiment, where we restrict attention to a certain notion of information that traces back to Shannon [1]; see [14] for a comprehensive survey.
Let the random variable X ∈ X := [n] describe a parameter to be determined with a prior probability distribution p ∈ ∆ n and let the random variable Y ∈ Y := [m] denote an observation. Furthermore, consider a family of experiments (W (λ,n) ) λ∈Λ , where Λ ⊂ R d and each experiment W (λ,n) ∈ M n,m is characterized by the conditional probabilities W (λ,n) x,y := P (λ) [Y = y|X = x] for all x ∈ X and y ∈ Y. 2 The task of optimal experiment design is, given a prior distribution p ∈ ∆ n , to find the experiment that provides the highest average amount of information, as described by the mutual information between the parameter and the observation [14, Definition 2], i.e., the goal is to find λ ⋆ ∈ Λ such that I p, W (λ ⋆ ,n) ≥ I p, W (λ,n) for all λ ∈ Λ. This requires one to compute sup λ∈Λ I p, W (λ,n) .
The optimization problem (22) in general is difficult to solve. Moreover, an evaluation of the objective function, the mutual information, for a given λ has a computational complexity of O(nm) and as such for large sets X and Y even solving (22) for local optimality can be computationally demanding.
The task of designing optimal experiments has recently attracted interest in the context of biological systems, where understanding about the underlying biological mechanisms emerges through iterations of modelling and experiments. Since experiments are expensive an effective selection of informative experiments is essential, see [15]. We will show in Section 5 A that the asymptotic capacity formula, given in Theorem 2.4, allows us to derive upper bounds on the expected information gain by an experiment for certain classes of (random) experiments. In addition, Theorem 2.4 provides an efficient method to select suboptimal experiments, that are close to optimal in our numerical example, see Example 5.3. Let (V x,y .

A. Upper bound on maximum expected information gain
In the limit, as n → ∞, we can establish the following upper bound on the maximum expected information gain by an experiment.
Proposition 5.1 (Upper bound on maximum expected information gain). For the family of channels (W (λ,V,n) ) λ∈Λ introduced above that satisfy Assumptions 2.3 and 3.1, we have with high probability The upper bound provided by Proposition 5.1 is particularly useful if it admits a closed form solution, whereas the optimal information gain sup λ∈Λ I p, W (λ,V,n) is difficult to compute (see Example 5.3 for more details).
Before proving Proposition 5.1 we state a preliminary standard result. The first inequality of (24) is trivial and the last equality follows by Theorem 2.4. Therefore it remains to prove that the first equality in (24) holds almost surely. Note first that the following property holds (i) The capacity of the channel W (λ,V,n) converges uniformly in Λ to its asymptotic capacity in probability, i.e., for all ε > 0, lim because by Theorem 3.2, we know that for each n there exists M n < ∞ and N n ≤ 1 as well as Ω n ⊂ Ω with P[Ω n ] ≥ N n such that Moreover, M n → 0 and N n → 1 as n → ∞, which implies that for all ε > 0, lim and hence, property (i) holds. Note also that the following property holds trivially since C(W) ≤ log(n ∧ m) for any channel matrix W ∈ M n,m (ii) sup λ∈Λ C(W (λ,V,n) ) < ∞ almost surely for all n ∈ N.
Let us consider the case where the prior distribution is uniform. In this case the upper bound (23) is tight, by following the proofs of Theorem 2.4 and Proposition 5.4. Figure 2 depicts for different alphabet sizes n in (a) the empirical mean of the maximum expected information gain (blue line) for 500 experiments, which in general is difficult to compute in particular for higher dimensional examples than Example 5.3. The red line represents the empirical mean of the suboptimal expected information gain, that is given by evaluating the mutual information for the optimal parameters for the asymptotic capacity, derived in Proposition 5.4 and as such is computationally much cheaper. The empirical variance of the maximum expected information gain (blue line) as well as the empirical variance of the suboptimal expected information gain (red line) are depicted in (b).

CONCLUSION AND DISCUSSION
In this article we studied the capacity of discrete memoryless channels whose channel transition matrix consists of entries that are nonnegative i.i.d. random variables X before being normalized. It was shown that under some mild assumptions on the distribution of the random variables, the capacity of such a channel as the dimension goes to infinity converges to the asymptotic capacity given by µ 2 µ 1 −log µ 1 almost surely and in L 2 , where µ 1 := E[X] and µ 2 := E[X log X]. Interestingly, for some distributions, e.g., the uniform and exponential distribution, the asymptotic capacity is a constant. Furthermore, we have shown that the capacity of these random channels converges exponentially in the dimension to its asymptotic value in probability. Finally, we provided an interpretation of the asymptotic capacity as an upper bound to the maximum expected information gain in the context of Bayesian optimal experiment design.
For future work we aim to investigate if the asymptotic capacity of a random channel determined by Theorem 2.4 has an operational meaning in other scenarios, e.g., in the setup of fading channels or in Bayesian estimation. Furthermore, it would be interesting to study the variance of the capacity of such random channels and its decay rate. 1 ln 2 ψ(1 + k 2 ) − log k almost surely and in L 2 , where ψ(·) denotes the digamma function. This is a direct consequence of Theorem 2.4, since for V x,y ∼ χ 2 (k) we have µ 1 = E[V x,y ] = k and µ 2 = E[V x,y log V x,y ] = k + k ln 2 ψ(1 + k 2 ). Example A.3 (Beta distribution). Consider a DMC as defined in Section 2 using a beta distribution with shape parameters α, β > 0. For n → ∞ its capacity converges to Hα−H α+β ln 2 − log α α+β almost surely and in L 2 , where H n denotes the n-th harmonic number. This is a direct consequence of Theorem 2.4, using that for V x,y ∼ beta(α, β) we have µ 1 = E[V x,y ] = α α+β and µ 2 = E[V x,y log V x,y ] = α (α+β) ln 2 (H α − H α+β ).