Computing Classical-Quantum Channel Capacity Using Blahut–Arimoto Type Algorithm: A Theoretical and Numerical Analysis †

Based on Arimoto’s work in 1972, we propose an iterative algorithm for computing the capacity of a discrete memoryless classical-quantum channel with a finite input alphabet and a finite dimensional output, which we call the Blahut–Arimoto algorithm for classical-quantum channel, and an input cost constraint is considered. We show that, to reach ε accuracy, the iteration complexity of the algorithm is upper bounded by lognlogεε where n is the size of the input alphabet. In particular, when the output state {ρx}x∈X is linearly independent in complex matrix space, the algorithm has a geometric convergence. We also show that the algorithm reaches an ε accurate solution with a complexity of O(m3lognlogεε), and O(m3logεlog(1−δ)εD(p*||pN0)) in the special case, where m is the output dimension, D(p*||pN0) is the relative entropy of two distributions, and δ is a positive number. Numerical experiments were performed and an approximate solution for the binary two-dimensional case was analysed.


Introduction
The computation of channel capacity has always been a core problem in information theory. The very well-known Blahut-Arimoto algorithm [1,2] was proposed in 1972 to compute the discrete memoryless classical channel. Inspired by this algorithm, we propose an algorithm of Blahut-Arimoto type to compute the capacity of discrete memoryless classical-quantum channel. The classical-quantum channel [3] can be considered as a mapping x → ρ x of an input alphabet X = {1, 2, . . . , |X |} to a set of quantum states in a finite dimensional Hilbert space H. The state of a quantum system is given by a density operator ρ, which is a positive semi-definite operator with trace equal to one. Let D m denote the set of all density operators acting on a Hilbert space H of dimension m. If the source emits a letter x with probability p x , the output would be ρ x , thus the output would form an ensemble: {p x : ρ x } x∈X .
In 1998, Holevo showed [4] that the classical capacity of the classical-quantum channel is the maximization of a quantity called the Holevo information over all input distributions. The Holevo information χ of an ensemble {p x : ρ x } x∈X is defined as where H(·) is the von Neumann entropy, which is defined on positive semidefinite matrices: H(ρ) = − Tr(ρ log ρ). (2) Due to the concavity of von Neumann entropy [5], the Holevo information is always non-negative. The Holevo quantity is concave in the input distribution [5], and thus the maximization of Equation (1) over p is a convex optimization problem. However, it is not a straightforward convex optimization problem. In 2014, Sutter et al. [6] promoted an algorithm based on duality of convex programming and smoothing techniques [7] with a complexity of O( (n∨m)m 3 (log n) 1/2 ε ), where n ∨ m = max{n, m}. For discrete memoryless classical channels, the capacity can be computed efficiently by using an algorithm called Blahut-Arimoto (BA) algorithm [1,2,8]. In 1998, H. Nagaoka [9] proposed a quantum version of BA algorithm. In his work, he considered the quantum-quantum channel and this problem was proved to be NP-complete in 2008 [10]. Despite the NP-completeness, in [11], an example is given of a qubit quantum channel which requires four inputs to maximize the Holevo capacity. Further research of Nagaoka's algorithm was presented in [12], where the algorithm was implemented to check the additivity of quantum channels. In [9], Nagaoka mentioned an algorithm concerning classical-quantum channel; however, its speed of convergence was not studied there and the details of the proof were not presented either. In this paper, we show that, with proper manipulations, the BA algorithm can be applied to computing the capacity of classical-quantum channel with an input constraint efficiently. The remainder of this article is structured as follows. In Section 2, we propose the algorithm and show how the algorithm works. In Section 3, we provide the convergence analysis of the algorithm. In Section 4, we show the numerical experiments of BA algorithm to see how well this algorithm performs. In Section 5, we propose an approximate solution for a special case, which is the binary input, two-dimensional output case.
Notations: The logarithm with basis 2 is denoted by log(·). The space of all Hermitian operators of dimension m is denoted by H m . The set of all density matrices of dimension m is denoted by D m := {ρ ∈ H m : ρ ≥ 0, Tr ρ = 1}. Each letter x ∈ X is mapped to a density matrix ρ x , thus the classical-quantum channel can be represented as a set of density matrices {ρ x } x∈X . The set of all probability distributions of length n is denoted by ∆ n := {p ∈ R n : p x ≥ 0, ∑ n x=1 p x = 1}. The von Neumann entropy of a density matrix ρ is denoted by H(ρ) = − Tr[ρ log ρ]. The relative entropy between p, q ∈ ∆ n , if supp(p) ⊂ supp(q), is denoted by D(p||q) = ∑ x p x (log p x − log q x ) and +∞ otherwise. The relative entropy between ρ, σ ∈ D m , if supp(ρ) ⊂ supp(σ), is denoted by D(ρ||σ) = Tr[ρ(log ρ − log σ)] and +∞ otherwise.

Blahut-Arimoto Algorithm for Classical-Quantum Channel
First, we write down the primal optimization problem: Primal : where ρ x ∈ D m , s ∈ R n is a positive real vector, S > 0. We denote the maximal value of Equation (3) as C(S). In this optimization problem, we are to maximize the Holevo quantity with respect to the input distribution {p x } x∈X . Practically, the preparation of different signal state x has different cost, which is represented by s = (s 1 , s 2 , . . . , s n ). We would like to bound the expected cost of the resource within some quantity, which is represented by the inequality constraint in Equation (3). Lemma 1. [6] Let a set G be defined as G := arg max p∈∆ n χ({p x : ρ x } x∈X ) and S max := min p∈G s T p. Then, if S ≥ S max , the inequality constraint in the primal problem is inactive; and, if S < S max , the inequality constraint in the primal problem is equivalent to s T p = S. Now, we assume that min{s x } x∈X ≤ S ≤ S max . The Lagrange dual problem of Equation (3) is Dual : Define functions. Let where ρ = ∑ x p x ρ x .
Proof. Actually, we can prove a stronger lemma (the following lemma was proposed in [9], but no proof was given, perhaps due to the space limitation). We now restate the lemma in [9] and give the proof.
Proof. Considering Equation (7), we have where ρ XB = ∑ x p x |x x| X ⊗ ρ x and σ XB = ∑ x q x |x x| X ⊗ σ x are classical-quantum state [5]. Let the quantum channel N be the partial trace channel on X system; then, by the monotonicity of quantum relative entropy ( [5], Theorem 11.8.1), we have Notice that, if we let σ x = ρ x in Equation (7), with some calculation, Equation (7) becomes Lemma 3. Thus, Lemma 3 is a straightforward corollary of Lemma 4 Theorem 1. The dual problem in Equation (4) is equivalent to min λ≥0 F(λ) + λS. (8) Proof. It follows from Equation (5) and Lemma 3 that Hence, The BA algorithm is an alternating optimization algorithm, i.e., to optimize f λ (p, p ), each iteration step would fix one variable and optimize the function over the other variable. Now, we use BA algorithm to find F(λ). The iteration procedure is where ρ t = ∑ x p t x ρ x . To get p t+1 , we can use the Lagrange function: setting the gradient with respect to p x to zero. By combining the normalization condition, we can have (taking the natural logarithm for convenience): where r t Thus, we can summarize the Algorithm 1 below.

Lemma 5.
Let p * (λ) = arg max p f (p, p) for a given λ; then, s T p * (λ) is a decreasing function of λ.
, then, by the definition of p * (λ), we have: which is a contradiction to the fact that p * (λ 2 ) is an optimizer of χ(p) − λ 2 s T p. Thus, we must have We do not need to solve the optimization problem in Equation (8), because from Lemma 1 we can see that the statement "p * is an optimal solution" is equivalent to "s T p * = S and p * maximizes , which is also equivalent to "s T p * = S and p * maximizes f λ (p, p)", thus, if for some λ ≥ 0, a p maximizes f λ (p, p) and s T p = S, then the capacity C(S) = F(λ) + λS, and such λ is easy to find since s T p is a decreasing function of λ, and, to reach an ε accuracy, we need steps using bisection method.

Convergence Analysis
Next, we show that the algorithm indeed converges to F(λ) and then provide an analysis of the speed of the convergence.

The Convergence Is Guaranteed
Corollary 1.
The first equality comes from a manipulation of Equation (5). The second equality follows from Equation (10). The last equality follows from Equation (9).

Corollary 2.
For arbitrary distribution {p x } x∈X , we have ).
The first equality follows from Equation (9). The second equality follows from Corollary 1. The third equality follows from Equation (10). The last equality follows from Equation (1). Since the relative entropy D(ρ X ||ρ t ) is always non-negative [5], we have ).
Proof. Let p * be an optimal solution that achieves F(λ); then, we have the following inequality The first equality follows from Equations (5), (6), and (1). The first inequality follows from Corollary 2. The last inequality follows from the non-negativity of relative entropy.
Notice we take the initial p 0 to be uniform distribution, so the right hand side of Equation (15) is finite. Combine with the fact that f λ (p t+1 , p t ) is a non-decreasing sequence, this means f λ (p t+1 , p t ) converges to F(λ).
Proof. Remove the summation over t in Equations (13) and (14); then, we have Now that the sequence {p t } ∞ t=0 is a bounded sequence, there exists a subsequence {p t k } ∞ k=0 that converges. Let us say it converges top. Then, clearly, we have f (p,p) = F(λ) (or f (p t+1 , p t ) would not converge). Substituting p * =p in Equation (16), we have Thus, the sequence {D(p||p t )} ∞ t=0 is a decreasing sequence and there exists a subsequence {D(p||p t k )} ∞ k=0 that converges to zero. Therefore, we can conclude that {D(p||p t )} ∞ t=0 converges to zero, which means {p t } ∞ t=0 converges top.

The Speed of Convergence
Theorem 4. To reach ε accuracy to F(λ), the algorithm needs an iteration complexity less than log n ε .
Proof. From the proof of Theorem 2, we know Next, we show that in some special cases the algorithm has a better convergence performance, which is a geometric speed of convergence. Proof. Notice that the von Neumann entropy in Equation (2) is strictly concave [14], thus, for distributions p = p , ρ = ∑ x p x ρ x = ∑ x p x ρ x = ρ , which is followed from Assumption refas1. Thus, this means H(ρ) is strictly concave in p. Thus, Holevo quantity in Equation (1) is strictly concave in p, which means the optimal solution p * is unique.
We also need the following theorem: Theorem 6. [15] The relative entropy satisfies Now, we state the theorem of convergence: Theorem 7. Suppose start from some initial point p 0 , then, under Assumption 1, the algorithm converges to the optimal point p * , and p 0 converges to p * at a geometric speed, i.e., there exist N 0 and δ > 0, where N and δ are independent, such that, for any t > N 0 , we have x and the real vector |d = (d 1 , d 2 , . . . , d n ) T . Using Taylor expansion, we have where P = diag(p * 1 , p * 2 , . . . , p * n ). Now, p t converges to p * , i.e., |d converges to zero, thus there exists a N 0 such that, for any t > N 0 , we have From Theorem 6, we have where M ∈ R n×n : From Equation (18), we know that, under Assumption 1, M is positive definite. Thus, there exists a δ > 0 such that Thus, for any t > N 0 , it follows from Equations (17) and (18) that From Equation (12), we know for any t > N 0 .

Remark 2. (Complexity)
. Denote n, m as the size of input alphabet and output state dimension, respectively. A closer look at Algorithm 1 reveals that, for each iteration, a matrix logarithm log ρ t needs to be calculated, and the rest are just multiplication of matrices and multiplication of numbers. The matrix logarithm can be done with complexity O(m 3 ) [16], thus, by Theorem 4 and Equation (11), the complexity to reach ε-close to the true capacity using Algorithm 1 is O( ). With extra condition of the channel {ρ x } x∈X , which is Assumption 1, the complexity to reach an ε-close solution (i.e., D(p * ||p t ) < ε) using Algorithm 1 is N 0 ) ). Usually, we do not need ε to be too small (no smaller than 10 −6 ), thus, in either case, the complexity is better than O( (n∨m)m 3 (log n) 1/2 ε ) in [6] when n ∨ m is big, where n ∨ m = max{n, m}.

Numerical Experiments on BA Algorithm
We only performed experiments on BA algorithm with no input constraint (BA algorithm with input constraint is some combination of BA algorithm with no input constraint.) We studied the relations between iteration complexity and n, m (i.e., the input size and output dimension) when the algorithm reaches certain accuracy. Since we do not know the true capacity of a certain channel, we used the following theorem to bound the error of the algorithm.
Proof. Following from Algorithm 1, Corollary 1, and Theorem 3, we have lim t→∞ p t+1 where ρ * = ∑ x p * x ρ x , p * is an optimal distribution. The limit above is 1 if p * x > 0 and does not exceed 1 if p * x = 0. Thus, for every x ∈ X , with equality if p * x > 0. This proves For any p t and any optimal distribution p * , we have The first equality requires some calculation and the second equality follows since p * is an optimal distribution. This means max x {D(ρ x ||ρ t ) − λs x } converges to F(λ) from above.
Thus, our accuracy criterion was: for a given classical-quantum channel, we ran the BA algorithm (with no input constraint), until [max x {D(ρ x ||ρ t )} − f (p t , p t )] was less than 10 −k , and recorded the number of iteration. At this time, the accuracy was of order 10 −(k+1) at most since max x {D(ρ x ||ρ t )} and f (p t , p t )] converged to the true capacity from above and below, respectively.
We performed the following numerical experiments: for given values of input size n , output dimension m and accuracy, we generated 200 classical-quantum channels randomly, recorded the numbers of iterations, and then calculated the average number of iterations of these 200 experiments. The results are shown in Figure 1. Note that the accuracy 10 −k in Figure 1 means we ran the BA algorithm until [max x {D(ρ x ||ρ t )} − f (p t , p t )] was less than 10 −k , and the error between the true capacity and the computed value was of order 10 −(k+1) at most. We can see in Figure 1 that the iteration complexity scales better as accuracy and input dimension increase. We can also see for given input size n and accuracy, the output dimension has vary little influence on iteration complexity, which means the iteration complexity also scales better as the output dimension m increases. Compared with our theoretical analysis of iteration complexity in Theorem 4: to reach accuracy, we needed log n iterations; the numerical experiments showed that the number of iterations was far smaller than logn to reach accuracy, whether the output quantum states were independent or not (cases in (n, m) = (6, 2), (10, 2)). The reason for this is that the inequalities in the proof of Theorem 4 are quite loose. Thus, Theorem 4 only provides a very loose upper bound on iteration complexity. We can also guess that maybe the relation in Equation (20) holds generally and we just cannot prove it yet. Next, we needed to see the running time of the BA algorithm. There were three methods to compute the classical-quantum channel capacity: BA algorithm, the duality and smoothing technique [6], and the method created by Hamza Fawzi et al. [17]. In [17], a Matlab code package called CvxQuad is provided which accurately approximates the relative entropy function via semidefinite programming so that many quantum capacity quantities can be computed using a convex optimization tool called CVX. Here, we compared the running time of the above three methods. For different input size n and output dimension m, we generated a classical-quantum channel randomly and computed the channel capacity using the above three methods and then recorded the running time of each method. The results are shown in Figure 2. In Figure 2, we can see the BA algorithm was the fastest method. The duality and smoothing method was rather slow and we did not record the running time of the duality and smoothing method when n = 30 because it took too long. We can also notice that the running time of the CvxQuad method was extremely sensitive to the output dimension, which is not a surprise because CVX is a second-order method. Thus, our BA algorithm was significantly faster than the other two methods when n and m became big.

An Approximate Solution of p in Binary Two Dimensional Case
In this section, we provide an approximate optimal input distribution for the case of the input size and output dimension are both 2: {p 1 : ρ 1 ; p 2 : ρ 2 }, p 1 + p 2 = 1, ρ 1 , ρ 2 ∈ D 2 .

Use Bloch Sphere to Get an Approximate Solution
Any two-dimensional density matrix can be represented as a point in the Bloch sphere [5], as shown in the following: Any density matrix can be represented as a vector in the Bloch sphere starting from the origin. Suppose ρ 1 , ρ 2 can be represented as r 1 , r 2 respectively, as shown in Figure 3; then, the two eigenvalues would be 0.5 ± r 1 /2 and 0.5 ± r 2 /2, respectively. Extending r 1 , we get two intersections on the surface of the Bloch sphere; then, these two intersections represent the two eigenvectors of ρ 1 (the points on the surface of the sphere represent pure state and the interior points represent mixed states). A probabilistic combination of ρ 1 , ρ 2 can be represented as p 1 ρ 1 + p 2 ρ 2 = p 1 r 1 + p 2 r 2 ([5] Exercise 4.4.13). Any point on the surface of Bloch sphere can be represented as where α is the angle to the Z-axis and φ is the angle of the X-axis to the projection of the point on the X-Y plane. By symmetry, it is obvious that the Holevo quantity is only related to r 1 , r 2 , θ, p 1 , where θ is the angle between r 1 , r 2 . One interesting result is that the angle θ has very little influence on p * , where p * is the optimal distribution that maximizes Holevo quantity. If we know λ 1 , λ 2 (the bigger eigenvalues of ρ 1 , ρ 2 , respectively), θ and p 1 , then the Holevo quantity can be written as where S(·) is the binary entropy (S(x) = −(x log x + (1 − x) log(1 − x)) and r i = 2λ i − 1.

Numerical Experiments on the Approximated Solutionp 1
It is obvious that the maximum of Holevo quantity only depends on r 1 , r 2 and θ, thus, without loss of generality, we let ρ 1 be on the Z-axis and ρ 2 be on the X-Z plane: which means the angle between ρ 1 and ρ 2 (i.e., r 1 and r 2 ) is θ [5].
The approximate solution is an interesting phenomenon. The reason the angle θ has such little influence on the maximum of Holevo quantity is unclear.

Discussion
In this paper, we provide an algorithm which computes the capacity of classical-quantum channel. We analyzed the speed of convergence theoretically and numerical experiments showed that our algorithm outperforms the existing methods [6,17]. We also provide an approximated method to compute the capacity of binary two-dimensional classical-quantum channel, which shows high accuracy. As mentioned in the Introduction, for a general quantum-quantum channel, maximizing Holevo quantity with respect to both the input distribution and output quantum state is NP-complete and this is not a convex optimization problem because Holevo quantity is concave with respect to input distribution and convex with respect to output quantum states. Thus, it remains open whether there exists an efficient algorithm to solve this problem. However, for classical-quantum channel, Holevo quantity has an upper bound, thus one future work would be to maximize Holevo quantity with respect to output states with given input distribution. It remains open if alternating optimization algorithms, in particular of Blahut-Arimoto type, can also be given for other optimization problems in terms of quantum entropy.