Quantum Codes from Neural Networks

We examine the usefulness of applying neural networks as a variational state ansatz for many-body quantum systems in the context of quantum information-processing tasks. In the neural network state ansatz, the complex amplitude function of a quantum state is computed by a neural network. The resulting multipartite entanglement structure captured by this ansatz has proven rich enough to describe the ground states and unitary dynamics of various physical systems of interest. In the present paper, we initiate the study of neural network states in quantum information-processing tasks. We demonstrate that neural network states are capable of efficiently representing quantum codes for quantum information transmission and quantum error correction, supplying further evidence for the usefulness of neural network states to describe multipartite entanglement. In particular, we show the following main results: a) Neural network states yield quantum codes with a high coherent information for two important quantum channels, the generalized amplitude damping channel and the dephrasure channel. These codes outperform all other known codes for these channels, and cannot be found using a direct parametrization of the quantum state. b) For the depolarizing channel, the neural network state ansatz reliably finds the best known codes given by repetition codes. c) Neural network states can be used to represent absolutely maximally entangled states, a special type of quantum error-correcting codes. In all three cases, the neural network state ansatz provides an efficient and versatile means as a variational parametrization of these highly entangled states.


Introduction
The exponential growth of the Hilbert space dimension in the number of particles is both a blessing and curse for quantum science: On the one hand, it is crucial to the widely-believed computational advantage of quantum computers over classical ones, but on the other hand it renders many questions about properties of many-body systems intractable. Yet we know that the "physical" corner of this Hilbert space has to be small: local Hamiltonians with highly-entangled ground states only require a polynomial number of parameters to describe, as do quantum circuits of polynomial depth.
This fact motivates the use of variational representations of quantum states to solve a large class of problems. At the heart of any variational ansatz is the idea to preserve as much information about the quantum state as possible, while discarding irrelevant features. Quantum mechanical properties of a state are fundamentally dictated by its entanglement, which captures quantum correlations between its subsystems.
For instance, correlation length in many-body spin systems is tightly linked to the existence of a spectral gap [Has07; GH16]. For gapped one-dimensional systems (which follow an entanglement entropy area law), one can use matrix product states (MPS) with polynomial bond dimension to efficiently represent ground states [FNW92; LVV15; Ara+13]. The MPS ansatz has further proven useful e.g. in the study of cricital systems [Pir+12] or in the continuum limit [Cue+17]. Other tensor network states include MERA and higher-dimensional variants such as PEPS-applied e.g. in the context of renormalization [Vid07;VC04], and proven similarly successful as part of numerical techniques [Orú14].
A relatively recent development is the use of neural network states as a variational ansatz, where the network is used as a function to calculate the state amplitudes [CT17]. There are many possible neural network architectures to choose from: one proposed model is to use restricted Boltzmann machines (RBMs) to represent e.g. the ground states and unitary dynamics of a transverse-field Ising model and the antiferromagnetic Heisenberg model [CT17], volume-law entanglement and the ground state of even long-range Hamiltonians [DLD17], as well as ground states of various stabilizer Hamiltonians, including the surface code [Jia+18]. While there exist local Hamiltonians that cannot be represented efficiently with shallow RBM architectures, it has been shown that deep RBM networks can in fact represent most physical states, which includes those that can be created by poly-depth quantum circuits, or ground states of local Hamiltonians with a 1/ poly spectral gap [GD17].
Apart from describing the physics of many-body systems, entanglement also plays a crucial role in information-processing tasks: teleportation [Ben+93], superdense coding [BW92], and entanglement-assisted classical [Ben+99] and quantum [DHW04] communication all build on bipartite entanglement as a resource. In contrast, for certain tasks such as quantum information transmission through many uses of a quantum channel, or the encoding of quantum information in quantum error correction codes, the crucial property is multipartite entanglement, which encapsulates correlations among all the constituents of the system simultaneously [Has07].

Main Results
We demonstrate that neural network states with only polynomially many parameters in the system size (which, in this context, we call efficient) are capable of representing quantum codes for quantum information transmission and quantum error correction. In particular, we show the following: • The neural network state ansatz finds new quantum codes with a high coherent information (CI) that outperform all previously known codes for two channel models, the generalized amplitude damping channel and the dephrasure channel. For the generalized amplitude damping channel, the new codes also increase the threshold of the channel, i.e., the boundary of the interval in the parameter space with positive quantum capacity. For both channels, the new codes cannot be found with 'traditional' numerical methods, i.e., a direct parametrization of the complex amplitudes of the quantum state.
• For the depolarizing channel, neural network states can efficiently represent the best known codes. We carry out a detailed comparison of different network architectures, showing that FF networks converge faster than RBMs with comparable parameter counts in almost all tested cases. Furthermore, we constructively prove that the best known codes (repetition codes, and products thereof) can be obtained efficiently with both an RBM and a FF architecture.
• Neural network states can be used to parametrize so-called "absolutely maximally entangled" (AME) states. These AME states, defined on n systems of local dimension d each, are examples of quantum error-correcting codes with the property that they are completely mixed after tracing out at least half of the systems. Besides their quantum error correction capabilities, AME states are useful in multi-user information-theoretic tasks such as open-destination teleporation, secret sharing or entanglement swapping that require maximal entanglement across different choices of bipartitions [HC13; Hel+12].
The properties of both quantum codes with high coherent information and AME states are the result of the non-trivial multipartite entanglement present in these states. The main finding of this paper is that for both high-CI states and AME states, a neural network state ansatz is able to faithfully represent this multipartite entanglement, which we demonstrate empirically for small problem instances. We furthermore provide numerical evidence that the variational ansatz vastly outperforms a full state parametrization for the respective learning tasks.

Structure of this Paper
This paper is structured as follows. In Sec. 2 we introduce the quantum capacity of a channel and state the corresponding coding theorem which expresses the quantum capacity as a regularized formula in terms of an entropic quantity called the coherent information.
We then discuss how lower bounds on the quantum capacity can be obtained by solving an entropic optimization problem. In Sec. 3 we review neural network states based on restricted Boltzmann machines and feed-forward nets. We then present our main results about representing quantum codes with neural network states. In Sec. 4 we discuss the generalized amplitude damping channel and the dephrasure channel. We show that the neural network state ansatz finds new quantum codes providing the strongest lower bounds to date on the quantum capacities of these channels. Moreover, we demonstrate that these new codes are not found using a "direct" parametrization of quantum states. We then show in Sec. 5 for the depolarizing channel how tensor products of repetition codes-i.e. the known optimal codes for k ≤ 9 uses of this channel-can be efficiently represented using FF and RBM networks, and comment on the trainability of our chosen network architectures. Finally, in Sec. 6 we demonstrate how known examples of AME states can be efficiently represented using neural networks, and we comment on the trainability of the network architectures that we used. We conclude in Sec. 7 with a discussion of our results and open problems.
In the appendices, we give more details about certain aspects of the paper. In App. A and App. B we state explicit formulas for the coherent information of weighted repetition codes for the generalized amplitude damping channel and dephrasure channel, respectively, which serve as benchmarks for our quantum codes from neural networks. We also supply additional data obtained in our numerical investigations. In App. C we give an overview of the best known codes for the depolarizing channel, and provide an analytical construction of these codes for neural networks with various architectures. In App. D we provide some Figure 1: Left: Restricted Boltzmann machine (RBM) with five input nodes and five hidden nodes. Right: Feed forward neural network with five input nodes, two (realvalued) output nodes, and three fully-connected hidden layers of size five each. Each line represents one real value being propagated forward from node to node; the f i are non-linear activation functions (e.g. sigmoid, ReLU, cos, see Sec. F for a discussion) applied to an affine transformation of the node inputs (see Eq. 8).
background information on absolutely maximally entangled states, and prove a useful bound on a trace distance parameter indicating how close a state is to being absolutely maximally entangled. In App. E we discuss possible encodings of d-ary input strings to neural networks. In App. F we comment on the role of activation functions for quantum codes; furthermore, we propose a novel NN Schmidt decomposition ansatz, which we benchmark against a full NN parametrization for the depolarizing channel. In App. G we give a high-level explanation of the global derivative-free numerical optimization techniques used in our paper. Finally, we provide additional numerical data for some of our results in App. H. We encourage researchers to adopt our methods by providing full access to our code (in C++ and MATLAB) that was used to obtain the numerical results of this paper. These code files can be found in the "Ancillary files" section of the arXiv post of this paper [Anc]. In

The Quantum Capacity of a Quantum Channel
A point-to-point communication link between quantum systems can be modeled by a quantum channel. For quantum systems A and B with underlying (finite-dimensional) Hilbert spaces H A and H B , respectively, a quantum channel N : A → B is a linear, completely positive, trace-preserving map between the algebras of linear operators B(H A ) and B(H B ). A quantum state ρ A on A is a linear positive semidefinite operator with unit trace. A quantum state ψ A with rank 1 is called pure, and can be identified with a normalized vector |ψ A ∈ H A such that ψ A = |ψ ψ| A .
The communication capabilities of a quantum channel are characterized by various capacities, depending on what kind of information one attempts to transmit faithfully through the channel. The quantum capacity Q(N ) of a quantum channel N : A → B characterizes the optimal rate of faithful quantum information transmission through the channel. Q(N ) can be defined in terms of the operational task of entanglement generation as follows.
Suppose Alice, the sender, prepares a pure state ψ RA n in her laboratory and sends the A n -part to Bob through n independent uses of the quantum channel N . 1 Upon receiving the quantum systems from Alice, Bob applies some decoding operation D n : B n → R to the output, yielding the final state where the channel coherent information Q (1) (N ) is defined as with the von Neumann entropy S(ρ) := − tr(ρ log(ρ)). Formula 1 for the quantum capacity involves the evaluation of the channel coherent information Q (1) (·) over an (in principle) unbounded number of channel copies. If the channel coherent information is weakly additive, Q (1) (N ⊗n ) ≤ nQ (1) (N ), then the regularization disappears and Eq. 1 becomes Q(N ) = Q (1) (N ). Weak additivity of the channel coherent information is only known to hold for certain classes of channels such as degradable channels [DS05]. Moreover, there are examples of quantum channels for which the channel coherent information is strictly superadditive, Q (1) (N ⊗n ) > nQ (1) (N ) for some n, rendering the regularization over n in the quantum capacity formula 1 necessary in general [DSS98]. However, for so-called low-noise channels that are close in diamond norm to a noiseless channel, the effect of superadditivity of coherent information cannot be too large, and the single-letter coherent information is essentially the right answer [LLS18b; Sut+17]. In this paper, we are interested in the high-noise regime where superadditivity of channel coherent information typically occurs.
An important part of the quantum capacity theorem in Eq. 1 is the fact that the channel coherent information is an achievable rate [Llo97; Sho02; Dev05]: (3) Using block codes, this can be generalized to Q(N ) ≥ 1 n Q (1) (N ⊗n ) for all n ∈ N. The rough proof idea of Eq. 3 is the following: Assume that |ψ RA is a pure state with strictly positive coherent information, Q (1) (ψ, N ) > 0. Once Alice and Bob share k copies of the state σ RB = (id R ⊗N )(ψ RA ) (which they can achieve by Alice sending the A k part of the state ψ ⊗k RA to Bob through N ⊗k ) for a sufficiently large k, there is a protocol defined in terms of the typical subspaces of σ ⊗k RB that allows Alice and Bob to generate entanglement between them at a rate of r − δ for arbitrarily small δ ∈ (0, r), where r is equal to the coherent information of the state σ, that is, In this operational picture, we can think of ψ RA as the inner code, whereas the (1-LOCC assisted) distillation protocol manipulating σ ⊗k RB is the outer code. The rate at which the full protocol generates entanglement is solely determined by the (strictly positive) coherent information of the inner code ψ RA . Hence, in this paper we refer to the inner code ψ RA simply as a quantum code. The main objective of this paper is to find quantum codes |ψ RA n that achieve high coherent information 1 n Q (1) (ψ RA n , N ⊗n ) > 0. To find such quantum codes, we use the neural network state ansatz introduced in [CT17]. In the next section, we review different variants of this ansatz.

Neural Network States
For simplicity we consider in the following a system consisting of n qubits, that is, a collection of n 2-dimensional quantum systems each described by a Hilbert space isomorphic to C 2 . The state space of the n qubits is described by the tensor space (C 2 ) ⊗n with the "computational basis" {|0 , |1 } ⊗n , and a general pure normalized quantum state |ψ ∈ H ⊗n can be written as Here, C is a normalization constant ensuring ψ|ψ = 1, the set of binary strings of length n is denoted by {0, 1} n , and for a string i n = (i 1 , . . . , i n ) ∈ {0, 1} n we define |i n := |i 1 ⊗ . . . ⊗ |i n . Evidently, a full description of the quantum state |ψ consists of a list of the 2 n complex amplitudes ψ(i n ), corresponding to 2 · 2 n − 1 real degrees of freedom.
For a neural network state ψ, the amplitude function ψ(i n ) in Eq. 4 is computed from the input string i n using a neural network. There are different network architectures that can be used, and we describe a few common choices in the following subsections.

Restricted Boltzmann States
The first architecture-and one of the most well-studied ones, see e.g. [Gla+18] for an excellent review-are restricted Boltzmann machines (RBM). They have proven particularly fruitful as a variational ansatz for representing various ground states of local Hamiltonians [CT17], notably surpassing fidelity as compared to other neural network architectures in some cases.
A Boltzmann machine has visible and hidden nodes (see Fig. 1). A set of complex variables is assigned to each node; we denote the visible units with i 1 , . . . , i n , and the hidden units with h 1 , . . . , h m . Each link between nodes corresponds to an Ising-type coupling, which defines an energy function (which one can think of as a Hamiltonian) The two vectors a ∈ C n and b ∈ C m define a bias over the visible and hidden nodes, respectively, while the matrix W ∈ C m×n defines the coupling between the two layers. The energy of the system allows us to define a complex probability distribution over the vectors To extract a weight ψ(i n ) used to assemble a state via Eq. 4, we simply trace out the hidden nodes of the RBM, which yields a marginal probability distribution over the input nodes. We obtain If we take all parameters a, b and W to be real-valued, the resulting state will only have real non-negative weights. In order to retain full generality in the RBM ansatz, the network weights are typically chosen to be complex [CT17].

Deep Boltzmann States
While RBM states struggle to represent e.g. ground states for local Hamiltonians with even mildly-decaying spectral gap, adding links between the nodes within each layer yields a model with vastly greater representative power [GD17; Gla+18]-deep Boltzmann machines (DBMs, see Fig. 2).
Figure 2: Deep Boltzmann machine (DBM) with five input nodes and five hidden nodes. The architecture resembles that of an RBM (see Fig. 1), but where the nodes within each layer are cross-linked.
[GD17] showed that the model with connections within a layer is equivalent to one with more than two inter-connected layers but no connections within each layer.
In analogy to Eq. 5, we can define an energy function for a DBM by introducing additional coupling matrices D ∈ C m×m and C ∈ C n×n for the hidden and visible nodes, respectively. This yields an overall Hamiltonian The way one obtains a state from a DBM follows the same method as for an RBM.

Feed-Forward Network States
The third architecture is obtained by using the most prominent neural network model to date, feed-forward nets, to represent quantum states. This has proven successful in a number of cases [CL18; Sai17]. A feed-forward network consists of a visible layer v = i n with input nodes i 1 , . . . , i n , a fixed number H of hidden layers h (j) of width M j , and an output layer o with two output nodes o 1 and o 2 (see Fig. 1). Each hidden neuron h k . Here, we use the notation [n] := {1, . . . , n} for n ∈ N. The interactions between two hidden layers h (j−1) and h (j) are mediated by weight matrices (W and l ∈ [M j−1 ]. The weight matrix W (1) mediates between the visible layer and the first hidden layer, and the weight matrix W (H+1) mediates between the last hidden layer h (H) and the output layer o with bias b (H+1) . In each hidden layer h (j) the state of the neurons is processed with a non-linear activation function f j . In the following, we interpret the visible layer v, the hidden layers h (j) , and the output layer o as column vectors, and functions are evaluated component-wise. Given the input v = i n , the amplitude function ψ(i n ) is computed as follows: A network architecture is specified by the data (H, {M j , f j } j∈ [H] ). Common choices for the activation functions are the sigmoid function σ(x) := (1 + exp(−x)) −1 , the hyperbolic tangent tanh, or the rectified linear unit ReLU(x) = max{0, x}, which are depicted in Fig. 11. From a theoretical point of view these choices are all equivalent, since feed-forward networks as described above are universal: With a single hidden layer, they can approximate any given function to arbitrary precision provided the activation function is non-constant and the number of hidden neurons is sufficiently large [Kol61; Hor91]. However, in practice the choice of activation functions has to be tailored to the problem at hand to achieve good numerical results. In App. F, we elaborate on the heuristics of choosing activation functions for neural network states; of particular interest in this context is that periodic activation functions such as cosine seem to be able to capture more of the structure of various quantum states [CL18]. We prove analytically in App. C.2 that periodic activation functions are also beneficial in representing good quantum codes.

Generalized Amplitude Damping Channel
The first quantum channel for which we investigate the neural network state ansatz is the generalized amplitude damping channel (GADC) A γ,N . It is defined in terms of two parameters γ, N ∈ [0, 1] and acts on a qubit state ρ as The GADC models the dynamics of a qubit in contact with a thermal bath at temperature N and transition probability γ between the ground state |0 and the excited state |1 . This quantum channel is a realistic noise model in various physical processes such as relaxation processes of spin systems, superconducting quantum computers, and loss processes in linear optical systems [Mya+00; Tur+00; CB08; Zou+17]. Furthermore, for N = 0 the GADC reduces to the well-known amplitude damping channel modeling energy dissipation of a qubit.
While the quantum capacity of the amplitude damping channel A γ,0 is equal to its (additive) single-letter coherent information for all γ ∈ [0, 1] and can be computed efficiently [GF05], the quantum capacity of the more general noise model A γ,N with N ∈ (0, 1) is unknown. Various upper bounds on Q(A γ,N ) have been computed in the recent work [KSW19], but so far achievable rates (i.e., lower bounds on Q(A γ,N )) improving upon the single-letter coherent information Q (1) (A γ,N ) have not been studied extensively. We prove in this section that for N ∈ (0, 1) and particular intervals of γ the channel coherent information Q (1) (A γ,N ) of the GADC is superadditive. As shown in the discussion below and in Fig. 3, superadditivity is achieved by, e.g., weighted repetition codes A compact formula for the coherent information of this code in terms of an optimization over the weight λ ∈ [0, 1] and arbitrary blocklength k can easily be derived (see App. A).
Note that the optimal single-letter coherent information Q (1) (A γ,N ) is achieved by Eq. 10 with k = 1 and optimized weight parameter λ ∈ [0, 1] [GP+09]. We will show in this section that the neural network state ansatz finds superadditive codes for the GADC that substantially outperform weighted repetition codes. In the following, we restrict our attention to the interval N ∈ [0, 1/2], as A γ,N and A γ,1−N are unitarily equivalent and hence their channel coherent informations (and quantum capacities) coincide [KSW19]. In the optimization procedure we consider the values N ∈ {0.1, 0.2, 0.3, 0.4, 0.5} and identify intervals of γ in which weighted repetition codes are superadditive, that is, they yield a higher coherent information than the optimal single-letter coherent information. For k = 3, 4, 5 copies of A γ,N , we search for neural network codes using a feed-forward architecture as described in Fig. 1 with four hidden layers of width 2k each. We choose cos as the activation function in the first layer, the hyperbolic tangent function tanh as the activation function in the subsequent layers, and a Cartesian output layer (see Eq. 8). In contrast to the more common gradient-based optimization techniques in machine learning, we choose to optimize the neural network parameters using stochastic gradient-free techniques. In particular, we use particle swarm optimization algorithm followed by pattern search. We motivate our choice to use these algorithms in App. G, which also contains high-level explanations of these techniques.
For all values N ∈ {0.1, 0.2, 0.3, 0.4, 0.5} we find neural network codes outperforming the weighted repetition codes Eq. 10, as shown in Fig. 3. For each N ∈ {0.1, 0.2, 0.3, 0.4, 0.5}, the codes in Fig. 3 are obtained by first carrying out our optimization technique for a particular value of γ close to the threshold of the best weighted repetition code. We then plot the best neural network code found in this manner for the entire interval γ where superadditivity occurs. As a benchmark, we evaluate weighted repetition codes for up to k = 16 channel copies using the formula derived in App. A; the codes φ k for 1 ≤ k ≤ 5 perform best and are shown in Fig. 3 for comparison. We focus here on the neural network codes found for the values (γ, N ) = (0.44035, 0.1) and k = 3, 4, 5 copies of A γ,N , and note that the neural network codes for the other values of (γ, N ) are collected in App. A. In Tab. 1 we list the best codes (as plotted in Fig. 3) for each blocklength together with their coherent information. In Fig. 4 we plot the convergence of the particle swarm optimization algorithm for (γ, N ) = (0.44035, 0.1) and k = 3, 4, 5 (FF), and compare its performance to a direct parametrization (RAW) of the 2 2k complex amplitudes in the quantum code |ψ n , again optimized using PSO. Evidently, using comparable optimization parameters the raw ansatz is not able to find even trivial product codes with coherent information equal to zero. Note also that for (γ, N ) = (0.44035, 0.1) the weighted repetition codes in Eq. 10 do not yield positive coherent information up to at least k = 16. Hence, the neural network codes increase the threshold of the GADC substantially, as seen in Fig. 3. The threshold of a parametrized family of quantum channels is defined as the boundary of the region in which the channel has positive quantum capacity.

Dephrasure Channel
The neural network ansatz is also able to find new quantum codes for the dephrasure channel that was introduced recently in [LLS18a]. It is defined in terms of probabilities p, q ∈ [0, 1] as where Z = |0 0| − |1 1| is the Pauli Z-operator, and |e is an erasure flag that is orthogonal to the input space. The name 'dephrasure' is derived from the fact that N p,q first dephases an input state in the Z-basis with probability p, and then erases it with probability q.
Despite the fact that both dephasing and erasure noise are well-understood in terms of quantum information transmission, the dephrasure channel-a concatenation of the twoexhibits superadditivity of coherent information for as little as two uses of the channel [LLS18a]. As a result, the quantum capacity of the dephrasure channel is unknown for a large region in the parameter space.
As for the GADC in the previous section, superadditivity of coherent information for the dephrasure channel is again achieved by weighted repetition codes φ λ k as defined in Eq. 10. A compact formula for the coherent information 1 k Q (1) (φ λ k , N ⊗k p,q ) of these codes was derived in [LLS18a], and we state it in App. B. Similar to the GADC in Sec. 4.1, we note that the optimal single-letter coherent information . We show in this section that the neural network state ansatz finds new quantum codes demonstrating even larger superadditivity of coherent information for the dephrasure channel.
In the following, we focus our attention to the values q ∈ {0.1, 0.2, 0.3, 0.4} of the erasure probability; for each q, we then investigate values of the dephasing probability p for which weighted repetition codes achieve superadditivity. Since the dephrasure channel maps a qubit to a qutrit, optimizing its coherent information is computationally more costly than for the GADC, which forces us to restrict our attention to k = 2, 3, 4 copies of N p,q (we refer to Sec. 7 for a discussion of these numerical limitations). We again use a feed-forward network as described in Fig. 1 with four hidden layers of width 2k each and cos as the activation function in the first layer. However, in contrast to Sec. 4.1 we use ReLU as the activation function in the remaining layers, and an exponential output layer corresponding to a polar parametrization instead of a Cartesian one. We found these choices to perform significantly better for the dephrasure channel. As in Sec. 4.1, the neural network parameters were optimized using the particle swarm optimization algorithm followed by pattern search (see App. G).
For all values q ∈ {0.1, 0.2, 0.3, 0.4} we find neural network codes outperforming the weighted repetition codes Eq. 10, as shown in Fig. 5. For each q ∈ {0.1, 0.2, 0.3, 0.4} and k = 2, 3, 4, the codes in Fig. 5 are obtained by first carrying out our optimization technique for a particular value of p close to the threshold of the best weighted repetition code. We then plot the best neural network code (labeled ν k for k = 2, 3, 4) found in this manner for an interval of p where superadditivity occurs. We also individually optimized the coefficients of the basis strings s n with non-zero weight across the shown interval of p, yielding even better codes ν * k . Curiously, such an additional optimization over coefficients gave no improvement for the neural network codes found for the GADC in Sec. 4.1. In contrast, for fixed q there is an evident interplay between the dephasing probability p in the dephrasure channel N p,q and the coefficients of the neural network codes ν k , as evident from Fig. 5. As a benchmark, we evaluated weighted repetition codes φ k for up to k = 10 channel copies using the formula in App. B; the maximum max k φ k over these codes for 1 ≤ k ≤ 10 is shown in Fig. 5 for comparison, along with the optimal single-letter code φ 1 .
We focus in the following on the neural network codes found for the values (p, q) = (0.08, 0.4) and k = 2, 3, 4; the other neural network codes are listed in App. B. In Tab. 2 we list the best codes (as plotted in Fig. 5) for each blocklength together with their coherent information. In Fig. 6 we plot the convergence of the particle swarm optimization algorithm for (p, q) = (0.08, 0.4) and k = 2, 3, 4 (FF), and compare its performance to a direct parametrization (RAW) of the 2 2k complex amplitudes in the quantum code |ψ n , again optimized using PSO. Similar to the GADC in Sec. 4.1, the raw ansatz is not able to find codes with coherent information rates as high as the neural network codes. However, in contrast to the GADC the raw ansatz is indeed able to find superadditive quantum codes. For k = 2, these codes found using the raw ansatz are optimal (as already observed in [LLS18a]), while for k = 3, 4 they are clearly outperformed by our neural network codes. Another observation of [LLS18a] is that the dephasing part of N p,q suggests a Schmidt ansatz for quantum codes, a neural network state version of which is discussed in Eq. 16 in Sec. 5. However, in the high-noise regime investigated above, this Schmidt ansatz did not yield codes performing as well as the codes ν k resp. ν * k .

Representing the Best Known Codes for the Depolarizing Channel
The depolarizing channel is used as a model to describe qubit decoherence in a noisy environment. For a qubit in a state described by the density operator ρ, and for a real parameter p ∈ [0, 4/3], the action of the channel is given by i.e. the original state ρ is replaced by the maximally mixed state with 'probability' p (for p ≤ 1); in other words, if on the Bloch sphere ρ has spin polarization vector x, the channel D p shrinks x by a factor 1 − p. Figure 5: Overview of quantum codes for the dephrasure channel N p,q comparing the neural network codes ν k (solid orange, magenta, and red lines) for k = 2, 3, 4 to the optimal single-letter code φ 1 (grey dashed line) and the maximum over all weighted repetition codes φ k (black dashed line) for 2 ≤ k ≤ 10 defined in Eq. 10. We also plot the neural network codes ν * k with optimized parameters over the shown interval (dash-dotted lines), and the best code χ 3 on three channel qubits found with a direct parametrization of the quantum state amplitudes (blue line). For each q ∈ {0.1, 0.2, 0.3, 0.4}, we plot the interval of p in which superadditivity occurs. The neural network codes ν k are listed in Tabs. 2, 8, 9, 10 for q = 0.4, 0.3, 0.2, 0.1, respectively.
100 200 300 400 500 4 · 10 −5 5 · 10 −5 6 · 10 −5 training step (FF) k = 4 0 100 200 300 400 500 1 · 10 −5 2 · 10 −5 3 · 10 −5 4 · 10 −5 training step (RAW) Figure 6: Training convergence of a particle swarm optimization algorithm maximizing the CI of k = 2, 3, 4 copies of the dephrasure channel N p,q with parameters (p, q) = (0.08, 0.4). The left column plots a feed-forward (FF) net representation with four hidden layers of width 2k each (see Sec. 4.2), having 90/182/306 real parameters for k = 2, 3, 4, respectively. The right column plots a direct parametrization (RAW) of the 2 (2k) complex amplitudes, resulting in 32/128/512 real parameters for k = 2, 3, 4, respectively. While the two parametrization find equivalent codes for k = 2, the feed-forward net representation finds strictly better codes for k = 3, 4 than the raw parametrization. For the depolarizing channel, the single-letter channel coherent information , and evaluates to [Wil16] Q (1) (D p ) remains positive up to the threshold at p = 0.25238 (the threshold is defined as the highest p for which Q (1) (D p ) > 0). The next highest thresholds are achieved for k = 3 and 5 channel copies and a k-repetition code for which the channel coherent information Q (1) (φ k , D ⊗k p ) reaches zero at p = 0.25350 and p = 0.25380, respectively. Both in terms of the rate and the threshold, these repetition codes are the best known codes up to 9 channel copies, which is discussed in more detail in App. C.1.
We show in the following that a variational neural network ansatz achieves these codes for the depolarizing channel. We also contrast the various architectures (RBM, feed-forward, and their Schmidt variants) on an empirical level. To compute the amplitude function ψ(i n ) in the tensor basis expansion we use both an RBM architecture as well as an FF architecture with a cos activation function in the first hidden layer, and ReLU in two subsequent hidden layers. This setup, which has been shown to perform well in the context of representing quantum states of local Hamiltonians [CL18], clearly outperformed a ReLU-only architecture in our numerical investigations of the GADC and the dephrasure channel in Sec. 4. Furthermore, we propose a Schmidt-ansatz similar to Eq. 15 given for 2l qubits by This approach greatly reduces the number of degrees of freedom required to parametrize |ψ 2l , but enforces the environment R to have the same dimension as the system A. Note that this may introduce redundancy, as e.g. a repetition code ordinarily only requires a single purifying qubit. The ansatz in Eq. 16 furthermore introduces a choice of basis for the channel input qubits, rendering it less general than the ansatz in Eq. 15.
Using an explicit construction, we show that both FF and RBM architectures can efficiently represent products of repetition codes (which are discussed in App. C.2): given k repetition codes on n 1 , . . . , n k qubits, respectively, an RBM with i n i visible units and k hidden nodes can represent the corresponding state amplitudes, and a FF net with first cos and second ReLU hidden layer width k, and a single final ReLU node suffices.
Empirically, we contrast FF, RBM and their corresponding Schmidt variants as a variational ansatz ψ n (with n = 2k) to maximize Q (1) (ψ n , D ⊗k p ); the FF architecture consists of three hidden layers of width n = 2k with cos-ReLU-ReLU for the activation functions and a Cartesian output layer. In comparison with a full state vector on n qubits with 2 × 2 n real parameters, we can see a significant improvement in convergence speed (see Fig. 7), both in the case that the best-known code is a single repetition code for three channel uses, or a three times one product repetition code (see App. C.1 for an explanation of this terminology). For both FF and RBM architectures, the Schmidt ansatz Eq. 16 surpasses the standard parametrization Eq. 15, which is likely due to the significantly-reduced parameter count. FF networks further outperform RBM architectures with comparable parameter counts on three and four channel uses of a depolarizing channel, which we verified with various global derivative-free optimization techniques (see App. G for an overview) to reduce the likelihood of a systematic bias in our numerical findings. The numerical data for these findings is collected in App. H. We also note that a deep Boltzmann machine ansatz as described in Sec. 3.2 offered no advantage over an RBM ansatz, neither in terms of representability nor convergence speed. Figure 7: Training convergence of a particle swarm algorithm maximizing the CI of three resp. four copies of the depolarizing channel D p , with noise parameter p = 0.2523. Plotted are the best candidates of 80 threadsà 100 particles for every training step from 0 to 500. The final candidate distribution, and the outcome of other optimization algorithms can be seen in App. H. For three channel copies, a threerepetition code maximizes the coherent information, whereas for four channel copies a product code of a three-repetition and single-repetition code is optimal. Plotted are FF (feed-forward net, 140 resp. 234 real parameters; see Sec. 5 for the FF architecture), FF/Schmidt (Schmidt representation obtained from a feed-forward net, 40 resp. 65 real parameters), RBM (restricted Boltzmann machine with hidden layer width 9, 138 resp. 232 real parameters), RBM/Schmidt (Schmidt representation obtained from an RBM with hidden layer width 9, 39 resp. 64 real parameters), and raw (parametrizing the full state vector, 128 resp. 512 real parameters); note that the FF and RBM representations are in fact overspecified for three channel uses.

Representing Absolutely Maximally Entangled States
Absolutely maximally entangled (AME) states are n-partite states having maximal correlation across any bipartition of the n parties into equal halves. These states are certain examples of quantum error-correcting codes, whose intricate multipartite entanglement structure mediates correlations between different subsets of the constituent systems. AME states can be used as a resource for multi-user information-theoretic tasks such as open-destination teleporation, secret sharing or entanglement swapping that require maximal entanglement across different choices of bipartitions [HC13; Hel+12]. In a holographic context, where AME states are referred to as perfect tensors, they provide examples of holographic error-correcting codes [LS15; Pas+15; Li+17]. More generally, an arbitrary AME state on n qudits of local dimension d can be interpreted as a ((n, 1, n 2 + 1)) d quantum error-correcting code, i.e., a code of distance n 2 + 1 encoding a 1-dimensional subspace in n physical qudits [Sco04].
To define absolutely maximally entangled (AME) states in a precise way, we consider a pure state |ψ n,d ∈ (C d ) ⊗n on n qudits of local dimension d. For a subset S ⊂ [n] := {1, . . . , n} of the n qudits we denote by ρ S = tr S c ψ n,d the marginal of ψ n,d on S. Then ψ n,d is AME if ρ S = 1 |S| I S for every S ⊂ [n] with |S| = n 2 . We use the notation AME(n, d) for an AME state on n qudits of local dimension d.
Since an AME state is maximally entangled across all possible bipartitions into equal halves, monogamy of entanglement [CKW00] puts an obstruction on their existence. Furthermore, the fact that AME states are particular quantum error-correcting codes yields additional constraints via weight enumerator theory [SL97; Rai98]. Consequently, AME states do not exist for all (n, d) Sec. D. For example, it is known that there is no AME(4, 2) state [HS00]. On the other hand, an example of an AME(4, 3) state is The property of ψ n,d being absolutely maximally entangled is related to the linear entropy Defining for m = 1, . . . , n 2 the average linear entropy a pure state ψ n,d is AME if and only if Q n 2 (ψ n,d ) = 1 [Sco04]. Hence, to search for AME(n, d)-states ψ n,d , we can use Eq. 17 with m = n 2 as the objective function and optimize the parameters in an ansatz for ψ n,d such that Q n 2 (ψ n,d ) ≈ 1. As before, we use a neural network state ansatz for ψ n,d based on the following decomposition with respect to a given basis {|i } d−1 i=0 : where as before C is a normalization constant, and we use the notation The amplitude function ψ(i n ) is again computed by a neural network; since this is now a function from the set of all d-ary strings of length n into C, there are multiple options how to encode i n as the input to a neural network. We discuss these options in detail in App. E. We demonstrate in Fig. 8 that parametrizing ψ n,d with a neural network state ansatz yields AME(n, d)-states for the pairs (n, d) = (3, 6), (4, 4), (4, 7), and (5, 6). For the numerical optimization, we use the artificial bee colonization algorithm, followed by pattern search and a final round of gradient search (see App. G). These choices of parameters are only exemplary, and the neural network state ansatz is capable of representing AME(n, d)states also for other pairs (n, d) such as (3, 3), (4, 3), and (4, 5). In the last three cases, the convergence is remarkably fast and only takes a few iterations in optimization algorithms such as ABC or PSO to reach a value of Q n 2 sufficiently close to 1. To assess our numerical results, we introduce an 'average trace distance' parameter where π S := 1 |S| I S denotes the completely mixed state, and X 1 = tr √ X † X is the trace norm of an operator X. The parameter D m (ψ n,d ) measures the average trace distance of the marginals of a state ψ n,d on m subsystems to the completely mixed state. Clearly, D n 2 (ψ n,d ) = 0 if and only if ψ n,d is AME. We prove in Sec. D that This bound allows us to relate a value of Q m to how close (on average) in trace distance a state is to being AME (see Fig. 8).

Discussion
In this work, we have shown that quantum codes for noisy quantum communication and certain quantum error-correcting codes can be modeled efficiently with various neural network representations. In particular, we investigated quantum codes that yield high coherent information for the generalized amplitude damping channel (GADC), the dephrasure channel, and the depolarizing channel. For the GADC and the dephrasure channel, the neural network ansatz finds codes that outperform the best known codes found with traditional numerical methods. For k ≤ 6 of the depolarizing channel, we analyzed the representative power of neural network states with regards to the best known codes, repetition codes, and benchmarked how well they can be trained using a variety of global optimization algorithms. Finally, we demonstrated how neural network states can represent absolutely maximally entangled states on n qudits of local dimension d for an array of pairs (n, d). An interesting question is, of course, whether a neural network state ansatz can be used to find better quantum codes for the depolarizing channel in the high noise regime: either in terms of a higher rate than, say, the 5-repetition code right below the noise threshold, or in terms of increasing the noise threshold itself. Our results indicate that in order to find such codes outperforming the repetition codes (or products thereof), one ought to increase the number of channel copies beyond 5, resulting in code states on 10 or more input qubits. While the (polynomial) scaling of the neural network ansatz in the number of input qubits is favorable, the calculation of the coherent information is the bottleneck here: The computation for a code on k qubits requires diagonalizing a dense 4 k × 4 k matrix, which scales exponentially in runtime with the number of qubits. Due to these computational limitations, evaluating the coherent information for k 7 channel uses is thus an infeasible undertaking, and we would need to find an alternative approach-e.g. by exploiting symmetry considerations, or an approximate cost function that is faster to compute (see e.g. [WBS14], with the added difficulty that the coherent information is the difference between two entropies).  Furthermore, it could be possible that better quantum codes lie in maxima of measure almost zero, while the repetition code maxima dominate the potential landscape, making it difficult to find codes that surpass repetition codes. In fact, in all our simulations for k ≤ 6 copies of the depolarizing channel, the variational NN ansatz converges to product repetition codes. Our results might be seen as indication that, among the states that can be represented using a neural network, repetition codes are in fact optimal for k ≤ 6 copies of the depolarizing channel. We note that our techniques of finding quantum codes using neural network states can also be applied to other channels such as generalized Pauli channels, which includes the depolarizing channel. A thorough investigation of other channels in this class, such as the BB84 channel [BB84], is the subject of ongoing work.
We also applied our ansatz to search for AME(n, d)-states for values of (n, d) for which it is unknown yet whether these states exist. The smallest-dimensional instances of these cases are (4, 6) and (7, 4) (see App. D). For (n, d) = (4, 6) the best value we obtained was Q 2 (ψ 4,6 ) ≈ 0.9956, which translates via Eq. 20 to a bound on the average trace distance parameter of D 2 (ψ 4,6 ) 0.6429. The state ψ n,d achieving these values is an RBM state with binary encoding and a hidden layer width of M = 12. For (n, d) = (7, 4), we obtained Q 3 (ψ 7,4 ) ≈ 0.9962, corresponding to D 3 (ψ 7,4 ) 0.7870, achieved by an FF state with binary encoding and hidden layers (14, 14, 14) with activation functions cos-ReLU-ReLU. These results suggest that, assuming AME states do exist in these cases, one has to tweak the neural network ansatz or the numerical methods, or both, in order to obtain numerical instances of AME states.

A. Codes for the Generalized Amplitude Damping Channel
In this section we provide an overview of the quantum codes for the GADC defined in Eq. 9 found using the neural network state ansatz. To benchmark these neural network quantum codes we use weighted repetition codes whose simple structure allows for an efficient computation of the coherent information Q (1) (φ λ k , A ⊗k γ,N ). In the following, we first carry out this calculation, and then present the optimal neural network codes that we found for the GADC.

A.1. Formula for the Coherent Information of Repetition Codes
We first determine the action of the GADC A γ,N = 4 i=1 A i ρA † i with A i as defined in Eq. 9 on a single qubit: Setting which is a diagonal operator with eigenvalues with multiplicity k m for m = 0, . . . , k. Hence, For the state on the joint system, we have This operator can be written as Let r denote one of the eigenvalues of the state σ (0) in Eq. 33, let for m = 1, . . . , k − 1, and Then the entropy of σ RB k equals The coherent information Q (1) (φ λ k , A ⊗k γ,N ) = S(B k ) σ − S(RB k ) σ can now be efficiently computed using Eq.s 25 and 40 for blocklenghts up to k = 20.

A.2. Neural Network Codes for the GADC
We list the best neural network codes found for the GADC A γ,N in the following tables: •     A comparison of these codes to weighted repetition codes is plotted in Fig. 3 in the main text.

B. Codes for the Dephrasure Channel
In the following, we give a summary of the results about the coherent information of the dephrasure channel N p,q (defined in Eq. 11) that were obtained in [LLS18a]. These results are concerned with the one-way quantum capacity, as defined in Sec. 2; for a discussion of two-way capacities, see [PLB19].

B.1. Formula for the Coherent Information of Repetition Codes
Superadditivity of the channel coherent information of the dephrasure channel can be achieved using a simple weighted repetition code where λ ∈ [0, 1]. In [LLS18a], the following formula is derived for its channel coherent information: is the binary entropy (in terms of the binary logarithm), artanh(x) := 1 2 log 1+x 1−x , and Moreover, it is shown in [LLS18a] that for k = 1 the formula in Eq. 42 maximized over λ ∈ [0, 1] is in fact the optimal single-letter channel coherent information. That is, Q (1) (N p,q ) is optimized by states whose marginal on the system qubits is diagonal in the computational basis. Hence, the formula Eq. 42 can be used to find quantum codes that surpass the optimal code for a single copy of N p,q , demonstrating superadditivity of coherent information.  For q ∈ {0.1, 0.2, 0.3, 0.4} and the relevant intervals of p, the rates of the weighted repetition code φ λ k (optimized over λ ∈ [0, 1]) for 1 ≤ k ≤ 5 are plotted in Fig. 5. The lines corresponding to φ 1 represent the codes achieving the optimal single-letter coherent information Q (1) (N p,q ).

B.2. Neural Network Codes for the Dephrasure Channel
We list the best neural network codes found for the dephrasure channel N p,q in the following tables: •    A comparison of these codes to weighted repetition codes is plotted in Fig. 5 in the main text.

C. Codes for the Depolarizing channel C.1. Product Repetition Codes for the Depolarizing Channel
In this appendix, we discuss the known optimal codes for the depolarizing channel, which are given by repetition codes For p 0.2519, the single-letter coherent information Eq. 13 is optimal. For 0.2519 p 0.2533, the 3-repetition code φ 3 (defined in Eq. 44) is optimal, while for 0.2533 p 0.2538 the 5-repetition code φ 5 is optimal. The point p 0.2538 marks the highest threshold for a single repetition code. This threshold can be further extended using the concatenated codes of [SS07; FW08]. 2 We summarize this in Fig. 9, where we compare the repetition codes and their rates and thresholds. The above codes are the best known information-theoretic codes, yielding the best lower bounds on the quantum capacity of the depolarizing channel by Eq. 3. However,   in numerical investigations we are facing a slightly different problem of maximizing the k-coherent information 1 k Q (1) (ψ, D ⊗k p ) over quantum codes ψ for fixed k, that is, solving For k ≤ 9 channel uses, the optimization problem Eq. 45 is solved by products of repetition codes, Here, k = (k 1 , . . . , k l ), and the resulting code |Φ k is a quantum code on l i=1 k i channel input qubits and l purifying qubits.
To illustrate this, consider 4 channel uses of the depolarizing channel, and recall that the single-letter coherent information Eq. 13 vanishes around p 0.2524. The respective thresholds for the 4-repetition code φ 4 and the 3-repetition code φ 3 on three input qubits are p 0.2532 and p 0.2535, respectively (see Fig. 9 and the file rep-codes-tabular.txt in [Anc]). Hence, for 0.2532 ≤ p ≤ 0.2535 it is clearly advantageous to "freeze" one input qubit to some fixed pure state, and use a 3-repetition code on the remaining 3 input qubits. Since pure input states can never establish coherent information between Alice and Bob, the frozen input does not contribute to the overall coherent information, and the resulting code incurs a penalty in the rate. However, this code inherits the same threshold as the 3-repetition code on three input qubits, thus outperforming the plain 4-repetition code. Similarly, one finds that for p ∈ [0.2519, 0.2524] the quantity 1 4 Q (1) (D ⊗4 p ) is maximized by a 3-repetition code tensored with a 1-repetition code (i.e., using three of the four input qubits with one purifying qubit for a repetition code, and maximally entangling the remaining input qubit with another purifying qubit). In Fig. 10 and Tab. 11, we provide an overview of the thresholds and rates of the optimal such combinations of repetition codes for k ≤ 10 uses of the depolarizing channel. For k ≥ 10 uses of the depolarizing channel, concatenated codes can surpass the best known repetition code thresholds [SS07; FW08].  Table 11: Intermediate product repetition code thresholds; before the first column at 0.25186 the best code is given by the single-letter coherent information.

C.2. Products of Repetition Codes as Benchmark for Depolarizing Noise
As a benchmark for finding quantum codes, we demand that the models we propose can at least achieve the product repetition codes described above; either because they can represent products of repetition codes directly, or because they achieve the target rates by some other means. In particular, this should serve as a sanity check for the models we propose, indicating whether we need to increase the width of a hidden layer, or the depth of the model. The relevant question for us is whether a state |Φ n as defined in Eq. 46 can always be represented accurately by the weights obtained from an RBM or an FF net.

C.2.1. RBM States
First observe how the Hamiltonian H RBM describes a linear single-layer FF classifier (i.e., a linear function on the inputs i k ). Seen as a linear function on bit strings, the Hamiltonian can therefore represent a target state |ψ as well as a linear model allows. For the simple case of products of repetition codes, where we subdivide the set of basis states into those of weight 0 and 1, respectively, this question is well-studied in the context of linear classifiers. rate of the k-repetition code. The solid black line is the best achievable rate when only using product codes, e.g. for k = 3 and below p ≈ 0.252, a product of three single-channel repetition codes (1 × 1 × 1) is superior to one 3-repetition code. It is noteworthy that the segmentation of the best achievable rates is not clear a priori: For k = 4, the segments are 1 × 1 × 1 × 1 and then 3 × 1, where the extra kink at p ≈ 0.2524 signifies that the single-letter CI has now dropped to zero; for k = 6, the segments are 1 × . . . × 1, 3 × 3, and 5 × 1-the latter one of which is just a single segment, as the single-letter CI is already zero.
A single k-repetition code has the form |0 · · · 0 + |1 · · · 1 =: |a + |b . Since the RBM uses a scaled encoding (see Tab. 12), the bit strings correspond to real entries in a kdimensional vector, and thus |b = 0 |a ; a linear function L therefore necessarily satisfies L(|b ) = L(|a ) = 0. If we let |b be a basis state (unnormalized) and complete the basis with k − 1 arbitrary orthogonal vectors, it therefore suffices to define L in such a way to have ker L = span{|b }.
Products of repetition codes always have the form k i=1 |φ n i ; since basis states are bit strings for the RBM classifier, the corresponding code is a direct sum of the individual repetition codes. We can thus construct a classifier for the overall code by writing L 1 ⊕ . . . ⊕ L k = L, which is still linear.
Since H RBM appears in an exponential in Eq. 6, we can use the spectral gap of H RBM to obtain a lower bound on how close to zero an entry in the code can be set. For instance, if we were to represent a 3-repetition code |000 + |111 , we can require that |111 is the eigenvector corresponding to the smallest eigenvalue of H RBM ; all other binary strings should have an energy that is as large as possible, such that the exponential function suppresses the corresponding weight. Consider the binary state |011 , which has overlap 2/3 with |111 (assuming normalization). If ∆ is the spectral gap of H RBM -i.e. the difference between the ground state energy and the second lowest eigenvalue-then 001| H RBM |001 = 2∆/3, yielding a lower bound between largest code weight and smallest code weight of exp(−∆). To get an empirical estimate, assume we flip a single bit-e.g. i 1 -in Eq. 5. How large can the energy difference be? If all parameters are chosen (in magnitude) within a range [−M, M ], then a simple estimate would be ∆ ≤ M + M 2 ; this is, of course, an upper bound to a lower bound. In practice we found that M = 10 is sufficient for our purposes.

C.2.2. DBM States
Eq. 7 introduces a quadratic term in the input. Since one can easily embed a 1-in-3Sat instance into a quadratic polynomial (for three boolean variables v 1 , v 2 , v 3 where true= 1 and false=0 enforced by terms v 2 i − v i = 0, the equation (v 1 + v 2 + v 3 − 1) 2 = 0 if and only if exactly one of the v i is true; the existence or nonexistence of a root for the sum of all constraints thus answers the instance), it is clear that the discriminative power of DBM states should vastly outperform that of RBM states, albeit at a higher computational cost. As discussed in the introduction, for various ground states of local Hamiltonians this intuition has empirically been shown to be correct.

C.2.3. Feed Forward Network States
It is easy to explicitly construct weights for an FF net that can represent any product repetition code. As a first step, consider a single repetition code |φ n . We set up a threelayer neural network from n inputs, one hidden layer of width 1, and a single output node (for simplicity we disregard the imaginary part for the state output in Eq. 8). The weights and activation functions to be chosen are x i −→ z := ReLU y − cos(2π/n + 1) 1 − cos(2π/n + 1) , and one can verify that the output is one on the all 1s and 0s input, and zero otherwise. We refer to Sec. F for a more detailed discussion. For a product code given by some n = (n 1 , . . . , n k ), we simply partition the input nodes into k subsets and dovetail those with a network given in Eq. 47; we obtain k outputs z 1 , . . . , z k . Since we know that a logic AND gate corresponds to all the z i = 1, we can use a final ReLU( k i=1 z i − k + 1) layer to enforce that the weights are 1 if all individual segments are valid repetition codes, and 0 otherwise-or merge all already existing ReLU nodes into one. Observe that we could always implement a single cos node in this function with two ReLU nodes, followed by another ReLU node to combine the outputs (as in Sec. F). This would increase the hidden layer width by a factor of two; we can incorporate addition of the individual outputs into the last existing ReLU layer, so the depth should remain constant.
One immediate consequence is that any product code of k repetition codes can always be represented by a network architecture where the first hidden layer has width k; and we in fact empirically found that the trained weights of the first layer are similar to those in Eq. 47.
A final note on the parameter range necessary for the argument: the largest coefficients in absolute value in Eq. 47 and its final AND node are max i (1 − cos(2π/n(i))) −1 , or k − 1, whichever is larger. Restricting the network's parameter range artificially below this threshold could result in worse representability of product repetition codes.

C.2.4. Schmidt Network States
The argument is similar as for feed-forward network states. Note that, in general, Schmidt codes will be redundant, since for e.g. four channel uses we are forced to using more than just a single purifying qubit. The fact that the neural net calculates Schmidt coefficients means that the repetition codes always uses as many purifying dimensions as system dimensions.

D. Absolutely Maximally Entangled States
An AME(n, d)-state is a pure state |ψ n,d ∈ (C d ) ⊗n on n qudits with local dimension d ≥ 2 satisfying for every S ⊂ [n] with |S| = n 2 . As mentioned in the main text, whether or not an AME(n, d)-state exists depends on n and d. Using weight enumerator theory [SL97; Rai98], Scott proved that an AME(n, d)-state can only exist if n ≤ 2(d 2 − 1) for even n, and n ≤ 2d(d + 1) − 1 for odd n [Sco04]. This technique was recently extended by Huber et al.
[Hub+18] to give further constraints on the existence of AME(n, d)-states. For fixed n an AME(n, d)-state always exists for sufficiently large local dimension d [HC13]. For example, AME(n, d)-states exist for d a prime power and n ≤ d [GBR04]. Recently, it was proved in [HGS17] that an AME state on seven qubits cannot exist. This result completely settled the case of qubit AME states: they exist for n = 2, 3, 5, 6, and only for these n. Here, we merely mention that it is unknown whether AME(n, d)-states exist for (n, d) = (4, 6) and (n, d) = (7, 4), (7, 6).
Scott [Sco04] proved that a multipartite state |ψ n,d ∈ (C d ) ⊗n is AME if and only if the average linear entropy Q m (ψ n,d ) = 1, where Q m (·) is defined in Eq. 17. Since we are searching for AME states by maximizing Q m (·), we need to make sure that a state ψ n,d with Q m (ψ n,d ) ≈ 1 is also approximately AME. We determine the latter by introducing the average trace distance parameter D m defined in Eq. 19 that measures the average trace distance between the marginals of ψ n,d on m subsystems and the completely mixed state. The average trace distance parameter D m (·) can be bounded from above in terms of Q m (·), as stated in Eq. 20. We restate this bound here for the reader's convenience: To prove Eq. 49, we use the quantum version of Pinsker's inequality, D(ρ σ) ≥ 1 2 ρ − σ 2 1 , where D(ρ σ) = tr(ρ log ρ) − tr(ρ log σ) is the quantum relative entropy. We also use the 2-relative Rényi entropy D 2 (ρ σ) = log tr(ρ 2 σ −1 ) [Pet86], and the well-known fact that D(ρ σ) ≤ D 2 (ρ σ).
Since AME states are defined on tensor products of d-dimensional Hilbert spaces, the input string i n to the neural network computing the amplitude ψ(i n ) in the ansatz Eq. 15 is a d-ary string. Depending on the local dimension, we use different encodings of this d-ary input string, as explained in App. E below.

E. Input Encoding of d-ary Strings for Neural Networks
In order to parametrize quantum states on n qubits, it is rather straightforward to use the neural network ansatz described in Sec. 3. In the case of AME(n, d)-states with local dimension d > 2, we slightly tweak the neural network ansatz. To this end, we fix a basis {|i } d−1 i=0 for C d , and express a general quantum state ψ n,d ∈ (C d ) ⊗n as where C is again a normalization constant ensuring ψ n,d |ψ n,d = 1, and we use the notation Example: For d = 6, the encoding is 0 → 000, 1 → 111, . . . , 5 → 101.
3. One-hot encoding: Encode each symbol in a 'one-hot' vector of length d and use the resulting binary string of length dn as the input to the neural network.
We have found that the performance of the specific encoding used in the neural network optimization depends on the local dimension d. For prime d, the neural network optimization using the scaled encoding converges quickly to known AME(n, d)-states such as AME(4, 7), as evident from Fig. 8 in the main text. On the other hand, for composite d the NN ansatz is more powerful using binary or one-hot encoding. Since binary encoding has a smaller overhead in terms of the 'physical' qubits used in the ansatz ( log d n vs. dn), we use binary encoding for composite local dimension d. We summarize the different encodings in Tab. 12.

F. The Role of Activation Functions for Quantum Codes
In machine learning, the use of nonlinear activation functions is crucial to a neural network's performance; otherwise, the network is just a single affine transformation and not useful beyond linear regression. The overall network can have varying activation functions per neuron (see Fig. 1). In essentially all cases, the activation functions are the same within a layer. The operation of such a layer is thus to perform an affine transformation on the input vector and then, element-wise, apply the nonlinearity f . For a single neuron z depending on x = (x 1 , . . . , x n ), the mathematical operation can thus be visualized as . . .
Commonly used activation functions are e.g. ReLU, sigm or tanh, which are plotted in Fig.  11; in addition to some thorough studies [IS15; He+15; KSH17], there seems to be a lot of empirical understanding which activation functions perform better in various scenarios [Phy]. One example is that e.g. sigm saturates (meaning the gradient vanishes for large or small values), whereas e.g. ReLU does not have the same problem. Furthermore, the general consensus seems to be that non-monotonic or periodic activation functions-such as e.g. sin-weaken the neural network's performance. We found conflicting evidence for this in the literature ([Sop99; GA16] and [GBC16, sec. 6.2.2]), suggesting that such periodic functions can indeed be useful for specific tasks-especially in the context of representing ground states for local Hamiltonians [CL18].
In one example of such a task, [CL18] use neural network states to approximate the ground states of certain Hamiltonians. They report good performance of feed-forward x ReLU(x) x tanh(x) x cos(x) Figure 11: Various activation functions (bold lines) and their derivatives (thin lines). tanh is an example for a sigmoid function; more commonly used, however, is sigm(x) = (1+exp(−x)) −1 . It is clear that sigmoid functions suffer from a vanishing gradient problem on both ends of its input. This can be countered either by going to another activation function-such as a rectified linear unit ReLU (or its "leaky" version, i.e. one where the segment for x < 0 has a small but non-vanishing slope), or using techniques such as batch normalisation [IS15]. Non-monotonic activation functions such as cos are rarely used in practice, but can be useful for certain specific tasks.
network architectures with a cosine activation function in the first layer for a 1D antiferromagnetic Heisenberg model, arguing that the cosine function is capable of handling the "sign problem" typically found in the analysis of Hamiltonians. We found that using cosine in the first hidden layer also performs well in finding good quantum codes for quantum channels such as the depolarizing channel defined in Eq. 12, or the dephrasure channel defined in Eq. 11. In the following, we want to give an intuition why a periodic activation function such as cos can be useful for learning quantum codes with a structure that can be easily derived from the binary signature of its state vector. To give an example, consider a repetition code on five qubits, given by |00000 + |11111 . A function M : (C 2 ) ⊗5 → C with M (|00000 ) = M (|11111 ) = 1, and 0 elsewhere, is trivial to construct from elementary logic gates (i.e. either all bits are zero, or all bits are one).
For a feed-forward neural network, one could imagine adding up all bits within one neuron, and thresholding this value with a ReLU activator: A similar gate with flipped signs can activate only when all bits are zero; the two outputs can then be combined using a final ReLU node.
We can achieve the same activation using a single cos neuron, dovetailed by a ReLU in the next layer: x i and z 2 = ReLU z 1 − cos(1/5) 1 − cos(1/5) .
While this looks like a more complicated version of the same calculation, it quickly becomes obvious that one can easily perform modular arithmetic using this technique-what we have in fact calculated is whether i x i ≡ 0 (mod 5). Why is this an advantage? As a slightly more complicated example, let us consider an (unnormalized) tensor code built from a 3-repetition code |φ 3 = |000 000 + |111 111 and a 1-repetition code (or simply maximally entangled state) |φ 1 = |0 0 + |1 1 . In both cases, the first block of qubits (3 resp. 1) is sent through the channel, and the second block form the purifying environment. On 4 qubits, the tensor code thus looks as follows (for visualization purposes we boldface the single channel repetition code): Any tensor channel N ⊗n is naturally covariant 3 with respect to permuting tensor factors, i.e., the unitary representation π → U π of the symmetric group S n on (C 2 ) ⊗n defined by U π |e 1 ⊗ . . . ⊗ |e n = |e π −1 (1) ⊗ . . . ⊗ |e π −1 (n) . Since the coherent information I(A B) is furthermore invariant under local unitaries of the form U A ⊗U B , codes that are permutations of each other yield the same value for the coherent information. For example, the code (U (14) ⊗ U (24) )(|φ 3 ⊗ |φ 1 ) = |0000 0000 + |1000 0100 + |0111 1011 + |1111 1111 (57) is obtained from |φ 3 ⊗ |φ 1 by swapping channel qubits 1 and 4 and environment qubits 2 and 4, 4 and is thus equivalent for quantum information transmission. 5 Hence, within each block of four qubits (either channel or environment) the code is characterized by the Hamming weight of the code vectors (0, 1, 3 and 4 in the example above), and ideally this is identified by the neural network. With modular arithmetic, we can have a cos neuron identifying 0 and 4 (e.g. all Hamming weights ≡ 0 (mod 4)), and another one identifying 1 and 3 (e.g. all odd Hamming weights). While it is conceivable that for simple codes such as Eq. 56 one can write down relatively simple circuits with non-periodic activation functions, it should be clear that we do save space within the neural network representation if we can perform calculations such as the ones above within a single neuron.

G. Numerical Optimization Techniques
In most applications neural networks are trained using the backpropagation method, in which each network parameter is updated using the gradient of a loss or objective 4 Note that in Eq. 57 the two tensor products on the left-hand side are with respect to different tensor factors. For the first tensor product, the two factors correspond to channel input and purifying qubits, respectively. 5 We do not claim that optimal codes are in any way symmetric due to this permutation invariance.
function with respect to that parameter. In our main application of neural networks, maximizing the coherent information of a quantum channel, the objective function is the coherent information itself. In the interesting case of a high-noise quantum channel (such as D p for p 0.2523), a randomly selected quantum code (e.g., with respect to the Haar measure on pure states) has strictly negative coherent information with high probability, whereas a product state |ψ 1 R ⊗|ψ 2 A always has vanishing coherent information, I c (ψ 1 ⊗ψ 2 , N ) = S(N (ψ 2 ))−S(ψ 1 ⊗N (ψ 2 )) = 0. Hence, the coherent information landscape is dominated by local maxima, and gradient-based optimization techniques are likely to get stuck in these local maxima.
This intuition was confirmed in our numerical search for good quantum codes for the depolarizing channel and the dephrasure channel. In the search for AME(n, d) states, the objective function is the function Q m (ψ) defined in Eq. 17. Here, numerical investigations also showed that gradient-based optimization was again likely to get stuck in local minima.
The failure of gradient-based optimization methods in both scenarios led us to consider gradient-free, stochastic global optimization techniques instead. In the following, we give high-level explanations of four popular such algorithms, particle swarm optimization, artificial bee colonization, pattern search (also known as direct search), and genetic evolution.

G.1. Particle Swarm Optimization
Particle swarm optimization (PSO) [KE95] is a meta-heuristic, derivative-free global optimization technique. The idea of PSO is to have multiple particles explore the landscape on the search for a global minimum, and communicate their individual best value to the swarm. At the same time, each particle records its own history and stores the personal best value. In each iteration, the update of a particle's velocity vector is determined by the current velocity, recurrence to the location of the personal best function value, and attraction towards the location of the global best value.
More precisely, fix model parameters α, β, γ > 0 and consider N particles with random initial position x (0) i and random initial velocity v (0) i for i ∈ [N ]. For each particle i, the variable p i stores the location of the personal best function value, while the variable g stores the location of the global best function value among the whole swarm. In the k-th iteration, the velocity and position of a particle are updated according to where r β , r γ ∈ [0, 1] are drawn uniformly at random. The parameter α is called inertia, while β and γ are usually called self-interaction and social interaction, respectively. A common modification of the particle swarm optimization is to limit the social interaction to neighborhoods of a certain size within the swarm, ensuring a more thorough exploration of the landscape by the swarm.
The MATLAB implementation of PSO, available in the Global Optimization Toolbox, uses the neighborhood modifications with variable neighborhood sizes and an adaptive adjustment of the inertia weight. We refer to the official documentation [Pso] for details of the algorithm, as well as the MATLAB files in [Anc] for the algorithm settings used in this paper. Furthermore, we used the "inertia weight" variant of PSO in Pagmo [Abc], with parameter settings as found in the C++ source files [Anc].

G.2. Artificial Bee Colonization
Artificial bee colonization (ABC) [Kar05] is another meta-heuristic, derivative-free global optimization technique based on the principle of swarm intelligence. The algorithm works as follows: The population consists of N employer bees and N onlooker bees. While the employer bees explore the neighborhood of randomly created 'food sources' (i.e., points in the landscape with a low objective function value for a minimization problem), the onlooker bees evaluate the food sources according to the promise given by the fitness of the food source, and join the employer bees in exploring the neighborhood of those food sources. If an employer bee cannot find any new food around its location for a certain number of iterations (i.e., it fails to find points in the neighborhood of the food source with a lower objective function value), it is converted into a scout bee and assigned to a new random food source.
In more detail, to minimize a function f : R D → R, an employer bee at site x i randomly explores the neighborhood of x i by probing the location x i which differs from x i in exactly one randomly drawn component j ∈ [D] according to where x k = x i is another randomly drawn food source, and r ∈ [−1, 1] is a uniform random number. If f (x i ) < f (x i ), the employer bee switches to x i and continues exploring its neighborhood. The fitness of the food source x i is defined as fit i := (1 + f (x i )) −1 , and each onlooker bee reinforces the employer bee group by selecting a food source according to the probability distribution {fit i / i fit i } i . We use the standard implementation of ABC found in the C++ optimization library Pagmo [Abc], as well as our own implementation of the standard algorithm in MATLAB (see [Anc]).

G.3. Pattern Search
The third derivative-free optimization technique we use in this paper is called pattern search or direct search. To minimize a function f : R D → R, the algorithm takes as input a starting point x 0 ∈ R D together with the objective function value f (x 0 ), and creates a mesh of probing points around the starting point. In each iteration or poll, the objective function is evaluated at each mesh point. If for one of the mesh points, say x 1 , the objective function value is lower than the current one (at x 0 ), the algorithm centers at x 1 and creates a new mesh.
There are different ways in how the mesh at a new center point is created. In a popular variant called generalized pattern search (GPS), the new probing points y i of the mesh are defined by a fixed set S ⊂ R D of vectors. Common choices are , where e i denotes the i-th standard basis vector, or S D+1 = {e i } D i=1 ∪ {−(e 1 + · · · + e D )}. In the k-th round with center point x k−1 , the points of the mesh are defined as y i = x k−1 + ∆v i , where v i ∈ S, and ∆ is the mesh constant. In a successful poll (i.e., when a new point with a lower objective function value is found), the mesh constant for the new mesh is doubled. If the poll is unsuccessful, the center point remains the same and ∆ is halved.
Another popular variant is called mesh adaptive direct search (MADS). Here, the set R ⊂ R D of vectors for the new mesh points is randomly created after each successful poll. In analogy to the GPS variant above, common choices are where in each case the v i are random vectors. The above variants of pattern search are available in the Global Optimization Toolbox of MATLAB [Psm]. We refer to the MATLAB files in [Anc] for the algorithm settings used in this paper.

G.4. Simple Genetic Algorithm
The fourth derivative-free optimization algorithm is a genetic algorithm, which is related to evolutionary methods such as PSO and ABC, but motivated from the process of gene evolution.
Starting from a random selection of N so-called "chromosomes" x (0) i -where each vector component is called a "gene"-a traditional implementation follows four steps.
Selection. Pick random tuples of size s from the chromosome pool, and select the ones with the best function value within each tuple; this creates a selected chromosome pool of size less than N .
Crossover. Randomly select a parent tuple (can be more than two, and up to the entire selected pool). Merge the parents, e.g. by selecting a random chromosome, and replacing each gene (coordinate of x (0) i ) with some probability p by genes from other chromosomes. Continue creating child chromosomes until the new pool reaches size N .
Mutation. Randomize child genes within each chromosome according to some randomness distribution D and mutation probability m; a popular variant of which is called polynomial mutation where D ∼ 1/ poly, which introduces a stronger bias towards creating children close to their parents.
Reinsertion. Merge parent and child chromosome pool and select N of the fittest candidates.