Modeling Sequences with Quantum States: A Look Under the Hood

Classical probability distributions on sets of sequences can be modeled using quantum states. Here, we do so with a quantum state that is pure and entangled. Because it is entangled, the reduced densities that describe subsystems also carry information about the complementary subsystem. This is in contrast to the classical marginal distributions on a subsystem in which information about the complementary system has been integrated out and lost. A training algorithm based on the density matrix renormalization group (DMRG) procedure uses the extra information contained in the reduced densities and organizes it into a tensor network model. An understanding of the extra information contained in the reduced densities allow us to examine the mechanics of this DMRG algorithm and study the generalization error of the resulting model. As an illustration, we work with the even-parity dataset and produce an estimate for the generalization error as a function of the fraction of the dataset used in training.


Introduction
In this paper, we present a deterministic algorithm for unsupervised generative modeling on strings using tensor networks. The algorithm is deterministic with a fixed number of steps and the resulting model has a perfect sampling algorithm that allows efficient sampling from marginal distributions, or sampling conditioned on a substring. The algorithm is inspired by the density matrix renormalization group (DMRG) procedure [1,2,3]. This approach, at its heart, involves only simple linear algebra which allows us to give a detailed "under the hood" look at the algorithm in action. Our analysis illustrates how to interpret the trained model and how to go beyond worst case bounds on generalization errors. We work through the algorithm with an exemplar dataset to produce a prediction for the generalization error as a function of the fraction used in training which well approximates the generalization error observed in experiments.
The machine learning problem of interest is to learn a probability distribution on a set of sequences from a finite training set of samples. For us, an important technical and conceptual first step is to pass from Finite Sets to Functions on Finite Sets. Functions on sets have more structure than sets themselves and we find that the extra structure is meaningful. Furthermore, well-understood concepts and techniques in quantum physics give us powerful tools to exploit this extra structure without incurring significant algorithmic costs [4]. We emphasize that it is not necessary that the datasets being modeled have any inherently quantum properties or interpretation. The inductive bias of the model can be understood as a kind of low-rank factorization hypothesis-a point we expand upon in this paper.
Reduced density operators play a central role in our model. In a happy coincidence, they play the central role in both the model's theoretical inspiration and the training algorithm. There is structure in reduced densities that inspire us to model classical probability distributions using a quantum model. The training algorithm amounts to successively matching reduced densities, a process which leads inevitably to a tensor network model, which may be thought of as a sequence of compatible autoencoders. We refer readers unfamiliar with tensor diagram notation to references such as [5,6,7]. This paper also builds on investigations of tensor networks as models for machine learning tasks. Tensor networks have been demonstrated to give good results for supervised learning and regression tasks [8,3,9,10,11,12,13,14]. They have also been applied successfully to unsupervised, generative modeling [15,16,17,18] including a study based on the parity dataset we use here [17]. This work focuses on the latter task, proposing and studying an alternative algorithm for optimizing MPS for generative modeling. The expressivity of models like the one considered in this paper have been studied [19]. In this paper, we focus on understanding how our training algorithm learns to generalize. Evenbly, James Stokes, and Yiannis Vlassopoulos for helpful discussions, and are happy to acknowledge KITP Santa Barbara, the Flatiron Institute, and Tunnel for support and excellent working conditions.

Densities and reduced densities
For our purposes, the passage from classical to quantum can be thought of as the passage from Finite Sets to Functions on Finite Sets, which have a natural Hilbert space structure. We are interested in probability distributions on finite sets. The quantum version of a probability distribution is a density operator on a Hilbert space. The quantum version of a marginal probability distribution is a reduced density operator. The operation that plays the role of marginalization is the partial trace. In our setup, the reduced densities contain more information than the marginal distributions associated to them and much of our work concerns this extra information.
Given a finite set S, one has the free vector space V = C S consisting of complex valued functions on S, which is a Hilbert space with inner product The free vector space comes with a natural map from S → C S , which we recall in a moment. To avoid confusion, it is helpful to use notation to distinguish between an element s ∈ S and its image in C S , which is a vector. Commonly, the vector image of s is denoted with a boldface font or an overset arrow. We like the bra and ket notation, which is better when inner products are involved. For any s ∈ S, let |s denote the function S → C that sends s → 1 and s → 0 for s = s. The set {|s } is an independent, orthonormal spanning set for V . If one chooses an ordering on the set S, say S = {s 1 . . . . , s d }, then |s j is identified with the j-th standard basis vector in C d , thus defining an isometric isomorphism of V ∼ − → C d and a "one-hot" encoding S → C d . More generally, we denote elements in V by ket notation |ψ ∈ V .
For any |ψ ∈ V , there is a linear functional in V * whose value on |φ ∈ V is the inner product ψ|φ . We denote this linear functional by the succinct bra notation ψ| ∈ V * . Every linear functional in V * is of the form ψ| for some |ψ ∈ V . We have vectors |ψ ∈ V and covectors ψ| ∈ V * and the map |ψ ←→ ψ| defines a natural isomorphism between V and V * . We have chosen to distinguish between vectors and covectors with bra and ket notation; we will not imbue upper and lower indices with any special meaning.
When several spaces V, W, . . . are in play, some tensor product symbols are suppressed. So, for instance, if |ψ ∈ V and |φ ∈ W , we will write |ψ |φ , or even |ψφ , instead of |ψ ⊗ |φ ∈ V ⊗ W . An expression like |φ ψ| is an element of W ⊗ V * , naturally identified with an operator V → W . The expression |ψ ψ| is an element in End(V ). Here, End(V ) denotes the space of all linear operators on V and in the presence of a basis is identified with dim(V ) × dim(V ) matrices. If |ψ is a unit vector, then the operator |ψ ψ| is orthogonal projection onto |ψ : it maps |ψ → |ψ and maps every vector perpendicular to |ψ to zero.
A density operator, or just density for short, is a unit-trace, positive semidefinite linear operator on a Hilbert space. Sometimes a density is called a quantum state. If S is a finite set and V = C S , then a density ρ : V → V defines a probability distribution on S by defining the probability π ρ : S → R by the Born rule Going the other way, there are multiple ways to define a density ρ : V → V from a classical probability distribution π on S so that π ρ = π. One way is as a diagonal operator: ρ diag := s∈S π(s)|s s|. Another way is to define There exist other densities that realize π via the Born rule, but think of the diagonal density and projection onto |ψ as two extremes. The density ρ π has minimal rank and ρ diag has maximal rank. In the language of quantum mechanics, a state is pure if it has rank one and is mixed otherwise. The degree to which a state is mixed is measured by its von Neumann entropy, − tr(ρ ln(ρ)), which ranges from zero in the case of ρ π up to the Shannon entropy of the classical distribution π in the case of ρ diag . In this paper, we always use the pure state ρ := ρ π . To summarize, we associate to any probability distribution π : S → R the density ρ π : V → V defined by Equation (2), which has the property that π ρπ = π. If a set S is a Cartesian product S = A × B then the Hilbert space C S decomposes as a tensor product C S ∼ = C A ⊗ C B . In this case, a density ρ : C A ⊗C B → C A ⊗C B is the quantum version of a joint probability distribution π : A × B → R. By an operation that is analogous to marginalization, ρ gives rise to two densities ρ A : C A → C A and ρ B : C B → C B which we refer to as reduced densities. We now describe this operation, which is called partial trace.
If X and Y are finite dimensional vector spaces, then End(X ⊗ Y ) is isomorphic to End(X) ⊗ End(Y ). Using this isomorphism, there are maps tr Y (f ⊗ g) := f tr(g) and tr X (f ⊗ g) := g tr(f ) for f ∈ End(X) and g ∈ End(Y ). The maps tr Y and tr X are called partial traces. The partial trace preserves both trace and positive semi-definiteness and so the image of any density ρ ∈ End(X ⊗ Y ) under partial trace defines reduced densities tr Y ρ ∈ End(X) and tr X ρ ∈ End(Y ). It is worth noting that while we have maps End(X) ⊗ End(Y ) → End(X) and End(X)⊗End(Y ) → End(Y ), there do not exist natural maps V ⊗W → V or V ⊗ W → W for arbitrary vector spaces V and W ; partial trace is special, it is defined in the case that V and W are endomorphism spaces.
2.1. Reconstructing a pure state from its reduced densities. We now discuss the problem of reconstructing a pure quantum state ρ on a product X ⊗ Y from its reduced densities ρ X and ρ Y .
Using the isomorphism X ∼ = X * that is available in any finite dimensional Hilbert space, one can view any vector |ψ in a product of Hilbert spaces X ⊗ Y as an element of X * ⊗ Y , hence as a linear map M : X → Y .
and maps the perpendicular complement of the span of the {|e i } to zero. Now, given a unit vector |ψ ∈ X ⊗ Y , we have the density ρ = |ψ ψ| ∈ X ⊗ Y ⊗ Y * ⊗ X * and the reduced densities ρ X : X → X and ρ Y : Y → Y . The reduced densities of ρ are related to the operator M : X → Y fashioned from |ψ as follows (4) ρ X = M * M and ρ Y = M M * as illustrated in Figure 2. The singular vectors {|e i } and {|f i } of M are precisely the eigenvectors of the reduced densities. Therefore, the density ρ can be completely reconstructed from its reduced densities ρ X and ρ Y . One obtains |ψ by gluing the eigenvectors of the reduced densities along their shared eigenvalues ( Figure 3). In the nondegenerate case that the eigenvalues are distinct, then there is a unique way to glue the {|e i } and the {|f i } and |ψ is recovered perfectly.

Reduced densities of classical probability distributions
Let π : S → R be a probability distribution and consider the density ρ π as in Equation (2). Suppose S ⊂ A × B and let ρ A = tr Y ρ and ρ B = tr X ρ denote the reduced densities where, as above, X = C A , Y = C B , and V = X ⊗Y . Let us now interpret the matrix representation of these reduced densities. We compute: and zero otherwise, we can understand the (a, a ) entry of the reduced density ρ A as In particular, the diagonal entry (ρ A ) a a is b∈B π(a, b) and we see the marginal distribution π A : A → R along the diagonal of the reduced density ρ A . We make the consistent observation that ρ A has unit trace. The off-diagonal entries of ρ A are determined by the extent to which a, a ∈ A have the same continuations in B. Note that ρ A is symmetric. The reduced density on B is similarly given: So, the reduced densities of ρ contains all the information of the marginal distributions π A and π B and more. Now, let's take a look at the extra information carried by the reduced densities, which is entirely contained in the off diagonal entries. Since the entire state, and therefore π itself, can be reconstructed from the eigenvectors and eigenvalues of ρ A and ρ B , we know that from a high level this spectral information encodes the conditional probabilities that are lost by the classical process of marginalization. En route to decoding this spectral information, let us describe how an arbitrary density τ is a classical mixture model of pure quantum states. If |e 1 , . . . , |e k is a basis for the image of a density τ consisting of orthonormal eigenvectors, then the corresponding eigenvalues λ 1 , . . . , λ k are nonnegative real numbers whose sum is one. One has The density τ defines a probability distribution on pure states: the probability of the pure state |e i e i | being λ i . Then, |e i e i | defines a probability distribution on the computational basis {s} via the Born Rule: the probability of s is s|e i e i |s = | e i |s | 2 .
We're interested in the reduced densities of ρ = |ψ ψ| and in this case there exists a one-to-one correspondence |e i ↔ |f i between eigenvectors of the reduced densities ρ A := tr Y (ρ) and ρ B := tr X (ρ) spanning their respective images.
as outlined in Section 2.1.
Putting together the general picture of a density as a mixture of pure states with the reduced densities of a pure state leads one to the following paradigm. With probability λ i the prefix subsystem will be in a state determined by the corresponding eigenvector |e i of ρ A , and the corresponding suffix subsystem will be in a state determined by the eigenvector |f i . The vector |e i = a γ a i |a determines a probability distribution on the set of prefixes A: the probability of the prefix a is |γ a i | 2 . The vector |f i = b β b i |b determines a probability distribution on the set of suffixes B: the probability As a final remark, if we had begun with the diagonal density whose Born distribution is also π, then the matrices representing ρ A and ρ B would be diagonal matrices with marginal distributions on A and B along the diagonals and all off diagonal elements are zero. The eigenvectors of ρ A and ρ B are simply the prefixes |a and and suffixes |b and carry no further information. The process of computing reduced densities of ρ diag is nothing more than the process of marginalization. We always use the pure state ρ = |ψ ψ| ensuring that the reduced densities carry information about subsystem interactions. The eigenvectors of the reduced densities, which are linear combinations of prefixes and linear combinations of suffixes, interact through their eigenvalues and capture rich information about the prefixsuffix system. Let us summarize. Begin with a classical probability distribution π on a product set S = A×B. Form a density ρ π on C A×B by the formula in Equation (2). The reduced densities ρ A and ρ B on C A and C B contain marginal distributions π A and π B on their diagonals, but they are not diagonal operators. The eigenvectors of these reduced densities encode information about prefix-suffix interactions. The prefix-suffix interactions are tantamount to conditional probabilities and carry sufficient information to reconstruct the density ρ.
3.1. Learning from samples. In the machine learning applications to come, the goal is to learn ρ π defined in Equation (2) from a set {s 1 , . . . , s N T } of samples drawn from a probability distribution π. Each sample s i will be a sequence (x 1 , . . . , x N ) of a fixed length N . The algorithm to learn the density ρ π on the full set of sequences S is an inductive procedure.
One only works with a density ρ defined using the sample set since the density ρ π for the entire distribution π is unavailable. The procedure begins by computing the reduced density ρ A and its eigenvectors for a subsystem A consisting of short prefixes.
Step by step, the size of the subsystem A is increased until one reaches a point where the suffix subsystem B is small. In a final step, ρ is recombined from the collected eigenvectors of ρ A for all the prefix systems A and the eigenvectors and eigenvalues of ρ B . This procedure leads naturally to a tensor network approximation for ρ.
An important point is that the reduced density ρ A operates in a space whose dimension grows exponentially with the length of the prefix system A. So, instead of computing ρ A exactly, it is computed by a sequence of approximations that keep its rank small. The modeling hypothesis is that π is a distribution whose corresponding quantum state ρ π has low rank in the sense that the reduced densities ρ A and ρ B are low rank operators for all prefix-suffix subsystems A and B. The large rank of the density ρ witnessed from the empirical distribution drawn from π is regarded as sampling error.
Therefore, under the modeling hypothesis, the process of replacing the empirically computed reduced densities with low rank approximations should be thought of as repairing a state damaged by sampling errors. The low rank modeling hypothesis can lead to excellent generalization properties for the model.
Let us continue our analysis of the reduced densities as in the previous sections using notation appropriate for the machine learning algorithm. Let T be a training set of labeled samples T = {s 1 , . . . , s N T }. We use N T for the number of training examples. Each sample s i will be a sequence of symbols from a fixed alphabet Σ of a fixed length N . We will designate a cut to obtain a prefix a i and suffix b i whose concatenation is the sample Let π be the resulting empirical distribution on T so that Let us look at the empirical state the empirical density ρ = |ψ ψ|, and its partial trace Here the sum is expressed in terms of the indices i, j, which range over the number of samples. The coefficient s(a i , a j ) of |a i a j | is a nonnegative integer, namely the number of times that a i and a j have the same continuation b i = b j . It may be convenient to have some notation for shared continuations. For any pair a, a of elements of A, let T a,a be the subset of B consisting of shared continuations of a and a : So, the (a, a ) entry of the matrix representing ρ A is the cardinality of the set T a,a divided by an overall factor of 1/N T . A similar combinatorial description holds for the reduced density on B, is the number of common prefixes that b i and b j share.
The counting involved can be visualized with graphs. Every probability distribution π on a Cartesian product A × B uniquely defines a weighted bipartite graph: the two vertex sets are A and B and the edge joining a  and b is labeled by π(a, b). Here, because we assume the samples in T are distinct, the graph can be simplified since π(a, b) is either 0 or 1/N T . We draw an edge from a to b if (a, b) ∈ T and we omit the edge if (a, b) / ∈ T and understand the probabilities to be obtained by dividing by N T , which is the total number of edges in the graph.
In the example above, the total number of edges is the sample size N T = 6. The probability of (a 1 , b 1 ) = 1/6 and the probability of (a 1 , b 2 ) = 0. Now we illustrate how to read off the entries of the reduced density ρ A from the graph. There will be an overall factor of 1/N T multiplied by a matrix of nonnegative integers. The diagonal entries are d(a), the degree of vertex a. The (a, a ) entry is the number of shared suffixes, which equals the number of paths of length 2 between a and a , divided by 6.
Given any graph with |A| = 2, such as the one above, the reduced density on the prefix subsystem is equal to (11) ρ

The Training Algorithm
Suppose that |ψ ∈ V 1 ⊗ · · · ⊗ V N . We depict |ψ as There are various sorts of decompositions of such a tensor that are akin to an iterated SVD. We will describe one decomposition that results in a factorization of |ψ into what is called a matrix product state (MPS) or synonymously, a tensor train decomposition. The process defines a sequence of "bond" spaces {B k } and operators {U k : The initial operator has form U 1 : B 1 → V 1 and the final tensor has the form U N ∈ B N −1 ⊗ V N . We begin with B 1 = V 1 and set U 1 : B 1 → V 1 to be the identity. For k = 2, . . . , N − 1 we will define U k inductively.
To describe the inductive process, first notice that for any k = 1, . . . , N −1, one has the tensor factorization The operator α k : V 1 ⊗ · · · ⊗ V k → V k+1 ⊗ · · · ⊗ V N fashioned from |ψ may be pictured as follows: The operators U k when composed U 1 U 2 · · · U k as below One then has the composition The inductive hypothesis is that α k U 1 U 2 · · · U k U * k · · · U * 2 U * 1 = α k . Pictorally, In the penultimate step, one has the operator α N −1 U 1 U 2 · · · U N −1 : B N −1 → V N . The final step is to define U N as the adjoint of this operator: Therefore, the entire composition reduces nicely: The final equality follows from the adjoint of the inductive hypothesis. The outcome α * N −1 : V * N → V 1 ⊗ · · · ⊗ V N −1 , after a minor reshaping, is the same as |ψ .
To define the inducive step, assume the spaces B 1 , . . . , B k−1 and operators U k have been defined and satisfy the inductive hypothesis. Reshape the The adjoint of the map U * k : B k−1 ⊗V k → B k , pictured as the blue triangle on the right hand side, is then defined to be U k : B k → B k−1 ⊗ V k and becomes the next tensor in the MPS decomposition. To check that the inductive hypothesis is satisfied, note that α k U 1 · · · U k−1 U k U * k = α k−1 U 1 · · · U k−1 since α k−1 U 1 · · · U k−1 = W k D k U * k and U * k U k = 1. Here is the picture proof: which is the first picture: In our application, the vector |ψ and the operators β k−1 : B k−1 ⊗ V k → V k+1 ⊗ · · · ⊗ V N operate in spaces of such high dimensions that neither they, nor a direct SVD of them, is feasible. Nonetheless, the U k operators can be obtained from an SVD of a reduced density operating in the effective space In our application, the effective reduced density β * k−1 β k−1 can be computed as a double sum over the training examples and we can efficiently compute the tensors required for the inductive steps. Then in the final step, the complementary space is small so the final map U N D N : B N −1 → V N completes the reconstruction.
More specifically, to define the U k , we only need an eigenvector decomposition of β * k−1 β k−1 , which looks like and is given by a formula like the one in Equation (9). In general, when factoring an arbitrary vector as an MPS, the bond spaces B k grow large exponentially fast. Therefore, we may characterize data sets for which the MPS model is a good model by saying that |ψ as defined in Equation (2) has an MPS model whose bond spaces B k remain small. Alternatively, one can truncate or restrict the dimensions of the spaces B k resulting in a low rank MPS approximation of |ψ . As a criterion for this truncation, one can inspect the singular values at each inductive step and discard those which are small according to a pre-determined cutoff, and the corresponding columns of U and W . In the even-parity dataset that we investigate as an example, we always truncate B k to two dimensions throughout.
To understand whether this kind of low-rank approximation is useful, remember that we understand that the eigenvectors and eigenvalues of the reduced densities carry the essential prefix-suffix interactions. By having a training algorithm that emphasizes these eigenvalues and eigenvectors as the most important features of the data throughout training, the resulting model should be interpreted as capturing the most important prefix-suffix interactions. We view these prefix-suffix interactions a proxy for the meaning of substrings within a language of larger strings.

Under the hood
With an in-depth understanding of the training algorithm, we aim to predict experimental results, simply given the fraction 0 < f ≤ 1 of training samples used. Such an under-the-hood analysis shows that each tensor within the MPS is comprised of eigenvectors of a reduced density operator. The eigenvectors can be understood in terms of the reduced density matrix representation, which contains information from errors accrued in the algorithm's prior steps, along with combinatorial information from the current step. We now describe these ideas in careful detail.
As an example, we perform an analysis of how well the algorithm learns on the even-parity dataset. Let Σ = {0, 1} and consider the set Σ N of bitstrings of a fixed length N . Define the parity of a bitstring (b 1 , . . . , b N ) to be (15) parity The set Σ N is partitioned into even and odd bitstrings: Consider the probability distribution π : Σ N → R uniformly concentrated on E N : This distribution defines a density ρ π = |E N E N | where  (8). To begin our analysis on |ψ , let us closely inspect the algorithm's second step. The ideas therein will generalize to subsequent steps.
In step 2, we view each sample s as a prefix-suffix pair (a, b) where a ∈ Σ 2 and b ∈ Σ N −2 . We visualize the training set T as a bipartite graph. Vertices represent prefixes a and suffixes b and there is an edge joining a and b if and only if (a, b) ∈ T . In this case, ρ 2 is a rank 2 operator. It has two eigenvectors-one from each block. This is the idealized scenario: every sequence is present in the training set, the tensor obtained ρ 2 = U 2 D 2 U * 2 is then where |E 2 = 1 √ 2 (|00 + |11 ) denotes the normalized sum of even prefixes of length 2, and |O 2 = 1 √ 2 (|01 + |10 ) denotes the normalized sum of odd prefixes of length 2. As a matrix, U 2 has |E 2 and |O 2 along its rows. We think of it as a "summarizer": it projects a prefix onto an axis that can be identified with either |E 2 or |O 2 according to its parity, perfectly summarizing the information of that prefix required to understand which suffixes it is paired with.
More generally, however, if T = E N then the reduced density ρ 2 may be full rank. In this case we choose the eigenvectors |E 2 , |O 2 that correspond to the two largest eigenvalues of ρ 2 . We assume these eigenvectors come from distinct blocks. This defines the tensor U 2 , which as a matrix has |E 2 and |O 2 along its rows, where |E 2 = cos θ 2 |00 + sin θ 2 |11 for some angles θ 2 and φ 2 . These angles can be computed following the expression in (13) for the eigenvectors: denote the gaps between the diagonal entries in each block. The angles should be thought of as measuring the deviation from perfect learning in step 2: if f = 1 then G e , G o = 0 and so θ 2 = φ 2 = π/4 which implies |E 2 = |E 2 and |O 2 = |O 2 . In this case, step 2 has worked perfectly. Note that this is not an if-and-only-if scenario. Even if f < 1 then the reduced density may still have |E 2 and |O 2 as its eigenvectors. Indeed, this occurs whenever G e = G o = 0 and s e , s o = 0. In that case, the eigenvectors of ρ 2 are the desired parity vectors |E 2 , |O 2 , and the summarizer U 2 obtained is a true summarization tensor. But if G e or G o are both nonzero, then step 2 induces a summarization error, which we measure as the deviation of θ 2 and φ 2 from the desired π/4. The analysis described here is repeated at each subsequent step k = 3, . . . , N , with minor adjustments to the combinatorics. So let us now describe the general schema. In the kth step of the training algorithm, each sample is cut after the k-th bit and viewed as a prefix-suffix pair s = (a, b) where a ∈ Σ k and b ∈ Σ N −k . Let |ψ k ∈ C Σ k ⊗ C Σ N −k denote the sum of the samples after having completed step k − 1.
and let ρ k := tr Σ N −k |ψ k ψ k | denote the reduced density on the prefix subsystem at step k. It is an operator on B k−1 ⊗ V k , where B k−1 is a 2dimensional space which may be identified with the span of the eigenvectors associated to the two largest eigenvalues of ρ k−1 . As a matrix, ρ k is a direct sum of 2 × 2 matrices, We postpone a description of the entries until Section 5.2. But know that, as in the case when k = 2, the upper and lower blocks contains combinatorial information about prefixes of even and odd parity, respectively. As before, we are interested in the largest eigenvectors |E k , |O k contributed by each block. They define the tensor U k , which as a matrix has |E k and |O k along its rows, and can be understood inductively. The eigenvectors contain combinatorial information from step k along with data from step k − 1. Let |E 1 := |0 and |O 1 := |1 . Then for k ≥ 2 Again, the angles are a measurement of the error accrued in step k. Significantly, no error is accrued when the gaps G e := e0 − o1 and G o := e1 − o0 are zero and the off-diagonals s e , s o are non-zero, for then θ k = φ k = π/4. This outcome, or one close to it, is statistically favored for a wide range of training fractions.
As a matrix, and so U k is akin to a map B k−1 ⊗ V k → B k that combines previously summarized information from B k−1 with new information from V k . It then summarizes the resulting data by projecting onto one of two orthogonal vectors, which may be identified with |E k or |O k , in the new bond space B k .
The true orientation of the arrows on U k are down-left, rather than up-right. But the vector spaces in question are finite-dimensional, and our standard bases provide an isomorphism between a space and its dual. That is, no information is lost by momentarily adjusting the arrows for the purposes of sharing intuition. In summary, this template provides a concrete handle on the tensors U k that comprise the MPS factorization of |ψ . 5.1. High-level summary. We close by summarizing the high-level ideas present in this under-the-hood analysis. At the kth step of the training algorithm one obtains a 4 × 4 block diagonal reduced density matrix ρ k . It is given in Equation (18) in the case when k = 2 and as in Equation (19) when k > 2. These matrices are obtained by tracing out the suffix subsystem from the projection |ψ k ψ k |, where |ψ k is the sum of the samples in the training set after having completed step k − 1. Since |ψ k depends on the error obtained in step k − 1, so does ρ k . This error is defined by the angles θ k−1 and φ k−1 . As shown in Equation (20), these angles-and hence the error-are functions of the entries of the matrix representing ρ k−1 . So, the kth level density takes into account the errors accrued at each subsequent step as well as combinatorial information in the present step. A partial trace computation thus directly leads to the matrix representation for ρ k given in Equation (19). Explicitly, the non-zero entries of the matrix are computed by Equations (23) and (24). With this, one has full knowledge of the matrix ρ k and therefore of its eigenvectors |E k , |O k . Written in the computational basis, they are of the form shown in Equation (13). These two eigenvectors then assemble to form the rows of the tensor U k , when viewed as a 2 × 4 matrix.
This analysis gives a thorough understanding of the error propagated at each step of the algorithm, as well as of the final MPS |ψ MPS . To measure the algorithm's performance, we begin by evaluating the inner product of this vector with an MPS decomposition of the target vector |E N .
The kth tensor comprising the decomposition of |E N is equal to U k when θ k and φ k are evaluated at π/4. The contraction thus results in a sum of products of cos θ k , sin θ k , cos φ k , sin φ k for k = 2, . . . , N . More concretely, for each even bitstring s ∈ E N the inner product s|ψ MPS is the square root of the probability of s. For now, we'll refer to it as the weight w(s) := s|ψ MPS associated to the sample s. For each s, its weight w(s) is a product of various cos θ k , sin θ k , cos φ k , sin φ k , the details of which are given in Section 5.2. The final overlap is then the sum Now, suppose the training set consists of a fraction f of the entire population. The entries of the reduced densities in (19) are described combinatorially, as detailed in the next section. This makes it possible to make statistical estimates for gaps G e and G o and off-diagonal entries s o and s e in (20). Therefore, we can make statistical predictions for the angles θ k and φ k and hence for the tensors U k comprising the trained MPS and the resulting generalization error. The results are plotted in Figure 4, where we use the Bhattacharya distance between the true population distribution and the one defined by either an experimentally trained MPS as a proxy for generalization error. The theoretical curve could, in principle, be improved by making more accurate statistical estimates for the combinatorics involved.

5.2.
Combinatorics of reduced densities. We now describe the entries of kth level reduced density in Equation (19). They depend on certain combinatorics in step k as well as error accumulated in the previous step. The latter has an inductive description. To start, observe that the parity of a prefix a ∈ Σ k is determined by its last bit, together with the parity of its first k − 1 bits. The set Σ k thus partitions into four sets: By viewing the training set as a bipartite graph, one has a visual understanding of these sets: E0 contains all prefixes of even parity whose last bit is 0; O1 contains all prefixes of even parity whose last bit is 1, and so on.
In the example below with k = 3, we use color to distinguish each set. As shown, each prefix also has a weight that records its contribution to the error accumulated in previous steps. Concretely, we assign to each prefix a ∈ Σ k a weight w(a), which is a product of k − 2 terms. For 2 ≤ i ≤ k − 1, the ith factor of w(a) is defined to be • cos θ i if the parity of the first i − 1 bits is even and the ith bit is 0 • sin θ i if the parity of the first i − 1 bits is odd and the ith bit is 1 • cos φ i if the parity of the first i − 1 bits is even and the ith bit is 1 • sin φ i if the parity of the first i − 1 bits is odd and the ith bit is 0 For example, if k = 3 then w(011) = cos φ 2 . If k = 5 then w(01101) = cos θ 4 sin θ 3 cos φ 2 . These weights are naturally associated to each tensor. For instance, recalling that each tensor U k is akin to a summarizer, one sees w(01101) in the following way: We can now describe the entries of the reduced density defined in Equation (19). The first diagonal entry is and the other diagonals are defined similarly. If perfect learning occurs then e0 is, up to a normalizing constant, the sum of the squares of the degrees of each suffix, with respect to E0. For example, in the graph below e0 is proportional to 2 2 + 2 2 + 1 2 = 9. In general, though, the summands will not be integers but rather products of weights. The off-diagonal entry in the even block of the reduced density is When perfect learning occurs, s e counts the number of paths of length 2, where now a path is comprised of one edge from E0 and one edge from O1. For example, in the graph below s e = 3.
In general, however, s e will be a sum of products of weights. The expression for the off-diagonal s o in the odd block is similar to that in Equation (24). In summary, the theory behind the reduced densities and their eigenvectors gives us an exact understanding of the error propagated through each step of the training algorithm. We may then predict the Bhattacharya distance in (22) using statistical estimates of the expected combinatorics. This provides an accurate prediction based solely on the fraction f of training samples used and the length N of the sequences.

Experiments
The training algorithm was written in the ITensor library [20]; the code is available on Github. For a fixed fraction 0 < f ≤ 0.2 we run the algorithm on ten different datasets, each containing N T = f 2 N −1 bitstrings of length N = 16. We then compare the average Bhattacharya distance in Equation (22) to the theoretical prediction. To handle the angles θ k and φ k in the theoretical model, we make a few simplifying assumptions about the expected behavior of the combinatorics.
First we assume θ = φ k for all k since the combinatorics of both blocks of the reduced densities ρ k in (19) have similar behavior. We further assume the average angle θ is a function of the average off-diagonal s e and the average diagonal gap G e at the kth step, that is E[θ k (s e , G e )] = θ k (E[s e ], E[G e ]) for all k. The expectation for s e is experimentally determined to be independent of k, and dependent on the fraction f and bitstring length N alone: E[s e ] = f · N T /4 for all k. We approximate the expected gap G e at the kth step to be an experimentally determined function of f and the expected gap G 2 = |d 1 − d 2 | of the diagonal entries of the reduced density defined at step 2 of the algorithm. Understanding the expected behavior of G 2 is similar to understanding the statistics of a coin toss. On average, one expects to flip the same number of heads and tails and yet the expectation for their difference is non-zero. The distribution for G 2 is similar, but a little different: where r = d 1 + d 2 = N T /2 and n = 2 N −3 is the number of even parity bitstrings of length N − 2. The plots in Figure 4 compare the theoretical estimate against the experimental average.

Conclusion
Models based on tensor networks open interesting directions for machine learning research. Tensor networks can be viewed as a sequence of related linear maps, which by acting together on a very high-dimensional space allows the model to be arbitrarily expressive. The underlying linearity and powerful techniques from linear algebra allow us to pursue a training algorithm where we can look "under the hood" to understand each step and its consequences for the ability of our model to reconstruct a particular data set, the even-parity data set.
Our work also highlights the advantages of working in a probability formalism based on the 2-norm. This is the same formalism used to interpret the wavefunction in quantum mechanics; here we use it as a framework to treat classical data. Density matrices naturally arise as the 2-norm analogue of marginal probability distributions familiar from conventional 1-norm probability. Marginals still appear as the diagonal of the density matrix. Unlike marginals, the density matrices we use hold sufficient information to reconstruct the entire joint distribution. Our training algorithm can be summarized as estimating the density matrix from the training data, then reconstructing the joint distribution step-by-step from these density matrix estimates.
The theoretical predictions we obtained for the generative performance of the model agree well with the experimental results. Note that care is needed to compare these results, since the theoretical approach involves averaging over all possible training sets to produce a single typical weight MPS, whereas the experiments produce a different weight MPS for each trainingset sample. In the near future, we look forward to extending our approach to other measures of model performance and behavior, and certainly other data sets as well.
More ambitiously, we hope this work points the way to theoretically sound and robust predictions of machine learning model performance based on empirical summaries of real-world data. If such predictions can be obtained for training algorithms that also produce state-of-the art results, as tensor networks are starting to do, we anticipate this will continue to be an exciting program of research.