Machine learning by unitary tensor network of hierarchical tree structure

The resemblance between the methods used in quantum-many body physics and in machine learning has drawn considerable attention. In particular, tensor networks (TNs) and deep learning architectures bear striking similarities to the extent that TNs can be used for machine learning. Previous results used one-dimensional TNs in image recognition, showing limited scalability and flexibilities. In this work, we train two-dimensional hierarchical TNs to solve image recognition problems, using a training algorithm derived from the multi-scale entanglement renormalization ansatz. This approach introduces mathematical connections among quantum many-body physics, quantum information theory, and machine learning. While keeping the TN unitary in the training phase, TN states are defined, which encode classes of images into quantum many-body states. We study the quantum features of the TN states, including quantum entanglement and fidelity. We find these quantities could be properties that characterize the image classes, as well as the machine learning tasks.

The resemblance between the tensor networks (TNs) and machine learning has drawn considerable attention.In particular, TNs and deep learning architectures bear striking similarities suggesting using quantum techniques for machine learning.In this work, we train two-dimensional hierarchical TNs to solve image recognition problems, using a training algorithm derived from the multipartite entanglement renormalization ansatz.This approach overcomes scalability issues and implies novel mathematical connections among quantum many-body physics, quantum information theory, and machine learning.The algorithm optimally encodes each image class into a TN state, so that the learning tasks as well as the image classes can be characterized by quantum properties of the state including quantum entanglement and fidelity.Furthermore, the unitary conditions of the local mappings in our algorithm make it possible to realize the machine learning by, e.g., quantum state tomography techniques or quantum computations.
Over the past years, we have witnessed a booming progress in applying quantum theories and technologies to realistic problems including quantum simulators (1) and quantum computers (2)(3)(4).To some extent, the power of the "Quantum" stems from the properties of quantum many-body systems.As one of the most powerful numerical tools for studying quantum many-body systems (5)(6)(7)(8), tensor networks (TNs) has drawn more attention.For instance, TNs have been recently applied to solve machine learning problems such as dimensionality reduction (9, 10) and hand-writing recognition (11,12).As TNs allow the numerical treatment of difficult physical systems by providing layers of abstraction, deep learning achieved similar striking advances in automated feature extraction and pattern recognition (13).The resemblance between the two approaches is beyond superficial.At the theoretical level, there is a mapping between deep learning and the renormalization group (14), which in turn connects holography and deep learning (15,16), and also allows studying network design from the perspective of quantum entanglement (17).In turn, neural networks can represent quantum states (18)(19)(20)(21).
In this work, we derive an efficient quantum-inspired learning algorithm based on the Multipartite Entanglement Renormalization Ansatz (MERA) approach (22)(23)(24)(25) and hierarchical representation that is known as tree TN (TTN) (26).As shown in Fig. 1, the idea is firstly transforming images to vectors living in a d N -dimensional Hilbert space (11), and then mapping the vectors (or named as "vectorized images") through a TTN (denoted by Ψ) to predictions of classifications as outputs.Note N is the number of pixels, d is the dimension of the vector mapped from one pixel, and D is the number of classes.TTN suits more the two-dimensional (2D) nature of images than those based on a one-dimensional (1D) TN, e.g., matrix product state (11,12).Secondly, our scheme explicitly connects machine learning to quantum quantities, such as fidelity and entanglement.Additionally, we propose to use unitary mappings to construct the TTN.High accuracy is reached by using small bond dimensions of the local gates, which makes it possible in near future to use quantum simulations/computations (27) to implement our proposal.
The algorithm is tested on both the MNIST (handwriting recognition with binary images) and CIFAR (recognition of color Figure 1: (Color online) An image of "7" is firstly vectorized to a product state by mapping each pixel to a d-dimensional vector, and then fed to the TTN denoted as Ψ.The output is the vector after contracting with the TTN.The accuracy is obtained by comparing the output with the vectorized label |p .images) databases.We obtain accuracies comparable to the performance of convolutional neural networks.More importantly, the TTN states that optimally encodes each class of images as a quantum many-body state can then be defined, which is akin to the study of a duality between probabilistic graphical models and TNs (28).Combining with t-SNE (29), we find that the level of abstraction changes the same way as in a deep convolutional neural network (30), or a deep belief network (31), and the highest level of the hierarchy allows for a clear separation of the classes.Finally, we show that the valid information is actually located in a small subspace spanned by the (nearly) orthonormal TTN states which encode different image classes; each state possesses a tolerable entanglement entropy (S ∼ O(1)), meaning the image classes can be efficiently captured by a TTN with small bond dimensions.Our implementation is available under an open source license 1 .

Power of representation and generalization
To verify the representation power of the TTN, we use the CIFAR-10 dataset (32), which consists of 10 classes with 50,000 RGB images in the training dataset and 10,000 images in the testing dataset.Each RGB image was originally 32 × 32 pixels.We transformed them to gray-scale to reduce the complexity of training, which is a reasonable trade-off for reducing information for the training.
Figs. 2 (a) and (b) exhibit that the relation between the representation power (learnability or model complexity) and the bond dimensions of the TTN.The TTN Ψ actually gives a mapping that optimally project the vectorized images from the d N -dimensional space to the D-dimensional one.Thus, from the perspective of tensor algebra, the limitation of the representation power of Ψ depends on the input dimension d N of Ψ.On the other hand, the TTN can be considered as an approximation of such an exponentially large mapping, by writing it into the contraction of small tensors.The dummy indexes that are contracted inside the TTN are called virtual bonds, whose dimensions determine how close Ψ can reach the limitation.
The sequence of convolutional and pooling layers in the feature extraction part of a deep learning network is known to arrive at higher and higher levels of abstraction that help separate the 1 The code of the implementation is available at https://github.com/dingliu0305/Tree-Tensor-Networks-in-Machine-Learningclasses in a discriminative learner (13).This is often visualized by embedding the representation in two dimensions by t-SNE (29), and by coloring the instances according to their classes.If the classes clearly separate in this embedding, the subsequent classifier will have an easy task performing classification at a high accuracy.We plotted this embedding for each layer in the TN in Fig. 4. We observe the same pattern as in deep learning, having a clear separation in the highest level of abstraction.Furthermore, to test the generalization power of TTNs, we used the MNIST dataset, which is widely used in handwritten recognition.The training set consists of 60,000 (28 × 28) gray-scale images, with 10,000 testing examples.For the simplicity of encoding, we rescaled them to (16 × 16) images so that the TTN can be built with four layers.
With the increase of bond dimensions (both of the input and virtual bonds), we find an apparent rise of training accuracy, which is shown in Fig. 3.At the same time, we observed the decline of testing accuracy.The increase of bond dimension leads to a sharp increase of the number of parameters and, as a result, it will give rise to overfitting and lower the performance of generalization, mirroring the theoretical principles of statistic learning.Therefore, one must pay attention to finding the optimal bond dimension -we can think of this as a hyperparameter controlling model complexity.Considering the efficiency and avoiding overfitting, we use the minimal values of bond dimensions (Table 1) to reach the training accuracy around 95%.Our results indicate that only small bond dimensions (O(1)) are needed.The fidelity between two states is defined as F pp ′ = ψ p |ψ ′ p .It measures the distance between the two quantum states in the Hilbert space.Fig. 5(a) shows the fidelity between each two |ψ p 's trained from the MNIST dataset.One can see that F pp ′ remains quite small in most cases.This means that {|ψ p } are almost orthonormal.Although the total dimension of the vectorized images is d N , most of the relevant information gathers in a small corner spanned by the orthonormal states {|ψ p }.
In addition, the largest value of the fidelity appears as F 4,9 = 0.1353.We speculate that this is closely related to the way how the data instances are fed and processed in the TTN.In our case, two image classes that have similar shapes will result in a larger fidelity, because the TTN essentially provides a realspace renormalization flow.In other words, the input vectors are still initially arranged and renormalized layer by layer according to their spatial locations in the image; each tensor renormalizes four nearest-neighboring vectors into one vector.Fidelity can be potentially applied to building a network, where the nodes are classes of images and the weights of the connections are given by the F p ′ p .This might provide a mathematical model on how different classes of images are associated to each other.We leave these questions for future investigations.
Another important concept of quantum mechanics is (bipartite) entanglement, a quantum version of correlations (33).It is one of the key characters that distinguishes the quantum states from classical ones.Entanglement is usually given by a normalized positive-defined vector called entanglement spectrum (denoted as Λ).The strength of entanglement is measured by the entanglement entropy S = − a Λ 2 a ln Λ 2 a .Fig. 5(b) shows the entanglement entropy of {|ψ p } trained with the MNIST dataset.We compute two kinds of entanglement entropy by cutting the images in the middle along the x and y directions as shown in fig. 1.The results were marked by up-down and left-right in Fig. 5(b).The first one denotes the entanglement between the upper part of the images with the downer part.The latter denotes the entanglement between the left and the right parts.With the TTN, the entanglement spectrum is simply the singular values of the matrix M = T [K,1] |p with T [K,1] the top tensor.This is because the all the tensors in the TTN are orthogonal.Note that M has four indexes, of which each represents the effective space renormalized from one quarter of the vectorized image.Thus, the bipartition of the entanglement determines how the four indexes of M are grouped into two bigger indexes before calculating the SVD.
Two implications can be achieved from the entanglement entropy.Firstly, it is known from tensor network that entanglement entropy reveals the needed dimensions of the virtual bonds for reaching a certain precision.In other words, entanglement entropy is a characterization of the computational complexity of the classification using TTN.Secondly, for a physical state with two subsystems, entanglement entropy measures the amount of information of one subsystem that can be gained by measuring the other subsystem.Here, an important analog is between knowing a part of the image and measuring the corresponding subsystem of the quantum state.Thus, we suggest that in our image recognition, entanglement entropy characterizes how much information of one part of the image we can gain by knowing the rest part of the image.In other words, if we only know a part of an image and want to predict the rest according to the trained TTN state, the entanglement entropy measures how accurately this can be done.Moreover, we show that {|ψ p } actually possess small entanglement, meaning that the TTN can efficiently capture and classify the images with a relatively small virtual bond dimension.Our results suggest that the images of "0" and "4" are the easiest and hardest, respectively, to predict the missing part given the other part.

Discussion
We continued the forays into using tensor networks for machine learning, focusing on hierarchical, two-dimensional tree tensor networks that we found a natural fit for image recognition problems.This provides a scalable approach of a high precision.We conclude with the following observations: • The limitation of representation power (learnability) of a TTN strongly depends on the input bond dimensions, and the virtual bond dimensions determine how well the TTN reaches this limitation.
• A hierarchical tensor network exhibits the same increase level of abstraction as a deep convolutional neural network or a deep belief network.
• Our scheme naturally connects classical images to quantum states, permitting to use quantum properties (fidelity and entanglement) to characterize the classical data and computational tasks.
Moreover, our work contributes towards the implementation of machine learning by quantum simulations/computations.Firstly, since we propose to encode image classes into TTN states, it is possible to realize the proposed machine learning by, e.g., quantum state tomography techniques (27).Secondly, arbitrary unitary gates can in principle be realized by the so-called digital quantum simulators (34).This makes another possible way of realizing our proposal by quantum simulations, thanks to the unitary conditions of the local tensors.

Feature map
Our approach to classify image data begins by mapping each pixel x to a d-component vector v s (x).This feature map was introduced by ( 11)) and defined as: where s runs from 1 to d.By using a larger d, the TTN has the potential to approximate a richer class of functions.With such an nonlinear feature map, we can project a gray-scale image from scalar space to d N -dimensional vector space, where the image is represented as a direct product state of N local d-dimensional vectors {|v j }.The coefficients of |v j are given by the feature map [Eq.( 1)] from the j-th pixel.

MERA-inspired training algorithm
Ψ can be written as a hierarchical structure of K layers TN (see Fig. 1 for example), whose coefficients are given by where N k is the number of tensors in the k-th layer.The output for classifying the j-th sample is a D-dimensional vector obtained by contracting the vectorized image (denoted by v [j] | for the j-th sample) with the TTN, which reads as Where |p [j] acts as the predicted label corresponding to the jth sample.Based on these, we derive a highly efficient training algorithm inspired by MERA (22).We choose the cost function to be minimized as the square error, which is defined as To proceed, let us give the cost function in the following form The third term comes from the normalization of |p [j] , and we assume the second term is always real.The dominant cost comes from the first term.We borrow the idea from the MERA approach to reduce this cost.Mathematically speaking, the central idea is to impose that Ψ is orthogonal, i.e., Ψ † Ψ = I.Then Ψ is optimized with Ψ Ψ † = I satisfied in the valid subspace that optimizes the classification.By satisfying in the subspace, we do not require an identity from Ψ Ψ † , but mean In MERA, a stronger constraint is used.With the TTN, each tensor has one upward and four downward indexes, which gives a non-square orthogonal matrix by grouping the downward indexes into a large one.Such tensors are called isometries and satisfy T † T = I after contracting all downwards indexes with its conjugate.When all the tensors are isometries, the TTN gives a unitary transformation that satisfies Ψ † Ψ = I; it compresses a d N -dimensional space to a D-dimensional one.
In this way, the first terms becomes a constant, and we only need to deal with the second term.The cost function becomes Each term in f is simply the contraction of the tensor network, which can be efficiently computed.The tensors in the TTN are updated alternatively to minimize Eq. ( 6).To update T [k,n] for instance, we assume other tensors are fixed and define the environment tensor E [k,n] , which is calculated by contracting everything in Eq. ( 6) after taking out T [k,n] (Fig. 1) (25).Then the cost function becomes f = −Tr(T [k,n] E [k,n] ).Under the constraint that T [k,n] is an isometry, the solution of the optimal point is given by T [k,n] = V U † where V and U are calculated from the singular value decomposition E [k,n] = U ΛV † .At this point, we have f = − a Λ a .
Then, the update of one tensor becomes the calculation of the environment tensor and its singular value decomposition.In the alternating process for updating all the tensors, some tricks are used to accelerate the computations.The idea is to save some intermediate results to avoid repetitive calculations by taking advantage of the tree structure.Another important detail is to normalize the vector obtained each time by contracting four vectors with a tensor.
The scaling of both time complexity and space complexity is O((b v 5 +b 4 i b v )M N T ), where M is the dimension of input vector; b v the dimension of virtual bond; b i the dimension of input bond; N T the number of training inputs.

Multi-class classification
The strategy for building a multi-class classifier is the oneagainst-all classification scheme in machine learning.For each class, we train one TTN so that it recognizes whether an image belongs to this class or not.The output of Eq. ( 3) is a two-dimensional vector.We fix the label for a yes answer as |yes = [1,0].For the p image classes, we accordingly have p TTNs { Ψ[p] } that satisfy |ψ p = Ψ[p] |yes .Then for recognizing the j-th sample, we introduce a p-dimensional vector F [j] , where the p-th element is defined as the inner product between |ψ p and the vectorized image, satisfying The position of its maximal element gives which class the image belongs to.

Figure 3 :
Figure 3: Training and test accuracy as the function of the bond dimensions on the MNIST dataset.The virtual bond dimensions are set equal to input bond dimensions.The number of training samples is 1000 for each pair of classes.

Figure 5 :
Figure 5: (a) Fidelity F p ′ p between each two handwritten digits, which ranges from −0.0032 to 1.The diagonal terms F pp = 1 because the quantum states are normalized; (b) Entanglement entropy corresponding to each handwritten digit entropy.