UvA-DARE (Digital Academic Repository) Entangled q-convolutional neural nets

We introduce a machine learning model, the q-CNN model , sharing key features with convolutional neural networks and admitting a tensor network description. As examples, we apply q-CNN to the MNIST and Fashion MNIST classification tasks. We explain how the network associates a quantum state to each classification label, and study the entanglement structure of these network states. In both our experiments on the MNIST and Fashion-MNIST datasets, we observe a distinct increase in both the left/right as well as the up/down bipartition entanglement entropy (EE) during training as the network learns the fine features of the data. More generally, we observe a universal negative correlation between the value of the EE and the value of the cost function, suggesting that the network needs to learn the entanglement structure in order the perform the task accurately. This supports the possibility of exploiting the entanglement structure as a guide to design the machine learning algorithm suitable for given tasks.


Introduction
Convolutional neural networks (CNNs) have seen remarkable successes in various applications.At the same time there are tasks with similar descriptions that can nevertheless not be solved with a CNN architecture 3 .Moreover, it is not always transparent what choices of hyperparameters work the best, and why.More generally, we do not always have a precise explanation of why certain choices of machine learning architectures and hyperparameters work and do not work for a given task.This lack of a precise understanding is related to the curse of dimensionality which prevents an explicit analysis.That said, the data of a given problem typically lie in a high co-dimensional subspace.For instance, a typical point in the configuration space of all possible N-pixel grayscale pictures resembles a 'white noise' image, and looks nothing like a picture encountered in the relevant data set.This is very reminiscent of the situation in quantum many-body systems.The high dimensionality of the Hilbert space of quantum states makes it hard to find the desired state (e.g.ground state of the given Hamiltonian) explicitly.Tensor network [2,3] is one of the most popular tools utilised in many-body quantum physics to overcome this problem.Abstractly speaking, they provide a way to approximate high-order tensors in terms of lower-order tensors, and by doing so greatly reduce the parameters needed to describe the relevant quantum states, circumventing the curse of dimensionality.This is possible because the physically relevant states lie in a tiny 'corner of the Hilbert space' .One can quantify this using the entanglement entropy (EE) of a quantum state with respect to a bipartition of the system, which measures the degree to which the quantum state is entangled between two subsystems.While a typical element of the Hilbert space has an EE which scales like the volume of the sub-region, the physically relevant states tend to have entanglement entropies that scale like the boundary area (possibly with logarithmic corrections) of the sub-region.At the same time, the entanglement structure of a quantum state is precisely what constrains how effectively it could be approximated by a given tensor network architecture.See [4,5] for a review.
The analogy between quantum many-body systems and machine learning prompts the following questions.Could we have a similar theoretical understanding in the context of machine learning architectures on how effective they are for a given task?Could we also understand the subspace of relevant data with similar tools as those used in the study of quantum many-body systems?Moreover, one might also hope that the analogy between quantum many-body systems and machine learning architectures can help the developments of natural and effective quantum machine learning architectures [6][7][8].
Inspired by these questions, there have been increasing efforts to build a bridge between the two fields of machine learning and quantum many-body systems, and in particular tensor networks [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25].In this work, we continue to strengthen this bridge, focusing on CNNs.Specifically, we build a CNN-like architecture, which we call q-CNN, which admits a description as a tensor network.In particular, our architecture has the same weight sharing property as the usual CNN, and as a result the number of parameters grows only logarithmically with the system size.Subsequently, we apply q-CNN to the classification tasks on the MNIST and Fashion-MNIST datasets, obtaining maximum test accuracy 97% and 89% respectively, comparable to [26].
With the confidence that our architecture has satisfactory performance, we then go ahead and explore its quantum mechanical properties.As mentioned before, a crucial probe of the qualitative features of a quantum states is its entanglement entropies.We compute the entanglement entropies of the network quantum states, with respect to bipartitions of the configuration space corresponding to the left/right and up/down partitions of an image.In the current context, the EE of the network given a specific bipartition of the image measures the degree to which the network captures the correlation between the two parts of the image.For instance, a vanishing EE implies that the network assigns a probability for the image to be in a certain category that can be expressed as the product of two probabilities corresponding to each part of the bipartition, which is a clearly an undesirable outcome.
Quantum EE is well-known to be an important quantity quantifying the important entanglement attributes of a quantum system.As a result, computationally it has been playing a crucial role in the design of tensor networks capable of performing computations pertaining to given quantum systems; too intricate an entanglement structure could pose a fundamental hurdle to computability via given tensor networks.See for instance [27] for a recent review.Similarly, from the analogy between quantum many-body systems and machine learning, it is reasonable to suspect that the concept of entanglement can also play the similar role as a guiding principle for choosing the right architectures and hyperparameters capable of solving a given problem.As a first step, in this work we established that entanglement is indeed a relevant structure the network needs to learn before it can start performing the tasks we designated.
In both our experiments on the MNIST and Fashion-MNIST datasets, we initialize the network in a random way which renders a quantum state with no particular entanglement structure.Subsequently, we observe a distinct increase in EE during training as the network learns the fine features of the data.More generally, we observe a negative correlation between the value of the EE and the value of the cost function, across different initialisations and choices of hyperparameters of the network.This constitutes convincing evidence that one of the structures of the data that the network needs to 'learn' is the entanglement structure.It can also be read out that the entanglement needed for the (Fashion-) MNIST classification tasks is low, which could be viewed as a 'justification' why a simple CNN is capable of performing these tasks.

Related work
The q-CNN architecture discussed in this work is based on the theoretical architecture, named deep convolutional arithmetic circuit, introduced in [23,24].The product pooling proposed in [23,24] presents a challenge in training the network described above, since it can easily lead to numerical instabilities such as underflow or overflow.We would like to render the network trainable in practice, and do so without spoiling the analogy to quantum many-body systems.In [28] this architecture was trained using simnets [29], circumventing the numerical instability of product pooling by performing the calculations in logarithmic space.In q-CNN, we instead introduce additional batch normalisation layers which can easily be incorporated into the tensor description of the network, compatible with the quantum analogy.
As opposed to other works aiming to study and/or practice machine learning in ways inspired by quantum many-body physics [26,30,31], we train the network as a usual neural network with Pytorch instead of optimization schemes commonly used for tensor networks, such as DMRG.Also note that the number of parameters of our network grows merely logarithmically with the size of the image, as opposed to linearly as in the aforementioned approaches, since we retain the weight sharing feature of the usual CNN in our architecture.In [14], the EE of the final trained network was computed for a very different architecture directly related to tensor networks, but not its evolution during training and the correlation between entanglement and accuracy.In [32] and [24], the possibility was mentioned that the requirement of being capable to accommodate the entanglement could be used to guide the design of the network.We note that the values of the EE in our experiments on the MNIST and F-MNIST datasets is of the order of log 2. As a result, the entanglement is unlikely to be a bottleneck of the network performance for such tasks.

The q-CNN architecture
In this section we describe the architecture of q-CNN, as summarised in figure 1.
Consider a grayscale image with N pixels, corresponding to a point in the configuration space x = (x 1 , . . ., x N ) ∈ [0, 1] ⊗N .For simplicity we have flattened the 2d image into a chain.We will take N = 2 L with integral L, using padding if necessary.The first layer of the q-CNN is a (non-linear) representation layer: Explicitly, we write (19).Subsequently, we will have L iterations of feature learning.Each iteration consists of three operations: batch normalisation, convolution, and pooling.In what follows we will discuss them individually.

Batch normalisation:
In q-CNN, we employ a somewhat unusual product pooling to correlate information captured in different spacial locations of the image.This product operation is prone to numerical instability such as overflow and underflow.To remedy this, we use batch normalisation to standardise the input features of each layer such that they have a specific (learnable) mean µ (ℓ) and variance σ (ℓ) over the spacial and batch dimensions.In other words, in the ℓth iteration and for a given batch, let µ in components.In the above, we have that w (ℓ) = diag(w where ε is a small regularizing constant.

Convolution:
We consider a convolution of window size 1 × 1.Note that this is nevertheless non-trivial since we have d ℓ > 1 channels.As a result the convolution mixes information carried in different channels though not in different spacial locations.Writing the weight tensor in the ℓth layer as a matrix a (ℓ) of size d ℓ+1 × d ℓ , we have with ξ in components.Note that the tensor a (ℓ) does not depend on the spacial location p, a feature often referred to as weight sharing in the context of the usual CNN architecture.

Product pooling:
Following each such convolution there is a product pooling operation which perform the same-channel product of the corresponding (one-dimensional) features in non-overlapping spacial windows of size 2, thus reducing the spacial size of the feature map by a factor of 2. In other words, we have where

Classification:
The last pooling layer is followed by a batch normalisation operation and a dense linear layer (trivially a convolution on a 1 × 1 feature map), which maps the remaining features to the classification space of |C| labels, with α = a (L) ζ ′ (L) 1 , or in components.

Tensor description
As mentioned before, without batch normalisation, the architecture consisting of the specific form of convolutional and product pooling layers described above is the convolutional arithmetic circuit that the authors proposed in [23], and further studied in [24,33].It was also pointed out that such an architecture implements a (hierarchical) Tucker decomposition of the network tensors.To have a trainable network we additionally apply batch normalisation.Naively, this destroys the description of the network as a tensor operation, since batch normalisation ( 3) is an affine instead of a linear transformation.However, this can be easily remedied by adding an additional dimension, corresponding to the 'constant term' , in all layers.Concretely, we can equivalently describe the affine transformation (2) as the following linear transformation: where At the same time, the convolution and product pooling layers can also be described in terms of the p and ξ(ℓ) p in a straightforward way.For instance, the batch normalisation and the convolutional layers can be described as a combined tensorial operation: with ξ(ℓ) p = ã(ℓ) ζ ′ (ℓ) p , where for ℓ = 0, 1, . . ., L − 1.In the final classification layer, we simply have since we do not need the additional constant channel in the final output.As a result, the q-CNN we defined above can also be described in terms of tensor networks, as we will further describe in the following section.

Quantum properties
In this section we discuss the description the data and the q-CNN network in the language of quantum many-body systems, which then enables the definition and calculation of the EE of the network states.

A quantum description
To describe the neural network introduced in section 2, let us first describe our quantum Hilbert space in terms of a space of L 2 -integrable complex functions 4 .Explicitly, consider As is well-known, this space is equipped with a natural norm f, g := ˆ1 0 dx f(x)g(x), (18) and an orthonormal basis is given by the Fourier basis5 for k ∈ N. To obtain a finite-dimensional Hilbert space, we restrict to a subspace by truncating the modes with frequency larger than a given n ∈ N: We will identify this space with the our local Hilbert space describing the state of the local pixel (or spin in the physics analogy).Explicitly, corresponding to the orthonormal basis (19) for L 2 n (S 1 ), we introduce an orthonormal basis |f 0 , . . ., |f 2n for H loc , and it follows from the orthonormality that 2n We also introduce x|∈ H * loc to be giving the evaluation map: In other words, we can think of |x as the position eigenstates, and f (x) can be thought of as the corresponding wavefunction associated with the state |f .Explicitly, one has and Note also that is the so-called Dirichlet kernel on L 2 n (S 1 ), and has the Dirac delta function δ(x) as the limit when n → ∞.Now, consider a system with N pixels (or lattice sites in the physics analogy), and we have We will write the coordinates x = (x 1 , . . ., x N ) for the N-torus T N = [0, 1) ⊗N , and introduce the following natural (orthonormal) basis for H, corresponding to the basis |f i p for the local Hilbert space of pth pixel.Identifying a greyscale image with N pixels with a point in [0, 1] ⊗N , the 'position eigenstate' |x corresponds in our case to a specific image.Note that, by working with the space (26), we restrict ourselves to a periodic representation that is invariant under the action of swapping a zero with an one 6 .Just as in the single pixel/particle case, its expansion in the orthonormal basis (27) reads where we have written i = (i 1 , . . ., i N ), and the coefficient function reads7 It then follows immediately that the final score function output from our neural network can be viewed as the value of the wavefunction corresponding to the 'network state' 8|Ψ y = ∑ i∈{0,1,...,2n} ⊗N Ψ y;i |f i (30) corresponding to the label y ∈ C: At this point, it is tempting to associate to our wavefunction the usual probabilitic interpretation à la Born's rule.To do this, introduce the normalised network states and define the joint probability density function Building a (generative) network learning the states |Ψ y,0 that gives a good approximation to the above probability density function is beyond the scope of the current paper; here we focus on the classification task and hence we can only trust p(y, x) in the subspace of H where x resembles the training data in some way.Instead, for the classification task at hand, the relevant probability is the conditional probability As is manifest from the last equality, the conditional probability is insensitive to the normalisation (32) of the network state |Ψ y .This justifies our classification given the network output α y (x) = x|Ψ y : Note that this is different from the more common ways of assigning probabilities to the outputs of such a classification network, such as through a softmax function.

Entanglement entropy
Equipped with the interpretation described in section 3.1 of our architecture as quantum states, we are now ready to 'measure' the behaviour of the the neural network using quantum mechanical tools.In particular, in this subsection we will discuss the EE of the network.
As the name suggests, the EE measures the extent to which a quantum state is entangled across two different subsystems, U and U.It is defined as the von Neumann entropy of the reduced density matrix.For instance, if the state is a product state of two separate systems, then its EE (with respect to the subsystems) vanishes.
Here, we consider a bipartition (U, U) of the images that spatially splits the pixels x = (x U , x U ) into two groups, with the corresponding Hilbert space decomposition H = H U ⊗ H U .To discuss EE of different network states corresponding to different classification labels y ∈ C, we normalise the network state as Note that this is a different normalisation from (32).The corresponding reduced density matrix, obtained by tracing the density matrix over the subspace H U , reads In what follows we will focus on the bipartition with U and U each contains N/2 pixels (or lattice sites), corresponding to the two inputs of the top pooling tensor.Computationally, this is the biparition whose EE is easiest to compute, corresponding to the fact that one needs to cut the least number of legs in the diagram.In the next section we flatten the input images in ways such that this bipartition corresponds to either the left/right or the up/down separation of the images.In this case, analogous to the basis (27) for the total Hilbert space H, we introduce the corresponding orthonormal basis for The top pooling layer of the tensor decomposition gives the following expression where for both |ϕ I U and |ϕ I U .Recall that |ϕ I U = |ϕ I U under the natural isomorphism H U ∼ = HŪ, and this is a consequence of the weight sharing feature of our architecture.Also note that the y-dependence in |Ψ y comes entirely from the top layer tensor a (L) .
Putting it together, we have the following expression for the reduced density matrix (37) where the matrix elements are given by From the eigenvalues m i , i ∈ {0, 1, . . ., 2n} ⊗ N 2 of the above matrix, we readily compute the EE (38) of the network states to be with 4. Experiments
The number of parameters of the network can easily be calculated, where the first term comes from the convolution (without bias), and the second from the batch normalisation layers.Note that, in the case that d ℓ is independent of ℓ, the number of parameters grows like log 2 N with N.
Our main examples will be networks with number of channels increasing with the depth, d ℓ = d(ℓ + 1) for ℓ = 0, . . ., L, and d L+1 = |C| = 10 for the final layer.This results in the following total number of parameters which scales quadratically with d and merely with (powers of) log 2 N, a result of the tree tensor network structure, which eliminates quadratic growth, as well as weight-sharing, which eliminates linear growth with N .We used values of d from 4 up to 40, resulting in a number of parameters in the range (1320, 391 200).Classification of an input x is done by choosing the label y for which the output |α y (x)| is the largest (cf (35)), consistent with the quantum interpretation discussed in section 3.1.We used the following square distance loss function, where l(x) is the correct label of x, the first sum is over the images in the batch and the second is over the different labels.Reducing this cost means that the output in the correct label channel moves closer to 1, while the outputs in the rest of the channels move closer to 0 9 .Optimization was done by using the AdamW optimizer [37], with weight decay parameter 0.01 and learning rate parameter 0.01, which was reduced by half every ten epochs.Batch size was chosen equal to 50 and we trained for a total of 90 epochs.Tensor network computations for evaluating the EE were done using the Tensornetwork library [38].The corresponding code can be found here.

Results and discussion
The best (average) test accuracies achieved among all experiments were 97% for MNIST and 89% for Fahsion-MNIST, which are comparable to the results obtained in [26].Note that since our main goal is the 9 Given (34) and (35)   .However, these choices all give worse training results than (48).One simple reason is just that the above loss is invariant under a sign change of αy and hence does not admit a unique minimum.study of the entanglement structure of this type of network architecture instead of achieving the very best classification results, we did not perform an exhaustive hyperparameter search, so it is likely that the achieved accuracies can be improved further. Figure 2 depicts a typical (as opposed to optimal in terms of attained accuracies) run of an experiment with d = 18 (81 000 parameters), for both datasets 10 .Notice that after epochs 30-50 the test accuracy and cost value start to slowly approach their final values.Figure 2(c) shows the trend (averaged over two epochs) of the average EE of all ten output channels, according to a left/right partition, as it develops during training.Upon initialization of the network the EE is practically zero, and during the first few epochs its value can vary widely among different initialization seeds, optimizers and number of hyperparameters.However, after the accuracy and the cost value has stopped changing rapidly, which happens after around ten epochs, the EE starts to increase steadily.This signifies a rise in the degree of correlations that are built between the left/right parts of the network in order to classify the images with more precision.This is to be contrasted with the 'accidental' entanglement that appear during the first ten epochs in the MNIST case, when the network is not yet able to correctly classify the images systematically.Note that in the MNIST case the value of the EE seems to stabilise during the final epochs, whereas it appears to keep growing in the case of F-MNIST.A possible explanation is that at an accuracy of ~88% there are many more misclassifications in the latter case compared to the former case, and thus the minimum EE needed for near-optimal classification has not yet been reached.In support of this, we note that after around epoch 30 the F-MNIST plot resembles the region between epochs 10 to 40 of the MNIST plot, where the corresponding accuracy is still lower than the final achieved value.
Figure 2(d) shows the average EE trend for a different run with the same hyperparameters, but with a different flattening of the images such that the top pooling corresponds to an up/down division of the images.Note that the development trend of the up/down EE is very similar to that of the left/right EE.We believe that the heuristic explanation of the trend mentioned above for the left/right EE also applies to the case of up/down EE.
As mentioned before, in the (final value of the) EE of a rather different neural network has been measured in the context of MNIST dataset.We note that the order of magnitude of the final EE we measured is similar to that in suggesting that EE is indeed a robust quantitative measure of certain key properties of the given tasks.
Finally, in figure 3(a) we depict the average EE versus the value of the cost function, for the same run throughout the training process, as recorded in figures 2(a)-(c).The data before the 30th epoch are discarded due to their large fluctuations, as noted earlier.We observe a distinct negative correlation between EE and the value of the cost function, which is to say that higher values of the EE tend to appear when the network has lower values for the cost function.This can be seen more easily in figure 3(b), where the plots contain data from many different experiments, across different initialization seeds and choices of hyperparameters, which appear as 'islands' in a larger landscape that also exhibits some degree of negative correlation.This is consistent with our interpretation that the EE of the network starts increasing steadily only as the network starts learning the finer features of the data.Also note that this negative correlation is less evident in the test cost, here depicted by the red dots; we believe that this is due to the fact that the test cost drops much slower than the training cost after a certain point during training.This suggests that the aforementioned increase of the EE is in part also tied to overfitting.
In this work, we have introduced the q-CNN architecture and demonstrated that it is capable to perform the classification tasks on the MNIST and Fashion-MNIST datasets.We have subsequently computed the entanglement entropies given the bipartition corresponding to the left/right and up/down of the images.The results of these experiments provide evidence that (1) the entanglement structure is necessary for the classification task (2) the network does learn this structure during training.
Potentially, the entanglement structure that a network needs to attain before it can start performing the designated task successfully can be a crucial hint to the architecture design and the choice of hyperparameters, as is prominently the case in the analogous context of designing tensor networks to represent specific quantum states.See also our comments section 1.1.In case the knowledge of the necessary level of entanglement is unavailable, a reasonable proxy could be the (estimated) mutual information of the relevant dataset.In particular, lessons from quantum physics indicate that how entanglement varies with the length scale of the partition could be an important feature to consider.See [27] for a recent review.
It would be interesting to explore entanglement structure of other bipartitions, corresponding to shorter-ranged entanglement for instance, apart from the left/right and up/down bipartitions.It would also be interesting to investigate the entanglement structure of a generative network approximating the probability density function (34), as such a network will have a more global knowledge of the Hilbert space H. Finally, it would be interesting to compute the EE of the network trained by physical data, for instance in the problem of phase classification of physical systems such as the Ising model, and compare that to the EE of generic quantum states in the corresponding physical system.

b
be the mean and variance of the input ζ(ℓ) over the spatial (denoted by p above) dimensions and over the data points in the batch b.Then the batch normalisation layer amounts to the following affine transformation:

Figure 1 .
Figure1.The neural network architecture used in this paper (left), and its tensorial description (right).The triangles represent the delta tensor implementing the same-channel product pooling(8).The circles represent the matrix multiplication as in(6).

Figure 2 .
Figure 2. (a), (b) Accuracy and cost during training for a typical run of an experiment with d = 18 for both datasets.(c) The corresponding left/right average EE trend, where we average over the classes with an averaging window of two epochs.(d) The up/down average EE trend for a different run with the same hyperparameters.

Figure 3 .
Figure 3. (a) The average EE trend versus the value of the cost function for a typical run of an experiment with d = 18, for both datasets.(b) The corresponding plots across multiple experiments.
which satisfies Tr HU ρ y;U = Ψ y, * |Ψ y, * = 1.Then the EE of the network state |Ψ y, * corresponding to the bipartition (U, U) is given by This quantity measures the extent to which the quantum state is entangled across the subspaces U and U, i.e. the degree to which the quantum state fails to be separable into two parts, belonging to two subsystems.For instance, if |Ψ y is a product state |Ψ y = |Ψ y U ⊗ |Ψ y U with |Ψ y U ∈ H U and |Ψ y Ū ∈ HŪ, then the probability (34) also takes the form of a product of contribution from U and Ū, and we have S(ρ y;U ) = 0.