$\beta$-Variational Autoencoder as an Entanglement Classifier

We focus on using an architecture similar to the $\beta$-Variational Autoencoder ($\beta$-VAE) to discriminate if a quantum state is entangled or separable based on measurements. We split the data into two sets, the set of local and correlated measurements. Using the latent space, which is a low dimensional representation of the data, we show that restricting ourselves to the set of local data it is not possible to distinguish between entangled and separable states. Meanwhile, when considering both correlated and local measurements, an accuracy of over 80% is attained in the structure of the latent space.


I. INTRODUCTION
Entanglement is one of the most outstanding properties in quantum systems.It introduces correlations that are non-classical and may occur between otherwise noninteracting systems.Its usefulness emerges in such areas as quantum information, quantum computation, quantum cryptography and quantum metrology.It is also crucial to the phenomena of quantum teleportation [2].Due to the importance of entanglement, which appears in so many instances, it is no wonder that there is a huge interest in finding methods that can detect, classify and quantify it [1,9,16,19].
On the other hand, deep learning (DL) techniques are becoming one of the most important assets in the physicists' toolbox, for instance helping understanding patterns that have little or no bias from a previously established theoretical framework.In general, DL techniques have been used in relation to computer science, and some successful examples are the technologies associated with pattern recognition, especially computer vision [15,21], as well as some applications that received attention from the media, like Alpha GO [22].Lately this kind of approach found its way in physics, going beyond computer science.Some recent applications emerge in condensed matter physics [4], quantum many-body physics [3,24] and molecular modeling [20].DL techniques are even being applied as a way to unveil how physical concepts emerge [10,17].
In this paper, we are concerned to the problem on how to distinguish between entangled and separable states, this is known to be an NP-hard problem [6], therefore there is no known classical algorithm that could solve this problem efficiently.
Specifically, we analyze a method to encode the high dimensional labeled data coming from measurements of 2 qubit states, that corresponds to density matrices with 15 parameters, into a lower-dimensional representation which we call latent space.We deal with this problem through a Neural Network architecture that is similar to the so called a β-Variational Autoencoder [7] which is used as a tool to distinguish between entangled and separable states.
In section II we explain how we simulated and labelled the data.In section III we detail how to use the β-Variational Autoencoder architecture for our problem and specify the loss function used.We discuss and show the results of the model in section IV, and we finish with our conclusions and future works on section V.

II. DATA
There are several methods to distinguish between entangled and separable states, as summarized in Horodecki et al. [9].Here we chose to study the case of bipartite entanglement of 2 qubit states, applying the Positive Partial Trace (PPT) criterion (sometimes called Perez-Horodecki criterion) that was proposed first by Perez [18] and extended by Horodecki et al. [8] where it was shown that it provides a sufficient and necessary condition for a two-qubit system to be entangled.
In Quantum Theory the most general way to describe a quantum state is using a density matrix which is represented by a linear operator having the following properties: (1) Unity trace ; (2) Positivity.
The PPT Criterion consists of using the partial transpose transformation on a density matrix.If after the transformation, the state of two qubits is completely positive, then it is separable (SEP).Otherwise, it is entangled (ENT).Considering ρ ∈ C 2 ⊗C 2 as a density matrix of two qubits and T as the transpose map, the PPT can be synthesized by the following expression: Since we are going to use supervised learning we need labeled data.To generate the data we used Qutip [11,12] arXiv:2004.14420v3[quant-ph] 23 Aug 2021 to simulate random density matrices and use the PPT criterion to label the data, we chose the label "1" for entangled states and the label "0" for separable states.These labels are one-hot encoded for the input of the Neural Network.We then measure on the Pauli matrices basis {σ i ⊗ σ j }, where i, j ∈ 0, x, y, z and σ 0 = I.All measurements are labeled using the following convention: For the two-qubit case we need 15 measurements, excluding M 00 that always equals to 1 because of the density matrix definition, in order to have a complete tomography of the state.One can split the tomographiccomplete measurements into two disjoint sets: correlated measurements, M ij such that i, j = 0, having 9 measurements and local measurements, M ij such that i = 0 or j = 0, having 6 measurements.
Using this convention, we have three types of training and validation data, each with 5000 and 3000 samples, respectively.For convenience, we call these three types: the tomographic-complete dataset, correlated measurements dataset and local measurements dataset.The data used is slightly unbalanced, because we have approximately 65% of the states belonging to the entangled class.If we choose a classifier that takes into account only the most frequent class we get an accuracy of 65%, this "dummy" classifier will be our baseline model.

III. MODEL
The Variational Autoencoder (VAE) was proposed by Diederik and Welling [14] and is most commonly used for generative modeling.It has been extended by Higgins et al. [7] for an architecture which is called β-Variational Autoencoder (β-VAE) which creates unraveled (disentangled) representations on the latent space (usually a lowerdimensional representation of the data).Both models can be represented by the same graph shown in figure 1.
The main principle beneath the VAE, or β-VAE, is the use of a probabilistic latent space, which is a lowerdimensional representation of the data, that, as assumed, follows some prior distribution.The most common choice is the Gaussian distribution N (0, 1) which will be used in this work.
Our main result is that we can encode the high dimensional input of measurements into a two-dimensional latent space that acts as a classifier of entanglement using an architecture that resembles a β-VAE.This representation cannot be obtained on a Feed-Forward Neural Network because it would create a discontinuous latent space.Besides using the latent space as a classifier, we use it to discriminate between the kinds of measurements being made, thus showing that local measurements are not able to characterize if a state is entangled or separable.
FIG. 1.The architecture used on this paper which resembles a VAE (or β-VAE) architecture, which consists of an encoder and a decoder with a probabilistic latent space.The nodes represent Neurons of the neural Network and the edges represent connectivity between nodes, which is assumed to be completely connected.In this paper, only the input and output depends on the dataset, all others are independent.The encoding structure is of size 128, 64, and 32.The latent space is of size 2, and the decoder structure is of size 32, 64, 128 and the output of size 2.
We trained from end-to-end using back propagation with a two-component loss function that consists of a categorization loss and a latent loss.Our choice of categorization loss L cat (y, ŷ) is the categorical cross-entropy, where y i is the true label and ŷi is the predicted label.For the latent loss L KL (µ, σ) we chose the Kullback-Leibler (KL) Divergence.The total loss is given by the sum of these two losses: Where r cat and β are weighting coefficients which are hyperparameters of our model and should be optimized for our task.
For the training, we adopt the Adam algorithm [13] as the optimizer with the 'reduce learning rate on plateau' method callback on Keras framework [5].In all layers, except the last layer, we used LeakyReLU Activation in conjunction with a Dropout layer [23] to avoid overfitting.For the last layer, our choice was the softmax activation in order to capture the probability of the state being separable or entangled.
We trained the model for 100 epochs, starting with learning rate 0.005, batch size of 256, and using r cat = 500 and β = 1 for each data set.In order to find those hyperparameters we discuss the methods for hyperparameter tuning on sec IV.

IV. RESULTS AND DISCUSSION
We trained and evaluated our model for the three datasets, as specified in sections II and III.In our model, we will encode the information of the 15-dimensional input into a 2 dimensional latent space as represented in figure 3, in which is possible to see that there are correlations between the dataset used and the latent space.The loss, for the tomographically-complete set, regarding training is showed in figure 2. For the other datasets, the loss behaves similarly when varying the accuracy.
The latent space is divided into two types of distributions, one regarding the separable states and the other regarding entangled states.These distributions can be distinguished by a line that depends on the initialization of the weights, due to the random initialization the line can be on the X or Y axis.Thus, in order to find which one suits better, we can plot the latent space for the known training data and choose the suited line that cuts the plane to distinguish between separable and entangled states.
For tomographic-complete measurements, we see a clear distinction between entangled and separable states, therefore we can use the latent space as an entanglement classifier.For instance, if we chose all points with y > 0 to be separable we find an accuracy of 80%, comparing to the accuracy of the whole model, which is 84%.Therefore, to use only the latent space is effective as a discriminator of entanglement.
The same can be done for correlated measurements.Indeed, choosing y > 0 to be separable states we find an accuracy of 80% and for the full model, we find an accuracy of 83%.On the other hand, for local measurements choosing y > 0 on the latent space gives approximately the same accuracy as the whole model, namely, 63%.
It is interesting to note, as stated before, that the latent space representation depends on the type of measurements being made.For the correlated measurements (M ij such that i, j = 0) we see that the latent space still clusters into two different classes just as the case where we use tomographically-complete measurements.On the other hand, for local measurements(M ij such that i = 0 or j = 0) it is not possible to distinguish between separable and entangled states using the latent space.This is a FIG. 3. The plot of the latent space for each validation dataset (examples that weren't previously seen by the model).Yellow dots represent entangled states and black dots represents separable states.From left to right tomographic-complete measurements, correlated measurements and local measurements.We see that there is clustering on both tomographiccomplete measurements and correlated measurements, but we find no clustering on local measurements, showing that the importance of each measurement for entanglement detection is different, therefore we could use only the correlated measurements for detecting entanglement.FIG. 4. Plot showing the β/rcat-dependency of the accuracy for tomographic-complete measurements.There are two regimes that are defined by the ratio between β and rcat, this happens because the β factor enforces the latent space distribution for being equal to a Gaussian distribution N (0, 1) if the β is considerably smaller than rcat the latent space distribution doesn't need to be Gaussian.
feature that becomes evident when using the architecture that resembles a β-VAE.
To evaluate the hyperparameters of the model we varied the β factor of the loss equation (Eq. 3) as shown in figure 4. As can be seen, when β/r cat > 0.3 the accuracy of the model goes down considerably.
As expected, the β factor multiplying the KL Divergence forces the latent space distribution to the same prior Gaussian N (0, 1) distribution.Analyzing the shape of our result distribution, mainly the entangled states, we see that it is not Gaussian at all, therefore enforcing the KL Divergence condition will diminish the accuracy of the model.

V. CONCLUSION
In this paper, we propose a novel way to use the latent space of a β-Variational Autoencoder to encode the information concerning the entanglement of the quantum state using a set of tomographically-complete measurements.
We divide this set into two disjoint sets, one for correlated measurements and the other for local measurements in order to analyze if there is any difference between these two types of measurements.
Applying our method on a tomographically-complete set of measurements of two-qubit system, we can distinguish between entangled and separable states with high precision both in the prediction of the model (84%) and using the latent space as an entanglement classifier (83%).In addition, for correlated measurements, of type σ x,y,z ⊗ σ x,y,z , we also can distinguish between entangled and separable states, but with less precision for the whole model (82%) and using only the latent space (80%) compared to the set of tomographically-complete measurements.
On the other hand, applying for local measurements the model is not able to learn any representation of en-tanglement in the latent space, showing that local measurements of type σ x,y,z ⊗ I or I ⊗ σ x,y,z are not able to characterize if the state is entangled or separable.
This result is supported by quantum theory because entangled states are expected to show non-locality given by correlated measurements, on the other hand, separable states are expected to be characterized by local measurements.Our model provides a novel way to identify local and correlated measurements, by using the latent space.
In the future, we intend to analyze if this type of model can find accurate description of bipartite or multipartite entanglement for more than two qubits.

FIG. 2 .
FIG. 2. The plot of the Loss function (left) and accuracy of the model (right) during training and evaluation on unseen data.The behavior of the loss function is the same for all datasets but with different accuracy, as specified in the text.