Neural networks for quantum inverse problems

Quantum inverse problem (QIP) is the problem of estimating an unknown quantum system from a set of measurements, whereas the classical counterpart is the inverse problem of estimating a distribution from a set of observations. In this paper, we present a neural-network-based method for QIPs, which has been widely explored for its classical counterpart. The proposed method utilizes the quantumness of the QIPs and takes advantage of the computational power of neural networks to achieve remarkable efficiency for the quantum state estimation. We test the method on the problem of maximum entropy estimation of an unknown state ρ from partial information both numerically and experimentally. Our method yields high fidelity, efficiency and robustness for both numerical experiments and quantum optical experiments.


Introduction
Learning quantum states is an essential task in quantum information processing [1][2][3]. Typically, performing measurements on a quantum system, getting readouts, and reconstructing the quantum states is the way to study the corresponding systems. In general, this process can be written as c = A(ρ) + , where ρ is the quantum state of the system, A is a function of ρ, and c is expectation values obtained from the measurements that are specified by the function A. The vetcor is the noise vector that is subject to a noise distribution π noise . If one knows the state ρ, then c can be obtained by getting the measurements specified by A, and we call this process the quantum forward problem (QFP); the opposite direction is then called the quantum inverse problem (QIP).
Despite of the contexts of quantum systems, when the operator ρ is a diagonal matrix, i.e. a classical probability distribution, the QIP reduces to its classical counterpart, i.e. the classical inverse problem (IP), which is known as one of the most important mathematical problems since it estimates parameters that are not directly measurable, a situation widely applicable in science and technology. From the analytic perspective, it is a problem of recovering system parameters (elements of diagonal ρ) from observables c according to a particular model; from the Bayesian inference viewpoint, the goal is to recover the classical probability distribution of a diagonal ρ subject to the given c [4].
IP has been studied for a long time in statistical inference and statistical learning [5,6]. One of the difficulties is that IPs are often ill-posed. That means the solution may not exist, not unique when it exists, or unstable in the sense that a slight change in the input may cause a significant change to the output. Deep neural networks are state-of-art tools for solving IP [4,[7][8][9][10][11][12], with several of them focused on the ill-posed This paper is organized in the following way: section 2 presents and discusses the deep learning method; section 3 demonstrates the example of giving the MEE ρ MEE of an unknown state ρ from partial information in detail, supported by both numerical and quantum optical experimental results; section 4 then follows with discussion and outlook.

The neural networks method
For a quantum physical system, the underlying mechanism is governing by the physical model where ρ ∈ X is the density matrix, ∈ Y is the noise vector (subject to a noise distribution π noise ), and c ∈ Y is the expectation values of measuring a fixed set of observables F = {F 1 , . . . , F m }. X is the interested set of d × d density matrices according to given prior information. For example, if we know the states ρ are pure states, X is then the set of rank-1 density matrices; X is the set of all density matrices if no prior information is provided. Y is the vector space R m . The physical model A F is always associated with a set F of observables to determine which space it maps into. For simplicity, we will use the notation A instead. Giving ρ to get c is the QFP, the reverse direction we call the QIP. When the measurements are not complete-that is, the number m of observables is not enough to uniquely determine the system, the QIP does not possess a unique solution. Prior information about the interested system decreases the degree of freedom. Under this prior information, a model A is indeed bijective or almost bijective ('almost' in this context means 'measure one', i.e. the function is bijection except for a measure zero set of exceptions.) That means the system can be reconstructed with fewer measurements if the prior information has been appropriately used. However, the challenge of how to efficiently encode the prior information into a practical scheme is demanding [26]. This turns out to be a blessing for our framework, which is straightforward to utilize such information. This point will be explained later.
The key observation is that the forward problem is almost always more straightforward than the inverse. For example, reconstructing an object from its projections is substantially more challenging than making the projections of a known object. Especially in quantum information theory, the forward problem is usually clear thanks to the study of quantum physics. Knowing the information of a quantum system, such as its Hamiltonian or state (density matrix), the measurement outcomes of this system according to a fixed set F are predictable. Compared to reconstructing information about the system from its measurement outcomes, the forward direction is significantly easier.
We take advantage of the complexity difference between the two directions and use the easier direction to help deal with the problematic side. Supervised learning is the relatively mature branch of machine learning techniques that finds the model between input and output pairs data with labelled examples. On the contrary, unsupervised learning techniques are normally used while lack access to training data or having trouble with labelling data [33]. In our problem, data resource for supervised learning is guaranteed. The QFP contributes to generating training and testing data for supervised learning. The next problem is the training data distribution. The forward model A contains the information about the landscape of c according to ρ. This information guides the training data sampling process, largely determining the training data distribution.
From another perspective, QIPs are also regression problems to fit given data pairs. Neural networks are incredibly versatile tools for regression problems. Even simple NNs only with one hidden layer are very expressive. With nonlinear activation functions between neurons, these 'vanilla' NNs can represent arbitrary functions [34]. Traditionally, regression problems require deliberately chosen techniques to achieve better performance. However, NNs are extremely flexible. They automatically adapt themselves to different regression techniques according to the particular scenario. This feature made NN a convenient tool for solving various QIPs.
Before implementation, we need to parametrize density matrices ρ. The parametrization function P : X → X is a bijection, where a ∈ X , X is a vector space. The choice of P is based on X. For example, if the set X is the Gibbs states of a class of Hamiltonian, P could be the map from Hamiltonian parameters to the Gibbs states. The training data set is denoted as where X train ⊂ X is the finite set of sampled parameters a train . The training data naturally implants the prior information contained in A. After data preparation, the network can be trained. A neural network is a tunable function Φ NN : Y → X . The training process uses training algorithms to tune the parameters embedded in Φ NN to minimize the distances between the NN output and desired output according to the chosen loss function, i.e. minimizing L( a train , Φ NN ( c train )), where L : X × X → R is the loss function. It is chosen to reverberate the parametrization P. The goal is to bring ρ 1 and ρ 2 closer by minimizing L( a 1 , a 2 ). The loss function L can be the mean square error or mean absolute error (MAE) if as need to be precisely approached on magnitudes. If P focus more on the direction of as, a loss function that minimizes angle (e.g. cosine similarity) will be a better option. L can also be a type of entropy when as are probability distributions. The choice of P and L, as well as the training data set T all reflect prior information of the problem. For testing data generated by the QFP, comparing the ideal ρ and the estimated ρ est can tell us the accuracy of the estimation. A reasonable question to ask is, given a data c unk with an unknown preimage, how we can know whether the NN estimation ρ est is acceptable. It turns out that QFP can serve as the mechanism of examining the estimation that come out from a trained NN. Choosing a metric f in Y, one can compare c unk and the image of NN output, Ideally, we want c unk and (A • P • Φ NN )( c unk ) to be identical, but numerical errors are inevitable in reality.
Bounding the value of equation (4) can bound the confidence of ρ est .
In the next section, we will provide an example to demonstrate our method. The task is to give a MEE based on noiseless partial information of an unknown state. The network takes incomplete measurements of the unknown quantum state ρ and returns the MEE of ρ. Compared to other algorithms, our method shows extraordinary efficiency without sacrificing much accuracy. It also shows a remarkable ability to tolerate experimental error.

Learning maximum entropy estimation from partial information
Maximum entropy inference is believed to be the best estimation to present the current knowledge when only part of the information about the system is provided [35,36]. The entropy is mostly Shanon entropy in classical physics and engineering, and is von Neumann entropy for the quantum counterpart.
In quantum system, given the set of incomplete measurement expectation values {c i |c i = tr(ρF i ), F i ∈ F} of an unknown state ρ for a fixed set of observables F, there may exist more than one quantum state with the same measurement outcomes. The incompleteness means that the measurements are insufficient for a full tomography (m < d 2 − 1). Denote the set of states as The unknown state ρ is one of the elements in P. The MEE ρ MEE of ρ can be represented as a thermal state where β is the reciprocal temperature of the system and a i 's are real coefficients [37][38][39][40]. At the mean time, ρ MEE should satisfy that it has the same measurement outcomes when measuring the same set of operator F (i.e. ρ MEE ∈ P). The thermal representation is unique [37]. The measurement results c = (c 1 , . . . , c m ) where c i = tr(ρF i ), F i ∈ F, therefore, possess a one-to-one correspondence with its MEE ρ MEE . An interesting special case is that P only has one element, then ρ = ρ MEE . A well-studied example of such case is when ρ is an unique ground state of H = − i a i F i , where F i ∈ F and a i ∈ R [41][42][43]. This situation means that if ρ is a unique ground state of H, ρ MEE not only has a one-to-one correspondence with c, but also is the actual state ρ.
In this particular QIP, the parametrization function P and the forward model A are as follows: A : ρ MEE → c Figure 2. (a) β and a are randomly generated. The Hermitian operator set F associated with the forward model A provides information for the distribution of β. Generated pairs of ( c, β a) are training data: c is the input of the neural network, β a is the ideal output. The left block is the parametrization function P, and the right block is the foreword model A for this problem.
(b) The trained network Φ NN is the approximation of P −1 • A −1 , which can produce a parameter estimation β a according to an input c. The right block is the parameterization function P that maps estimated parameters β a to ρ est .
where a = (a 1 , . . . , a m ) and ρ MEE is defined in equation (5). The noise is set to be zero, i.e. = 0. Supervised learning can train a network Φ NN to approach the inverse function More specifically, as shown in figure 2(a), we randomly generate many β's and a's, achieving corresponding measurement results c. These pairs of ( c, β a) are used as training data for the neural network. The trained network Φ NN is the approximation of the function where β and a i are NN estimations of the true values β and a i . To be noticed that when the unknown state ρ is not a thermal state, the operator −H = − i a i F i is not necessarily the real Hamiltonian of the system. We call H a pseudo Hamiltonian. We test our method numerically with two systems: (1) F has three 64 by 64 random generated Hermitian operators; (2) the five-qubit one-dimension lattice, is the set of Pauli operators with the 2 by 2 identity. The upper index i indicates the qubit which the operator acts on. Moreover, we test the method with experimental data of an optical set-up, which are derived from unique ground states associated with fixed Hermitian operator sets. Therefore the MEE estimations ρ est are also the estimation of the true states measured in our experiments.

Data preparation and network training
Training data preparation is the key to supervised learning since the learning outcome depends heavily on the training data set.
The data generating procedure is shown in figure 2(a). Parameter a is drawn from normal N (0, 1) distribution then normalized. The 'coldness' β is randomly sampled from (0, 100 ]. Generally, when β reaches ∼30, thermal states are almost pure. Here we allow β goes to 100 for some extreme cases. The distribution of β in the whole training data set is critical in this process. We will discuss this issue in depth later. Parameter a together with the fixed set of operators F set up the operator H = i βa i F i . The measurement results c = (tr(ρF 1 ), . . . , tr(ρF m )) come from trace the product of ρ = exp(H)/tr[exp(H)] and operator F i 's. Every pair of β a and c counts for a pair of training data.
It turns out that the distribution of β in the training data set is the key to our problem. By data distribution of β, we mean the proportion of β picked in a given interval I to the amount of data in the whole training data set. Intuitively, the network should be trained with more data in the place where the function changes more rapidly. Specifically, the network should see more data on where the slight change of β causes a significant change on ρ then on c in the relative sense. Despite the matrix exponential function, the property also depends on F. Luckily enough, since we know F, we have all the information we need. The function is steeper while β is small and is smooth while β is relatively large.
However, if we put significantly more data on the narrow steep region (e.g. β ∈ (0, 5)), that may confuse the network-the network will have bad performance on the wider smooth region since it does not see enough data. In order to achieve optimal overall performance, one needs to balance between fitting the rough region and giving enough data of other regions.
First of all, we need a way to measure the 'roughness' of the function in a given area according to the parameter β. We choose how far away the thermal state ρ = exp( i βa i F i )/tr[exp( i βa i F i )] is from being a pure state as the indicator (denote as λ). In other words, where λ 0 is the biggest eigenvalue of ρ.
We divide β into multiple intervals I i = (i, i + 1] where 0 i 99 and i ∈ Z. In each interval I i , 1000 data points have been sampled. 1000 βs are drawn from uniform distribution in I i while 1000 normalized a i is sampled from normal distribution. These βs, as and F together form 1000 Getting λ from each ρ, we calculate the average of these λs and denote it asλ i . The vector λ = (λ 1 , . . . ,λ N ) for all intervals is a characterization of the model according to the change of β (denote N as the number of intervals for generality). Let p i =λ i / iλ i and p β = (p 1 , . . . , p N ).
One may consider using p β to be the data distribution. However, it transpires that p β is not appropriate since it will concentrate the training data in the lower region.
Referring to our previous arguments, we need to balance the distribution. We take two flattened steps: (a) Take the average of the first 10 elements in λ and call itλ ten , then replace these first ten elements which are smaller thanλ ten with it; (b) Denote iλ i /N asλ avrg and then replace elements which are smaller thanλ avrg with it.
We normalize the resulting vector and denote it as p flat . It is the data distribution we use in this work. Three different data generating methods have been compared in detail in appendix A.
The neural networks used in this work are fully-connected feed-forward. It means that the neurons in one layer are fully connected to the neurons in the next layer, and the information is only passing forward. The input and output layers are determined by the given length of the measurement results (i.e. the cardinality of the fixed operator set F). The three random 64 by 64 operator cases have three input and three output neurons (we refer to it as case 1 later in this paper). The five-qubit 1D lattice example has 51 neurons (5 3 1 + 4 3 1 3 1 = 51) for input and output layers (we call it case 2). These two networks all have two hidden layers; each layer has 100 neurons.
The networks in this work are trained with Adam optimizer [44] which is a popular adaptive learning rate optimization algorithm designed for deep networks. The loss function we chose is MAE where e = a − a is the error vector between the true value a and the estimated value a . MAE performs better than mean squared error (MSE, L( e) = m i e 2 i /m, another commonly used loss function). It is because the parametrization function P (equation (6)) would require a to be as close to a as possible to bring the images close and the square in MSE will make small errors more indistinct.
For case 1, the number of training data pairs is 3010 470. The batch size is 40 000, and we train the network for 300 epochs. And we use 2005 584 pairs of training data for the five-qubit 1D lattice model. The batch size is 20 000, and the number of epochs is also 300. How the amount of training data affects the accuracy of NNs is analyzed in appendix B.

Numerical results
New data sets are generated to test the performance of trained neural networks. Similar to the procedure of producing training data in figure 2(a), the testing data are pairs of c and β a. β's are uniformly picked from (0, 100 ] and a's are normalized. The estimated MEE ρ est comes out from adopting the course in figure 2(b). We compare each ρ est with its true MEE ρ MEE by calculating the fidelity. The fidelity function we use is given as [20] f (ρ 1 , ρ 2 ) = tr √ ρ 1 ρ 2 √ ρ 1 .
For case 1, the average fidelity between true MEE ρ MEE and the estimated MEE ρ est is 99.0%. Figure 3(a) shows the fidelities of all tested data. The mini-figure is its boxplot [45], which is a graphical way to depict data through their quartiles. The orange line in the boxplot is the median value which is 99.5%. Statistically,   the circles in the boxplot are outliers which are data points notably different from others, hence lacking statistical significance. Similarly, figure 3(b) shows the fidelities of the whole testing data set for case 2. The average fidelity is 97.1%, and the median fidelity is 98.1% (table 1).

Experimental verification and the effect of error
To verify the performance of our well-trained neural network in processing actual experimental data and its robustness against experimental noise, we implement a qutrit photonic set-up capable of preparing different qutrit states and measuring arbitrary operators, as shown in figure 4. Particularly, when experimental data are generated by unique ground states of a pseudo-Hamiltonian, ρ MEE is the exact ground state measured, and the NN estimation ρ est is also the approximation of the real state. Therefore, we intentionally prepare ground states of a class of pseudo Hamiltonians and feed them into the measurement devices. By directly comparing the prepared ground states ρ exp with ρ est , we can reveal the network's performance in real experiments. In our experiment, we choose two set of operators F 1 and F 2 , and each contain three Hermitian operators. For each set, 300 ground states {ρ exp } of different pseudo Hamiltonian are randomly prepared by changing the setting angles of the two configurable half-wave plates (HWPs) in figure 4(b). Then the prepared states are input into the measurement part, which is constituted by wave plates, calcite beam displacers (BDs) and photon detectors, capable of projecting the input states into an arbitrary basis. From the measurement statistics, expectation values of different operators can be estimated. Thus by this preparation-and-measurement set-up, we obtain the experimental data set c exp = (c 1,exp , c 2,exp , c 3,exp ). See appendix C for details.
Before feeding experimental data into the neural networks, we have trained the networks individually for each operator set. 1010 196 and 1003 808 pairs of noiseless (meaning = 0) numerical data have been used to train networks for F 1 and F 2 , respectively. The network structure and other settings (e.g. training algorithm, loss function etc) are in similar fashion with the previous numerical cases. Figure 5(a) shows the numerical results of F 1 for 1000 random generated data. The average fidelity is 99.9%. Figure 5(c) shows the testing fidelities for F 2 , and the mean value is 99.8%.
The well-tuned neural networks are now ready for experimental data. Measurement outcomes c exp derived from the experiments are inputs of the networks. From the output parameter set β a , the estimated MEEs ρ est s can be derived. The fidelities between ρ exp and ρ est have been calculated and are shown in figure 5(b) (F 1 ) and figure 5(d) (F 2 ). The mean value of all 300 data points is 99.8% for F 1 , and 99.7% for F 2 .
In this experiment, the measurement outcomes suffer from different systematic errors, such as inaccuracies of wave plate setting angles, imperfect interference visibility and drifting of the interferometers, and statistical fluctuations. The average relative errors of different operators ((c i,exp − tr(ρ exp F i )/tr(ρ exp F i )) range in 0.79% ∼ 2.43% (see more details in appendix C). Even in this level of additional experimental errors, the networks show almost the same performance in processing the experimental data compared with the numerical data.

Comparison with other methods
The MEE for given c = (tr(ρF 1 ), . . . , tr(ρF m )) is an optimization problem. It is closely related to the field of information geometry, statistical inference, and machine learning. An iterative algorithm based on the information-geometry viewpoint is proposed in [37], which runs as follows. First, initialize the system Hamiltonian as an identity operator H = I, so the initial density matrix ρ ini = exp(I)/tr[exp(I)] is the maximum mixed state. The following task is to solve the equations tr(ρF i ) = tr(τ F i ) for each i, or, to be more precisely, find a density matrix τ to minimize i |tr(ρF i ) − tr(τ F i )|. This is done by iteratively update the Hamiltonian H by H + F i , so that the density matrix τ is updated as in which the parameter is something like a gradient and can be approximated as for each F i . Repeat the iteration for several rounds, and we can find a τ as closely to ρ as possible.
Another related method is based on the so-called quantum Boltzmann machine (QBM) [46]. The QBM uses a different loss function (or objective function) for optimization, i.e. the cross-entropy, with p i and p i are probability distributions: p i is the ideal case and p i depending on some parameters. The learning process of a QBM is to find certain parameters to minimize L. Take p i = C tr(ρF i ) and p i = C tr(τ F i ), where C and C i are normalization constants. The density matrix τ here can also be expressed as τ = exp(H)/tr[exp(H)]. Since H = i a i F i , the loss function is now a function of a i s. The loss function L reaches its minimum for p i = p i , so our goal is to optimize L over possible a i s.
We can use the same method that the QBM uses to learn the maximum entropy state. To use the cross entropy, for F i with negative eigenvalues, we first renormalize p i s by adding ( −f i min + 1)I to F i , where f i min is the lowest eigenvalue of F i . This ensures p i s being positive, and adding unity operator to Hamiltonian has no effect on its thermal state. Second, since the p i and p i in cross entropy are probability distributions, which means i p i and i p i are both restricted to 1, we add normalization constants C and C in front of tr(ρF i ) and tr(τ F i ), respectively.
We test both the iterative algorithm and the QMB algorithm using MATLAB for the examples in appendix B. The iterative algorithm converges to the desired results precisely and effectively. The average time for an iterative algorithm for each case is about 0.0425 s. As a comparison, if we run the optimization using the functions provided by MATLAB, the time for each case is about 0.0148 s.
However, the method of QBM cannot provide a precise approximation to the original density matrix. This fact may be because the gradient is hard to obtain (notice that the forms of the matrix F i in our cases are far more complicated than the ones discussed in QBM (see [46]). Also, it may be due to the normalization of p i s we have introduced, which can introduce more issues in the learning process. There may be ways to improve the training method, which we will leave for future investigation.
Given that the iterative algorithm seems more effective and accurate for optimization, we compare our supervised learning method with the iterative algorithm. For the case that the measured set F possesses three 64 by 64 Hermitian operators, our method estimates the test set with 99.0% average fidelity (section 3.2), setting the error bound as 10 −10 . As a comparison, the iterative algorithm provides the outcome states with the fidelity of almost 1 for every data point. In terms of accuracy, the interactive algorithm is slightly stronger than ours.
By using the same computational device [47], our network can predict 5000 data in less than a second, while the iterative method requires about 10 min for 100 data. In this sense, our method is more efficient for estimation once trained without sacrificing much accuracy.

Discussion
This paper presents a deep learning method for QIPs. The method shows good performances for both numerical and experimental data, and robustness against experimental noise. As mentioned in the introduction, other techniques in IP can also be modified to solve particular QIPs. For conciseness and consistency, we only discuss simple networks in this paper. This choice left a vast room for future study.
The example we demonstrated is a quantum state learning problem, with the training data numerically generated. The network can also be trained with noisy experimental data to output idea (noiseless) states. The outcomes from the trained network would naturally mitigate the experimental error. One can use deliberate approaches to fight against noise in data, such as different NN structures, regularizers, training algorithms, and loss functions [53]. Our method can also be easily adapted to other setups such as Hamiltonian learning. An example and detailed discussion can be found in [48].
We show that our method is straightforward to implement for scenarios that prior information grants a mostly bijective relation between ρ and c. It can also be used to examine if the prior information is adequate to ensure the map is almost bijective. The 'almost' here means that it is true except a measure zero set of exceptions. After properly implementing this scheme, if the trained network does not perform well, we may consider that the prior information is not exactly pinned down an almost one-to-one correspondence.
For more ill-posed problems, there are several techniques can be applied, such as various regularizers [7], different network structures [14], statistical estimators [11,49], depending on the particular problem.
Last but not least, the neural networks in the inverse process can be replaced by quantum neural networks [50][51][52]. In this case, we may not need the parametrization process P. The modified method has the potential to implement on NISQ devices. Two testing data sets have been generated. The first set has 5000 data points, where 50 different β's have been uniformly drawn from every interval I i . Fidelity boxplots of every five intervals present in figure 7(a) ( p even ), figure 7(c) ( p β ) and figure 7(e) ( p flat ). For comparison purposes, we use the same scale for each plot. The network tuned with the even distribution p even data set has significantly poor performance when β ∈ (0, 5] and also has several exceptional outliers on other intervals ( figure 7(a)). The network of p β is expected to have high fidelity for β ∈ (0, 5] and substandard performance on other parts (figure 7(c)) because of the data concentration. The network of p flat has a balance in between (figure 7(e)).
The second test set has 1000 data points, which β's are uniformly taken from (0, 100 ]. The testing results are shown in figure 7(b) ( p even ), figure 7(d) ( p β ) and figure 7(f) ( p flat ).

Appendix B. The scaling of training data
In this section, by demonstrating with the randomly generated Hermitian set F (F = {F 1 , F 2 , F 3 }, d = 64, the case 1 in the main text), we show how the number of training data for NNs will affect the average fidelities of a fixed unseen test set. For comparison, the hyperparameters (such as epochs, learning rate, batch size etc) are chosen to be the same as the corresponding case in the main text.

Appendix C. Experimental details
The two qutrit operator sets F in our experiment are as follows here F ji stands for the ith operator in the jth set. To demonstrate the neural network's performance, we sample 300 ground-states {|ψ ji } of pseudo Hamiltonian {H j = 3 i=1 a ji F ji } by randomly ranging the parameter set {a ji }. As shown in figure 4, wave plates and a BD are used to distribute single photons in the superposition of optical polarization and spatial modes, realizing the preparation of these ground states. Note that only two configurable HWPs are enough for the preparation (no need for quarter-wave plates or other phase retarders), as the operator sets are all real operators and the ground states should also be real. The three eigen-modes of the qutrit state are defined as |0 = |H ⊗ |s1 , |1 = |H ⊗ |s2 , |2 = |V ⊗ |s2 , where |H (|V ) stands for the horizontal (vertical) polarization and |s1 (|s2 ) stands for the upper (lower) spatial mode.
As for the measurement of different operators F ji , we use linear optical devices such as wave-plates and BDs to construct a three-stage interferometer, that is capable of implementing arbitrary qutrit unitary operation [43]. For the same reason, only HWPs are needed here, and the setup is relatively simpler than implementing a universal unitary. To estimate tr(ρF ji ), we apply the unitary transformation on an input state ρ, here |λ (ji) k (k = 0, 1, 2) is the corresponding eigen-vector of F ji with eigen-value λ (ji) k . It transforms any state from the eigen-basis of F ji into computational or experimental basis. Therefore, from the measurement statistics measured by the following detectors, the expectation value tr(ρF ji ) of F ji can be estimated.
Throughout this experiment, the experimental errors are mainly contributed by systematic errors, as the statistical fluctuations are very low due to enough trials (above 35 000 registered photons) for each measurement. The systematic errors include inaccuracies of wave plate setting angles (typically ∼0.2