Enhancing quantum state tomography via resource-efficient attention-based neural networks

In this work, we propose a method for denoising experimental density matrices that combines standard quantum state tomography with an attention-based neural network architecture. The algorithm learns the noise from the data itself, without a priori knowledge of its sources. Firstly, we show how the proposed protocol can improve the averaged fidelity of reconstruction over linear inversion and maximum likelihood estimation in the finite-statistics regime, reducing at least by an order of magnitude the amount of necessary training data. Next, we demonstrate its use for out-of-distribution data in realistic scenarios. In particular, we consider squeezed states of few spins in the presence of depolarizing noise and measurement calibration errors and certify its metrologically useful entanglement content. The protocol introduced here targets experiments involving few degrees of freedom and afflicted by a significant amount of unspecified noise. These include NISQ devices and platforms such as trapped ions or photonic qudits.

Any conventional QST protocol suffers inevitably from a plethora of noise sources; ranging from measurement calibration errors, dark counts, losses, or technical noises, to name a few.Such effects are eminently challenging to a model and will eventually decohere the system and wash out any quantum resource.
In recent years, machine learning (ML), artificial neural networks, and deep learning have entered the field of quantum technologies [40], offering many new solutions to QST [41][42][43][44][45][46][47][48][49][50][51].These approaches learn the noise from the experimental data itself and are unaware of its sources or models [52][53][54][55][56]. Thus, not only shot noise, which is inherent to the QST task, but also other disturbances are susceptible to be mitigated by such methods.As only minimal assumptions about the system are required, they are especially suited for the certification task.
Despite their initial success, there are still setbacks in the application of neural network-based methods for QST.In this work, we will focus on two of them: (i) the learning ability of a network with a reduced training set size [57], and (ii) the possibility of out-of-distribution (OOD) use of this class of methods [58].OOD is a subfield of ML that analyzes how models perform on new data that do not belong to the training data distribution, with the latter called the in-distribution dataset (ID).To this end, we offer a computationally fast general protocol that combines established QST protocols and a supervised architecture trained to denoise the density matrix.
To begin with, we assess the learning and generalization abilities of the proposed method with generic states, as produced e.g. from a random emitter.Then, we use our denoising network, trained exclusively on density matrices affected by shot noise only, to reconstruct new ones obtained from a simulated real-case scenario.In particular, we consider squeezed states of few spins under depolarizing and measurement calibration noise, and we certify its entanglement depth and usefulness for metrology applications.
The new protocol aims to certify quantum resources in low-dimensional systems afflicted by a significant level of unknown noise [52,59,60], where complete tomography is required.
The paper is organized as follows: in Sec.II, we introduce the main concepts behind the QST; in Sec.III we introduce the data generation protocol and neural net-work architecture, as well as define QST as a denoising task.Sec.IV is devoted to benchmarking our method against known approaches, and we test it on quantum states of physical interest.In Sec.V, we provide practical instructions to implement our algorithm in an experimental setting.We conclude in Sec.VI with several possible future research directions.

II. PRELIMINARIES
Consider the d-dimensional Hilbert space.A set of informationally complete (IC) measurement operators π = {π i }, i = 1, . . ., d 2 , in principle, allows unequivocally reconstructing the underlying target quantum state τ ∈ C d×d within the limit of an infinite number of ideal measurements [61,62].After infinitely many measurements, one can infer the mean values and construct a valid vector of probabilities p = {p i } for any proper state τ ∈ S, where by S we denote the set of d-dimensional quantum states, i.e., containing all unittrace, positive semi-definite (PSD) d × d Hermitian matrices.Alternatively, π can form a set of operators that spans the space of Hermitian matrices.In such a case, p can be evaluated from multiple measurement settings (e.g., Pauli basis) and is generally no longer a probability distribution.In any case, there exists a one-to-one mapping Q from the mean values p to the target density matrix τ : where F S is the space of accessible probability vectors.
In particular, by inverting the Born's rule, Eq. ( 1), elementary linear algebra allows us to describe the map Q as where Ĝ is the Gram matrix of the measurements settings, with components G ij = Tr(π i πj ).
The inference of the mean values p is only perfect in the limit of an infinite number of measurement shots, N → ∞.
In a realistic scenario, with a finite number of experimental runs N , we have access to frequencies of relative occurrence f = {f i := n i /N }, where n i is the number of times the outcome i is observed.Such counts allow us to estimate p within an unavoidable error dictated by the shot noise, whose amplitude typically scales as 1/ √ N [63].With only frequencies f available, we can use mapping Q for estimation ρ of the target density matrix τ , i.e., In the infinite number of trials N → ∞, f i = p i and ρ = τ .Yet, in the finite statistics regime, as considered in this work, the application of the mapping as defined in Eq. (3) to the frequency vector f will generally lead to nonphysical results (i.e.ρ not PSD).In such case, as an example of proper mapping Q we can consider different methods for standard tomography tasks, such as linear inversion (LI), or maximum likelihood estimation (MLE), see Appendix A. As operators π, we consider positive operator-valued measures (POVMs) and a more experimentally appealing Pauli basis (check Appendix B).

III. METHODS
This section describes our density matrix reconstruction protocol, data generation, neural network training, and inference procedure.In Fig. 1, we show how these elements interact within the data flow.In the following paragraphs, we elaborate on the proposed protocol in detail.
The first step in our density matrix reconstruction protocol, called pre-processing, is a reconstruction of density matrix ρ using finite-statistic QST with frequencies f obtained from measurement prepared in target state τ .Next, we feed the reconstructed density matrix ρ through our neural network acting as a noise filter, which we call this stage post-processing.To enforce the positivity of the neural network output, we employ the Cholesky de-composition of the density matrices, that is, ρ = C ρ C † ρ and τ = C τ C † τ , where C ρ,τ are lower-triangular matrices.Such decomposition is uniquely provided that ρ and τ are positive [64].We treat the Cholesky matrix C ρ obtained from the finite-statistic QST protocol as a noisy version of the target Cholesky matrix without noise C τ calculated from τ .With these data, we prepare a supervised training for our architecture.

A. Data generation
To construct the training data set, we first start with generating N train Haar-random d-dimensional target density matrices, {τ m }, where m = 1, . . ., N train .Next, we simulate experimental measurement outcomes f m , for each τm , in one of the two ways: 1. Directly: When the measurement operators π form an IC-POVM, we can take into account the noise by simply simulating the experiment and extracting the corresponding frequency vector where N is the total number of shots (i.i.d.trials) and the counts {n i } m are sampled from the multinomial distribution.
2. Indirectly: As introduced in the preliminaries (Sec.II), with projective measurements π the p m is no longer a probability distribution, like the Pauli basis (see Appendix B).So, we can insert an amount of noise as the direct case, obtaining f m = p m + δp m , where δp m is sampled from the multi-normal N 0, ∼ 1/(2 √ N ) of mean zero and isotropic variance, saturating the shot noise limit.
Upon preparing the frequency vectors {f m }, we apply QST by mapping Q, Eq. (4), obtaining the set of reconstructed density matrices {ρ m }.We employ a rudimentary and scalable method, i.e., linear inversion [65], but other QST methods can also be used.

B. Neural network architecture
Our proposed architecture is inspired by other recent models [56,66,67], combining convolutional layers with a transformer layer that implements a self-attention mechanism [68,69].The convolutional layer extracts local features from the data, while the transformer seizes global ones.By combining them, we aim at taking the best of both approaches.The self-attention mechanism utilizes the representation of the input data as nodes within a graph [70] and aggregates the relationships between the nodes.
Architecture.-Theneural network action can be described as a mapping h θ that transforms the inputvectorized Cholesky matrix Ĉρ into an output h θ ( Ĉρ ).The symbol θ denotes all the variational parameters as weights and biases to be optimized during the training phase.The choice of architecture considered here contains two convolutional layers h cnn , a transformer layer h tr between, and a final linear layer h l , i.e.: where γ(y) = 1/2y(1 + Erf(y)/ √ 2), y ∈ R, is the Gaussian Error Linear Unit (GELU) activation function [71], broadly used in the modern transformers architectures, and tanh(y) is the hyperbolic tangent, acting elementwise on neural network nodes.A detailed explication of the model is offered in App.D.

C. Neural network training
The neural network training process relies on minimizing the cost function defined as a mean squared error (MSE) of the network output with respect to the target density matrix τ .
with Tr[C θ C † θ ] a regularization term, cf.chapter 7 of Ref. [72] for detail.We train the model with a dataset containing N train training samples {ρ l }.The equivalence between MSE and the Hilbert-Schmidt (HS) distance is discussed in detail in Appendix C, where we also demonstrate that the mean squared error used in the cost function, Eq. ( 6), is a natural upper bound of the quantum fidelity.Hence, the choice of the cost function, Eq. ( 6) approximates the target state in a proper quantum metric. .By minimizing Eq. ( 6), we obtain the set of optimal parameters θ for our model hθ.Finally, the neural network allows for the reconstruction of the target density matrix τ via Cholesky matrix C ρ [73], i.e., where

IV. RESULTS AND DISCUSSION
Following the presentation of our QST protocol, we demonstrate its advantages in scenarios of both computational and physical interest.To this aim, we consider two examples.
As the first example, we study an idealized random quantum emitter (see e.g.Refs.[74,75] for recent experimental proposals) that samples high-dimensional mixed states from the Hilbert-Schmidt distribution.After probing the system using single-setting square-root POVM, we can show the usefulness of our neural network by improving the preprocessed LI and MLE states.This example enables us to assess the learning ability and expressivity of the QST neural network introduced in this work.
In the second example, we focus on a specific class of muti-qubit pure states of special physical relevance, i.e., with metrological resource, as quantified by the quantum Fisher information (QFI).Such states are generated via one-axis twisting dynamics (OAT) [76,77].We simulate realistic data by considering the experiment in the presence of depolarizing noise, which mixes the OAT state |Ψ⟩ according to where I is the identity operator.Furthermore, we reproduce miscalibration of the measurements by adding a bias error to the inferred expectation values.This example allows us to evaluate our protocol with OOD data.

A. Reconstructing high-dimensional random quantum states
Scenario.-Letus consider states {τ j } sampled from distribution uniform with respect to the Hilbert-Schmidt measure (see Appendix E) on the Hilbert space of dimension d = 9.Each target state τj is measured N trial times using its copies.This scenario allows us to benchmark our algorithm against the protocol offered in Ref. [54].
We prepare measurements on each trial state τj using information-complete square-root POVM (IC), as defined in Eq. (B1).This allows obtaining the state reconstruction ρj via two standard QST protocols, i.e. linear inversion (LI) and maximum likelihood estimation (MLE) algorithms, as well as by our neural network enhanced protocols denoted LI-NN and MLE-NN, see Fig. 1.Finally, we evaluate the quality of the reconstruction using the square of the Hilbert-Schmidt distance between the target and the reconstructed state Benchmarking.-Fig.2(a) presents the averaged Hilbert-Schmidt distance square D 2 HS as a function of the number of trials states N trial .To obtain a reconstruction for a given averaged HS distance, our neural network-enhanced protocols require a much lower number of N trial copies of states, compared to linear inversion and maximum likelihood estimation alone.We note that the proposed protocols improve MLE for the relatively small number of trial states N trial < 10 3 (see inset), which is important from an experimental point of view.As expected, the lowest HS distance is obtained for many trials of the MLE algorithm.
Recently, a state-of-the-art QST neural network protocol was proposed by Koutný et al. [54].The authors report better performance than the MLE and LI algorithms with N train = 8 • 10 5 training samples for qutrit as well as larger systems, d ≥ 3.In Fig. 2(b), we show that our protocol requires an order of magnitude smaller training data, achieving a comparable level of reconstruction.
Data preparation.-Forthis task, we generate our data focusing on the experimentally friendly Pauli operators.The state under consideration are prepared indirectly, with a fidelity from the target OAT state of ∼ 85%.Then we simulate the presence of a noise channel by depolarizing our input state with a strength factor p = 0.3 according to Eq. ( 8).Lastly, to simulate the presence of a calibration defect in the measurement apparatus, a fixed bias of random values of the order of 10 −4 is added to the Born values.After applying all these noise sources, the LI reconstructions can obtain an average fidelity of 75.4 ± 1.1%.For the test set, we select 100 OAT states in evenly spaced times t ∈ (0, π) and assess the average reconstruction achieved by our neural network [118] .
The description of the model under consideration can be found in Table I.
Inferring the quantum Fisher information.-Finally,we evaluate the metrological usefulness of the reconstructed states measured by the quantum Fisher information (QFI), F Q [ρ, Ĝ].The QFI is a non-linear function of the state and quantifies the sensitivity upon rotations generated by Ĝ. Notwithstanding, its usefulness in metrology, it is a highly nontrivial task to evaluate it experimentally due to its great sensitivity to noise disturbances, with state-of-the-art research only up to 4 qubits [119,120].We refer the reader to Appendix F for more details.
In this context, we consider the collective spin component Ĵv as the generator Ĝ = Ĵv , with the orientation v ∈ R 3 selected to achieve maximal sensitivity.Quantum Fisher Information (QFI) related to collective rotations can also serve to verify quantum entanglement [80], specifically the entanglement depth k, which is the smallest number of genuinely entangled particles required to describe the state.If F Q [ρ, Ĵv ] > kL, then the quantum state ρ possesses an entanglement depth of at least k + 1 [121,122].In particular, for states with depth k = 1 (i.e., separable), the absence of detected entanglement implies that the metrological capability is limited to the shot-noise threshold [123].This limit is reached by coherent spin states, such as our initial (t = 0) state for the evolution of the One-Axis Twisting (OAT), |+⟩ ⊗L .
OOD Results.-InFig. 3, we present the evolution of the QFI (normalized by the coherent limit, L = 4) for the OAT target states (top of the solid blue lines).For this numerical experiment, we make full use of the OOD approach; using a network trained only for tackling sampling noise, we feed it in inference with a dataset that also considers the depolarization and measurement noise.As Fig. 3 shows, for two different realizations of calibration noise, our network can highly improve the LI reconstructions obtaining a fidelity of 88.7 ± 2.3% and 91.0 ± 1.9%, on the left and right panel, respectively.Thereby, we surpass the three-body bound (QFI/L = 3), thus revealing a genuine 4-body entanglement, which is the highest depth possible in this system (since it is of size L = 4).For example, note that at time t = π/2, the OAT dynamics generates the cat state, , which is genuinely L-body entangled, and so it is certified.For completeness, two complexity analyses are offered.First, in Appendix G, an analysis of the QFI time evolution for OAT states for different sampling noise values only, using Pauli and tomographically optimal SIC-POVM operators in data generation.Next, in Appendix H, a benchmark of our model is shown against two different convolutional architectures to assess the advantage offered by our transformer-based model on quantum data.Lastly, we propose an alternative method to incorporate the statistical noise as a depolarizing channel in the Appendix I.

V. CONCRETE EXPERIMENTAL IMPLEMENTATION
To recapitulate this contribution, as a complement to Fig. 1 and our repository provided in Ref. [124], we summarize the practical implementation of the protocol introduced in this work.
1. Scenario: We consider a finite-dimensional quantum system prepared in a target state τ .Here, our objective is to verify the preparation of a quantum state τ via QST.To this end, we set a particular measurement basis π to probe the system.
2. Experiment: After a finite number of experimental runs, we construct the frequency vector f from the countings.
3. Preprocessed quantum state tomography: From the frequency vector f and the basis π, we infer the first approximation of the state ρ via the desired QST protocol (e.g., one of those introduced in Appendix A).

Assessing pre-reconstruction:
We evaluate the quality of the reconstruction by, for example, computing D 2 HS (τ , ρ), quantum fidelity, or any other meaningful quantum metric.To improve such a score, we resort to our neural network solution to complete a denoising task.As with any deeplearning method, training is required.For example, if we reconstruct OAT states (Section IVb), we may train only in the permutation-invariant sector.
(c) Due to the greater generalization ability demonstrated when the case of mixed state was considered, we can perform transfer learning to tailor a pre-trained model on a specific apparatus with less computational resources; in this way, the model will have knowledge of that specific apparatus noise.For example, if we have a quantum random source to characterize (Section IVa), an amount of experimental data corresponding to the 10 − 15% of the training dataset can be used to refine the model.

6.
Feeding the neural network : We feed the preprocessed state ρ into our trained matrix-to-matrix neural network to recover the enhanced quantum state ρ.

7.
Assessing the neural network : We compute the updated reconstruction metric on the post-processed state D 2 HS (τ , ρ).Finally, we assess the usefulness of the neural network by comparing how small such a value is compared to the pre-processed score D 2 HS (τ , ρ).
The strength of our proposed protocol lies in its broad applicability, as the choice of the basis π and the QST pre-processing method is arbitrary.

VI. CONCLUSIONS
We proposed a novel deep learning protocol that improves standard quantum state tomography methods, such as Linear Inversion and Maximum Likelihood Estimation.Based on a combination of transformer and convolutional layers, we greatly reduce the dataset dimension for training, and we can perform denoising on new unseen data, when also depolarization and measurement noise are accounted for.First, the proposed method reduces the number of necessary measurements in the target density matrix by at least an order of magnitude compared to other QST protocols supported by finitestatistic neural networks.
Secondly, for 4-qubits and Bell-correlated few-spin states generated with OAT, the inference stage was on OOD.We tested our model, pre-trained only for statistical sampling noise, on data accounting also for depolarization and measurement noise, achieving an average fidelity reconstruction 88.7 ± 2.3% and 91.0 ± 1.9% for two different realizations of noise in the measurement setup.The superior learning ability demonstrated for mixed states makes our architecture an optimal candidate for transfer learning, when further refinement is desired on a pre-trained model using fewer experimental data.On the other hand, the OOD demonstrate the potential of our protocol for more plug-and-measure applications, given its high resilience to unknown level of noise due to different physical sources.Thus, it paves the way for the use of these novel methods in current quantum computers, NISQ devices, and quantum simulators based on spin arrays [19,125].
Another persistent challenge in this field concerns the scalability of the algorithms with the number of subsystems.In this regard, several strategies are proposed based on incomplete tomography like the well known generative NN applications [42].A limitation of the protocol is its scalability, which is limited by the use of Cholesky decomposition in data preprocessing, which limits our method to applications of modest Hilbert space dimensions.An extension to a Cholesky-free approach is left for future explorations.
Data and code availability.-Dataand code are available at Ref. [124].
Pauli basis.-Thelast IC basis π in L-qubit systems is the Pauli basis constructed as: With respect to such a basis, expectation values can be evaluated experimentally by rotation of each qubit individually.This is also true for the SIC-POVM if evaluated with multiple settings.Such expectation values p no longer form a probability distribution (note that, in particular, such mean values can be negative).The reason why it will not lead to a probability distribution is that the basis does not form a POVM (that is, its elements are not PSD and do not sum up to I).However, it covers the whole space of Hermitian matrices supported in [C 2 ] ⊗L = H as any basis specified in this appendix.
Appendix C: From quantum fidelity to mean-squared error Upper bound on the Bures distance.-TheBures distance between two states ρ and τ is defined as, where, is the square root of quantum fidelity between ρ, and τ .The square root of the fidelity F (ρ, τ ) can be expressed as ( [128], Eq. 9.30) where the maximization is over the complex amplitudes {A, B} which constitute a polar decomposition of {ρ, τ } respectively.Eq. (C3) is actually the original definition of quantum fidelity motivated by the concept of transition probability.In fact, if both states are pure {ρ = |ψ⟩⟨ψ| , τ = |ϕ⟩⟨ϕ|}, then {A = |ψ⟩ , B = |ϕ⟩} and the RHS of Eq. (C3) amounts to overlap | ⟨ψ|ϕ⟩ | 2 .Note that the decomposition admits a gauge degree of freedom A → AU for U unitary (and similarly for B).Our work resolves redundancy using the Cholesky decomposition defined in the main text.From Eq. (C3) we see that for any polar decomposition (and in particular the Cholesky as canonical one, ), the following inequality always holds: Finally, rewriting 2 as where the HS distance defined in the main text is extended to complex matrices (not necessarily Hermitian) as: Hilbert-Schmidt distance as mean squared error (MSE).-In the following we connect the Hilbert-Schmidt distance [Eq.(C6)] between two Cholesky matrices {C ρ , C τ } associated to quantum states {ρ, τ }, with the mean-squared error of the matrix elements.First, consider a d × d complex matrix K, Next, let us introduce the vectorization ⃗ K of its matrix elements as where K is the flattening of the matrix, i.e., K = (K 11 , K 12 , .., K dd ), and ⊕ the direct sum of vectors, Let K = C ρ − C τ , then the square HS distance, Eq. (C6), reads Finally, we observe that the MSE is the natural cost function of a standard feed-forward neural network.

Appendix D: Architecture details
The first layer h cnn applies a set of K fixed-size trainable one-dimensional convolutional kernels to ⃗ C ρ followed by a non-linear activation function, i.e. γ(h cnn ( ⃗ C ρ )) → {F 1 cnn , . . ., F K cnn }.During the training process, the convolutional kernels learn different features of the dataset, which are then fed to the transformer block h tr .The transformer block h tr distills the correlations between the features extracted from the kernels through the self-attention mechanism, providing a new set of vectors, that is, h tr (F 1 cnn , . . ., F K cnn ) → {F 1 tr , . . ., F K tr }.In the last step, the outputs of the convolutional kernel of the layer h cnn are added and form an output ⃗ C θ , tanh(h cnn (F 1 tr , . . ., The output is combined in C θ as in Eq. ( 7), and a custom cost function η is used to apply the RElU activation on the diagonal elements of the reconstructed matrix and the tanh on the off-diagonal elements.The role of the attention mechanism is explored in Appendix H, where we benchmark the transformer block against CNNs.
The training data and the considered architecture allow interpreting the trained neural network as a conditional debiaser (for details, see the Appendix J).Although the proposed protocol cannot improve the predictions of unbiased estimators, any estimator that outputs valid quantum states (e.g., LI, MLE) must be biased due to boundary effects.In the given framework, the task of the neural network is to learn such skewness and drift the distribution towards the true mean.
Computational details.-The architecture is trained on 10000 data in training and 1500 in validation, with a batch size of 1500.The code is run on a Nvidia A100 GPU card with 80GB of memory (Cuda version 12.1).The total training time amounts to ∼ 25 minutes.A similar time can be obtained on a commercially available GPU virtual machine.
In practice, we use the python package qutip to compute those.For further information on random states, we refer to the book Ref. [128].
Appendix F: Capturing the metrological usefulness of OAT evolved states Given a quantum state ρ with spectral decomposition ρ = k p k |k⟩⟨k|, we can evaluate its sensitivity upon rotations generated by Ĝ, ρ(θ) = e −iθ Ĝ ρe +iθ Ĝ as quantified by the QFI: In the main text, we consider an ensemble of L spins-1/2, or two-level atoms.In such system and in the context of magnetometry (e.g., via Ramsey interferometry), the phase is encoded the same way in every constituent via the collective generator For a generic state ρ, it is not direct to find the optimal spatial direction v to exploit the maximal sensitivity.However, for pure states ρ = |Ψ⟩⟨Ψ| (like the OAT evolution that we consider), the QFI is (four times) the variance of the generator, and the best direction is then yielded by the maximal eigenvalue v max of the covariance matrix C: where the expectation value is taken against pure states, ⟨•⟩ := ⟨Ψ| • |Ψ⟩.The maximal value of the QFI achieved is consequently , where λ max indicates maximal eigenvalue.Here, we will introduce two basic examples of such results: • A coherent spin state pointing along x of length J = L/2, |+⟩ ⊗L .This initial state was chosen to start OAT dynamics, Eq. ( 9) (main text).In such case, the optimal axis is any orientation orthogonal to x, i.e., contained in the yz plane.The QFI achieved is exactly L, which is the maximal value that can be reached by separable states (a.k.a.shot noise limit) [123].
• The GHZ or cat state aligned along x, (|+⟩ ⊗L + e iϕ |−⟩ ⊗L )/ √ 2, which is realized at time t = π/2 during the OAT dynamics.Now, the optimal generator points in the x direction, and the corresponding QFI (variance times four) is L 2 .Such value actually is the maximal QFI achievable within the quantum framework and requires genuine L-partite entanglement [121,122].
Since the OAT reconstructed states ρ are of high purity, the same procedure can approximate the optimal orientation v.However, the QFI results are evaluated exactly as per Eq.(F1).For further study, we refer the reader to the excellent review of Ref. [78].I. Comparison of average fidelity and its standard deviation between the reconstructed and the target states of size d = 16 for various QST methods (rows), with varying size of measurement trials N trial = 10 6 , 10 5 , 10 4 , 10 3 , as indicated by the consecutive columns.The first row presents the average fidelity reconstruction for linear inversion QST, averaged over OAT states, evenly sampled from t = 0 to t = π.Employing our neural network presents an enhancement over the bare LI, as shown in the second row for the same target set.Finally, the third row also shows data for NN-enhanced LI but averaged over general Haar-random states.All initial Born values are calculated by noiseless SIC-POVM.
fulness of the transformer architecture in improving our QST matrix-to-matrix protocol.This becomes especially pronounced in the regime of highly correlated, that is, entangled quantum states, as shown in the latter part of this appendix.To test its role, we explore and compare our transformer model with two different convolutionalonly architectures (CNN) based on their reconstruction ability.In particular, we benchmark against two setups: (i) a four-layer convolutional neural network, where the transformer layer is replaced with two convolutional ones, and (ii) the simplest CNN model consisting of two convolutional layers, represented as, l where γ is the GELU activation function in both the cases.
We set an equivalent number of trainable parameters for all three architectures.We tested the two CNNs for the same datasets as previously used for mixed-and pure-state reconstruction.In the following, we analyze the performance of the models for mixed-and pure-state reconstruction.
Mixed state reconstruction.-InFig. 5, we show the reconstruction of mixed states using the three setups, namely our transformer and the two CNN architectures.Firstly, as shown in panel (a), for the LI pre-processed data, our transformer-based model outperforms CNNs by showing a higher expressivity in terms of better reconstruction ability for a large number of trials N trials > 10 4 .However, the three models are almost equivalent in the undersampled regime.Next, in panel (b), we show our numerical experiments conducted on ).The red lines represent the whole setup with neural network post-processing of data from corresponding green lines, indicating improvement over the LI method.The neural network advantage over the bare LI method can be characterized by entanglement depth certification, as shown by the horizontal lines denoting the entanglement depth bounds ranging from the separable limit (bottom line, bold) to the genuine L-body limit (top line).In particular, the presence of entanglement, k ≥ 2, is witnessed by QFI > L, as shown by the violation of the separable bound (bold horizontal line).
the MLE pre-processed dataset, which has a comparatively lesser amount of noise.We can see that the three models are almost equivalent.
Pure state reconstruction.-ForHaar-random pure states, using any of the two CNN models for the QST protocol causes an evident drop of ∼ 10% in the fidelity of reconstruction compared to the our attention-based model for the undersampled regime (N trial = 10 3 ), as exemplified in Table II.
Finally, we apply our protocol to the pure state reconstruction task generated from one-axis twisting dynamics.In Fig. 6 we present the quantum Fisher information extracted from the reconstructed states.A significant drop in quantum Fisher information reconstruction is observed in using CNN-based architectures compared to the   transformer architecture.We conclude that our transformer-based architecture works significantly well compared to the only CNN-based architectures in the case of reconstructing pure OAT states, when the QFI is considered a figure of merit.In the other case of mixed states reconstruction, the three architectures are comparable, and the transformer-based approach works better in the LI-preprocessed states in the highly sampled regime.

Appendix I: Improving the averaged MLE in HS metric
The MLE is efficient asymptotically, with a number of measurements N → ∞.However, for a few measure- ments, there is no guarantee that MLE performs best.Such a situation takes place in the undersampled regime.In terms of the HS metric, it is usually more convenient to ignore any knowledge and just take the maximally mixed state I/d as an estimation.Here, we propose a simple method to interpolate between the two extreme results by depolarizing the MLE state ρ, with 0 ≤ p ≤ 1.The parameter p would then incorporate the statistical noise inherent to have a finite number of samples.
In Fig. 7 we show the average HS distance as a function of a number of trials N trial , for different values of the parameter p.The first extreme case (p = 0) i.e., maximally mixed state is presented as a horizontal solid line, while the second extreme case (p = 1), i.e., the MLE reconstructed state, is depicted as a dashed line.All intermediate values of p form an envelope shape, corresponding to a critical value p * , outperforming all other values of p.
To calculate the critical p * let us notice that the average of the squared HS distance with respect to the state Eq.(I1), D 2 p = D 2 HS (τ , ρp ), can be expressed as, where   Finally, one should check how this method relates to Bayesian approaches [131].In particular, how to incorporate partial information about the ensemble, e.g., by only assuming as prior knowledge an average purity.
Appendix J: Interpretation of the neural network as a conditional "debiaser" Our neural network takes as input a reconstructed state ρ from the experimental results f through some estimator (e.g.LI, MLE), Q[f ] = ρ.The neural network returns a state ρ that, on average over the realizations (in f ) better approximates the target τ than ρ.In the following, we outline the situations in which we expect a poor performance of our algorithm.From the above setting, we immediately see that it is useless for the unbiased estimators.
Observation 1.If the estimator Q is unbiased, i.e., ρ = τ , no further improvement can be achieved with our approach.In fact, the mean already provides the best estimation.Consequently, if some enhancement is observed, the inference of the input state must be biased.The bias here comes from the requirement that ρ must be a proper quantum state.
To observe why it is so, let us focus on the simplest case of two projective quantum measurements, i.e., spin measurements of an electron in the x and y directions.Although both measurements belong to the set [−1/2, 1/2], not all pairs of measurement results are admissible.For example, there are no valid quantum states for which both measurements yield 1/2 since then the total spin would be larger than 1/2.
Therefore, if we account for the inevitable noise present for finite statistics, the unphysicality of certain real-world measurements is unavoidable.There are two general QST strategies to overcome this obstacle -either discard the unphysical outcomes (e.g., MLE) or keep them all (e.g., LI).Any of these methods has its drawbacks, and, for finite statistics, one cannot have a strategy that satisfies both the linearity and the physicality of the predicted states [132].
Observation 2. Any estimator Q which, as required by our neural network architecture, outputs a valid state ρf for any physical f ∈ F, is biased.This phenomenon becomes more prominent for τ close to the boundary of the quantum set S, i.e., with high purity [133,134].By convexity, the boundary of F S is attained by extremal elements of S. From which we notice that the chance of non-quantum outcomes is higher and E f [ρ f |ρ f ∈ S] becomes significantly displaced from τ (see sketch Fig. 10).In such terms, the task of the neural network can be interpreted as drifting the skewed probability distribution towards the true mean, by means however of only quantum outcomes.In other words: From {f ∈ F S }, find ensemble {f NN ∈ F S } such that the output bias |E fNN (ρ fNN ) − τ | is minimal.As a result, generic mixed states are harder to improve than pure states.More studies are needed to confirm the said behavior.In particular, one has to verify that no improvement is possible if the reconstruction method to obtain ρf already incorporates the skewness and how performance depends on the bias of the estimator Q.

Figure 1 .
Figure 1.Schematic representation of the data pipeline of our QST hybrid protocol.Panel (a) shows data acquisition from a generic experimental set-up, during which the frequencies f are collected.Next, panel (b) presents standard density matrix reconstruction; in our work, we test the computationally cheap LI method together with the expensive MLE, to better analyze the network reconstruction behaviour and ability.Panel (c) depicts the matrix-to-matrix deep-learning strategy for Cholesky matrices reconstruction.The architecture herein considered combines convolutional layers for input and output and a transformer model in between.Finally, we compare the reconstructed state ρ with the target τ .
Finally, we construct the training dataset as N train pairs ⃗ C ρ , ⃗ C τ , whereby ⃗ C we indicate the vectorization (flattening) of the Cholesky matrix C (see Appendix C for definition).

Figure 2 .
Figure 2. Evaluation of the QST reconstruction quality measured by the mean value of the Hilbert-Schmidt distance square, D 2 HS, between the target and the reconstructed state for different QST protocols, averaged over 1000 target states.In both panels, the best-performing setups are those that are as far right (better quality) and bottom (less costly) as possible.Panel (a) uses the number of measurements N trial to compare four QST protocols: linear inversion (LI, green dots, neural network enhanced MLE (MLE-NN, orange crosses), neural network enhanced LI (NN-LI, blue diamonds) and maximal likelihood estimation (MLE, red squares).We add an inset focusing in the undersampled regime, N trial ≤ 5 × 10 3 .Panel (b) shows the quality of reconstruction as a function of product N trial ×Ntrain for the latter two protocols and the network model proposed in Ref. [54] (violet triangles).Both panels depict resource costs on horizontal axes in different scenarios: in (a), the cost is the number of performed measurements, while in (b), the training phase is additionally counted as a cost.Our proposed protocol achieves competitive averaged HS reconstruction for the size of training data of an order of magnitude smaller than the method proposed in Ref. [54].During models' training, we used Ntrain = 2000 random pure states for the MLE-NN protocol, and Ntrain = 5000 for the LI-NN.Lines are to guide the eye; shadow areas represent one standard deviation.

Figure 3 .
Figure 3. Two different simulations for out-of-distribution (OOD) inference.In each panel, we evaluate the normalized Quantum Fisher Information (QFI) for 100 four-qubit states as validation metric.The target, noiseless states are evolved according to the OAT dynamics given in (9), and depicted by the purple dotted line.For this OOD tests, the neural network was trained exclusively to learn statistical sampling noise.During inference, test data are permeated by depolarization and measurement (calibration) errors also.The green line represents the normalized QFI derived from reconstructions via the Linear Inversion (LI) algorithm; the red line illustrates the enhancement provided by the network when supplemented with LI algorithm reconstructions, underscoring the robustness of our protocol in mitigating noise effects.

5 .
Training strategies: Different training strategies can be implemented: (a) Train over uniform ensembles (e.g., Haar, HS, Bures etc.) if τ is a typical state or we do not have information about it.(b) Train over a subspace of states of interest.
is sufficient for neural network output, the length of the real vector ⃗ K is d 2 [d for the diagonal and d(d − 1)/2 × 2 for the lower triangle (×2 for the real and imaginary part); the remaining elements are zero].

Figure 4 .
Figure 4. Time evolution of the normalized QFI during the OAT protocol for L = 4 qubits system.Solid blue lines represent QFI calculated for target quantum states.The mean values of QFI calculated from tomographically reconstructed density matrices are denoted by green-dashed (reconstruction via LI), and red-dotted lines (reconstruction via neural network post-processed LI outputs).Shaded areas mark one standard deviation after averaging over 10 reconstructions.Panels (a) and (b) correspond to LI protocol with SIC-POVM data, whereas (c) and (d) denote LI reconstruction inferred from Pauli measurements.In the upper row, the left (right) column corresponds to N trial = 10 3 (10 4) trials; in the lower row, the left (right) column reproduces the LI initial fidelity reconstruction of ∼ 74%(∼ 86%).The red lines represent the whole setup with neural network post-processing of data from corresponding green lines, indicating improvement over the LI method.The neural network advantage over the bare LI method can be characterized by entanglement depth certification, as shown by the horizontal lines denoting the entanglement depth bounds ranging from the separable limit (bottom line, bold) to the genuine L-body limit (top line).In particular, the presence of entanglement, k ≥ 2, is witnessed by QFI > L, as shown by the violation of the separable bound (bold horizontal line).

Figure 5 .
Figure 5.Comparison of the efficiency of QST reconstruction schemes evaluated using Hilbert-Schmidt distance square D 2 HS for transformer-based, 2-layer, and 4-layer attentionfree CNN models, averaged in 1000 mixed states.All models share an equivalent number of training parameters.(a) Average reconstruction values for the 10 different LI pre-processed test datasets.Similarly to Fig. 2, we vary the number of trials N trials to analyze the reconstruction efficiency and also use states of dimension 9 for a direct comparison.(b) The same analysis applied to the models trained on the MLE pre-processed data.To summarize, only for the MLE preprocessed data, the 4-layer CNN model can outperform the transformer-based for N trials = 10 6 , 10 5 , while for the LI preprocessed our network shows better outcomes.
θ used inside our matrix-to-matrix protocol.the second row, the outcomes we obtained by applying the transformed-based model.Similar to Fig.4, we check for a different number of trials N trial = 10 5 , 10 4 , 10 3 to analyse the performance of the different architectures.All the initial Born values are calculated via noiseless SIC-POVM.

Figure 6 .
Figure 6.Time evolution of the normalized QFI during the OAT protocol for 4 qubits.The dotted dark grey line represents the QFI calculated for the target quantum state, and the light grey dashed line is the QFI upon LI reconstruction (our minimal threshold).Panels (a) and (b) correspond to the QFI obtained for the states reconstructed by the 2−layers and 4-layers CNN respectively.We observe that, firstly, the transformer-based model outperforms the CNN models at all times with reconstruction ability very close to the OAT target states.Secondly, the CNN models perform equivalently irrespective of the number of layers in the architecture as shown in panels (a) and (b) for 2−layers and 4−layers respectively when considering QFI as our reconstruction metric.

Figure 7 .
Figure 7. Averaged HS distance of reconstructed MLE from the HS ensemble with d = 9 as mixed according to Eq. (I1)for different values of p (coloured lines).We highlight the limiting cases, namely p = 0 (solid line); the average with respect I/d, and p = 1 (dashed): the MLE result.The envelope of such a family of lines is marked with a dotted line.Such bound can be realized with an optimal p * , which depends on the number of trials via the reconstructed {ρMLE}.

Figure 8 .. 9 ,
Figure 8. Geometric interpretation of the optimal depolarization of the MLE state, such as incorporating the statistical noise stemming from a finite number of experimental runs.

Figure 9 .
Figure 9. Average reconstruction distance as a function of the mixing parameter p for a given set of number of trials.We verify the parabola curves of Eq. (I2) and the nontrivial minima.

Figure 10 .
Figure 10.Action of the neural network as a conditional debiaser.(a) Inference of the state τ by many finite-size realization {ρ f } f not necessarily a proper state (i.e., it might be outside S).(b) Disregarding the non-physical realizations results in a skewed conditional distribution whose mean is displaced from the true state.The action of the neural network is then to shift back the mean to the target state by drifting the distribution.

Table II .
Values of averaged fidelity and its standard deviation between the reconstructed and the target Haar pure states of size d = 16 obtained by the two CNN architectures l