Transition role of entangled data in quantum machine learning

Entanglement serves as the resource to empower quantum computing. Recent progress has highlighted its positive impact on learning quantum dynamics, wherein the integration of entanglement into quantum operations or measurements of quantum machine learning (QML) models leads to substantial reductions in training data size, surpassing a specified prediction error threshold. However, an analytical understanding of how the entanglement degree in data affects model performance remains elusive. In this study, we address this knowledge gap by establishing a quantum no-free-lunch (NFL) theorem for learning quantum dynamics using entangled data. Contrary to previous findings, we prove that the impact of entangled data on prediction error exhibits a dual effect, depending on the number of permitted measurements. With a sufficient number of measurements, increasing the entanglement of training data consistently reduces the prediction error or decreases the required size of the training data to achieve the same prediction error. Conversely, when few measurements are allowed, employing highly entangled data could lead to an increased prediction error. The achieved results provide critical guidance for designing advanced QML protocols, especially for those tailored for execution on early-stage quantum computers with limited access to quantum resources.


I. Introduction
Quantum entanglement, an extraordinary characteristic of the quantum realm, drives the superiority of quantum computers beyond classical computers [1].Over the past decade, diverse quantum algorithms leveraging entanglement have been designed to advance cryptography [2,3] and optimization [4][5][6][7][8], delivering runtime speedups over classical approaches.Motivated by the exceptional abilities of quantum computers and the astonishing success in machine learning, a nascent frontier known as quantum machine learning (QML) has emerged [9][10][11][12][13][14][15], seeking to outperform classical models in specific learning tasks [16][17][18][19][20][21][22][23][24][25].Substantial progress has been made in this field, exemplified by the introduction of QML protocols that offer provable advantages in terms of query or sample complexity for learning quantum dynamics [26][27][28][29][30][31], as a fundamental problem toward understanding the laws of nature [32].Most of these protocols share a common strategy to gain advantages: the incorporation of entanglement into quantum operations and measurements, leading to reduced complexity.Nevertheless, an overlooked aspect in prior works is the impact of incorporating entanglement in quantum input states, or entangled data, on the advancement of QML in learning quantum dynamics.Due to the paramount role of data in learning [33][34][35][36][37][38], addressing this question will significantly enhance our comprehension of the capabilities and limitations of QML models.
A fundamental concept in machine learning that characterizes the capabilities of learning models in relation to datasets is the No-Free-Lunch (NFL) theorem [39][40][41][42].
The NFL theorem yields a key insight: regardless of the optimization strategy employed, the ultimate performance of models is contingent upon the size and types of training data.This observation has spurred recent breakthroughs in large language models, as extensive and meticulously curated training data consistently yield superior results [43][44][45][46][47].In this regard, establishing the quantum NFL theorem enables us to elucidate the specific impact of entangled data on the efficacy of QML models in learning quantum dynamics.Concretely, the achieved theorem can shed light on whether the utilization of entangled data empowers QML models to achieve comparable or even superior performance compared to low-entangled or unentangled data, while simultaneously reducing the sample complexity required [48].Building upon prior findings on the role of entanglement and the classical NFL theorem, a reasonable speculation is that high-entangled data contributes to the improved performance of QML models associated with the reduced sample complexity, albeit at the cost of using extensive quantum resources to prepare such data that may be unaffordable in the early stages of quantum computing [49].
In this study, we negate the above speculation and exhibit the transition role of entangled data when QML models incoherently learn quantum dynamics, as shown in Fig. 1.In the incoherent learning scenario, the quantum learner is restricted to utilizing datasets with varying degrees of entanglement to operate on an unknown unitary and inferring its dynamics using the finite measurement outcomes collected under the projective measurement.The entangled data refers to quantum states that are entangled with a reference system, with the degree of entanglement quantitatively characterized by the Schmidt rank r.We rigorously show that within the context of NFL, the entangled data has a dual effect on the predic- The goal of the quantum learner is to learn a unitary V X that can accurately predict the output of the target unitary U X under a fixed observable O, where the subscript X refers to the quantum system in which the operator O act on.The learning process is as follows.(a) A total number of N entangled bipartite quantum states living in Hilbert space H X ⊗ H R (R denotes the reference system) are taken as inputs, dubbed entangled data.(b) Quantum learner proceeds incoherent learning.The entangled data separately interacts with the target unitary U X (agnostic) and the candidate hypothesis V X extracted from the same Hypothesis set H. (c) The quantum learner is restricted to leverage the finite measured outcomes of the observable O on the output states of U X and V X to conduct learning.(d) A classical computer is exploited to infer V * that best estimates U X according to the measurement outcomes.For example, in the case of variational quantum algorithms, the classical computer serves as an optimizer to update the tunable parameters of the ansatz V X .(e) The learned unitary V * is used to predict the output of unseen quantum states in Hilbert space H X under the evolution of the target unitary U X and the measurement of O.A large Schmidt rank r can enhance the prediction accuracy when combined with a large number of measurements m, but may lead to a decrease in accuracy when m is small.tion error according to the number of measurements m allowed.Particularly, with sufficiently large m, increasing r can consistently reduce the required size of training data for achieving the same prediction error.On the other hand, when m is small, the train data with large r not only requires a significant volume of quantum resources for states preparation, but also amplifies the prediction error.As a byproduct, we prove that the lower bound of the query complexity for achieving a sufficiently small prediction error matches the optimal lower bound for quantum state tomography with nonadaptive measurements.Numerical simulations are conducted to support our theoretical findings.In contrast to the previous understanding that entanglement mostly confers benefits to QML in terms of sample complexity, the transition role of entanglement identified in this work deepens our comprehension of the relation between quantum information and QML, which facilitates the design of QML models with provable advantages.

II. Main results
We first recap the task of learning quantum dynamics.Let U ∈ SU(2 n ) be the target unitary and O ∈ C 2 n ×2 n be the observable which is a Hermitian matrix acting on an n-qubit quantum system.Here we specify the observable as the projective measurement O = |o⟩ ⟨o| since any observable reads out the classical information from the quantum system via their eigenvectors.The goal of the quantum dynamics learning is to predict the functions of the form where |ψ⟩ is an n-qubit quantum state living in a 2 ndimensional Hilbert space H X .This task can be done by employing the training data S to construct a unitary V S , i.e., the learned hypothesis has the form of h S (ψ) = Tr(OV S |ψ⟩ ⟨ψ| V † S ), which is expected to accurately approximate f U (ψ) for the unseen data.While the learned unitary acts on an n-qubit system H X , the input state could be entangled with a reference system H R , i.e., |ψ⟩ ∈ H X ⊗H R .We suppose that all input states have the same Schmidt rank r ∈ {1, • • • , 2 n }.Then the response of the state |ψ j ⟩ is given by the measurement output o j = m k=1 o jk /m, where m is the number of measurements and o jk is the output of the k-th measurement of the observable O on the output quantum state (U ⊗ I R ) |ψ j ⟩.In this manner, the training data with N examples takes the form The risk function is a crucial measure in statistical learning theory to quantify how well the hypothesis function h S performs in predicting f U , defined as where the integral is over the uniform Haar measure dψ on the state space.Intuitively, R U (V S ) amounts to the average square error distance between the true output f (ψ) and the hypothesis output h S (ψ).
Under the above setting, we prove the following quantum NFL theorem in learning quantum dynamics, where the formal statement and proof are deferred to SM C.
Theorem 1 (Quantum NFL theorem in learning quantum dynamics, informal).Following the settings in Eqn.(1), suppose that the training error of the learned hypothesis on the training data S is less than ε = O(1/2 n ).Then the lower bound of the averaged prediction error in Eqn. ( 2) yields where , and the expectation is taken over all target unitary U , entangled states |ψ j ⟩ and measurement outputs o j .
The achieved results indicate the transition role of the entangled data in determining the prediction error.Particularly, when a sufficient number of measurements m is allowed such that the Schmidt rank r obeys r < m/(c 1 n), the prediction error is determined by the term rn and hence increasing r can constantly decrease the prediction error.Accordingly, in the two extreme cases of r = 1 and r = 2 n , achieving zero averaged risk requires N = 2 n c 2 /n and N = 1 training input states, where the latter achieves an exponential reduction in the number of training data compared with the former.This observation implies that the entangled data empower QML with provable quantum advantage, which accords with the achieved results of Ref.
[50] in the ideal coherent learning protocol with infinite measurements.By contrast, in the scenario with r ≥ m/(c 1 n), increasing r could enlarge the prediction error.This result indicates that the entangled data can be harmful to achieving quantum advantages, which contrasts with previous results where the entanglement (e.g., entangled operations or measurements) is believed to contribute to the quantum advantage [50][51][52].This counterintuitive phenomenon stems from the fact that when incoherently learning quantum dynamics, information obtained from each measurement decreases with the increased r and hence a small m is incapable of extracting all information of the target unitary carried by the entangled state.
Another implication of Theorem 1 is that although the number of measurements m contributes to a small prediction error, it is not decisive to the ultimate performance of the prediction error.Specifically, when m ≥ r 2 c 1 n, further increasing m could not help decrease the prediction error which is determined by the entanglement and the size of the training data, i.e., r and N .Meanwhile, at least r 2 c 1 n measurements are required to fully utilize the power of entangled data.These results suggest that the value of m should be adaptive to r to pursue a low prediction error.
We next comprehend the scenario in which the lower bound of averaged risk in Theorem 1 reaches zero and correlate with the results in quantum state learning and quantum dynamics learning [26,27,29,30,53,54].In particular, the main focus of those studies is proving the minimum query complexity of the target unitary to warrant zero risk.The results in Theorem 1 indicate that the minimum query complexity is N m = Ω(2 n rc 1 c 2 ), implying the proportional relation between the entanglement degree r and the query complexity.Notably, this lower bound is tighter than that achieved in Ref. [26] in the same setting.The advance of our results stems from the fact that Ref. [26] simply employs Holevo's theorem to give an upper bound on the extracted information in a single measurement, while our bound integrates more refined analysis such as the consideration of Schmidt rank r, the direct use of a connection between the mutual information of the target unitary U and the measurement outputs o j , and the KL-divergence of related distributions (Refer to SM C for more details).Moreover, the adopted projective measurement O in Eqn.(1) hints that the learning task explored in our study amounts to learning a pure state U † OU .From the perspective of state learning, the derived lower bound in Theorem 1 is optimal for the nonadaptive measurement with a constant number of outcomes [55].Taken together, while the entangled data hold the promise of gaining advantages in terms of the prediction error, they may be inferior to the training data without entanglement in terms of the query complexity.
The transition role of entanglement explained above leads to the following construction rule of quantum learning models.First, when a large number of measurements is allowed, the entangled data is encouraged to be used for improving the prediction performance.To this end, initial research efforts [56-61], which develop effective methods for preparing and storing entangled states, may contribute to QML.Second, when the total number of measurements is limited, it is advised to refrain from using entangled data for learning quantum dynamics.Remark.The results of the transition role for entangled data achieved in Theorem 1 can be generalized to the mixed states because the mixed state can be produced by taking the partial trace of a pure entangled state.

III. Numerical simulations
We conduct numerical simulations to exhibit the transition role of entangled data, the effect of the number of measurements, and the training data size in determining We focus on the task of learning an n-qubit unitary under a fixed projective measurement O = (|0⟩ ⟨0|) ⊗n .The number of qubits is n = 4.The target unitary U X is chosen uniformly from a discrete set {U i } M i=1 , where M = 2 n refers to the set size and the operators U † j OU j with U j in this set are orthogonal.The entangled states in S is uniformly sampled from the set { r j=1 The simulation results are displayed in Fig. 2. Particularly, Fig. 2(a) shows that for both the cases of N = 2 and N = 8, the prediction error constantly decreases with respect to an increased number of measurements m and increased Schmidt rank r when the number of measurements is large enough, namely m > 1000.On the other hand, for a small number of measurements with m ≤ 100 in the case of N = 8, as the Schmidt rank is continually increased, the averaged prediction error initially decreases and then increases after the Schmidt rank surpasses a critical point which is r = 3 for m = 10 and r = 4 for m = 100.This phenomenon accords with the theoretical results in Theorem 1 in the sense that the entangled data play a transition role in determining the prediction error for a limited number of measurements.This observation is also verified in Fig. 2(b) for the varied sizes of training data, where for the small measurement times m = 100, increasing the Schmidt rank could be not helpful for decreasing the prediction error.By contrast, a large training data size consistently contributes to a small prediction error, which echoes with Theorem 1.

IV. Discussion and outlook
In this study, we exploited the effect of the Schmidt rank of entangled data on the performance of learning quantum dynamics with a fixed observable.Within the framework of the quantum NFL theorem, our theoretical findings reveal the transition role of entanglement in determining the ultimate model performance.Specifically, increasing the Schmidt rank below a threshold controlled by the number of measurements can enhance model performance, whereas surpassing this threshold can lead to a deterioration in model performance.Our analysis suggests that a large number of measurements is the precondition to use entangled data to gain potential quantum advantages.In addition, our results demystify the negative role of entangled data in the measure of query complexity.Last, as with classical NFL theorem, we prove that increasing the size of the training data always contributes to a better performance in QML.
Our results motivate several important issues and questions needed to be further investigated.The first research direction is exploring whether the transition role of entangled data exists for other QML tasks such as learning quantum unitaries or learning quantum channels with the response being measurement output [26,30,[62][63][64][65][66][67][68][69][70][71][72].These questions can be considered in both the coherent and incoherent learning protocols, which are determined by whether the target and model system can coherently interact and whether quantum information can be shared between them.Obtaining such results would have important implications for using QML models to solve practical tasks with provable advantages.
A another research direction is inquiring whether there exists a similar transition role when exploiting entanglement in quantum dynamics and measurements through the use of an ancillary quantum system.The answer for the case of entangled measurement has been given under many specific learning tasks [26,29,30,73] where the learning protocols with entangled measurements are shown to achieve an exponential advantage over those without in terms the access times to the target unitary.This quantum advantage arises from the powerful information-extraction capabilities of entangled measurements.In this regard, it is intriguing to investigate the effect of quantum entanglement on model performance when entanglement is introduced in both the training states and measurements, as entangled measurements offer a potential solution to the negative impact of entangled data resulting from insufficient information extraction via weak projective measurements.A positive result could further enhance the quantum advantage gained through entanglement exploitation.Roadmap: Supplementary Material (SM) A provides preliminaries for the necessary mathematical background and introduces the related work.The problem setup of quantum dynamics learning in the framework of quantum no-free-lunch (NFL) theorem is presented in SM B. We elucidate the results and the proof of Theorem 1 in SM C. Finally, SM D exhibits the numerical details omitted in the main text and more numerical results.

A. Preliminaries
In this section, we will first present some essential mathematical foundations for deriving the main results of this work.It encompasses several key aspects, such as the introduction of pertinent notations, random variables, information theory, and Haar integration, which are separately elaborated upon in SM A 1 to SM A 4. Moreover, to contextualize our work within the existing literature, we conduct a comprehensive review of relevant studies in SM A 5.

Notation
We unify the notations throughout the whole work.The number of qubits and training data size is denoted by n and N , respectively.Let H d denote a d-dimensional Hilbert space.The Hilbert space of an n-qubit system is denoted by H 2 n .The d-dimensional unitary group and special unitary group are denoted as U(d) and SU(d), respectively.The notations [N ] refers to the set {1, 2, • • • , N }.We denote ∥ • ∥ 1 as the trace norm and ∥ • ∥ F as the Frobenius norm.The cardinality of a set is denoted as | • |.We use the standard bra-ket notation for pure quantum states.The identity operator on the d-dimensional Hilbert space is denoted by I d .We denote |e k ⟩ as the computational basis with the k-th entry being one and the other entries being zero.

Random variables
We denote random variables using the capital letter, i.e., X, including matrix-valued random variables.We use the lowercase letters (e.g., p, q) and the capital letters (e.g., P) with appropriate subscripts to denote the probability density function (PDF) and the corresponding cumulative distribution function (CDF) which obey dP X (x) = p X (x)dx.For instance, suppose X is a random variable taking values in X according to some distribution p X : X → [0, 1], where A is the set of Borel-measurable subsets [75] of X .Let g : X → R be any function of x.We denote E X g(X) and E X∼p X g(X) interchangeably as the expectation of g(•) with respect to the distribution p X , i.e., where the latter notation is used when there may be some ambiguity about the distribution is.When no confusion occurs, we drop all subscripts and write Eg(X).Next suppose we have random variables (X, Y ) jointly distributed on X × Y.We use p Y |x (y) and p(y|X = x) interchangeably to denote the conditional probability that Y = y given X = x.

Information theory
In this subsection, we review the basic definitions in information theory, including (Shannon) entropy, KL-divergence, mutual information, and their conditional versions.Refer to Refs.[76][77][78] for a more deep understanding.Throughout the whole paper, log denotes the logarithm with base 2. We first consider the scenario of discrete random variables taking values in the same space.
Entropy.We begin with a central concept in information theory-the (Shannon) entropy.Let P be a distribution on a finite set X .Denote the PDF as p associated with P. Let X be a random variable distributed according to P. The entropy of X (or of p) is defined as This quantity is always positive when p(x) < 1 for all x, and vanishes if and only if X is a deterministic variable, i.e., p(x) = 1 with taking 0 log 0 = 0.The Shannon entropy measures the uncertainty about the random variable X.
Let us consider another discrete random variable, denoted by Y , which takes values in the set Y. The joint distribution of X and Y can be given by p X,Y : X × Y → [0, 1].The joint entropy of these random variables is and the conditional entropy of X given Y refers to which can be intuitively interpreted as the total information of X and Y , and the amount of information left in the random variable X after observing the random variable Y , respectively.We next review some important properties of Shannon entropy.
Property 1 (Subadditivity of entropy, [76]).Let X 1 , • • • , X t be any sequence of random variables, we have Property 2 (Maximum Value, Property 10.1.5 in [77]).The maximum value of the entropy H(X) for a random variable X taking values in an alphabet X is log |X |: Property 3. Let X, Y be any two random variables, we have This property can be obtained by observing the definition of conditional entropy that H(Y ) = H(X, Y ) − H(X|Y ) and the non-negativity of entropy.KL-divergence.KL-divergence is used to measure the distance between probability distributions.Consider two discrete distributions (PDFs) p, q : X → [0, 1] with X being the value space of the associated random variables, the KL-divergence between p and q is given by Notably, the KL-divergence D KL (p∥q) ≥ 0, where the equality holds if and only if p(x) = q(x) for all x ∈ X .Particularly, employing the convexity of log and Jensen's inequality yields Mutual information.The mutual information I(X; Y ) between X and Y is defined as the KL-divergence between their joint probability distribution p X,Y and their product of (marginal) probability distributions p X p Y .Mathematically, According to the non-negativity property of KL-divergence, it is easy to see that I(X; Y ) ≥ 0 and the equality holds if and only if random variables X and Y are independent, i.e., p X,Y (x, y) = p X (x)p Y (y).The mutual information between two random variables can be defined in several mathematically equivalent ways.Another definition used in this work expresses the mutual information in terms of the entropy of random variables.It may be shown that the above equation is equal to We would give a brief derivation of the equivalence between the definition of mutual information given in the first equality of Eqn.(A9) and that in Eqn.(A8), while the second equality and the third equality in Eqn.(A9) employs the definition of conditional entropy in Eqn.(A3).Particularly, using Bayes' rule, we have P X,Y (x, y) = p Y (y)p X|Y (x|y), so In this end, the mutual information can be thought of as the amount of entropy removed (on average) in X by observing Y .
Analogous to the definition of conditional entropy, the conditional mutual information between X and Y given a third random variable Z is defined as which refers to the mutual information between X and Y when Z is observed (on average).We now present three fundamental properties about the mutual information.These facts will be leveraged to derive more stronger lower bounds on the prediction error than those obtained by using Holevo's theorem when the measurements are subject to certain restrictions.
Property 4 (Chain rule for mutual information).Let X, Y 1 , Y 2 , • • • , Y t be any t + 1 random variables, the mutual information between X and Y 1 , Property 6 (Data processing inequality).Let the random variables X, Y, Z form a Markov chain X → Y → Z such that if given Y , the random variables X and Z are independent.Then we have

Haar integration
In this subsection, we present a set of lemmas that enable the analytical computation of integrals of polynomial functions over the unitary group and the quantum state with respect to the unique normalized Haar measure.For a more comprehensive discussion on this subject matter, we refer the readers to Refs.[79][80][81].

Related work
In this subsection, we review prior literature related to the quantum NFL theorem, quantum dynamics learning, and the main methodologies employed in the proof of this work.
Quantum NFL theorem.The no-free-lunch (NFL) theorem is a celebrated result in learning theory that limits one's ability to learn a function with a training dataset.It is widely studied in classical learning theory [39-42, 82, 83].Ref. [74] made the initial attempt to establish NFL theorem in the context of quantum machine learning where the inputs and outputs are quantum states.Ref.
[50] reformulated the quantum NFL theorem in the quantum-assisted learning protocols where the bipartite entangled states are prepared as the input states by introducing a reference quantum system.They demonstrated that the utilization of entangled data can remove the exponential cost of the training data size for learning unitaries with unentangled data.However, their results are established in the ideal setting with infinite measurements, which can not apply to quantifying the practical power of entangled data.
Quantum dynamics learning.Quantum computers have the potential to enhance classical machine-learning models.One such application is the utilization of quantum dynamics learning, which involves the conversion of an analogue quantum unitary into a digital form that can be subsequently examined on either a quantum or classical computer.Numerous studies have delved into understanding the provable quantum advantage in terms of sample complexity, aiming to achieve approximate learning of quantum unitaries with small prediction errors.For instance, Ref. [19-22, 25, 63, 64, 84-88] studied in variational quantum machine learning how the sample complexity bounds are determined by the type of datasets and the complexity of the quantum model which acts as an unknown quantum unitary to be learned.These findings are obtained upon assumptions regarding the complexity of the unknown unitaries or the type of datasets, which are subsequently utilized to constrain the information-theoretic complexity of the learning task.As a result, these approaches are typically not applicable to general datasets and highly complex quantum unitaries without specific conditions.Recently, a growing literatures prove the exponential separation in sample complexity between learning unitaries with and without quantum memory [26,27,29,71].However, the proposed algorithms heavily depend on the entanglement measurement for optimal tomography, which may pose significant challenges from a practical standpoint.
Techniques employed in the proof.The main used tool for proving the lower bound of the prediction error is Fano's inequality.This inequality is widely used for obtaining the lower bounds in both classical and quantum learning theory with various formulations [76,77,89,90].In the filed of quantum machine learning, one of its applications involves utilizing Fano's inequality to derive the lower bound of sample complexity for quantum state tomography [26,55,91,92].Particularly, a fundamental framework for establishing such lower bounds builds upon insights from Fano's inequality [77,89], suggesting that performing tomography with sufficient precision accurately reduces to distinguishing well-separated quantum states.However, the aim in our work is to proving the lower bound of prediction errors for dynamics learning.In this regard, we adopt another formulation of Fano's inequality given in Ref. [76], which is widely used in classical learning theory for proving the lower bound of prediction errors.Such formulation suggests that learning unitary with limited training data accurately reduces to distinguishing well-separated PDF of the related measurement outputs.Intuitively, the expression of Fano's inequality used in this work and that used in the context of quantum state tomography could be regarded as the dual formulation in the sense that we employ the amount of training quantum states to lower bound the prediction error while previous works exploited a pre-fixed prediction error to lower bound the sample complexity.
Despite the various formulations of Fano's inequality, a common and standard technique in the field of density estimation involves "discretizing" the learning problem.This technique is employed to achieve worst-case lower bounds, which can be seen as the classical counterpart of quantum tomography (see for example Chapter 2 of Ref. [89]).One way to rigorously establish this argument is by utilizing Fano's inequality in conjunction with Holevo's theorem, which offers an interpretation in terms of a communication protocol between two parties, namely Alice and Bob.Holevo's theorem is conventionally employed to provide an upper bound on the mutual information between the target message X, transmitted by Alice, and the decoded message X received by Bob.However, there are two caveats to using Holevo's theorem in the context of the quantum NFL theorem.First, the utilization of Holevo's theorem does not take into account restrictions on the measurements we are allowed to perform on the N copies of the state.Second, it fails to consider the limitations imposed by the availability of a limited number of training states on the accessible information regarding X.
In contrast to previous studies, our approach distinguishes itself by directly exploiting the connection between the mutual information of two random variables and the Kullback-Leibler (KL) divergence of related distributions.Additionally, we utilize techniques for Haar integration with respect to the random training data and the random target unitary.Two crucial technical steps in our approach involve analyzing the KL divergence instead of individual measurement outcomes' probabilities, as well as investigating the maximal mutual information under the constraint of limited training states.These steps enable us to establish tight lower bounds on prediction error using a predetermined single-copy non-adaptive measurement strategy.

B. Problem setup of learning quantum dynamics in the view of quantum NFL theorem
For self-consistency, in this section we first recap the main results of the quantum NFL theorem which is achieved in the ideal setting by the study [50].Then we introduce the learning problems of the entanglement-assisted quantum NFL theorem in a realistic scenario for further elucidation.
In this regard, they adopt the averaged error in the trace norm of the output quantum states rather than the distance of classical measurement outputs R U (V S Q ) between the learned hypothesis and the target hypothesis as the risk function R U (V S Q ) to quantify the accuracy of the hypothesis V S Q .Formally, the risk function is defined as where |ϕ⟩ ∈ H X and the second equality follows directly Haar integration calculation.They showed that under the assumption of perfect training on the output states with the infinite measurements, i.e., (U then the unitary operators U † V S Q can be reduced to the form of where Y ∈ SU(d − N r) follows Haar distribution when the target unitary is Haar distributed.This leads to their main results as shown in the following theorem.
Theorem 2 (Formulated theorem according to the results of [50]).Let U and V S Q be the target unitary sampled from Haar distribution and the learned hypothesis unitary on the entangled data S Q , such that the assumption of perfect training on the output states holds.the averaged risk function defined in Eqn.(B2) over all target unitaries and training data yields It implies that the entangled data can remove the exponential cost of training data size in learning unitaries.Specifically, in the extreme cases of r = d and r = 1, the required training data size for reaching the zero risk of the former achieves an exponential advantage over that of the latter.
where o jk is the k-th measurement outcome on the state From the Schmidt decomposition of a pure state, each |ψ j ⟩ can be represented as The standard quantum NFL theorem considers the averaged risk function over all possible training data, target unitaries, and optimization algorithms from which the learned hypothesis unitary has the same output as the target unitary on the training input states, which has the form of E S E U R U (V S ).Particularly, the uniform sampling of the target unitary in SU(d) corresponds to the Haar distribution.While random pure states uniformly sampled from the Hilbert space H X R follow Haar distribution, it can not be assumed that the states uniformly sampled from the set of entangled states defined in Eqn.(B6) follow Haar distribution in SU(d 2 ) due to the restriction of fixed Schmidt ranks.On the other hand, we observe that the uniform distribution of the entangled training state |ψ j ⟩ should satisfy the following conditions: (i) the state vectors |ξ j ⟩ and |ζ j ⟩ in the subsystem X and R should be uniformly distributed in the corresponding Hilbert space and are independent; (ii) the state vectors |ξ j ⟩ and |ξ k ⟩ (|ζ j ⟩ and |ζ k ⟩) in the subsystem X (R) should be orthogonal for j ̸ = k; (iii) the vector consisting of the coefficients |c j ⟩ is also uniformly distributed.We summarize these observations as the following construction rules of the random entangled training states.In the realistic setting where the aim is to learn the target unitary under a fixed observable f U (ρ) = Tr(U † OU ρ) with a limited number of measurements, the learned hypothesis h S (ρ) = Tr(V † S OV S ρ) can not achieve zero training error on the training data S due to the statistical error in measuring the quantum system.Consequently, the assumption of perfect training, which is the pivotal condition in the proof of the NFL theorem in previous literature [50,74], is not applicable.With this regard, we adopt the information-theoretic based technique, which does not rely on the perfect training assumption, to prove the quantum NFL theorem when the number of measurements is finite.The core idea is to "reduce" the learning problem to the multi-way hypothesis testing problem.This is equivalent to showing that the risk error of the quantum dynamics learning problem can be lower bounded by the probability of error in testing problems, which can develop tools for.We then employ Fano's lemma to derive the information-theoretic bound for this hypothesis testing problem.

Construction
The organization of this section is as follows.We first discretize the class of target function through 2ε-packing in SM C 1. We demystify in SM C 2 how to reduce the learning problem to the hypothesis testing problem on the 2ε-packing with Fano's inequality.In this regard, learning the target unitary can be reduced to identifying the corresponding index X in the 2ε-packing.Then, we elucidate how to obtain the results of Theorem 1 by separately bounding the related terms in the Fano's inequality, namely, the mutual information I(X; X) between the target index X and the estimated index X, and the cardinality of the 2ε-packing.The theoretical proof of Theorem 1 is given in SM C 3. We provide the theoretical guarantee of the reduction from the learning problem to the hypothesis testing problem in SM C 4. SM C 5 and SM C 6 present the theoretical proofs for bounding the cardinality of the 2ε-packing and bounding the mutual information I(X, X), respectively.Finally, in SM C 7 we elucidate the tightness of the derived lower bound through establishing the connection to the results of quantum state tomography.

Discretizing the class of target functions through 2ε-packing
In the quantum dynamics learning problem under a fixed observable O described in SM B 2, the goal is to identify the target function f U (ρ) = Tr(U † OU ρ) from the hypothesis set F = {f U (ρ) = Tr(U † OU ρ)|U ∈ SU(d)} according to the measurement output o.This task is hard when F contains a large amount of very different operators.With this regard, we discretize the set of target functions F by equipping it with a local ε-packing, as shown in Fig. 3.
Definition 1 (ε-packing and local ε-packing).For a given set of functionals F and a distance metric ϱ on this set, the ε-packing M ε (F, ϱ) is a discrete subset of F whose elements are guaranteed to be distant from each other by a distance greater than or equal 2ε.Namely, for any element f 1 , f 2 ∈ M ε (F, ϱ), the distance between f 1 and f 2 satisfies Similarly, the local ε-packing where γ ≥ 2.
For a sufficient large γ obeying γε ≥ max f1,f2∈F ϱ(f 1 , f 2 ), the local ε-packing reduces to ε-packing.In this work, we adopt the local ε-packing instead of the ε-packing to obtain a tighter bound.This is because the learned hypothesis h S only needs to approximate f U (ρ) = Tr(U † OU ρ) on the training input states {ρ j } N j=1 up to a small training error, which can be viewed as the relaxation of the perfect training assumption in the ideal setting.In particular, a small bounded training error can restrict the possible space in which the target unitary resides to the vicinity of the learned hypothesis.Consequently, the local ε-packing suffices to discretize the hypothesis set F in this context.
The packing cardinality |M (γε) ε (F, ϱ)| heavily depends on the choice of the distance metric ϱ, which is generally determined by the employed risk function.The following lemma gives the analytical expression of the metric ϱ for the risk function defined in Eqn.(B7).
Lemma 6 (Reformulation of the risk function).For any given projective measurement O = |o⟩ ⟨o| and any fixed random Haar state ρ, the risk function defined in Eqn.(B7) has the following equivalent expression where

and
Proof of Lemma 6.We recall that the risk function given in Eqn.(1) refers to where the second equality employs Tr(A) 2 = Tr(A ⊗ A), the third equality exploits Lemma 4, and the SWAP is the swap operator on the space H ⊗2 d .When the observable O takes the projective measurement |o⟩ ⟨o|, the operators U † OU and V † S OV S can be treated as density matrix of the quantum state that we denote as |µ⟩ ⟨µ| and |ν⟩ ⟨ν| with |µ⟩ = U |o⟩ and |ν⟩ = V |o⟩ respectively.With this observation, we can rewrite the risk function as where the first equality comes from substituting the density matrix representation of operators O, U † OU, V † S OV S into Eqn.(C6), the second equality is based on the fact Tr(ρ 2 ) = 1 for any pure state ρ, the third equality employs the relation between trace distance and fidelity that 1 − ⟨u|v⟩ 2 = ∥ |u⟩ ⟨u| − |v⟩ ⟨v| ∥ 2 1 /4.Eqn.(C3) can be directly obtained by denoting Φ(t) = t 2 , and ϱ(h The definition of ϱ in Eqn.(C4) satisfies the properties of the distance metric, i.e., (1) non-negativity: With this well-defined distance metric ϱ on the space F, the local 2ε-packing is denoted as M (γε) 2ε with dropping the dependence on F and ϱ for simplification.Without loss of generality, we employ the positive integer set X .In the following, we refer the index set X (2γε) 2ε to the local 2ε-packing.

Reducing the learning problem to hypothesis testing
We now elucidate how to reduce the quantum dynamics learning problem introduced in SM B 2 to the hypothesis testing problem by exploiting Fano's method, which is extensively used in deriving the lower bound of the generalization error in statistical learning theory [76].Assume that the target function is in the discrete local 2ε-packing {f Ux } x∈X (2γε) 2ε , then the learning problem aims to identify the underlying index x from X (2γε) 2ε according to the training data S, which is exactly described by the hypothesis testing.Particularly, given the set of training input states {ρ j } N j=1 , we refer the hypothesis testing to any measurable mapping . The learning problem can be reduced to the following hypothesis testing problem as shown in Fig. 4: • first, nature chooses X = x according to the uniform distribution over X The choice of the uniform distribution over in the first step is because in the context of the quantum NFL theorem the target unitary is assumed to be sampled from the Haar distribution.The distribution P (j) o|x of the measurement outcome o jk in the second step is assumed to be a Gaussian distribution whose mean u x = Tr((U † x OU x ⊗I R )ρ j ) varies with the index x and the input state ρ j , and the variance σ 2 is assumed to be an n-dependent constant which is identified below (see Assumption 1).This learning algorithm in the third step is arbitrary, as long as the learned function can achieve the minimal training error (i.e., can be greater than 0), which can be regarded as the relaxation of the perfect training assumption with zero training error.In the fourth step, the notation P o,X denotes the joint distribution over the random index X and o.In particular, define Po = x∈X (2γε) 2ε | as the mixture distribution, the measurement output is drawn (marginally) from Po , and our hypothesis testing problem is to determine the randomly chosen index X given the training data S = {(ρ j , o j )} N j=1 with o = (o 1 , • • • , o N ) sampled from this mixture Po .Denote the average risk function over the training sets S and target unitary U as where the second equality follows Lemma 6. Notably, the randomness of S comes from the randomness of the input states set {ρ j } N j=1 and the randomness of measurement outcomes o j .For simplifying the calculation of Haar integration, we make the following mild assumption on the distribution of the measurement output o j = x OU x ⊗ I R ) |ψ j ⟩ ⟨ψ j |).According to the central limit theorem, the binomial distribution is approximated by the normal distribution x , (σ x ) 2 ) with the mean u x = Tr((U † x OU x ⊗ I R ) |ψ j ⟩ ⟨ψ j |) and the variance (σ x ) 2 = u x (1 − u x )/m.Furthermore, the variance is assumed to take the expectation of (σ x ) 2 over the random input state |ψ j ⟩, i.e., σ 2 = E |ψj ⟩∼Haar (σ Remark.We note that Assumption 1 is mild, as the convergence of binomial distribution to the normal distribution is guaranteed by the central limit theorem for a large number of measurements [93].Additionally, the approximate proxy of the variance σ 2 x with its expectation σ 2 is based on the observation that the variance σ 2 x is smaller than the expectation of u x with at least a multiplier factor of 1/m and is exponentially concentrated in the number of qubits n which is widely studied in barren plateaus of variation quantum algorithms [81,94,95].These observations enable the effective discrimination of the normal distribution of the measurement outputs corresponding to the different hypothesis unitaries in the local 2ε-packing.
With this setup, the following lemma whose proof is deferred to SM C 4 gives a theoretical guarantee of the reduction from the quantum dynamics learning problem to the hypothesis testing problem.
where the probability measure P refers to the joint distribution of the random index X and the measurement output o.
Lemma 7 reduces the problem of lower bounding the risk function R N (Φ ⊙ ϱ) to the problem of lower bounding the error of hypothesis testing.The Fano's inequality gives an information-theoretical bound to the latter.Lemma 8 (Fano's inequality).Assume that X is uniform in X (2γε) 2ε . The learning procedure can be depicted by the Markov chain X → S → X, where X is returned by the hypothesis testing, i.e., Φ ρ1,••• ,ρ N (o) = X.Then we have where I(X; X) refers to the mutual information between the estimated index X and X.
The information-theoretic lower bound for the risk function defined in Eqn.(C9) can be achieved by using Fano's method for multiple hypothesis testing established in Lemma 7 and Lemma 8.Moreover, it is reduced to separately bound the mutual information I(X; X) and the local 2ε-packing cardinality |X with ε = O(1/d) in the ϱ-metric such that the packing cardinality yields where ε = 2 2d(d + 1)ε and γ refers to an arbitrary constant obeying 0 < 4γ 2 ε2 − 1 < 1 and γ > 2.
Lemma 10 (Upper bound of the mutual information I(X, X)).Following the notations in Lemma 9 and Assumption 1, the average of mutual information over the training states {ρ j } N j=1 sampled from the distribution described in Construction Rule 1 yields where γ refers to an constant obeying 0 < 4γ 2 ε2 − 1 < 1 with ε = 2 2d(d + 1)ε and γ > 2.
Hence, learning quantum dynamics in the framework of the quantum NFL theorem taking consideration of the entangled quantum state and the finite measurements is encapsulated in the following theorem.
Theorem (Formal statement of Theorem 1).Let {f Ux } x∈X (2γε) 2ε be a local 2ε-packing with the maximal distance being γε of the function class F in the ϱ-metric.Assume that the index X corresponding to the target function f U X is uniformly sampled from the set X where ε = 2 2d(d + 1)ε and the expectation is taken over all target unitaries U .

Proof of Theorem 1
We are now ready to prove Theorem 1.
Proof of Theorem 1. Combining Lemma 8 with the reduction from learning to testing in Lemma 7, we have Employing the results of Lemma 9 and Lemma 10, we have where the first inequality employs Lemma 9 that log(|X . This completes the proof. ε-packing with the states ρ ε,Wi = W i |ϕ⟩ ⟨ϕ| W † i where |ϕ⟩ ∈ H d is any fixed quantum state.We then apply standard concentration of measure results to argue that the probability of selecting an undesirable state set (i.e., there exist two states whose trace distance is less than 2ε or larger than γε) is exponentially small.This in turn implies that such states set is a local ε-packing with the maximal distance being γε.To this end, we first exploit the concentration of projector overlaps, which have been used in deriving the lower bounds for quantum state tomography [26,55,92,96].
Lemma 11 (Lemma 3.2, [55]).Let W ∈ SU(d) be a Haar-random unitary operator and let Π 1 , Π 2 : H d → H d be orthogonal projection operators with the rank r 1 , r 2 respectively.For all t ∈ (0, 1) it holds that and With this lemma, we can derive the lower bound of the local ε-packing cardinality of n-qubit pure states, which is encapsulated in the following lemma.
Lemma 12 (Lower bound of packing cardinality for n-qubit quantum states).Under the distance metric of trace norm ∥ • ∥ 1 , there exists a local ε-packing P (γε) ε of the set of all n-qubit pure quantum states |ψ⟩ where the distance between arbitrary two elements in P (γε) ε is less than γε satisfying where γ refers to an arbitrary constant obeying 0 < 4γ 2 ε 2 − 1 < 1 and γ > 2.
Proof of Lemma 12.We give a probabilistic existence argument of the lower bound of local ε-packing cardinality.Particularly, we first construct a local ε-packing in which the distance between arbitrary two elements is less than γε by applying a probabilistic method.Let ρ 0 = |ϕ⟩ ⟨ϕ| be any fixed quantum state and being arbitrary unitary operators sampled from Haar distribution.In the following, with the aim of showing the existence of such local ε-packing with the desired lower bound, we will show the event that there exists i ̸ = j ∈ [L] such that the trace distance between ρ i and ρ j is less than 2ε or larger than γε occurs with a small probability.Mathematically, the probability of this event has the form of where the first inequality employs the subadditivity of probability measure, the first equality exploits the definition of ρ W = W |ϕ⟩ ⟨ϕ| W † and the invariance of trace distance under arbitrary unitary transformations, i.e., where the last equality follows that the unitary operator W = W † j W i follows Haar distribution as W i , W j ∈ SU(d) are sampled from Haar distribution.Next, we separately consider the probability of the events in Eqn.(C21) of ∥ρ W − ρ I 1 ≤ 2ε and ∥ρ W − ρ I 1 ≥ γε under the Haar distribution.Particularly, we first note that with the definition ρ W = W |ϕ⟩ ⟨ϕ| W † for any W ∈ SU(d), we have This leads to the conclusion that where the last inequality employs Lemma 11 with taking t = 1 − 2ε.
We are now ready to present the proof of Lemma 9.
the direct calculation of Haar integration σ 2 = E ρj ∼Haar u (j) x (1 − u  , and additionally the positivity of the second term.

More numerical results
Simulation results for orthogonal training states.Fig. 5 plots the prediction error when the sampled training states are orthogonal, i.e., ⟨ψ i |ψ j ⟩ = 0 for any i ̸ = j.In this case, the product of the Schmidt rank r and the size of training data N obeys r × N ≤ 2 n as the number of mutually orthogonal states in the Hilbert space H X is less than the dimension of this space d = 2 n .The simulation results show that the prediction error vanishes when the product of the Schmidt rank and the size of the training data rN equals the dimension d.Additionally, for the case of a small number of measurements m, increasing the Schmidt rank r may increase the prediction error, indicating the transition role of entangled data.These phenomenons accord with Theorem 1.
Simulation results for independently sampled training states.Fig. 6 plots the prediction error with a varied number of measurements m, size of training data N , and Schmidt rank r.The simulation result shows that for the case of the identical magnitude of the product of the measurement times and the training data size (e.g., (m, N ) ∈ (100, 10), (500, 2)), the tuple of a larger amount of training data and a smaller number of measurements in each training state (m, N ) = (100, 10) can achieve a smaller training error.This indicates that when the access times to the unknown target unitary is limited, the measurement outcomes should be collected from more different training states rather than from a few quantum states with a large number of measurements.These phenomenons accord with Theorem 1.

FIG. 1 :
FIG.1: Illustration of quantum NFL setting with the entangled data.The goal of the quantum learner is to learn a unitary V X that can accurately predict the output of the target unitary U X under a fixed observable O, where the subscript X refers to the quantum system in which the operator O act on.The learning process is as follows.(a) A total number of N entangled bipartite quantum states living in Hilbert space H X ⊗ H R (R denotes the reference system) are taken as inputs, dubbed entangled data.(b) Quantum learner proceeds incoherent learning.The entangled data separately interacts with the target unitary U X (agnostic) and the candidate hypothesis V X extracted from the same Hypothesis set H. (c) The quantum learner is restricted to leverage the finite measured outcomes of the observable O on the output states of U X and V X to conduct learning.(d) A classical computer is exploited to infer V * that best estimates U X according to the measurement outcomes.For example, in the case of variational quantum algorithms, the classical computer serves as an optimizer to update the tunable parameters of the ansatz V X .(e) The learned unitary V * is used to predict the output of unseen quantum states in Hilbert space H X under the evolution of the target unitary U X and the measurement of O.A large Schmidt rank r can enhance the prediction accuracy when combined with a large number of measurements m, but may lead to a decrease in accuracy when m is small.
being the expectation value of the observable O on the state (U ⊗ I R ) |ψ j ⟩ and N being the size of the training data.

FIG. 2 :
FIG. 2: Simulation results of quantum NFL theorem when incoherently learning quantum dynamics.(a) The averaged prediction error with a varied number of measurements m and Schmidt rank r when N = 2 and N = 8.The z-axis refers to the averaged prediction error defined in Eqn.(1).(b) The averaged prediction error with the varied sizes of training data.The label 'r = a & m = b' refers that the Schmidt rank is a and the number of measurements is b.The label '(×2/d 2 )' refers that the plotted prediction error is normalized by a multiplier factor 2/d 2 .
of Lemma 10-bounding the mutual information I(X; X) 24 7.The tightness of the derived lower bound in Theorem 1 from the perspective of quantum state tomography28 D. Numerical Simulations for the quantum NFL theorem 28 1.Details of numerical simulations 28 2.More numerical results 30

1 .
Quantum NFL theorem for learning quantum dynamics in the ideal setting In this subsection, we briefly recap the results achieved in Ref. [50] for self-consistency.The learning problem studied in Ref. [50] aims to learn the full representation of target unitary operator U through training a hypothesis unitary V S Q on the training data

2 .
Quantum NFL theorem for learning quantum dynamics in the realistic scenario Let us first recall the required notations for the learning problems of quantum NFL theorem.Let U ∈ SU(d) denote the target unitary and O = |o⟩ ⟨o| be the fixed projective measurement which acts on the quantum system X .Denote S as the training data with the size |S| = N .The entanglement-assisted protocol introduces a reference system R to prepare the entangled training states in the Hilbert space H X R := H X ⊗ H R with dim(H R ) = d.The m-measurement outcomes (o j1 , • • • , o jm ) of the observable O on the output states (U ⊗ I d ) |ψ j ⟩ are collected to construct the response of the input |ψ⟩ by taking the average value o j = m k=1 o jk /m.This leads to the training data j,k = 1 and hence |c j ⟩ = ( √ c j1 , • • • , √ c jr ) ⊤ forms a state vector in the Hilbert space H r .Suppose that all training states |ψ j ⟩ ∈ S have the same Schmidt rank r ∈ {1, 2, • • • , d} across the cut H X ⊗ H R .The problem of incoherent quantum dynamics learning aims to train a hypothesis unitary V S on the training data S such that it can approximate the output of the target unitary U under the observable O for given any unseen input state |ϕ⟩ ∈ H d .In this regard, we denote the target function and the learned hypothesis function with respect to the input state |ψ⟩ (or ρ = |ψ⟩ ⟨ψ| in the density matrix representation) as f U (ψ) = Tr(U † OU |ψ⟩ ⟨ψ|) and h S (ψ) = Tr(V † S OV S |ψ⟩ ⟨ψ|), respectively.Then to evaluate the quality of the learned unitary V S , we define the risk function R U (V S ) as

:FIG. 3 :
FIG. 3:The ε-packing of the hypothesis space.The left panel and the right panel refer to the ε-packing and the local ε-packing with maximal distance γε of F, respectively.
index the elements in the local 2ε-packing, i.e., M (2γε) 2ε = {f Ux } x∈X (2γε) 2ε , where each index x in X (2γε) 2ε uniquely corresponds to an element f Ux in the local packing M (2γε) 2ε

FIG. 4 :
FIG. 4:The paradigm of the reduction from quantum dynamics learning to hypothesis testing.

m k=1 o jk /m. Assumption 1 .
For any index X = x, the outcome o j = m k=1 o jk /m of the projective measurements O = |o⟩ ⟨o| on the given output state (U X ⊗ I R ) |ψ j ⟩ follows the binomial distribution B(m, u (j) x ) with u x = E[o j ] = Tr((U †

Lemma 7 .
Given the local 2ε-packing X (2γε) 2ε of the functional class F, the average risk error in Eqn.(C9) is lower bounded by

2 be
are given by the following two lemmas.Lemma 9 (Lower bound of local 2ε-packing cardinality for the output under the projective measurement).Let O = |o⟩ ⟨o| be the projective measurement, F = {f U : ρ → Tr(U † OU ρ)|U ∈ SU(d)} be the function class of the output of quantum system given an arbitrary fixed Haar state ρ, and ϱ(f U1 , f U2 ) = E ρ∼Haar Tr O U 1 ρU † 1 − U 2 ρU † 2 the distance measure.Then there exist a local 2ε-packing X (2γε) 2ε

.
Conditional on X = x, we obtain the training set S = {ρ j , o j } N j=1 where ρ j is the random entangled state of Schmidt rank r sampled from the distribution described in Construction Rule 1, and o j = m k=1 o jk /m is the measurement output of the observable O following Assumption 1. Then the averaged risk function defined in Eqn.(C9) is lower bounded by ) where the second inequality employs the property of trace norm that ∥A∥ 1 = max{| Tr(AW )| : W ∈ U(d)} for any square operator A, and I d −2 |ϕ⟩ ⟨ϕ| ∈ U(d), and the last equality is obtained by denoting Π 2 = |ϕ⟩ ⟨ϕ| and Π 1 = I d −Π 2 .

Proof of Lemma 9 .
To measure the local 2ε-packing cardinality of F, we first consider the local packing cardinality of the operator group U O = {U † OU |U ∈ SU(d)} and then employ the relation between ϱ(Tr(U † 1 OU 1 ρ), Tr(U † 2 OU 2 ρ)) and ϱ A (U † 1 OU 1 , U † 2 OU 2 ) = ∥U † 1 OU 1 − U † 2 OU 2 ∥ 1 toobtain the local 2ε-packing cardinality of F in the ϱ-metric.Specifically, denote ε = 2 2d(d + 1)ε, we first construct a local ε-packing P (γ ε) ε (U O , ϱ A ) of the operator group U O following the manner in Lemma 12 as U † OU is the density matrix representation of a quantum state for the projective measurement O = |o⟩ ⟨o|.In this regard, it holds according to Lemma 12 that for any U † 1 OU 1 , U † 2 OU 2 ∈ P (γ ε) ε

2
E |ξ i,k ⟩∼Haar Tr((U † x OU x − U † x ′ OU x ′ ) |ξ i,k ⟩ ⟨ξ i,k |) k c i,j E |ξ i,k ⟩∼Haar E |ξi,j ⟩∼Haar Tr((U † x OU x − U † x ′ OU x ′ ) |ξ i,k ⟩ ⟨ξ i,k |) Tr((U † x OU x − U † x ′ OU x ′ ) |ξ i,j ⟩ ⟨ξ i,j |) ϱ(f Ux , f U x ′ ) 2 − i ⟩ ⟨c i | |e k ⟩ ⟨e k |) 2 ϱ(f Ux , f U x ′ ) 2 − Tr(|c i ⟩ ⟨c i | |e k ⟩ ⟨e k |) Tr(|c i ⟩ ⟨c i | |e j ⟩ ⟨e j |) d(d 2 − 1) equality employs the representation of the input states |ψ j ⟩ = r k=1 √ c j,k |ξ j,k ⟩ X |ζ j,k ⟩ R defined inEqn.(B6) and the property of partial trace Tr R [|ψ j ⟩ ⟨ψ j |] = r k=1 c j,k |ξ j,k ⟩ ⟨ξ j,k | X , the third equality follows direct expansion of the square function and formulates the Haar integration of the random state |ψ i ⟩ ∈ S into the Haar integration with respect to |c i ⟩ and |ξ i,k ⟩ which follow Haar distribution in SU(r) and SU(d) according to Construction Rule 1, the fourth equality employs Lemma 6 and Lemma 5 to derive the first term and the second term respectively, the fifth equality follows that ci,k = | ⟨e k |c i ⟩ | 2 = Tr(⟨e k | |c i ⟩ ⟨c i | |e k ⟩) with |c i ⟩ = ( √ c i,1 , • • • , √ c i,r) ⊤ and |e k ⟩ being the computational basis defined in Section A 1, the sixth equality exploits Lemma 4 to calculate the Haar integration with respect to |c i ⟩ ∈ SU(r), and the last equality employs the property of the local 2ε-packing in Lemma 9 in which the distance between arbitrary two elements is less than 2γε, i.e., ϱ(f Ux , f U ′ x ) ≤ 2γε for any x, x ′ ∈ X (2γε) 2ε

Error (× 2 /𝑑 2 )FIG. 6 :
FIG. 6: Simulation results of quantum NFL theorem with independent training states.The averaged prediction error with a varied number of measurements m, size of training data N , and Schmidt rank r.The x-axis is arranged from left to right by the magnitude of the product of m and N .The label '(×2/d 2 )' refers that the plotted prediction error is normalized by a multiplier factor 2/d 2 .