Learning quantum processes without input control

We introduce a general statistical learning theory for processes that take as input a classical random variable and output a quantum state. Our setting is motivated by the practical situation in which one desires to learn a quantum process governed by classical parameters that are out of one's control. This framework is applicable, for example, to the study of astronomical phenomena, disordered systems and biological processes not controlled by the observer. We provide an algorithm for learning with high probability in this setting with a finite amount of samples, even if the concept class is infinite. To do this, we review and adapt existing algorithms for shadow tomography and hypothesis selection, and combine their guarantees with the uniform convergence on the data of the loss functions of interest. As a by-product we obtain sufficient conditions for performing shadow tomography of classical-quantum states with a number of copies which depends on the dimension of the quantum register, but not on the dimension of the classical one. We give concrete examples of processes that can be learned in this manner, based on quantum circuits or physically motivated classes, such as systems governed by Hamiltonians with random perturbations or data-dependent phase-shifts.


I. INTRODUCTION
The goal of science is to gain a better understanding of nature.Because 'nature isn't classical, dammit' [1], the tools of quantum information processing have been central to this pursuit.Insights transposed from statistical learning theory to the quantum domain have established rigorous guarantees for learners that predict properties of quantum states [2][3][4][5][6], classify phases of matter [5], learn quantum channels [7] or approximate models of physical dynamics [8].
A feature shared by many of these works is that they require the quantum learner to have precise control of the unknown object -for instance, the ability to request multiple identical copies of the unknown state or run the unknown process on a string of well-chosen inputs.That is, the unknown process is treated as a black box, to be applied on inputs specially designed by the learner.However, this assumption is not always satisfied in practice.A scientist can typically observe but not fully control an unknown process of interest -think of an astronomer analyzing signals generated by rare celestial events, a biologist probing molecular mechanics induced by biochemical signals in a noisy environment, or a physicist characterizing systems obeying Hamiltonians subject to random perturbations.In this work, we show that quantum processes can be learned even without the strong assumption of input control.
That is (see Fig. 1), the learner receives as examples the input-output pairs (x 1 , ρ(x 1 )), ...(x n , ρ(x n )) (1) where x i are classical inputs and the ρ(x i ) are the quantum states output by the unknown process with these inputs.Here the lack of input control is reflected in the fact that the x i are not chosen by the learner but are samples from some distribution D.
Practically, the imperfect control could consist in the impossibility of choosing the input to a process or the inevitability of the input fluctuating throughout the various applications of the process.For instance, in optical imaging of celestial, atmospheric, planetary or biological events, ρ(x) could describe the quantum state of an optical signal emitted or back-scattered by an unknown object, while x is a set of environmental parameters influencing the state of the signal, e.g., temperature, optical depth, distance.
We model the situations described above as follows: the learner has access to a source that outputs G n T 8 q j y y q o P R z u a P a J z y X I F G L p m 1 w 8 D P M C 6 Y Q c E l L N t R b i F j f M a m M H S t Z g p s X F S z L u m j 3 D J M a Q a G C k k r E C 4 6 C q a s P V e J U y q G p 3 a d K 8 H L u G G O k 3 5 c C J 3 l C J q X Q S g k V E G W G + E u A e h Y G E B k 5 e R A h a a c G Y Y I R l D G u Q N z d y u N w C o m A 9 7 4 q W K e a 8 H T M a y h E u d o W B N M l P v W 8 I m n S j E 9 L q J U w 7 K I M p X Q Y I 1 B s x w G c V F E R t F D 1 3 t B v K b I S k n 4 V z I q v M D p w n X d D L A 6 K p I w w Y U X R E Z M T z E y T E / d f m i j m s 7 E z f / H G c n K U P s X / 6 Z U 8 v B i U O W o j e H l R u e q J g y b O d T 5 p V s T 1 r 5 6 4 H Z k A R U T e p J q L F 6 c g R a G v o c 5 1 o z b U U n t v h Z T g f b J W 7 d F / b g p d K + 8 f s r 0 / 8 1 R 2 A m e d 7 o f u t 7 + y 9 V 7 3 y I P y E O y S w L S I / v k D T k g A 8 L J l H w l 3 8 j 3 1 r u W b X 1 u L X 5 L N z d W n v u k U a 0 v v w C 2 6 E N N < / l a t e x i t > G n T 8 q j y y q o P R z u a P a J z y X I F G L p m 1 w 8 D P M C 6 Y Q c E l L N t R b i F j f M a m M H S t Z g p s X F S z L u m j 3 D J M a Q a G C k k r E C 4 6 C q a s P V e J U y q G p 3 a d K 8 H L u G G O k 3 5 c C J 3 l C J q X Q S g k V E G W G + E u A e h Y G E B k 5 e R A h a a c G Y Y I R l D G u Q N z d y u N w C o m A 9 7 4 q W K e a 8 H T M a y h E u d o W B N M l P v W 8 I m n S j E 9 L q J U w 7 K I M p X Q Y I 1 B s x w G c V F E R t F D 1 3 t B v K b I S k n 4 V z I q v M D p w n X d D L A 6 K p I w w Y U X R E Z M T z E y T E / d f m i j m s 7 E z f / H G c n K U P s X / 6 Z U 8 v B i U O W o j e H l R u e q J g y b O d T 5 p V s T 1 r 5 6 4 H Z k A R U T e p J q L F 6 c g R a G v o c 5 1 o z b U U n t v h Z T g f b J W 7 d F / b g p d K + 8 f s r 0 / 8 1 R 2 A m e d 7 o f u t 7 + y 9 V 7 3 y I P y E O y S w L S I / v k D T k g A 8 L J l H w l 3 8 j 3 1 r u W b X 1 u L X 5 L N z d W n v u k U a 0 v v w C 6 T U N O < / l a t e x i t > x 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " C 8 4 3 A P q c R r I B h l 0 L 2 y E W / 4 E j K V c = " > A A A D y 3 i c d V J L b 9 N A E N 4 2 P E p 4 t X D k s s J C K h K K b B N I e i u P A w d A R T R t p d i K 1 p t J u s r u 2 t o d l x Q n R 4 5 c 4 X / w b / g 3 r E 0 M d S h z G n + P / c Y 7 m 2 R S W P T 9 n x u b r S t X r 1 3 f u t G + e e v 2 n b v b O / e O b J o b D g O e y t S c J M y C F B o G K F D C S W a A q U T C c T J 7 V f L H Z 2 C s S P U h n m c Q K z b V Y i I 4 Q w d 9 n I / 0 a N v z O 3 t 9 P + j 3 q W u e + s / 2 q i b 0 u 7 0 e D T p + V R 5 Z 1 c F o Z / N H N E 5 5 r k A j l 8 z a Y e B n G B f M o O A S l u 0 o t 5 A x P m N T G L p W M w U 2 L q p Z l / R R b h m m N A N D h a Q V C B c d B V P W n q v E K R X D U 7 v O l e B l 3 D D H S T 8 u h M 5 y B M 3 L I B Q S q i D L j X C X A H Q s D C C y c n K g Q l P O D E M E I y j j 3 I G 5 u 5 V G Y B W T A W / 8 V D H P t e D p G N Z Q i X M 0 r A k m y n 1 r + M R T p Z g e F 1 G q Y V l E m U p o s M a g W Q 6 D u C g i o + i h 6 7 0 g X l N k p S T 8 K x k V X u B 0 4 b p u B l g d F U m Y 4 M I L I i O m p x g Z p q d u P 7 R R T W f i 5 v / j j G R l q P 2 L f 1 M q e X g x q H L U x v B y o 3 N V E 4 b N H O r 8 0 q 0 J a 1 8 9 c D u y g I o J P U k 1 F i / O Q A t D 3 8 M c a 8 b t q K R 2 X 4 u p Q P v k r d u i f t w U u l d e P 2 X 6 / + Y o 7 A T P O 9 0 P X W / / 5 e q 9 b 5 E H 5 C H Z J Q H p k X 3 y h h y Q A e F k S r 6 S b + R 7 6 1 3 L t j 6 3 F r + l m x s r z 3 3 S q N a X X 4 Y I Q 4 o = < / l a t e x i t > x n < l a t e x i t s h a 1 _ b a s e 6 4 = " u C A E v o D M R V r l y x P x D e H 2 O U 8 q A 6 8 = " > A A A D 3 X i c b V J L b 9 N A E N 7 W P E p 4 N K V H L i s s p C K h K L Y q 4 F g e B w 4 I F a l p K 8 V W t N 5 M 3 F V 2 1 2 Z 3 X B K 5 4 c Y N c e X I F a 7 8 G / 4 N a 5 N A 7 X R O 4 + + x 3 3 h n k 1 w K i / 3 + 7 4 1 N 7 9 r 1 G z e 3 b n V u 3 7 l 7 b 7 u 7 c / / Y Z o X h M O C Z z M x p w i x I o W G A A i W c 5 g a Y S i S c J N N X F X 9 y D s a K T B / h P I d Y s V S L i e A M H T T q 7 k Z G 0 Q 8 F 0 1 i o T 9 S K V D M 5 6 v r 9 X r 8 u u t 4 E y 8 Y n y z o c 7 W z + i s Y Z L x R o 5 J J Z O w z 6 O c Y l M y i 4 h E U n K i z k j E 9 Z C k P X a q b A x m U 9 / Y I + K i z D j O Z g q J C 0 B u G y o 2 T K 2 r l K n F I x P L N t r g K v 4 o Y F T p 7 H p d B 5 g a B 5 x f B w 4 I F S k p q 2 U X U V e Z 5 J a s b 0 r e 7 a k 2 q b 8 A G 6 I K 0 e u I P F v + D d 4 l w S 6 m 8 5 p / O Y 9 v 7 F n k k w K i 7 3 e 7 7 W W d + 3 6 j Z v r t 9 q 3 7 9 y 9 t 9 H Z v H 9 o 0 9 x w 6 P N U p u Y 4 Y R a k 0 N B H g R K O M w N M J R K O k u n r s n 5 0 C s a K V B / g W Q a x Y h M t x o I z d N C w 8 z B 6 e h E Z R b l k 1 j p U X l C e a o Q Z D j t + r 9 u r g q 4 m w S L x y S L 2 h 5 u t X 9 E o 5 b k C j d V 1 g 6 C X Y V w w g 4 J L m L e j 3 E L G + J R N Y O B S z R T Y u K j e M K d P c s s w p R k Y K i S t Q L i s K J i y 9 k w l j q k Y n t h m r Q S v q g 1 y H O / G h d B Z j q B 5 a Y R C Q m V k u R H u c 4 C O h A F E V n Y O V G j K m W G I Y A R l n D s w d 7 9 V M 6 x s M u C 1 R x W z X A u e j q C B S p y h Y X U w U e 6 s 4 S N P l W J 6 V E S p h n k R Z S q h Q a O C Z j 4 I 4 q I o x 3 T g c j + I G 4 y s p I T / K c P C D x w v b P K m g N V V k Y Q x n v t B Z M T k B C P D 9 M T N h 9 a i r k x c / / + U k a w E S / 3 5 q k t F D y 8 b V Y q l M L x a 6 F R V h 2 H d h z q 9 d G P C p W 7 Z c D u y g I o J P X Y b W 7 w 8 B S 0 M f e 9 W d 1 l x M y p L W 2 / E R K B 9 / s 5 N U T + r E 9 2 W B 8 2 d X k 0 O w 2 7 w o r v 9 Y d v f e 7 X Y 9 3 X y i D w m W y Q g O 2 S P v C X 7 p E 8 4 + U S + k x / k p w f e Z + + L 9 / U v t b W 2 0 D w g t f C + / Q F G T E y e < / l a t e x i t > z v H Q 7 q j V a z V R R d b v x 5 0 y D z O h h s r P 4 K h z F P F W j k k l n b 8 1 s J 9 j N m U H A J s 1 q Y W k g Y n 7 A x 9 F y r m Q L b z 4 r x Z / R R a h n G N A F D h a Q F C B c d G V P W n q n I K R X D E 1 v l c v A y r p f i 6 H k / E z p J E T T P g 1 B I K I I s N 8 L d C 9 C h M I D I 8 s m B C k 0 5 M w w R j K C M c w e m 7 q J K g U V M A r z 0 U 9 k 0 1 Y L H Q 6 i g E q d o W B m M l P v W 8 J H H S j E 9 z M J Y w y w L E x V R v 8 K g m f X 8 f p b l S 2 q 7 v u H 3 K 4 o k l w T / J Y O s 4 T t d U N V N A I u j Q g k j P G / 4 o R H j E w w N 0 2 O 3 H 1 q q s j N y 8 / 9 z h r I w L P z n y y m F P L g Y V D g W x u B y o 3 M V E w b l H O r 8 0 q 0 J F 7 7 F w L X Q A i o m 9 C j W m O 2 c g h a G 7 s E U F 4 z b U U 5 t v h R j g f b J W 7 d F / b g s d K / c r 7 7 p 5 a Y b N P 2 n z a 1 3 W 4 3 t 3 f l 7 X y M P y E O y S X z y j G y T 1 + S A d A g n n 8 h 3 8 o P 8 9 D 5 4 n 7 0 v 3 t e / 0 t W V u e c e K Z X 3 7 Q + o c 0 j 9 < / l a t e x i t > OBSERVATIONS < l a t e x i t s h a 1 _ b a s e 6 4 = " y g a 6 I A X D 8 t P 4 m D + 5 n o 4 m t z T B U 8 g = " > A A A D 2 n i c b V J L b 9 N A E N 7 G P E p 4 N I U j l x U W U p F Q F F s V c C w F J J A q C G r T V o q j a L 2 Z p K v s r q 3 d c U j l 5 s I N c e X I F X 4 A / 4 Z / w 9 o k U D u d 0 / h 7 7 D f e 2 T i V w m K n 8 3 u j 4 V 2 7 f u P m 5 q 3 m 7 T t 3 7 2 2 1 t u 8 f 2 y Q z H H o 8 k Y k 5 j Z k F K T T 0 U K C E 0 9 Q A U 7 G E k 3 j 6 q u B P Z m C s S P Q R n q c w U G y i x V h w h g 4 a t l q R U T S K x 7 T

& classical context
The learner tries to find a theoretical explanation to the observed classical-quantum data using a hypothesis class of operator-valued functions h(x), possibly of infinite cardinality.We present quantum algorithms that are able to identify a near-optimal hypothesis h optimal and obtain sample complexity guarantees in terms of covering numbers of the hypothesis class.
classical-quantum states where ρ : X → S and x D ← X , with S being the set of states of some Hilbert space H and ρ the unknown process.At each data collection, the input will be a product state . This generalizes the setting of [7,9] on probably approximately correctly (PAC)-learning quantum channels.
Our goal is to learn a possible classical-quantum source from which the data are sampled, and we construct algorithms to do so that have bounded sample complexity.A key object in the analysis is a certain distinguishability measure of the class of candidate classical-quantum processes, which we identify, that can be bounded even if this class is infinite.As an application, we also improve the shadow tomography procedure of [3] for an unknown classical-quantum state: a naive application of shadow tomography would not be guaranteed to work if the classical register is infinite-dimensional, while we show that only the dimension of the quantum register matters.
Our theory also goes beyond previous works to encompass the agnostic case: the case when the unknown process is not included in the hypothesis set C. Note that the agnostic case was mentioned by [7], who provided a lower bound on the sample complexity, and tackled in a specific case by [9].This is as opposed to the realizable case, where the unknown process is guaranteed to be from the set.This is in keeping with the setting of learning Nature: agnostic learning models the situation when we learn a natural process using as hypotheses only the limited models that are within the reach of our theoretical understanding.
Our work overcomes a key technical hurdle not tackled by previous approaches: without input control, our learner cannot obtain identical copies of the process sampled at well-chosen points {x i }.This assumption is crucial in, for instance, [8] and [10].Instead, for every sampled x i , she only gets a single copy of ρ(x i ), which in general differs from the other copies ρ(x j ), j ̸ = i.Nevertheless, we design a measurement strategy that learns even without the luxury of identical copies.Ref. [7] hinted that a VC-dimension-like quantity for this setting might be impossible to define.Here instead we establish a fundamental prerequisite for the definition of a statistically meaningful dimension (analogous to the fundamental theorem for uniform convergence of [11]), providing sufficient conditions for learning concept classes of infinite cardinality with finite data: we introduce learning algorithms that succeed with high probability if a suitable covering number of the quantum concept class C grows sufficiently slowly with the sample size.
Our theory of learnability extends beyond the usual setting of learning quantum processes given by quantum circuits.In fact, our learning model also encompasses the following scenarios: • We want to study how a small quantum system behaves in a variable environment.In this case, the classical random variable is a measurement of the status of the environment, for example a measure of classical fields.The copy of the quantum state is a copy of the state of the system corresponding to the measured state of the environment.One can imagine applying this scenario to molecules or nanostructures.Notice that the border between environment and object is arbitrary, therefore in the classical random variable one could include the outcomes of some predetermined measurement on the object of interest.With the same idea in mind, the quantum state could be also not the original state of the system but some post-processing of it, for example via a quantum sensor.
• We want to do imaging of a system, that is to associate to each point of the system a quantum state or channel.When the detector clicks, we receive as experimental data a pair comprising the position and the quantum state corresponding to that position.If our experimental setup uses spontaneous/stimulated emission, such that in a specific time interval we cannot guarantee that we can obtain an observation at a specific position, our model correctly represents the fact that we receive data from random positions and we cannot afford to receive multiple copies of the state corresponding to an arbitrary position.
• We are studying a class of stars with a combination of classical and quantum sensors: for example, we get classical electromagnetic signals from a star, and a quantum sensor collects information about gravitational waves.
We would like to study correlations between the electromagnetic and gravitational waves, but since these events are rare and unique we cannot repeat them at will.Some toy models for concept classes inspired by these scenarios are discussed as applications.In these cases, we can find bounds on the covering number, from which sample complexity bounds can be obtained.

A. Setting: learning quantum processes with random classical input
We now go into more detail about our learning setting, which is a natural quantum generalization of supervised learning and builds on that of [7].Suppose there is an unknown function ρ : X → S to be learned, X possibly of infinite cardinality, and a distribution D : X → [0, 1].In fact, we will focus on processes ρ that map from a classical domain X to a quantum set S ⊆ L(H), where L(H) is the set of linear operators on a finite-dimensional Hilbert space H.When we want to keep track of the dimension d of a Hilbert space, we use the notation H (d) .Furthermore, we will always be interested in the case where the unknown process ρ outputs a quantum state, that is, all operators in S are positive semidefinite and have unit trace.
The learner receives as input samples the pairs (x i , ρ(x i )) n i=1 where x i D ← X .Furthermore, the learner has a set of hypotheses, C = {h : X → L(H)}, and would like to use the smallest possible number n of samples to choose a candidate h from the class that accounts well for the observations.The accuracy of the learner's output h rela-tive to the true function ρ will be measured by the true risk R ρ : C → R, defined via a loss function Therefore, it is in the learner's interest to minimize the true risk, although she can only do so approximately, as detailed in the next section.
In what is known as the realizable setting of learning (studied by [7]), there is a promise that the unknown function comes from C. We present results for this setting too, but go one step further.In learning Nature, our scientific models are but approximate descriptions that correspond more closely to reality at some scales than others.Thus, we primarily treat the agnostic (or unrealizable) setting, in which no hypothesis in C is guaranteed to correspond exactly to the unknown function.
We will focus on concept classes C that output two types of quantum objects: • (Case 1) Quantum states: C consists of hypotheses h(x) = σ h (x) which are state-valued functions.The true function ρ(x) is also a state-valued function, and we use trace distance L s (σ h , ρ) = d tr (σ h , ρ), a natural notion of distance between quantum states, as the loss function.
• (Case 2) Quantum events/projectors: C consists of hypotheses h(x) = Π h (x) which are projector-valued functions.Again, the true function ρ(x) is a state-valued function and we will use as loss function the probability of not accepting the projector, i.e., L p (Π h , ρ) We note that in Case 2 we can switch out the projectors for general POVM elements, by a standard dilation argument with which we can represent them as projectors on a larger space.However, the dilation is not unique and what we will say in the following will depend on the dilation.Therefore, for simplicity, we will always speak only about projectors.
A quick note on the motivation for defining projector-valued concept classes.In the classical case, when the label y is not a deterministic function of x, one speaks of learning probabilistic concepts [12].In [12] two possible approaches are considered.The first is to learn a deterministic concept which maximises the probability of correct prediction, the second is to learn the conditional probability distributions p(y|x) on average.Learning projector-valued classes is a generalization of the first approach, and learning state-valued classes is a generalization of the second approach.Moreover, estimating L p (Π h , ρ) for every h ∈ C (as we do later) encompasses shadow tomography [2,3], a task where one is given a fixed state and a list of observables and has to output the expectation values of all observables in the list.Shadow tomography corresponds to the case where ρ(x) and Π h (x) are constant as functions of x.Most importantly, our algorithm for this risk estimation problem on projector-valued functions is a key part of our strategy to attack Case 1.

B. Results
In this section, we state our main results, based on the algorithms summarized in Figure 2 and obtained in Sections III, IV, V. Recall that the goal is to learn an unknown function ρ and the output of our learning algorithm will be some hypothesis h ∈ C. While ρ is not required to be in C, our algorithms will always output some h ∈ C close to achieving the minimum empirical risk, i.e., the average loss computed on the examples (x i , ρ(x i )) n i=1 : (4) That is to say, we use minimizing the empirical risk as a proxy for minimizing the true risk.This principle, known as empirical risk minimization (ERM), originated in classical statistical learning theory [11]; the contribution of this paper is to adapt it to learning quantum-valued classes.In this paper, we distinguish between two types of tasks: the term empirical risk minimization refers to the task of outputting a single hypothesis from C that approximately minimizes the risk.Along the way, we will also develop an algorithm for empirical risk estimation (ERE), a term that refers to estimating the risk for every hypothesis in C.
As formulated by [11], the success of ERM depends on a property of a concept class known as uniform convergence [13,14] of the empirical risks, which is controlled by a certain measure of the effective size of the concept class we define: the γ 1,q covering number (see general treatments of covering numbers, e.g.[13,15]).
Definition 1 (Covering number).Let G ⊆ L(H) X be a class of functions mapping to linear operators, and let ϵ ∈ (0, 1].The covering number of G is where for a pseudometric d, N in (ϵ, G, d) is the smallest cardinality of any internal ϵ-cover of G according to the pseudometric d.
Here we have chosen as pseudometric the ∥•∥ 1,q,⃗ x -seminorm, which depends on the observed examples ⃗ x as where ||A|| q is a Schatten norm (see definitions in section II D) and |⃗ x| is the length of ⃗ x, i.e., the sample size.
Intuitively, γ 1,q describes the maximum number of hypotheses that can be pairwise distinguished on a dataset of size n with resolution ϵ, given information from the classical register only; in fact, the distinguishability between two hypotheses is measured by the average over classical outcomes of the appropriate distinguishability metric (operator norm for projectors and trace norm for states) for the corresponding values of the hypotheses.
Since the rate of convergence of the empirical risk to the true risk is controlled by the covering number γ 1,q [11,13,15], it is possible to minimize the risk by optimizing over an ϵ-net of the concept class, which is finite-dimensional, rather than the class itself, which in general is infinite-dimensional.This is the basis of our technique.While the ERM principle is well-established classically, the non-trivial part in the quantum case is to minimize the empirical risk Rρ (h) based on the string of samples (x i , ρ(x i )) n i=1 .The main difficulties in this respect are the lack of identical copies of the input states and the fact that the observables associated to naive estimators of the empirical risks do not commute.
We now present our results establishing a quantum variant of ERM, for both types of concept classes considered.

Quantum empirical risk minimization and estimation
Let us first discuss projector-valued concept classes.Here, the naive strategy of measuring each ρ xi with Π h (x i ) for all h to estimate the empirical loss does not immediately work, since the projectors do not necessarily commute for different hypotheses.Nevertheless, we construct algorithms for both ERE and ERM on projector-valued concept classes, establishing the following Theorems.

Case 1 (States)
Case 2 (Events) Generalized Threshold Search: there is an algorithm which outputs c * and μc * such that if n is large enough.In fact we can take Here, the notation Õ hides logarithmic dependence on log m, log δ, and ϵ.Note that the minimization in the caption of Theorem 1 refers to the minimization of the loss function, which is 1 − µ c , whereas the theorem statement is about maximizing µ c , which is equivalent.
Theorem 2 (Quantum empirical risk estimation for projector-valued functions).Given access to a product state and a collection of lists of projectors n } c=1,...,m , with there is an algorithm which outputs estimates μc , c ∈ [m] such such that if n is large enough.In fact we can take Here, the notation Õ hides logarithmic dependence on log m, log δ, log d, and ϵ.
Note that estimation, in comparison with minimization, shows a worse dependence on ϵ and an explicit dependence on d.However, it represents a key subroutine to realize the ERM algorithm in the case of state-valued concept classes, which is the subject of the next theorem.
Recall that, in this case, the loss function is different from the projector-valued case: it is the trace distance, rather than the overlap.This prevents us from immediately applying the techniques used for projector-valued classes, which only work to estimate expectation values.
Nevertheless, using a trick from quantum hypothesis selection [4], we can reduce the task of risk minimization for state-valued concept classes to estimating expectation values of Helstrom projectors constructed from the state-valued class.The result is the following Theorem.
Theorem 3 (Quantum empirical risk minimization for state-valued functions).
i=1 be a class of state-valued functions and There exists an algorithm which given ϱ outputs i * such that where with probability of error less than δ if n is large enough.In fact we can take Here, the notation Õ hides logarithmic dependence on log m, log δ, log d, and ϵ.

Learning via empirical risk minimization
Combining the Theorems just stated and uniform convergence guarantees from statistical learning theory, we show the following sufficient conditions for learning.
Theorem 4 (Learning quantum processes via ERM).Suppose the concept class C consists of classical-quantum processes mapping to projectors or states and let ϵ > 0 be the accuracy parameter.Furthermore, let S = (x i , ρ(x i )) n i=1 be the training set, with x i D ← − X and ρ(•) an unknown classicalquantum channel.
Then, the appropriate ERM algorithm of Theorems 1, 3, run on an ϵ-net of the concept class C (according to the appropriate pseudometric determined by x 1 , ..., x n ), provide an agnostic learning algorithm A : X n ×L(H (d) ) ⊗n → C.This algorithm outputs an hypothesis A(S) satisfying, for some fixed η, ξ ≥ 1, and n large enough, In particular, this applies to risks defined via the loss functions L p (in this case η = 1, q = ∞) and L s (in this case η = 3, q = 1) for projector-valued and state-valued concept classes C respectively.
We remark that Theorem 4 applies even to agnostic learners, going beyond Ref.[7] which considered only the realizable setting.In fact our methods also imply that we can learn with an infinite concept class if lim n→∞ log 2 (γ 1,q (n, 0, C))/n = 0 (while in Ref. [7] show a sample complexity O(poly(log |C|))).Moreover, in Appendix C we analyze the algorithm of [7] for learning pure states using a statistical learning theory approach, to show that lim n→∞ log(γ 1,1 (n, 0, C))/n = 0 suffices also with that algorithm.
As a consequence of proving Theorem 4 for the case when the concept class consists of projectors, we are also able to speed up shadow tomography of classical-quantum states, vis-a-vis Ref. [3]: Theorem 5 (Shadow tomography of classical-quantum states).Suppose the concept class C consists of quantum processes mapping to projectors, ϵ > 0 is the accuracy parameter, and the learner is given the same input as in the previous Theorem.Suppose also that Then, the estimation algorithm of Theorem 2 run on an ϵ-net of the concept class C (according to the appropriate pseudometric determined by x 1 , ..., x n ), provide an agnostic learning algorithm A : for n large enough.
In words, we can not only find the minimum possible risk in the concept class, but also simultaneously estimate risks for all concepts.In fact, when C is finite, algorithm A performs shadow tomography of classical-quantum states with a copy complexity of Õ(poly(log |C|, log d, 1/ϵ)), where d is the dimension of the quantum register only.By contrast, a naive application of shadow tomography [3] has a copy complexity scaling as the dimension of the full space, which may be infinite if |X | = ∞.Remark : We mention that the algorithm of the aforementioned Theorems performs ERM correctly with high probability on any possible data-sequence, and not just on a subset of sequences which occur with high probability.If one is willing to accept that sometimes ERM can fail with non-negligible probability, learning becomes possible when lim n→∞ log γ1,∞(n,ϵ,C) n = 0, rather than lim n→∞ log 2 γ1,∞(n,ϵ,C) n = 0. We argue for this in Appendix D, by presenting a modification of the algorithms that perform ERM correctly only on a certain ϵ-net on a subset of the data, with high probability.We illustrate this fact only for risk minimization for a projector-valued class, but the same can be shown also for risk estimation for a projectorvalued class, and for risk minimization for a statevalued class.

Examples of learnable classes
In Section VI, we evince the applicability of our results by giving examples of quantum processes covered by our Theorems, for which we compute explicit upper bounds on the covering number γ 1,q in terms of the dimensionality of the quantum systems and the fat-shattering dimension of an appropriate concept class F. In all of the below examples, let F be a class of real-valued functions g(x).We build our concept classes using finite dimensional circuits and matrix functions that depend on the data via real functions g(x) coming from concept classes F. In fact, for the function classes with explicit data dependence that we consider, the covering number can slowly grow to infinity, but our Theorems show they are still learnable.
(1) Quantum circuits in a particular architecture, possibly data-dependent.Concept classes consisting of m-qubit quantum circuits chosen from a set S m acting on arbitrary input states or projectors.That is, C m = {c U (x) := U ρ(x)U † } U ∈Sm where ρ(x) is a process that prepares the input state to the circuit, or C m = {c U (x) := U Π(x)U † } U ∈Sm , where Π(x) is a projector.We give explicit upper bounds on the covering number for S m being (a) the class of one-dimensional local quantum circuits on m qubits of depth ℓ [16], constructed by applying ℓ 2-qubit nearest-neighbor gates on any pair of neighboring qubits, (b) brickwork quantum circuits, (c) the set of all unitaries, (d) data-dependent circuits with any of the previous architectures, but modified by inserting in specific places in the sequence a number ℓ ′ of gates of the form e iHj g j ′ (x) , j ′ = 1, ..., ℓ ′ , H j ′ fixed.
(2) Gibbs states and low-energy projectors of a perturbed Hamiltonian.Concept classes given by a set of Gibbs states obtained by perturbing a Hamiltonian with a field-dependent term, but the specific dependence on the field is not known, namely H 0 + g(x)V .Similarly, concept class of projectors on low-energy eigenspace of H 0 + g(x)V .
Consider a spatially local channel that acts as a power of an unknown unitary channel at position x, according to some classical variable g(x) at that position (for example a thickness), which we probe with some position-dependent state ρ(x).We can model this class as a state-valued (or projector-valued, when ρ(x) are pure) function class:

C. Related work
There is a vast literature on how to learn quantum states and their properties.The exponential cost of full state tomography motivated to consider the problem of 'pretty-good tomography' [2], in which the goal of the learner is relaxed: she is content with obtaining a predictor function for properties of the state instead of a full description of it.A similar task, consisting of estimating expectation values of a fixed list of observables, is known by the name shadow tomography [3,4,17].Even though we aim to learn processes and not quantum states, we have leveraged the tools developed in the shadow tomography literature for our work.
In this work, we delve into the considerably less studied realm of learning quantum processes that have a classical input variable and a quantum output.As mentioned, this model is motivated by the observation that many processes in nature leave a classical imprint of the conditions under which they occur, which the learner can detect with experimental equipment, but not observe.Our goal is to build a statistical learning theory for learning quantum processes that takes into account the unique structure of quantum measurements.
Several previous works have investigated similar directions.Ref. [7] first introduced the notion of Probably Approximately Correct (PAC)-learning of quantum channels, which captures our notion of lack of input control -but solved the problem exclusively in the realizable setting.Their algorithms however have sample complexity scaling polynomially in log(|C|), and thus cannot be applied when the concept class C is infinite.Our paper encompasses their setting (and improves their result for pure states) and solves the agnostic setting of channel learning.We also define an extensive measure of the concept class -the covering number -which controls learnability and can be finite even when C is infinite.The work [9] addresses this aspect for the case of binary functions with quantum output (which are two states corresponding to 0 and 1 respectively), finding that Helstrom measurements are enough to get an upper bound on the sample complexity depending on the VC dimension of the associated classical function class.
Ref. [8] also considers agnostic learning for polynomial-time quantum processes.However, the paper models each process as an experiment over which the experimenter has full control.Accordingly, the learner runs the process on identical copies of the input in order to implement the hypothesis selection procedure (Section F.3.b).On the other hand, we are interested in the setting where the experimenter must contend with non-identical copies of the input.Additionally, while the model in [8] applies to channels with bounded input and output dimensions, where one can always find an ϵ-net directly for the concept class (by its compactness), in our case instead we are interested in learning correlations between a classical and a quantum random variable, with the classical variable living possibly in an unbounded space.In this case, one cannot always define a finite ϵ-net on the sources, but we show that learning can be still guaranteed in terms of ϵ-nets of the datasets formed by pairs of classical and quantum variables.
Ref. [18] also works in a PAC setting for learning quantum channels acting on classical inputs drawn from a distribution, but their notion of learning is to find a good predictor function (mapping to real values) for the expected value of a fixed set of observables on the output of an unknown quantum process.The focus of their work is whether quantum machine learning algorithms can have a large advantage over classical machine learning for various notions of prediction error.By contrast, our notion of learning is different and is not about predicting observables accurately.Furthermore, our work provides constructive quantum algorithms that achieve the sample complexity bounds we mention, while no such algorithms are presented in this other work.
Ref. [19] also considered the problem of learning classical-quantum processes, providing generalization bounds in terms of entropic quantities.However, their setting is closer to a discrimination problem: the task is to find, among a set of possible projectors, one that best approximates joint distributions of classical data x and a set of labels c, when measured on quantum states ρ(x) representing embedding of x.Our setting is different, as their projectors do not depend on the specific value of x; moreover, their bounds on the generalization error depend on the source and not on the concept class, capturing a different aspect.
Other relevant works on learning quantum processes are [20], which deals with estimating local properties of the output of an unknown circuit, with guarantees on the average error for certain distribution of input states, [21], which shows how to efficiently learn the matrix elements of a channel in the Pauli basis, as a generalization of shadow tomography, and [22][23][24], which studies how to estimate and test properties of an unknown channel with or without quantum memory.
Finally, the possibility of having generalization bounds for various loss functions evaluated on quantum circuits, using Rademacher complexities and covering numbers, has been considered extensively, especially with quantum machine learning applications in mind [25][26][27][28][29][30][31][32].Relations between different combinatorial dimensions that cater to different modes of learning quantum states have also been explored in [33].Our bounds on the covering numbers of processes described by quantum circuits are technically analogous to these existent results, which are devised for a different scenario with full control over the inputs.Indeed, such generalization bounds are valid if the sample complexity is defined as the number of unique data-point pairs (x i , ρ(x i )) seen by the learner, without restrictions on the number of times the pair (x i , ρ(x i )) can be actually produced (assuming that one is able to minimize the average of the loss function in some efficient way for every finite dataset).This analysis is not generically applicable in our case, since we do not have full control of the source and a good model for explaining the sample data cannot be in general inferred from finite samples.Obviously, since the covering numbers we consider are not constructed directly from the loss function but are dependent on the input data, they are generally worse in performance than those obtained for the cases with controlled input.

D. Learning algorithms: technical overview
In this section we explain the ideas behind our algorithms and their proofs (see also Figure 2).The key technical tool we establish is a generalization of the Threshold Search algorithm in [4], which takes as input many identical copies of a state σ, a collection of pairs consisting of a projector and a threshold (Π (c) , θ c ), c = 1, ..., m and reports if there is one projector whose average exceeds the threshold.
In our generalization, the algorithm takes as input only a single state ϱ -a product (of possibly non-identical ) states, ϱ = ⊗ n i=1 ρ i -and a collection of pairs, each comprising a list of projectors and a threshold ({Π , θ c ); as output, the algorithm reports if there is one projector list whose average on the sample exceeds the threshold: Lemma 1 gives the detailed formulation of this result.In the following we outline its main technical steps.For each list of projectors {Π , one can construct the collective projectors that accept at least k times, denoted as t , obtaining the correct answer with high probability with n = O 1 ϵ 2 .However, after one of these measurements is applied ϱ can change significantly, so that the method of checking each threshold by measuring the threshold's accompanying projective measurement will not work naively.Therefore we construct a more clever measurement that is "gentler": each threshold is associated with one such measurement.One performs the list of measurements sequentially on the same state and the measurement disturbs the state in a controlled way when it rejects.We now explain how such measurements are constructed in more detail.
From the projectors {E k } k , one can construct the events B c : where X is an exponential random variable, p λ (X = x) := λe −λx , for some 0 < λ < 1.When B c is measured on ϱ, the probability of accepting is Pr(X+T (c) > θn).The exponential random variable has the role of smoothing the event B c , so that the probability of accepting is still exponentially suppressed in n when i ] ≥ θ c − ϵ, but at the same time, when B c rejects, the state ϱ does not change a lot.More explicitly, we can show in Theorem 9 that for λ = 1 D √ n and some constants D > 0: and being the post-measurement state conditioned on rejecting B c , for some constant C > 0: Note that the last equation is a consequence of the Gentle Measurement lemma for the Fidelity (e.g., Proposition 2.2 in [4]), together with a bound on the χ 2 -divergence of the distributions of X + T (c) and X + T (c) conditioned on X + T (c) ≤ θn.
Based on the previous definitions, the algorithm (Algorithm 1) for threshold search consists in measuring the events B c , c = 1, ..., m in sequence until a first acceptance c * occurs.Then c * is declared as the index of the list of projectors with expectation value above threshold.Lemma 1 shows that ) is a sufficient condition for Algorithm 1 to succeed with probability larger than 0.03.The proof is based on a quantum union bound for sequential measurement proved in [4].
In fact, our contribution is to note that the proofs of [4] can be adapted to the case of non-identical states and projectors, using the fact that the concentration properties of the Poisson binomial distribution are sufficient to reproduce the same argument of [4] with some appropriate adjustments.We do not make additional comments here on the proof, and we redirect to [4] for additional explanations on the idea behind the algorithm and remarks on its connections with adaptive data analysis.

Quantum empirical risk minimization (summary of
Section IV) Projector-valued concept classes (Theorem 1 and Theorem 2) -The main idea that lets us perform ERM for projector-valued functions is to observe that, in this case, the empirical risk is the average of expectation values of a list of projectors.Hence, we can directly apply the techniques of Sec.I D 1 to check if empirical risks are above or below a threshold.Furthermore, for our purposes, it is sufficient to consider concept classes of finite cardinality.Thus, in the following we assume that our concept class can be described as m lists of projectors {Π Our Algorithm 2 for ERM works as follows.
(I) Sampling-without-replacement step: given an input ϱ = ⊗ n i=1 ρ i a sufficiently large number of random batches ϱ s = ⊗ l k=1 ρ s,k of size l, obtained by sampling without replacement from the list i = 1, ..., n.In the same way, one selects corresponding lists of projectors {Π , for all c.Via an argument based on concentration inequalities for sampling without replacement, we show that this procedure has the effect that the empirical risk on each batch ϱ s , i.e., 1 , is ϵ-close to that on the full training set ϱ.The precise relation between the batch size l, the number of batches, the batch-risk approximation error ϵ, and the probability of error of this procedure is given in Lemma 3.
(II) Binary search for optimal threshold: the empirical risks take values in [0, 1], therefore we start with a candidate minimum empirical risk 0.5, and use Algorithm 1 on the first batch to check if there is a candidate below this threshold.If we find one, we guess a new minimum empirical risk at 0.25, otherwise we guess 0.75 (these are approximate values, see the technical treatment for the details).We it-erate this procedure via binary search, using a new batch for each new threshold, and selecting the upper or lower halves of the candidate interval, until we get to approximate the value of the empirical risk with precision ϵ: we will terminate with a number of step O(log 1 ϵ ).Since Algorithm 2 is guaranteed to work only with a constant probability, we may need to repeat it a certain number of times to ensure we are able to get a candidate below threshold with high probability, if there is one.Moreover, we also need to check that the candidate risk is indeed below threshold, which is done by a further measurement of the corresponding empirical average.
Crucially, since in all of these steps we cannot reuse batches ϱ s , the empirical risks must be close between different batches, which is guaranteed by the sampling-without-replacement step.By a union bound on the probability of the several types of error, we can obtain our guarantee for ERM, Theorem 1.
A similar approach can be used to obtain Theorem 2, viewing ERE as a generalization of shadow tomography.Shadow tomography [3] consists of the following task: given n copies of an unknown quantum state σ, and a list of projectors Π 1 , . . .Π m , output approximations of Tr(ρΠ c ) ∀c ∈ [m].The Threshold Search algorithm of [4] was indeed used to obtain the state-of-the-art sample complexity for shadow tomography.
The idea is to keep a list of candidate intervals for the true values of Tr(σΠ c ) for each c, and use their extremes as thresholds in the Threshold Search algorithm, to be run with projectors Π 1 , . . .Π m as well as 1 − Π 1 , . . . 1 − Π m .If one of the true values of Tr(σΠ c ) is far from the candidate values, one such c * with this property will be found after a certain number of attempts, with high probability.If this happens, the list of candidates for the true expectation values is updated.Given a specific way to compute the candidate intervals, the process is guaranteed to terminate with intervals of size O(ϵ), containing the true values, after Õ( log d ϵ 3 ) rounds [3].Our Algorithm 3 generalizes the i.i.d case with two main modifications, as follows.
(I) Looking for bad estimates in the general product states case: with our generalization of Threshold Search for general product states and our sampling-without-replacement step, we can follow the same scheme for ERE: given a list of candidate values for (II) Update rule for the general product states case: we generalize Aaronson's update procedure.We stress that in both Aaronson's and our generalization this step can be carried out by a classical computer, even if the computation involves quantum states.
In the original work, the estimates µ t,c for the projector Π c at update step t are obtained starting from q copies of a maximally mixed state, that is from the state ρ * 0 := 1 d ⊗q .At each step, the estimates µ t,c are the expectation of the empirical average of Π c on ρ * t , which is also the predicted fraction of acceptances when Π c is measured on each subsystem H i in the tensor product q i=1 H i .If at step t = 1, ..., T some c * is detected as associated to a prediction smaller (larger) than the true value, ρ * t is obtained post-selecting ρ * t−1 on the event for which measuring Π c on each subsystem accepts a fraction of times slightly larger (smaller) than what previously predicted.Clearly, this makes the estimates progressively more accurate.
For the general product states case, the main change is that we start from the following classicalquantum state at step t = 0: where |i⟩⟨i| are orthonormal projectors of an auxiliary system.At each step ρ * t is used as a guess for the classical-quantum state in the following sense.Note that the quantities 1 − R ϱs (c) can be expressed as expectation values of projectors t is a guess for σ ⊗q in the sense that, at each step t, the estimates of the risks are µ c,t = Tr[Π (c) ρ * t ].The algorithm now runs exactly as Aaronson's.The key point to conclude is that for some positive semi-definite and trace-1 ω.Therefore, with probability at least 1 d q , ρ * 0 behaves as σ q , which would accept all the post-selection updates with high probability.On the other hand, if some estimate is incorrect at time t, one can show that ρ * t should reject with high probability.These two facts give rise to contradiction unless the number of necessary rounds is Õ( log d ϵ 3 ).The proof of Theorem 2 is then obtained through a union bound over all possible sources of errors of Algorithm 3. We also note that, for projector-valued functions, the estimation of the empirical risks is similar to the diverse-state setting for shadow tomography considered in [17], where the state is a general product state but the projectors do not change with the subsystems.
State-valued concept classes (Theorem 3) -Here the loss function used is different from the projector-valued case: it is trace distance, and not overlap.This prevents us from immediately recycling the technique used previously, which only works to estimate expectation values.
Also in this case, it will suffice for the moment to consider finite cardinality classes, described by lists of states {σ c (s)} The key idea is again inspired by [4], which used it for a task named quantum hypothesis selection.They used the following observations: • The trace distance between pairs of states σ i , σ j can be written as the difference of two expectation values via Helstrom's theorem.Indeed, defining A ij (s) := (σ i − σ j ) + , where (•) + is the projector on the positive part of the argument, we have that • If a state ρ is close to σ k , so are the expectation values.A good guess for ρ is then the state Indeed, suppose that i * is such the true minimizer of d tr (σ i * , ρ) ≤ η.Then we have that and via triangle-inequalities • Finally, if Tr[ρA ij ] are known only with precision ϵ, the above procedure is robust, and allows to find k * s.t.d tr (ρ, σ k * ) ≤ 3η + 2ϵ from approximate estimation of the expectation values of the Helstrom projectors.
Our crucial observation is that for the classicalquantum states σ c , the Helstrom projectors are In turn, their expectation values can be estimated via our ERE algorithm for the lists of projectors {A ij (s)} s∈[n] , i < j, given ϱ as input.This reduces our procedure for learning with statevalued functions to a post-processing of Algorithm 3 for ERE for projector-valued functions and proves Theorem 3.

Statistical learning for classical-quantum processes (summary of Section V)
Building on the ERM algorithms just reviewed, we are able to give sufficient conditions for the minimization of the true risk in both cases of projectorand state-valued functions.The algorithm works as follows.
• From the knowledge of the classical variables x i for all i, we can construct an ϵ-net of the function class using the appropriate pseudometric: for two projector-valued functions Π (c) (x) and Π (c ′ ) (x), their pseudodistance on the data is , while for two state-valued functions σ c (x) and σ c ′ (x), their pseudodistance on the data is . The cardinality of the ϵ-net will be bounded by the γ 1,∞ covering number in the case of projectors and by the γ 1,1 covering number in the case of states (see Definition 1).
• We then run ERM on ϱ = ⊗ n i=1 ρ(x i ) using as concept class the ϵ-net found at the previous step.
Let us observe that our ERM algorithms are guaranteed to work if the covering numbers γ 1,q grow slowly with n, as they take the place of m in Theorems 1, 2 and 3. On the other hand, via classical statistical learning theory, uniform convergence of the estimated empirical risk to the true risk is controlled by the covering numbers of the loss function, which depend on the unknown states ρ(x).
Nevertheless, we are able to show that these covering numbers can be bounded by γ 1,q , which does not depend on ρ(x) and is in principle computable by the learner.Therefore, we get a sufficient condition, independent of the data, for learning in terms of the growth of γ 1,q with n, proving Theorem 4. Similarly, Theorem 5 can be obtained by running the ERE algorithm of Theorem 2 and checking that uniform convergence is also satisfied when γ 1,∞ grows slowly with n.

E. Outlook
Our goal is to learn Nature.In the past few years, quantum information processing tools have been fundamental to building a theory of learning states produced by quantum circuits.However, the circuit picture may not be the most natural one to describe all quantum mechanical processes.Our work represents a first incursion into the territory of developing a statistical learning theory for physical processes beyond quantum circuits.Our intention is to build a general theory of what we can quantumlearn, going beyond the terra firma of quantum circuits and what they can model, reaching to characterization protocols in quantum mechanics that may be best expressed outside the circuit model, for instance, metrology, sensing, calibration and verification, and eventually building a unified foundation for designing physical experiments.

A. Notation
We will consider random variables X valued in the set X , denoting as D(x) = Pr x∼D (X = x) the probability that the random variable takes value x ∈ X ; accordingly, Pr x∼D (E) is the probability of an event E ⊆ X .We will denote the set {1, ..., n} as [n].We denote the n-fold cartesian product of X as X n , elements in it as vectors ⃗ x ∈ X n , and we use the notation |⃗ x| := n to refer to the length of the vector.D n denotes the probability distribution of ⃗ x.We denote a Hilbert space of dimension d by H (d) .We further denote by L(H (d) ) the set of linear operators on H (d) , and by D(H (d) ) ⊆ L(H (d) ) the set of density matrices, that is, the subset of L(H (d) ) which is positive semi-definite and has unit trace.These matrices describe quantum states.For brevity, the Hilbert space of n qubits is denoted H n .
An arbitrary valid quantum operation on quantum states can be expressed as a quantum channel Φ, i.e., a completely positive trace-preserving map, and we denote the output of a channel applied to ρ as Φ[ρ].A special case of quantum channels are unitary channels, defined from a unitary matrix U (meaning U U † = U † U = 1).An application of a unitary U to the state ρ results in the quantum state U ρU † .Any quantum channel in finite dimension can be expressed in Kraus representation as i for a suitable finite set of operators {K i }.
In order to extract classical information out of a quantum state, one can perform a POVM (positive-operator valued measurement) which is specified by a set of m positive semidefinite matrices {E 1 , . . ., E m } satisfying i E i = 1.A measurement on a state ρ using such a POVM returns a classical outcome i ∈ [m] with probability Tr[E i ρ].For any operator 0 ≤ E ≤ 1, we can define a two-outcome POVM {E, 1 − E}, which we say that implements the measurement of the event E, and the probability associated to E is denoted as in the classical case as Pr(E) = E ρ [E].We will similarly use the notation E ρ [A] := Tr[ρA] to express expectation values of operators A. We will denote the standard deviation of a classical random variable T as stddev , where the expectation value is with respect to the probability distribution of T.
We will focus on classical-quantum states, i.e. states that can be written as ρ = x∈X D(x)|x⟩⟨x|⊗ ρ(x), where ρ(x) are states of H (d) and D is a probability distribution on X , {|x⟩} x∈X are an orthonormal basis of a Hilbert space that we denote H X .Note that for any operator A ∈ L(H X ⊗ H (d) ) and a classical-quantum state ρ . For a set of n operators on a Hilbert space H, A = {A 1 , ..., A n }, we denote as A the operator on H ⊗n defined as

Distances between probability distributions, quantum states and quantum channels
We consider the following distances, defined for probability distributions on discrete supports.Let P, Q be two probability measures on the same support X .
• Total variation distance: • Chi squared distance: • Bhattacharya Coefficient: We will also need notions of continuity for matrices.First of all, we introduce some matrix norms.Let M ∈ C d×d , then: where σ i (M ) are the singular values of M . and • The spectral norm is the maximum singular value of M : which coincides with ||M || ∞ .This is also the operator norm induced by the 2-norm for vectors: We will also use the following fact: Fact 1.For any two vectors u, v ∈ C n (not necessarily normalized), which can be verified by applying definitions.
We will need in particular some measures of distance between quantum states and channels, defined as follows.For ρ, σ two quantum states in D(H (d) ): • Trace distance: (46) Note that in analogy to Eq. (38), this is half of the trace norm of the difference between ρ and σ.

B. Sequential measurements
The algorithms we will develop make extensive use of sequential measurements, which require to specify what is the state of the quantum system conditioned on the outcome of a measurement.We stick to the usual convention, which states that if the outcome of a POVM {E i } r i=1 is i, the state after the measure- Tr [Eiρ] .The following facts will be useful.[34] • [4,35] Let P be the probability distribution on [r] determined by the measurement M = {Π 1 , . . ., Π r } on ρ, and let Q instead be the distribution determined by M on ρ| √ A , where A = r i=1 a i Π i for some a i > 0. Then • [36,37] For each i = 1, . . ., m, let Π (i) 0 be projectors and Π The probability p of getting always outcome 1 for sequential measurements {{Π • Lemma 2.5 in [4]: Let E 1 , . . ., E m be events and consider a sequence of twooutcome POVMs implementing them, on the state ρ, so that the state post-selected on the events occuring is ρ and for i > 1, while • Lemma 2.6 in [4]: With the same notation as the previous point, let p 0 = 1, ρ 0 = ρ, and Morover, given a subset • Fix any POVM {E 1 , ..., E n }, a classical random variable X with values in R, and θ ∈ R. Furthermore, consider the classical random variable T, defined on a fixed quantum state ρ, which has the distribution Pr( The quantum event is such that when measured on a state ρ, it is accepted with probability E ρ [B] = Tr[ρB] = Pr(X + T > θ).Furthermore, by Eq. ( 49), B has the property that

C. Naive expectation estimation
A fundamental result on the concentration of sums of random variables is the following multiplicative Chernoff bound, with X 1 . . .X n independent random variables and 0 ≤ ϵ ≤ 1: A consequence of this bound is the following Proposition, related to what has been called naive expectation estimation by [4], but with i.i.d.observables.We retain the name: Proposition 1 (Naive expectation estimation).The random variable X obtained by measuring the observable ).This implies that, for given θ and ϵ, and n ≥ 27 ϵ 2 log 2 δ this measurement allows to distinguish the two cases with probability at least 1 − δ.

D. Growth functions and covering numbers:
Measuring the effective size of a concept class In classical statistical learning theory, conditions that guarantee the success of ERM to learn an unknown function in the concept class C are given in terms of effective measures of the size of C. We will be interested in two such measures, the growth function and the covering number.
For a finite sequence x 1 , • • • , x n and a set of functions C = {f α : X → Y}, we can identify a set of equivalence classes given by grouping together the functions that give the same outcomes on the inputs x 1 • • • , x n .Naturally, if Y is finite, the cardinality of this set is finite.However, even if Y is infinite, C could still be such that there are only finitely many equivalence classes.Indeed, for a fixed function class, the maximum number of equivalence classes induced by a sample of length n is called the growth function G(n).A formal definition of the growth function is as follows [13,14]: Definition 2 (Growth function).Let C ⊆ Y X be a class of functions with target space Y.For every subset Ξ ⊆ X define the restriction of C to Ξ as (59) The growth function G assigned to C is then defined for all n ∈ N as By a slight abuse of notation, we will also write C| ⃗ x for a concept class restricted to a domain given by the set of values appearing in ⃗ x.We will also want to define ϵ-nets over classes of functions and will be interested in their size.
Definition 3 (Covering number N in ).For a set M and a pseudometric [38] d) be a pseudometric space, let the sets A, B ⊆ M and fix ϵ > 0.
The set The ϵ-covering number of B, denoted by is the smallest cardinality of any internal ϵ-net of B.
The first pseudometric that will be of interest to us is the one built on the ∥•∥ p,⃗ x seminorm.
Definition 4 (∥•∥ p,⃗ x seminorms).For any set X , ⃗ x ∈ X n and any function class G ⊆ [0, c] X define the ∥•∥ p,⃗ x -seminorm on the linear span of G for p ∈ [1, ∞) as and note that ∥g∥ Definition 5 (Loss-function covering numbers).Let G ⊆ [0, c] X be a class of real-valued functions.For positive integer p, the p-th loss-function covering number is We make three remarks: 1.For a finite target space Y = 1, 2, . . .|Y|, and for any This follows from the definition of spectral norm.Thus and hence 2. The name of the above covering number comes from the fact that we will often be interested in choosing G to be the induced loss function class, defined below: Definition 6 (Induced loss function class).For any function class F ⊆ Y X and loss function On a related note, observe that, when the loss function is L(y, x) = |x − y|, for any h 1 , h 2 ∈ F, their pseudodistance ∥h 1 − h 2 ∥ 1,⃗ x upperbounds the difference between their empirical risks, R(h 1 ) − R(h 2 ).This fact will be crucial in the next few paragraphs.
We will also need to define covering numbers for classes mapping to operators instead of real intervals (in the following, H denotes some fixed Hilbert space).For this case we will use a different seminorm, defined for operator-valued functions: Definition 7 (∥•∥ p,q,⃗ x seminorms).For any set X , ⃗ x ∈ X n and any function class C ⊆ {h : X → M }, where M ⊆ L(H), we define the ∥•∥ p,q,⃗ xseminorm on the linear span of where ∥•∥ q is a Schatten q-norm, and note that ∥g∥ ∞,q,⃗ x := max i∈[n] ||g(x i )|| q .The corresponding pseudometric is then given by d : (g 1 , g 2 ) → ∥g 1 − g 2 ∥ p,q,⃗ x .
Accordingly, we will define the following covering number: Definition 8 (Operator-class covering numbers).Let C ⊆ L(H) X be a class of operator-valued functions.For positive integers p, q, the (p, q)-th operator-class covering number is For the concept classes of interest C : C ⊆ {h : X → M } where M ⊆ L(H), we elect to use the following natural operator-class covering numbers, which we can also relate to the corresponding lossfunction covering numbers: • (Case 1: Quantum states) In the case of M being quantum states, we can take p = 1, q = 1.The pseudodistance between two concepts becomes twice the average trace distance between the states, Note that the empirical risk (defined via L s ) evaluated on the input source and the concepts g 1 and g 2 satisfies where the inequality follows from the triangle inequality.This means that if ) is an ϵ-net for C, then one can construct an ϵ-net for G C,Ls , simply by associating every point c ∈ N in (C, ϵ, || • || p,q,⃗ x ) with a point • (Case 2: Projectors) In the case of M being projectors, we can take p = 1, q = ∞.The pseudodistance between two concepts becomes the average operator norm distance between the projectors, Note that the empirical risk (defined via L p ) evaluated on the input source and the concepts g 1 and g 2 satisfies Similarly to the above, we can conclude that Note that both properties above are a generalization of the decreasing property of covering numbers for real-valued function classes under composition with Lipschitz functions [13].We need operator-class covering numbers since we can construct ϵ-nets with respect to the || • || p,q,⃗ x seminorm, but not with respect to the loss function directly, since the value of the loss cannot be obtained immediately from the data.

E. Statistical learning theorems
The principle of empirical risk minimization dictates that, once a loss function L with values in the interval [0, c] has been fixed, in order to find a function in the concept class that minimizes the true risk relative to the unknown concept f , (True risk), (76) it suffices with high probability to find one that minimizes the empirical risk over the given sample: (77) In the classical case, the utility of this principle comes from the fact that the true risk cannot be estimated directly from the sample, but the empirical risk can.Crucially, the quality of the approximation and the rate of the above-mentioned convergence depend on the concept class under study.
In particular, discrete-output concept classes are very well-understood.In this special case, the rate of convergence of the empirical risk to the true risk is quantified by growth functions.Namely, a wellknown fact (see for instance [11,13,14]) states that given a sample S = (x i , f (x i )) n i=1 , with probability at least (1 − δ) w.r.t.repeated sampling of training data of size n we have (78) However, in our setting, we would like to be able to characterize channels that map to a continuous set of states.Since the restrictions of such concept classes to a finite sample are potentially infinite, the above statement is not useful.Nevertheless, a crucial observation is that, even for infinite concept classes F, we can obtain a PAC bound for learning F via the covering number of a class G related to F. G is induced by composing the chosen loss function with the functions in F, as earlier stated in Definition 6.
The learning theorem is then: Theorem 6 (PAC bound via uniform covering numbers (reported as in [14], also equivalent to Theorem 17.1 in [13]).For any concept class F and loss function L with values in [0, c], define G F ,L as in Def 6.
For any ϵ > 0 and any probability measure D on X × Y it holds where S = ((x i , y i )) n i=1 is the training sample and (x i , y i ) ∼ D: That is to say, the empirical risk R converges to the true risk R depending on the speed of growth of Γ 1 .
As noted above, for the loss functions we consider the covering number Γ 1 (n, ϵ, G F ,L ) can be bounded by the corresponding operator-class covering number γ 1,q (n, ϵ, C), which effectively controls our learning algorithms.

III. THRESHOLD SEARCH FOR NON-IDENTICAL STATES
Recall the discussion in the introduction about the difficulties encountered in naively generalizing the strategy of classical ERM to our quantum setting: they stem from not having an arbitrary number of identical copies of a quantum state ρ(x i ) and from not being able to naively estimate all the empirical risks at the same time.In this section, we nevertheless introduce a tool for performing our "quantum" version of ERM: an adaptation of quantum threshold search [3,4].The gist of the algorithm is to use identical copies of some state of interest ρ, to pick an (observable, threshold) pair from amongst a set of such pairs, with the property that the observable exceeds the threshold on ρ -or, reports that no such pair is available.We define it more formally: Problem 1 (Quantum Threshold Search [3,4]).Given as input where Π c ∈ L(H (d) ).

Output either
• Tr[ρΠ c ] > θ c − ϵ for some particular c, • Tr[ρΠ c ] ≤ θ c for all c, with probability of a correct statement at least 1 − δ.
This section is devoted to presenting an algorithm for threshold search that relaxes the need for access to identical copies of ρ, solving the same task on general product states.Intuitively, this is necessary for our setting because the learner, lacking control over the input to the process, receives as examples the sequence (x i , ρ(x i )) n i=1 where ρ(x i ) are not identical to each other.Thus, in contrast to the measurement in [4] which works on ρ ⊗n , our key tool will be a measurement on the product state ρ(x 1 ) ⊗ . . .⊗ ρ(x n ).The measurement reports if a threshold has been exceeded.We will eventually use this measurement to perform ERM for our setting.We define the properties of this measurement in Lemma 1.Our general proof strategy follows closely that of [4], adapting when needed.
Lemma 1 (Quantum threshold search on non-identical states).Given as input )) ⊗n which is a product of generally non-identical qudit states.
• A collection of lists of known projectors where each Π If the projectors and thresholds obey the promise that for at least one i, there is an algorithm such that: This algorithm is such that if for appropriate constants C 1 , C 2 > 0, it outputs c such that with probability at least 0.03.
In order to get this improvement, we need to generalize the measurement constructed in [4], which in a certain sense is gentler than the projectors which can check if the expectation is above or below threshold from Proposition 1. Roughly speaking, gentle means that the state does not change much after the measurement, with high probability.Implementations of product measurements that are gentle on product states had already been obtained in [3], and then reconsidered by [4] with a simpler analysis on identical states, which gave an improvement in the sample complexity of shadow tomography [4].In these implementations, a parameter λ can be increased in order to obtain a gentler version of a projective measurement, sacrificing the amount of information that the measurement can reveal.We go one step further and generalize the stronger statements of [4] so that they apply to our setting, which involves products of non-identical states.
To do this we need to study the behavior, under perturbation, of a random variable which is the sum of non-identical Bernoulli random variables: where The distribution of T is also known as the Poisson binomial distribution and we denote this as T ∼ P B(p 1 , . . .p n ).We also need to consider exponential random variables X, which have a density p λ (x) := λe −λx , for some λ > 0, and satisfy E[X] = 1/λ.
We first present an entirely classical Theorem.This Theorem says that a measurement that checks for the event that T + X exceeds a certain threshold is gentle when it rejects, in the sense that it only gently perturbs the distribution of T. This is a generalization of Theorem 1.2 of [4], which shows that adding exponential noise to a binomial random variable allows for gentle measurements.
Theorem 7 (Gentle classical measurement on Poisson binomial distribution).Let T ∼ P B(p 1 , . . ., p n ), and write q i = 1 − p i .Assume that X is an independent exponential random variable with mean 1/λ at least stddev[T] = n i=1 p i q i (and also at least 1).Let B be the event that T +X > θn, and assume that Pr[B] < 1 4 .Then for a sufficiently large constant C > 0.
Proof.See Appendix A. The proof of Theorem 7 requires to modify some key details of the original proof.The main observation is that the crucial properties of the binomial distribution used in the proof, i.e., the form of the generating function and the probability of exceeding the expectation, also hold, with some caveats, for the Poisson binomial distribution.Now let us turn to the quantum problem.The reason we must consider Poisson binomial random variables is that they describe the probability distribution of outcomes when measuring a sum of (possibly non-identical) local projectors on a product quantum state.The link to the learning setting is that the local projectors are exactly the ones in the set n }, which describe the c-th hypothesis under consideration, in a way we will make formal in the next section.
In the agnostic learning setting, the learner receives the string of classical-quantum examples (x 1 , ρ(x 1 )) ⊗ . . .⊗ (x n , ρ(x n )).The quantum part of these examples are n non-identical quantum states.We assume there is at least one hypothesis whose empirical risk on the examples goes below some threshold; this is captured by the promise in Lemma 1, that there is at least one set of projectors In order to perform ERM, therefore, it suffices to find some c satisfying such a guarantee.The next theorem (Theorem 8) shows that the learner can perform a gentle quantum measurement, in the sense we next define, on the product state, in order to find such c with high probability.This is also the measurement referred to in Lemma 1.It is sequential and adaptive, where the learner is presented with a projectors-threshold pair ({Π n }, θ c ) at each step, and responds by making an appropriate measurement on her state (which is reused for multiple measurements) depending on past measurement outcomes.The c-th measurement depends on the presented projectors-threshold pair and the outcomes of the first c − 1 measurements.
Theorem 8 should be viewed as a quantum counterpart to the classical Theorem 7, as it introduces the gentle quantum measurement that is at the core of our learning algorithm.Here we show that to every list of projectors {Π 1 , . . ., Π n }, one can associate a gentle quantum observable B that, similarly, only perturbs a quantum state by a small amount, if a measurement of B rejects.In Theorem 7, the gentleness was in the sense that d χ 2 ((T|B), T) was bounded by Pr(B) 2 .In the quantum Theorem 8, the gentleness is in the sense that the fidelity between the pre-and post-rejecting-measurement quantum state is exactly BC((T|B), T): this is analogous to Lemma 3.4 of [4].The two classical distance measures on probability distributions that we have mentioned are related via the inequality for probability distributions P, Q.
Theorem 8 (Gentle quantum measurements on non-identical product states).Fix 0 ≤ θ ≤ 1.Let X be any classical random variable taking values in [0, ∞).For any list of projectors {Π 1 , . . ., Π n }, Π i ∈ L(H (d) ), there exists a quantum event B ∈ L((H (d) ) ⊗n )) such that when B is measured against a product state ϱ, i.e.: we have where T ∼ P B(p 1 , . . ., p n ) and Proof.Observe that we may write the distribution of T explicitly as (91) Then consider the projectors {E k } n k=0 , It is easy to see that the event B B := fulfills Eq. ( 89).The 'furthermore' part of the statement of the Lemma follows immediately from Eq. (55).
Theorem 8 applies to X being any classical random variable.Now we specialize to X being an exponential random variable.This specification allows us to obtain a bound on the Bures distance between the state before and after the rejecting measurement, together with a bound on the probability of B. For the latter, at variance with [4], we employ here a secondorder Lagrange remainder that allows us to bypass the reduction to a fixed threshold, thus simplifying the overall proof procedure.
Then there exists a quantum event B ∈ C d×d ⊗n such that and Moreover, Proof.Eq. ( 95) is an easy consequence of Theorems 7, 8 and Eq.(87).Moreover, Eq. ( 96) follows from as desired.In the first inequality, we used that Pr[X > t] ≤ exp(−λt); in the fourth equality we used the MGF of a Poisson binomial random variable; in the second inequality we used the AM-GM inequality; in the third inequality we used that, from the second-order Lagrange remainder: Finally, in the fourth inequality, we used that 1 + x ≤ e x for x ∈ R.

3:
If the measurement accepts, output c and break.If the measurement rejects, let ϱ (c) denote the postmeasurement quantum state.4: end for

5:
If none of the measurements were accepted, output "pass on all".
In Lemma 1, we claimed that Algorithm 1 is such that if for an appropriate constants C 1 and C 2 , (and when ϵ < 1), with probability at least 0.03, the algorithm halts and outputs a projector exceeding the threshold, i.e. it halts on i such that We now prove this.We will adopt the following notation, which mirrors the notation in [4].For c = 1, . . ., m, let: 3. ϱ (0) = ρ 1 ⊗ . . .⊗ ρ n and ϱ (c) be the quantum state after the c-th measurement, conditioned on the event B j occurring for all 1 ≤ j ≤ c; ] be the probability that the event B i occurs assuming all the events B j with 1 ≤ j ≤ c − 1 occurred; 5. q c = r 1 • • • r c be the probability that all of the events B j with 1 ≤ j ≤ c occur; As in [4], our proof works by bounding the probabilities of two bad events.We now state what these events are and give an intuition for why their probabilities should be bounded, before we formalize the intuition.
1.The algorithm outputs a false negative: it passes on all projector-threshold pairs, even though there was one that fulfilled the promise.
2. The algorithm outputs a false positive: it outputs c that actually doesn't fulfil the promise, i.e.
Intuition for why these two probabilities are bounded : Suppose hypothetically that the algorithm had the luxury to measure the given projector on R fresh copies of ρ 1 ⊗ . . .⊗ ρ n each time.Then the Promise of Lemma 1, and Chernoff's bound, ensure that with R = O(1/ϵ 2 )-many copies, one could identify some projector fulfilling (109) with high probability, so that neither false negative nor false positive would come to pass.However, in Algorithm 1, we do not use R fresh copies on each of the m iterations of the For loop; instead, we make successive measurements on a single copy.Then the Damage Lemma (Eq.( 51)), together with the properties of the specially-constructed B c in Theorem 8, ensures that the usage of a 'damaged' copy does not affect the probability of acceptance on a new measurement too much, relative to using R fresh copies.The caveat is that 'low damage' is only guaranteed if all of the previous measurements have rejected; care must then be taken to account for this.
Proof of Lemma 1. Controlling the false negative probability -The promise of Lemma 1 is that there is some c ∈ For this particular c, let us bound the probability of acceptance of the corresponding measurement B c (which we previously denoted p c ).
To do so, let us recall that Theorem 9 guarantees that the B c 's are constructed in such a way that their measurement outcome statistics follow from that of a classical Poisson binomial random variable, i .This is a sum of independent and not identically distributed binary random variables.Their concentration is captured by the multiplicative Chernoff bound in Eqs. ( 56) and (57).Applying this to the sum i ], we have, by the promise, θ c ≤ p c and for any non-negative random variable X: for n = Ω(1/ϵ 2 ), where the second equality on the first line is by Eq. ( 94), and the first inequality on the third line follows from Eq. ( 57).Since we have now established that there is some i such that 1 − p i ≤ e −1/4 , so there must also be some minimum Otherwise, t > 1.Since t is minimal, (1−p 1 )...(1− p t−1 ) > e −1/4 .Taking logs on both sides, this implies that (where we have used the inequality x < − log(1−x)).If t > 1, by Eq. ( 51), Theorem 7 and the fact that d tr (ρ, σ) ≤ d Bures (ρ, σ) (by the standard Fuchs-Van de Graaf inequality [35]), we have and also Recall that X is an exponential random variable with E[X] = D √ n, meaning that choosing D large enough we have q t ≤ 4/5 and q t−1 ≥ 3/4.The former means that the algorithm will output some c ≤ t with probability larger than 1/5.
Controlling the false positive probability -Now we restrict our attention to the first t projectors presented to the algorithm, since we have established in the previous section that the algorithm terminates with more than some constant probability after at most t iterations of the For loop.We now need to show that the algorithm, with sufficiently high probability, does not output a c with Let us denote the set of such c as B (for Bad): By Eq. ( 96) in Theorem 9, whenever µ c ≤ θ c − ϵ (i.e. for all c ∈ B), we have that where we used that E[X] = 1/λ = D √ n for some D > 0. Therefore p c ≤ (η/m) 2 < 1/5 if (2 log(m/η) + eD −2 /2) 2 ≤ D −1 nϵ 2 , and η ≤ 0.01.The argument is now identical to [4], proof of Lemma 4.2.There, using Eq.(53), F (ϱ, ϱ t ) < 1, and c∈[t]/B p c < 1/4, q t < 4/5, one obtains Since c∈[t]/B s c is the probability that the algorithm returns an index c ∈ [t] with µ c ≥ θ c − ϵ, it follows that the algorithm is correct with probability at least 0.03.

IV. QUANTUM EMPIRICAL RISK MINIMIZATION
In this section we present our main algorithm to perform ERM for both projector-and state-valued classes of quantum processes.

A. Empirical risk minimization for projector-valued functions
We first show how to estimate the empirical risk of a fixed number of projector-valued functions on a given product state.The main theorem we prove is the following.
Theorem 10 (Quantum empirical risk minimization -Theorem 1, refined).Given access to a product state and a collection of lists of projectors n } c=1,...,m , with where the second term is the empirical risk of concept c on the entire product state) there is an algorithm which outputs c * together with an estimate μc * of µ c * such that if n is large enough.In fact we can take for some C 1 > 0.
Let us now convey the intuition behind the claimed algorithm (which is Algorithm 2).Observe that the task of finding the concept that attains the maximum overlap, max c∈[m] µ c , can be accomplished by binary search over the interval [0, 1]namely, we search for the largest threshold such that there is at least one risk above it, out of the m possible ones.The remainder of this section gives the details of how to do so.The key idea is to divide the given product state ϱ into blocks of l = n/2T k states, that is ) and take the block size l large enough so that the average risk on each block concentrates towards µ c for each c.This gives us sufficient confidence to apply our Algorithm 1 (ThresholdSearch) on each block to check if there exists some concept that exceeds the current candidate threshold.Depending on the results of this check, we adjust the candidate threshold accordingly.The guarantees of this Algorithm (Algorithm 2) are given in Lemma 2. Subsequently, in Lemma 3, we derive the block size l we need to ensure the desired concentration.
Lemma 2 (Quantum empirical risk minimization given large product states).Given access to 2T k blocks of states as in Eq. (126) and a collection of lists of projectors {Π (c) s,j } c=1,...,m,s=1,...,2T k,j=1,...,l , suppose that the following conditions hold: (127) 2. At the same time 3. the numbers 0 ≤ µ c ≤ 1 are approximations of the expected value of {Π (c) s,j } s,j with respect to the blocks of states for all c ∈ [m], i.e., they satisfy Then there is an algorithm that outputs c * and μc * such that Here we note that 'given large product states' in the title of the Lemma refers to the requirement that the n = 2T kl is large enough to guarantee that Eq. ( 127) holds.

9:
if Check outputs 'yes' then: f ailures ← 0. 12: else Check outputs 'no' and 13: f ailures+ = 1 Algorithm in words: Algorithm 2 runs binary search to find an interval containing max c µ c , starting with the candidate interval [0, 1], determining whether the desired value lies in the upper or lower half, and then updating the candidate interval to the relevant half and recursing.To determine in which half of the candidate interval the desired value lies, the algorithm uses up two blocks of samples, ϱ 2s−1 and ϱ 2s : on the first block, we run ThresholdSearch, which also outputs a concept that exceeds the current candidate value θ c .On the second block, we run a check, that confirms that this concept indeed exceeds the threshold θ c (as ThresholdSearch only succeeds with probability 0.03).We declare a f ailure if by the end of this process, we have not received a 'yes' from both algorithms; after k consecutive failures, we conclude that there was actually no concept exceeding the candidate threshold, and move on by decreasing the candidate threshold (Line 17).Con-versely, if at any point we receive a 'yes' from both algorithms, we conclude that the candidate threshold was too low and increase it (Line 10).
Error analysis: Let us now analyze the error in this algorithm.The algorithm will err if either Line 17 or Line 10 updates the candidate threshold wrongly.We are interested in the probability of either of these two events happening for a fixed candidate threshold; by a union bound, we will then multiply this probability by the number of candidate thresholds that are examined, which is O(log(1/ϵ)) -as the interval does not get updated once it becomes smaller than 6ϵ, and every update approximately halves the size of the interval (see Lines 10 and 17).
We thus define the following four error probabilities and their associated events: • p F P T S : ThresholdSearch outputs a false positive, i.e. it outputs that there is a concept above threshold when there is none.
• p F N T S : ThresholdSearch outputs a false negative, i.e. it outputs 'no concept above threshold' when, in fact, there was one.
Note that, by Lemma 1, it holds p F N T S < 0.97, whereas p F N C and p F P C can be made exponentially small in the block size l by the multiplicative Chernoff bound (Proposition 1).If Line 10 updates wrongly, it can only be because ThresholdSearch outputs a concept even though there is no concept that exceeds the threshold and Check outputs 'yes' on that wrong concept.The probability of this happening for a given interval is at most k p F P T S p F P C , as there are k rounds where this could potentially happen and any such event triggers an update of the interval.
If Line 17 updates wrongly, there must have been k consecutive failures when there actually was a concept above threshold.Each such failure is caused by one of the following events: • Threshold search executes correctly at some point but Check falsely outputs no.The probability that this happens in k rounds is at most p F N C .To see this, call τ i the event that the first false negative check occurs at time i. {τ i } are mutually exclusive events, and Pr The probability that there is a false negative at some time i is • Threshold search wrongly outputs that no concept was above threshold (probability p F N T S ) for k consecutive times.The probability of these events is bounded by 0.97 k , as explained above.
Summing over the T rounds, the probability of error is then upper bounded as With the choices made for the parameters of the algorithm, we obtain the guarantee given in the Lemma statement.
Next, we control the probability that the (possibly non-identical) product states give an accurate approximation to the expectation values of the projectors associated with the possible concepts, i.e. the probability that Eq. ( 129) is satisfied.The main technical ingredient is the following Lemma which shows, essentially, that Höffding's inequality is effective even for proving the concentration of the empirical mean of samples without replacement from a finite population.
Lemma 3. Let Y = {1, ..., n}, n ≥ 3Kl, and let the K sets of indices {{X lk+i } l i=1 } K−1 k=0 be random samples drawn without replacement from Y. Furthermore, consider m different finite populations of n numbers {{x c,j } n j=1 } m c=1 , with 0 ≤ x c,j ≤ 1.From each population {x c,j } n j=1 obtain K subsets of size l, as {{x c,X lk+j } l j=1 } K−1 k=0 .Then it holds where The proof is given in the appendix B (see the equivalent Theorem 19).With these two Lemmas in hand, we are now able to prove that Algorithm 2 works for finding the empirical risk minimizer amongst projector-valued functions.
Proof of Theorem 10.Let n ≥ 6T kl be the number of observations of the output of the unknown process (where we remind readers that each observation corresponds to a quantum state).That is, we set K = 2T k in Lemma 3. Sampling without replacement 2T k lists of length l from [n], let the s-th list define the product state and for each concept c ∈ [m], the set of projectors Now by identifying x c,j = Tr[Π (c) j ρ j ] in Lemma 3, we obtain the desired concentration (136) with probability p (1) err ≤ 4T kme −2lϵ 2 /64 , If Eq. ( 136) is true, then condition Eq. ( 129) is satisfied and we can apply Algorithm (2) to obtain a good estimate with probability of error p We can make p err + p (2) err ≤ δ with the choice of T , k as in Lemma 2 and l = O(max(log(T km/δ)/ϵ 2 ), (log m + C 2 ) 2 /ϵ 2 )).

B. Empirical risk estimation for projector-valued functions
In this section we will show that if the size of a product state is sufficiently large with respect to the logarithm of the local dimension, we can not only identify the concept yielding the highest minimum risk and estimate that risk, but in fact estimate the risks of all of the concepts at the same time.The key idea is built on the philosophy of shadow tomography: we find a function that predicts the expectation values where c runs over the possible different concepts, given access to the product state In the case of lists of identical projectors (Π independent of i), the task went under the name of diverse-state setting for shadow tomography in [17], where it was noted that the algorithm proposed there also works in this setting.Here we show that the improvements given by [4] carry through in this setting with the appropriate generalizations.We also make explicit an appropriate modification of the procedure of [3] to update the state guess, where a reference quantum state (possibly stored on a classical computer as a matrix) is used to compute the expectation values of the projectors, and updated to agree with the data from the experiment.Here we start with m lists of projectors and an unknown product state As in the previous section we use the notation is the random variable obtained by measuring h c,s on ϱ s .
Note that the quantities 1 − R ϱ (c) can be also expressed as expectation values of projectors on the classical quantum state The algorithm works by keeping track of a classical estimate of σ that is updated sequentially.The estimate is initialized at time t = 0 as and at each time t, ρ t is obtained by picking one system of the q at random and computing the marginal of ρ * t in that system.The form of the state to update is the main difference between what we propose and the strategy in [3].We give the proof in its entirety for convenience.
At each time step t, the algorithm is also provided with some c ∈ [m] for which the current best estimate ρ t makes poor predictions relative to σ, that is, is ϵ-far from (Later, in Lemma 5, we'll explain how to find such a c.)The updating procedure then makes use of Π (c)  to update ρ * t to ρ * t+1 .In doing so we can guarantee that at most a certain number T of updates will be required, in the following sense: Lemma 4 (Update).There exists a sequential procedure to update a classical estimate for some unknown quantum state σ, that initializes the estimate at time t = 0 as ρ * 0 given in Eq. (143), and at time t, takes as input the current estimate ρ * t and some c ∈ C such that either and outputs an updated estimate, ρ * t+1 , such that after updates, the estimate ρ T fulfils Proof.First of all, we construct the following projectors In other words, Π (c) (l) is the sum of all events that accept the projector Π (c) at exactly l points in [q] and reject Π (c) at the remaining q − l points in [q].Finally, we define Π (c),− (r) = l≤r Π (c) (l). (148) In words, Π (c),+ (r) is the sum of all events that reject Π (c) in at least r points in [q] and reject Π (c) in all remaining points in [q]; Π (c),− (r) is the sum of all events that reject Π (c) in at most r points in [q] and accept Π (c) at all remaining points in [q].We now describe the update procedure.In Case 1, we update ρ * t to ρ * t+1 by post-selecting on the event F + t (c) = Π (c),+ ((μ c,t + ϵ/2)q).In Case 2, we then update ρ t to ρ t+1 post-selecting on the event F − t (c) = Π (c),− ((μ c,t − ϵ/2)q).Define F t as the appropriate accepting event at time t, i.e.F t := F + t in Case 1 and F t := F − t in Case 2. We will now upper bound the probability that the first t post-selection steps all succeed, following an argument of [3].This is and we will be able to upper-bound it by p t ≤ (1 − ϵ) Ω(t) .Indeed, we can show this by considering the random variable X = Number of acceptances resulting from measuring Applying Markov's inequality, we can then see that, by writing and, in Case 1, while an analogous calculation yields that in Case 2, This implies p t ≤ (1 − ϵ) Ω(t) .Now, we lower-bound p t .Hypothetically, suppose that at time t we were to apply the measurement F t (which depends on which case we are in) to By the promise that we are either in case 1) or 2), and by Chernoff's bound, at each step the measurements reject with probability 1 − Tr[F t σ ⊗q ] ≤ e −Ω(qϵ 2 ) .Applying the measurements in sequence, F t . . .F 0 also always accepts with high probability by the quantum union bound (e.g.[37]).In particular the probability of accepting on every step, until step t, is But then, for every ρ i , i ∈ [n], it is possible to write for some positive semi-definite and trace-1 ω.This decomposition (specifically the fact that ω is PSD) elucidates the following lower bound: By taking q = C ϵ 2 log log d + log 1 ϵ we have p t ≥ Putting this together with the upper-bound p t ≤ (1−ϵ) Ω(t) , we need where T is the number of updates after which we can estimate all the expectation values at the desired precision.
To obtain c satisfying the promise of Lemma 4, we can use the following Lemma, which is a simpler variant of Lemma 2.
Lemma 5. Given access to 2k product states and a collection of lists of projectors {Π (c) s,j } c=1,...,m,s=1,...,2k,j=1,...,l and numbers (161) and numbers {λ c } c=1,...,m .Then if we are guaranteed that for appropriate constants C 1 , C 2 , and at the same time for a large enough k = O(log 1 δ log 1 ϵ ), there is an algorithm that, with probability larger than δ, if there exists some c such that |µ c − λ c | ≥ 2ϵ, either the algorithm declares failure or it outputs a c * such that Proof.Use ThresholdSearch (Lemma 1) on the states ϱ s , s odd with s,l } and threshold λ c + 7/4ϵ, • together with the list of projectors {1 − Π , and precision parameter ϵ/4.For each odd s, if c is output from the search, we measure On the other hand, if a concept is selected with By an argument identical to the analysis of the error probability in Lemma 2, the probability of error is bounded as p err ≤ 0.97 k + 2(k + 1)e −l ϵ 2 6 , and the choice of k and l in the Lemma statement makes it less than δ.Thus, our algorithm for ERE simply starts from an estimate dependent on the empirical distribution of measurement outcomes of the classical register, and then interleaves the ThresholdSearch and Update subroutines to progressively update this estimate.This is summed up in the following algorithm:

Algorithm 3 Empirical risk estimation
Input: Product states Parameters: T, q.
1: Initialize on a classical computer, the classical estimate 2: for t = 1, . . ., T do c ← ThresholdSearch on the state ϱs with projectors and parameters as described in the proof of Lemma 5.If ThresholdSearch declares failure then Break.

We can now prove
Theorem 11 (Quantum empirical risk estimation for projector-valued functions (Theorem 2, refined)).Given access to a product state and a collection of lists of projectors n } c=1,...,m , with there is an algorithm which outputs estimates μc such that if n is large enough; in fact we can take n = and a collection of lists of projectors {Π (c) s,j } c=1,...,m,s=1,...,2T k,j=1,...,l such that (173) with probability p

C. Empirical risk minimization for state-valued functions
A key subroutine we will need to introduce is based on hypothesis selection, which is a way of choosing a classical or quantum probability distribution that best fits some observed data.We'll use a generalized version of the algorithm of [4] for hypothesis selection (which applied to i.i.d states) to find the empirical risk minimizer, with the loss given by the trace distance, in a set of candidate statevalued processes.
There exists an algorithm which given ϱ outputs i such that and Then η = min i d tr (σ i , ρ) and the algorithm has to output a hypothesis k such that d tr (σ k , ρ) ≤ 3η + ϵ.
The algorithm works as follows: for each s, k and i < j define the Helstrom projectors where (•) + is the projector on the positive part of the argument, and their block-sum . By construction, these projectors satisfy The algorithm then uses Algorithm 2 on ϱ to perform ERE of the projector-valued functions {A ij (s) : i < j}, outputting estimates µ ij such that, with probability at least δ, it holds We also denote Finally, the algorithm employs a classical subroutine to minimize the quantity finding k * := argmin k ∆ k .We can show that this is a good enough candidate, i.e., d tr (σ k * , ρ) is sufficiently small.Indeed, if we define the optimal hypothesis attaining Eq. (176) as i * := argmin i d tr (σ i , ρ), by the triangle inequality it holds The first term in Eq. ( 186) can be bounded in the following way: Note that the last two inequalities above are implied by (181) and therefore the overall result holds with probability at least 1 − δ.Proceeding similarly for the second term in Eq. ( 186) we obtain as well.Therefore, by taking ϵ ′ = ϵ/3 we can conclude that d tr (σ k * , ρ) ≤ 3η + 4ϵ with probability at least 1 − δ.

V. STATISTICAL LEARNING FOR CLASSICAL-QUANTUM PROCESSES
The previous section presented Theorems for ERM on finite concept classes.In this section we extend our tools to learning even infinite concept classes.The main result of this section is a statistical learning theorem (Theorem 4) for classical-quantum processes which gives conditions on their learnability in terms of the covering number of the concept class C from which they are drawn.Our proof is constructive and provides an explicit algorithm to learn the given concept class.
Note however that the algorithm relies on constructing empirical covering nets, which can be demanding; hence we expect that faster algorithms can be found to improve the performance.We also remark that the following results are a consequence of the guarantees on our ERM algorithms for both projector-valued and state-valued concept classes, proved in Section IV, together with standard classical statistical learning theory guarantees on uniform convergence of the empirical risks in terms of the growth functions Γ 1 (see Sections II D and II E).It is important to note that our sufficient condition for learnability is instead expressed in terms of the growth functions γ 1,q ≥ Γ 1 , which are sufficient to guarantee both ERM and uniform convergence.
In Subsection V A we formulate a lemma that identifies, even for continuous-valued concept classes, a finite cardinality concept class on which we apply the algorithm of the previous section.In Subsection V B, we prove our main Theorem 4 for concept classes that map to projector-valued functions.Through shadow tomography, this can be extended to estimating all the empirical risks, and we do this in Section V C. In Subsection V D, we prove Theorem 4 for concept classes that map to state-valued functions.In Appendix C we also show that the algorithm of [7] to learn concept classes that output pure states, in the realizable case, works also when the growth function is slowly growing.

A. Uniform convergence
In the rest of this section we will perform ERE and/or ERM with respect to an ϵ-net of the concept class which depends on the classical data.The following Lemma ensures that ERE/ERM also gives a good solution for the estimation/minimization of the true risks.
with cardinality at most γ 1,q (l 0 , ϵ/2, C), and at the same time with probability of error Proof.This is a simple consequence of Theorem 6.
In fact, given a sample S consisting of l 0 samples of the random variable x ∈ X associated to the classical register, we have that Pr ≤ 4γ 1,q (2l 0 , ϵ/64, C)e − l 0 ϵ 2 512 .
By taking an ϵ/2-net C ϵ/2 (⃗ x) according to the appropriate loss function L on the data ⃗ x, we have that for every c ∈ C there is c * (c) ∈ C ϵ/2 (⃗ x) such that, via Eqs.( 68), (72 If both conditions are satisfied, we have The bound on the cardinality comes from the definition of γ 1,q . Via this Lemma, whenever we want to solve the risk estimation for a classical quantum source, we can use part of the classical data to extract a good ϵ-net on the space of true risks (meaning exactly that for every concept c there is a known concept c * c with true risk ϵ-close), and use the quantum data to obtain the best hypothesis on this finite cardinality concept class, using the algorithms shown in the previous sections.Remark: As the cardinality of the ϵ-net on the data can increase with the size of the sample, one could also construct the ϵ-net only on a part of the classical data, such that uniform convergence is ensured at the desired level.If lim n→∞ log γ 1,q (2n, ϵ/64, C)/n = 0, there will be a finite l 0 such that p err ≤ δ.The cardinality of the ϵ-net will be then γ 1,q (l 0 , ϵ/2, C).This observation is used in Appendix D.
B. Learnability of quantum processes that map to projectors (Proof of Thm 4 for loss function Lp) We now present one of our main technical results about learning unknown quantum processes without input control.This was Theorem 4, which we reproduce below for convenience: Theorem 13 ((Theorem 4, repeated) Learning quantum processes via ERM).Suppose the concept class C consists of classical-quantum processes mapping to projectors or states and let ϵ > 0 be the accuracy parameter.Furthermore, let S = (x i , ρ(x i )) n i=1 be the training set, with x i D ← − X and ρ(•) an unknown classical-quantum channel.
Then, the appropriate ERM algorithm of Theorems 1, 3, run on an ϵ-net of the concept class C (according to the appropriate pseudometric determined by x 1 , ..., x n ), provide an agnostic learning algorithm A : X n ×L(H (d) ) ⊗n → C.This algorithm outputs an hypothesis A(S) satisfying, for some fixed η, ξ ≥ 1, and n large enough, In particular, this applies to risks defined via the loss functions L p (in this case η = 1, q = ∞) and L s (in this case η = 3, q = 1) for projector-valued and state-valued concept classes C respectively.
We prove this theorem in two parts.In this section, we prove that Eq. ( 203) gives a sufficient condition for learning concept classes of projectorvalued functions (i.e. using loss function L p ).In Section V D, we will prove that the same condition suffices for learning a concept class whose functions map to mixed states (i.e. using loss function L s ).Note that, in the case where the functions in the concept class map to pure states, this latter subcase would follow almost immediately from the former.
We will actually prove a more refined, quantitative statement: Theorem 14 (Learning projector-valued functions via ERM).Suppose the concept class C consists of quantum processes mapping to projectors and let ϵ > 0 be the accuracy parameter, and suppose that Given as input a training set S = (x i , ρ(x i )) n i=1 with x i D ← − X and ρ(•) an unknown classical-quantum channel, there is an agnostic learning algorithm A : We emphasize that this ThresholdSearch subroutine differs from the original ones [3,4] because it does not require identical copies of the state to output a concept above threshold.Exploiting this algorithm and the convergence of the empirical risk to the true risk as stated in Theorem 6, we can finally prove Theorem 4 for the case of projectors.Proof.We remind readers that the empirical risk for projector-valued functions can be written as The algorithm looks at the classical register ⃗ x and and obtains an ϵ-net of the empirical risk C| ⃗ x , which is also a 2ϵ-net on the true risks, with probability of error bounded as in Lemma 6 (210) If this is true we have that, for the selected concept c The probability of error is then upper bounded as p err,unif + p err,ERM .
With the choices made for the parameters, we obtain the thesis.

C. Shadow tomography for classical-quantum states
We can also estimate the empirical risks of all the concepts in a class, a task that is strictly related to shadow tomography.By uniform convergence, these will be close to the true risks.In fact, we can show an improved algorithm for shadow tomography of classical-quantum states.Using this algorithm, we can not only find the minimum empirical risk in the concept class, but also simultaneously estimate empirical risk for all concepts.
Theorem 15 ((Theorem 5, refined) Improved shadow tomography of classical-quantum states).Consider a collection of projector-valued functions C = {f c : x → Π (c) (x)} with domain X and image in the projectors of a Hilbert space of dimension d.Given access to n copies of a classical-quantum state ρ = x∈X D(x)|x⟩⟨x| ⊗ ρ(x), the algorithm of Theorem 2 used with an ϵ-net of C outputs values {µ c } c∈C such that, except with probability δ, for all f c ∈ C it holds if either one of these conditions holds: • C is finite.Then the minimal number of copies satisfies n = Õ( log 2 m log d log(1/δ) have that the probability of error is bounded as On the other hand, the probability of error of the uniform convergence is the same as in the previous Theorem.By summing the two probabilities of error we get the bound in the thesis.Notice that with Theorem 2 we get estimates of the empirical risks at precision 2ϵ.By Theorem 6 this translates into a precision 3ϵ on the full class.
D. Learnability of quantum processes that map to states (Thm 4 with loss function Ls) In this section, we will prove that quantum processes that output states are also efficiently learnable (having shown the analogous statement for processes that output projectors in Section V B).This also constitutes the second half of the statement of Theorem 4. In this setting the empirical risk takes the form We can now prove Theorem 4 for state-valued functions, with loss function L s .In fact, we prove a refined statement: Theorem 16 (Refinement of Theorem 4 for statevalued functions).Suppose the concept class C consists of quantum processes mapping to states and let ϵ > 0 be the accuracy parameter.Suppose that Given as input a training set S = (x i , ρ(x i )) n i=1 with x i D ← − X and an unknown ρ(x), there is a learning algorithm A : X n × L(H (d) ) ⊗n → C obtained from the algorithm of Theorem 12 applied to an ϵ-net of C (according to the appropriate pesudometric defined by x 1 , ..., x n ), such that we have Therefore, we also have that R(c * ) ≤ 3 inf c∈C R(h)+ 6ϵ with high probability.We can then apply Theorem 12 to the quantum data with the concept class C ⃗ x .The probability of error of the algorithm of Theorem 12 is in fact the probability of error of the ERE algorithm for the projector-valued concept class constructed from C ⃗ x as in Theorem Putting all together we obtain the full error bound.

VI. CLASSES WITH BOUNDED COVERING NUMBERS
In this section we present some models of physically motivated concept classes for which upper bounds on their covering numbers can be found.We find these bounds on covering numbers using continuity and bounds on covering numbers of real functions and matrices.This opens up the question of whether it is possible to define a combinatorial dimension (such as the VC dimension or the fatshattering dimension) corresponding to the covering numbers we introduced.
Whenever needed in the following, we employ a real-valued function class F, and we assume that for this function class we have a bound on the covering number in terms of some combinatorial dimension, for example the fat-shattering dimension (see [13] for definitions and proofs of these bounds).It holds that, for a class of functions F : R → [0, B] with fatshattering dimension D = fat F (ϵ/4), the covering number is bounded as .
Furthermore, we can have finite covering numbers for sets of k ×k matrices, seen as functions with trivial input to C k × C k .These techniques were used for the purpose of studying generalization bounds in [31], and uniform convergence for learning quantum channels on multiqubit systems of finite size, in the controlled input setting, in [8].The following Lemma is crucial.
Lemma 7 (Size of an ϵ-net over unitary and bounded hermitian matrices).
We can use this to calculate the size of ϵ-nets over matrices with respect to the operator norm, by using the fact that for a set of unitaries S U in dimension k, S U ⊂ B 1 (0) with K = (2k) 2 , and for a set of hermitian matrices S H in dimension k with norm bounded by b, which we denote by M The bound on the covering number of the Euclidean ball is given in [15], and the argument generalizes to any norm in a finite-dimensional space.The first inequalities are a consequence of triangular inequality.For specific matrix classes one can have improved bounds.
In the following, we first exhibit concept classes based on circuits, which are valuable since they are associated to quantum states and measurement that can actually be produced efficiently on a quantum computer: not only the data that we receive could be states of this form, but also the resulting output hypothesis would be efficiently produced for any future need.In a further subsection we also present concept classes that are more motivated by physical scenarios, ideally exhibiting toy models for the setting described in the introduction, inspired by sensing and hamiltonian learning.

A. Quantum Circuits
We consider concept classes based on circuits.By choosing circuits that do not depend on the data, we will first give several examples of concept classes with covering numbers that are not just slowly growing, but in fact bounded by a quantity independent of n, the length of the sample.Then, including a finite number of gates that depend on the data through real functions with finite fat-shattering dimension, we will obtain concept classes that are slowly growing with n.
1.Quantum circuits that give rise to state-valued functions First, we will look at an example of a concept class that maps to quantum states.We remind readers that this corresponds to choosing as the loss function the trace distance L s .Consider the concept class C m ⊆ {c : X → L(H m )} consisting of m-qubit quantum circuits acting on arbitrary input states.That is, where ρ : X → D(H m ) is a fixed process preparing a mixed state that depends on a classical random variable X , and U is an arbitrary m-qubit unitary chosen from a set S m .For instance, we could be interested in studying processes corresponding to quantum circuits given by a particular architecture (specified by S m ) acting on pure computational-basis states.This class is obtained in our formalism by setting X = {0, 1} m and noting that the process that prepares the pure computational basis state |x⟩⟨x| corresponds to ρ(x) = |x⟩⟨x|.Specific relevant examples of S m will be discussed later.
Then an upper-bound on γ 1,1 (n, ϵ, C m ) is given by the size of an ϵ/8-net over the m-qubit unitaries in S m , where the distance metric is taken to be the spectral norm ∥•∥ (see Eq. ( 44)).
In the following, we remind readers that n is the number of data points observed, while m is a parameter of the concept class -the number of qubits in the circuits in the concept class.
Proof.To see this, observe that for an arbitrary state ρ and unitaries Indeed, with the spectral decomposition ρ = where in the third line we have used Fact 1.Therefore, for any two functions c U , c V ∈ C m , we have that where the second inequality is inequality (240).Now consider the set C Proposition 2 makes an important reduction: to upper-bound γ 1,1 (n, ϵ, C m ), where C m is a concept class induced by a set of m-qubit quantum circuits S m , it suffices to upper-bound the size of an ϵ-net in the spectral norm over the set S m .
We now compute this covering number for circuit classes S m corresponding to particular architectures of relevance.
• Consider the class of one-dimensional local random quantum circuits on m qubits of depth ℓ, which was defined by Brandao, Harrow and Horodecki [16] as applying a Haar-random 2qubit nearest-neighbor gate on a uniformlyrandom pair of neighboring qubits at each time-step t = 1, . . .ℓ. (Note that the distribution from which the circuits are drawn is actually immaterial to us; the concept class merely consists of the support of that distribution.)We shall call this class LQC(m, ℓ), and we will call the induced loss function class G LQC(m,ℓ) .
Corollary 1 (Bound on covering net size for 1d random local quantum circuits).
and hence by Proposition 2, Proof.The argument follows along the lines of that made in [8,31], and so we only sketch it briefly.Let N ε be an ε = ϵ d -net over a single 2-qubit unitary.Let us now consider the net ) where U i,p(i) denotes the unitary U i acting on qubits p(i), p(i) + 1.As a special case of Lemma 7, an ϵ-net over the set of 2-qubit unitaries (a set that lies within where each V i is a 2-qubit nearest-neighbor gate.We now show that Ñϵ is an ϵ-net over the set LQC(m, ℓ).For V i acting on qubits p(i), p(i) + 1, there exists some U i ∈ N ε such that ∥U i − V i ∥ < ε, and we can enforce that it acts on qubits p(i), p(i) + 1.This means that U 1 . . .U ℓ ∈ Ñϵ , and moreover where the first inequality is the triangle inequality, the second inequality follows from unitary invariance of the spectral norm, the third inequality follows from the assumption that N ε is an ϵ/ℓ-net and the last inequality follows from repeating the above three steps on the remaining term in (251).
• Exactly analogously, we could also consider the set S b (m, ℓ) of all 'brickwork' local quantum circuits, where every pair of neighboring qubits has a nearest-neighbor two-qubit gate acting on them at every time step.We then obtain • Alternatively, we could consider choosing our unitaries from the set of all possible unitaries on m qubits.Let us call this set S m .Immediately from Lemma 7 we have Observe that in all of these cases, the covering number bound (which determines learnability of the corresponding class) is independent of the number of sampled points, n.

Quantum circuits that give rise to projector-valued functions
In the previous subsection, we computed the covering number of state-valued concept classes.In this subsection, we will give an example of a concept class consisting of projector -valued functions, and compute its covering number.
Side note: An immediate example of a projectorvalued concept class is, in fact, the state-valued concept classes obtained by circuits applied to computational basis states, as a special case of the settings considered in the last subsection.This is because the concepts output pure states which are rank-1 density matrices, i.e. rank-1 projectors.Even for higher ranks, the calculation of the covering number for this concept class proceeds similarly.Some slight tweaks are needed to account for the use of a different loss function (namely L p ) and notion of distance between concepts (namely the spectral norm instead of the trace norm).
We present a class of projector-valued concepts that have an explicit x dependence but each of them is based on a single circuit.We fix a projectorvalued function Π(x) and we define C Π ⊆ {c : X → L(H m )}, where S m is a set of circuits: To calculate the covering numbers, it suffices to find the size of an ϵ/8-net over the isometries S m = {V h } h according to the diamond distance.To see that this suffices, we will need two standard consequences of the definition of operator norms, as follows   The previous subsections have studied concept classes based on circuits that are independent of the data.As a result, the upper bounds on the covering numbers do not depend on the size of the dataset.Without deep changes in the structure of the proofs of the bounds, we can also treat cases where this dependence is in fact explicit.Such data-dependent circuits are considered in variational quantum machine learning models with data re-uploading (see, e.g.[39]).Consider circuits described by a circuit model S of k two-qubit-gates, as in the previous sections, and modify it by inserting in specific places in the sequence a number k ′ of gates of the form e iH j ′ g j ′ (x) , j ′ = 1, ..., k ′ , ||H i || ≤ b, H j ′ fixed and g j ′ (x) belonging to a concept class F with fatshattering dimension D, where the functions in the class map to [0, B] as explained in the beginning of this section.Each circuit in the concept class will be made of a sequence of k data-independent qubit gates and k ′ data-dependent gates, in specified positions.Since for any unitarily invariant matrix norm we have [40] Now, take two circuits U and V such that the non-data-dependent parts are such that k j=1 ||U j − V j || ≤ ϵ 2 , and also ||g j ′ (x) − g ′ j ′ (x)|| 1,⃗ x ≤ ϵ/(4bek ′ ), for every j, j ′ .By triangular inequalities, for two state valued concepts c U , c V associated to circuits U , V , repeating the reasoning behind Eq. ( 249) and following equations, we get The same inequality holds if we take projectorvalued classes based on the circuits.This means that the covering number for a data-dependent class of circuits C ′ constructed from a qubit model S is bounded as

C. Learning physical processes
We now present some concept classes inspired by physical scenarios.We will always assume that the quantum register has dimension d.
We use facts about matrix continuity to obtain a bound on the covering number for our state-valued and projector-valued concept classes.In particular, we need the following.
• For any unitarily invariant matrix norm we have [40] ||e Proposition 7 (Phase-shifts with position-dependent depth).Consider a spatially local channel that acts as a power of an unknown unitary channel at position x according to some classical variable at position x (for example a thickness), measured as g(x), which we probe with some position dependent state ρ(x), possibly entangled with a different reference system for each x.We can model this class as either a state-valued class, but also as a projector-valued function class if the states ρ(x) are pure: The covering number of this class is bounded as − e iH ′ g ′ (xi) ρ(x i )e −iH ′ g ′ (xi) || q (281) which implies that C ϵ = {f |f (x) = U g(x) ρ(x)(U † ) g(x) , g ∈ F ϵ , V ∈ M Proof.The proof is identical to the one of Proposition 7 in the projector-valued case, exchanging ρ(x) with Π E (x).
Proposition 9 (Projectors on low-energy subspace of perturbed Hamiltonians).Similarly, we can consider a class that projects on the eigenspaces of low energy of a set of perturbed Hamiltonians, defining Π E as the projector on eigenstates of H 0 + g(x)V with energy less than E, where H 0 does not have eigenvalues in the interval [E − 2δ, E + 2δ], denoted P E (H 0 + g(x)V ).
δ/B }, (285) where we choose parameters in such a way that the perturbation is small, ||g(x)V || < δ.The covering number of this class is bounded as arXiv:2211.05005v3 [quant-ph] 5 Mar 2024 Quantum algorithm < l a t e x i t s h a 1 _ b a s e 6 4 = " D 4 j Y y 2 m bY R j L k I O z 0 D G n G + p W A O 4 = " > A A A D y 3 i c d V J L b 9 N A E N 4 2 P E p 4 t X D k s s J C K h K K b B N I e i u P A w d A R T R t p d i K 1 p t J u s r u 2 t o d l x Q n R4 5 c 4 X / w b / g 3 r E 0 M d S h z G n + P / c Y 7 m 2 R S W P T 9 n x u b r S t X r 1 3 f u t G + e e v 2 n b v b O / e O b J o b D g O e y t S c J M y C F B o G K F D C S W a A q U T C c T J 7 V f L H Z 2 C s S P U h n m c Q K z b V Y i I 4 Q w d 9 n I + C 0 b b n d / b 6 f t D v U 9 c 8 9 Z / t V U 3 o d 3 s 9

x 1 <
l a t e x i t s h a 1 _ b a s e 6 4 = " n 6 7 Q r F q 9 G a a S J r D3 Y + C D G 4 e r I / 4 = " > A A A D y 3 i c d V J L b 9 N A E N 4 2 P E p 4 t X D k s s J C K h K K b B N I e i u P A w d A R T R t p d i K 1 p t J u s r u 2 t o d l x Q n R4 5 c 4 X / w b / g 3 r E 0 M d S h z G n + P / c Y 7 m 2 R S W P T 9 n x u b r S t X r 1 3 f u t G + e e v 2 n b v b O / e O b J o b D g O e y t S c J M y C F B o G K F D C S W a A q U T C c T J 7 V f L H Z 2 C s S P U h n m c Q K z b V Y i I 4 Q w d 9 n I / C 0 b b n d / b 6 f t D v U 9 c 8 9 Z / t V U 3 o d 3 s 9 l n D u w c P f U C K x j c u C N n y p n h R Y 8 G 0 M L l T h D w 5 p g o t y 3 h o 8 8 U 4 r p c R l l G h Z l l K u E B i 0 G z W I Y x G V Z 7 e j I 9 X 4 Q t x R 5 J Q n / S 0 a l H z h d 2 N Z N A e u j I g k T v P C D y I j 0 D C P D d O r 2 Q x v V d C Z u / n / O S N a G l f 9 i P a W W h 5 e D a s f K G F 5 t d K 5 6 w r C Z Q 5 1 f u j X h y r c a u B N Z Q M W E n m Q a y x f n o I W h 7 2 C G K 8 b t q K L 2 X o t U oH 3 y 1 m 1 R P 2 4 K 3 S s P 2 m 9 6 v T k O e 8 H T 3 v 7 7 f f / g 5 f K 9 b 5 E H 5 C H Z I w F 5 R g 7 I G 3 J I B o S T O f l O f p C f 3 s j 7 7 H 3 x v v 6 V b m 4 s P b u k U d 6 3 P 9 M B S j c = < / l a t e x i t > quantum signal < l a t e x i t s h a 1 _ b a s e 6 4 = " y Y c A Z B k o 7 G 2 B n R y Y 8 U Z t 2 p u g

<
l a t e x i t s h a 1 _ b a s e 6 4 = " p n n v st + R G v 7 0 V R s p v W d I p p Y F c i M = " > A A A D 3 n i c b V L N b 9 M w F P c W P k b 5 W A f c u F h U S E N C V R N N w H E b I I G E ts H a b l J T F c d 9 7 a z a T m S / j I 6 s V 2 6 I K 0 e u c O S / 4 b / B C S 0 s 6 d 7 p 5 f f h 3 4 u f o 0 Q K i 6 3 W 7 5 V V 7 8 r V a 9 f X b t R u 3 r p 9 Z 7 2 + c b d r 4 9 R w 6 P B Y x u Y 4 Y h a k 0 N B B g R K O E w N M R R K O o s m L n D 8 6 B W N F r N t 4 l k B f s b E W I 8 E Z O m h Q v x 8 a R c N o R P d 3 D 1 + 9 7 + 6 0 3 + 7 4 f D w 3 f 7 B m 2 H L 7 7 Q 7 Z d H 1 J l g 2 P l l W d 7 j d + B W N E p 4 p 0 M g l s 7 Y f d F I c 5 M y g 4 B I W z S i z k DI + Z R P o u 1 Y z B X a Q l 6 M v 6 O P M M k x o C o Y K S U s Q L j t y p q w 9 V 7 F T K o Z n t s 4 V 4 F V c P 8 P x i 0 E u d J o h a F 4 E o Z B Q B l l u h L s T o C N h A J E V k w M V m n J m G C I Y Q R n n D s z c J V U C y 5 g U e O W n 8 n m m B U 9 G U E M l z t G w K h g r 9 6 3 h E 0 + U Y n q U R 4 m G R R 6 l K q Z B j U G z 6 A e D P C 8 W d O R 6 P x j U F G k h C f 9 L h r k f O F 1 Y 1 0 0 B y 6 M i C W O 8 8 I P I i M k Z R o b p i d s P r V T V G b v 5 / z k j W R p W / o v 1 l F I e X g 4 q H S t j e L X R u c o J w 2 o O d X 7 p 1 o Q r 3 2 r g Z m Q B F R N 6 n G j M X 8 5 A C 0 P f w x x X j N t R Q e 2 8 F h O B 9 u m B 2 6 J + U h W 6 V x 7 U 3 / R 6 c x y 2 g 2 f t 3 Y + 7 / t 7 + 8 r 1 v k o f k E d k h A X l O 9 s h b 0 i U 9 w s m M f C c / y E 8 v 8 j 5 7 X 7 y v f 6 W N j a X n A a m U 9 + 0 P v x h H l w = = < / l a t e x i t > POSSIBLE < l a t e x i t s h a 1 _ b a s e 6 4 = " J 6 o O B 0 e I W 9 K L J 9 g b e E D 8 S X e 0 R J k = " > A A A D 3 n i c b V J L b 9 N A E N 7 G P E p 4 p c C N y 4 o I q U g o i q M K O D Y 8 J J C q E E T S R o q j s N 5 M 0 l V 2 1 9 b u u K S 4 u X J D X D l y h S P / h n / D 2 i R Q O 5 3 T + H v s N 9 7 Z M J b C Y r P 5 e 6 v i X b p 8 5 e r 2 t e r 1 G z d v 3 a 7 t 3 D m 0 U W I 4 9 H k k I z M I m Q U p N P R R o I R B b I C p U M J R O H + R 8 U c n Y K y I d A 9 P Y x g p N t N i K j h D B 4 1 r 9 w K j a B B O 6 a t B 9 6 D d a f f e v O 2 8 H 9 f q z U Y z L 7 r Z + K u m T l b V H e 9 U f g W T i C c K N H L J r B 3 6 z R h H K T M o u I R l N U g s x I z P 2 Q y G r t V M g R 2 l + f h L + j C x D C M a g 6 F C 0 h y E 8 4 6 U K W t P V e i U i u G x L X M Z e B E 3 T H D 6 b J Q K H S c I m m d B K C T k Q Z Y b 4 e 4 F 6 E Q Y Q G T Z 5 E C F p p w Z h g h G U M a 5 A x N 3 U Y X A P C Y G X v i p d J F o w a M J l F C J C z S s C I b K f W v 4 y C O l m J 6 k Q a R h m Q a x C q l f Y t A s h / 4 o T b M l 9 V x f 9 0 c l R Z x J W v 8 l 4 7 T u O 1 2 r r J s D 5 k c F E q Z 4 V v c D I 2 b H G B i m Z 2 4 / t F B F Z + j m / + c M Z G 5 Y + 8 8 2 U 3 J 5 6 3 x Q 7 l g b W x c b n S u f s F X M o c4 v 3 Z p w 7 V s P X A 0 s o G J C T y O N a f s E t D C 0 A w t c M 2 5 H G b X 7 U s w E 2 s c H b o v 6 U V H o X r l f f t O b z W G r 4 T 9 p 7 L 3 b q + 8 / X 7 3 3 b X K f P C C 7 x C d P y T 5 5 T b q

1 .
Threshold search for general product states (summary of Section III) s∈[n] , c ∈ [m].Equivalently, we can encode this information into a collection of classical-quantum states σ c = 1 n n k=1 |s⟩⟨s| ⊗ σ c (s), while the information about the input ϱ = ⊗ n s=1 ρ(s) is encoded in the classical-quantum state σ = 1 n n k=1 |s⟩⟨s| ⊗ ρ(s).Then, the empirical risk for ϱ and the hypothesis σ c (s) is simply d tr (σ, σ c ).
n }.•If the measurement accepts B c , then the algorithm halts and outputs c, otherwise it passes to c + 1.

T k ϵ 2 •log d ϵ 3 log log d + log 1 ϵ
O(max(log(T km/δ), (log m + C 1 ) 2 )), with T = O and k = O(log(T /δ)).Proof.The algorithm for ERE runs as follows.Prepare the classical guess ρ * 0 as in Lemma 4. Divide the product states into T batches of 2k product states.For t = 1, ..., T , run the algorithm of Lemma 5 on the corresponding batch.If a concept c is selected, use it as an update for the algorithm in Lemma 4 and continue to the next t, otherwise terminate and update the collection of µ c as obtained from ρ * t .By Lemma 3, if n ≥ 6T kl we can obtain 2T k samples without replacement of length l from [n].By identifying x c,i = Tr[Π (c) i ρ i ], we get 2T k product states

3 log log d + log 1 ϵ
221) and setting n = 6T kl, for large enough T = O log d ϵ , k = O(log(T /δ)), there exist C 1 , C 2 , C 3 , C 4 > 0 constants such that as long as m induced by the ϵ-net of S m (denoted as S (ϵ) m ), and

6 ϵ 32 . 6 ε 32 •
a set of matrices with 4 × 4 × 2 parameters), has size Combining this with the fact that there are m − 1 choices of qubit pairs on which the unitary can act, Ñϵ has size (m − 1) ℓ .An arbitrary circuit in LQC(m, ℓ) has the form
e x i t > Learning projector-valued functionsInput: 2T k product states {ϱs} s=1,...2T k and 2T km sets of projectors {{h s is the random variable obtained by measuring h c,s on ϱ s .Algorithm 2 Output θ and the last selected concept, if there is one, otherwise pick it randomly.
The probability of error of the ERE, p err,ERE , becomes bounded as in the proof of Theorem 2, with m = γ 1,∞ (6T kl, ϵ/2, C): p Proof of Theorem 16.The proof is a straightforward generalization of the proof of Theorem 12 using the algorithm of Theorem 15.In this case, C is possibly of infinite cardinality.The algorithm works as follows.Following the notation of Theorem 15, first we measure the classical register of the copies of ρ, obtaining a vector ⃗ x and an ϵ-net C ⃗ x , with cardinality less than the covering number γ 1,1 (6T kl, ϵ/2, C).By Theorem 6, this ϵ/2-net gives an ϵ-net for the true risks, with probability of error bounded as p err,unif ≤ 4γ 1,1 (12T kl, ϵ/16, C)e −6T klϵ 2 /128 , (224) Therefore we have that if we can find c * such that R kl, ϵ/16, C)e −6T klϵ 2 /128+ C 4 T k(γ 1,1 (6T kl, ϵ/2, C)) 2 e − lϵ 2 32 .(223)