General Vapnik–Chervonenkis dimension bounds for quantum circuit learning

Quantifying the model complexity of quantum circuits provides a guide to avoid overfitting in quantum machine learning. Previously we established a Vapnik–Chervonenkis (VC) dimension upper bound for ‘encoding-first’ quantum circuits, where the input layer is the first layer of the circuit. In this work, we prove a general VC dimension upper bound for quantum circuit learning including ‘data re-uploading’ circuits, where the input gates can be single qubit rotations anywhere in the circuit. A linear lower bound is also constructed. The properties of the bounds and approximation-estimation trade-off considerations are discussed.

Using variational quantum circuits [17][18][19] as prediction models in supervised learning leads to the quantum circuit learning (QCL) method [16,[19][20][21]. In this setting, the learning task is similar to classical setting where the training data set and predictions are restricted to classical data. Only the hypothesis set is constructed using variational quantum circuits. Theoretical efforts toward understanding the expressive power of QCL have been conducted by many groups [22][23][24][25][26].
One important question in supervised learning is the learnability of the hypothesis set being used. If the size of training data set is small but the model complexity is high, a learning machine could overfit to the data noise and hence fail to generalize well for future predictions. Uniform non-asymptotic theory of generalization for supervised machine learning started with Vapnik-Chervonenkis (VC) theory [27] and is generally known as statistical learning theory [28][29][30][31][32]. Probably Approximately Correct (PAC) framework proposed by Valiant [33] also includes computational requirements in its original form. For binary classification tasks, VC theory can be used to establish the generalization ability by using the VC dimension of the model class [34].
Previous learnability results for quantum machine learning are based on fat-shattering dimension [35], pseudo-dimension [36], or quantum sample complexity [37]. Many recent learnability results based on various measures and settings could be found in literatures [38][39][40][41][42][43][44][45]. Another VC-dimension upper bound, which is different from our result, is proposed in [45]. Their bound is related to the dimension of vector space sum of images of observable operators, and their bound is restricted to 'encoding-first' circuits. Caro et al [43] obtain a Rademacher complexity generalization bound for Lipschitz loss functions, which is asymptotically equivalent to our result. Abbas et al [39] and Huang et al [38] provide input-dependent results. Du et al [41] give generalization result by using covering number bound [46] for Lipschitz loss functions. Bu et al [42] give Rademacher complexity bound in terms of L p,q matrix norm of operators.
The limitation of expressibility of 'encoding-first' quantum circuit was observed by many groups [24,43,47], and the 'data re-uploading' circuit [47] was proposed to resolve the limitation. The learnability of data re-uploading QCL is shown in [43] by using Rademacher complexity. Our previous study [48] shows that the growth of VC dimension saturates for deep QCL in the 'encoding-first' scheme. This is different from classical deep neural networks (number of edges = |E|, number of vertices = |V|), where the VC dimension grows asymptotically as O(|E| log(|E|)) (for sign activation function) or O(|V| 2 |E| 2 ) (for sigmoid activation function) [29,31,49,50]. In this work, we extend our previous [48] result of VC dimension upper bound to include the data re-uploading scheme [51]. The new results also include more general cases like mixed initial states and some hardware noise channels. A lower bound is also presented.
This paper is organized as follows. Section 2 provides brief explanations for quantum circuit learning method and statistical learning theory. Section 3 contains the main results and their proofs. Further discussions about the results are presented in section 4.

Preliminaries
Quantum circuit learning and statistical learning theory are introduced in this section.

Quantum circuit learning
For a supervised binary classification learning problem, we are given some classical training data set The goal of learning is to obtain a model h : X → Y such that the prediction error (out-of-sample error) The QCL considered in this work uses some quantum circuits to construct the hypothesis set H. Figure 1 depicts one example of data re-uploading QCL. For some d-dimensional input vector and some real variational parameters θ, the circuit gives a unitary evolution U θ ( ⃗ ϕ(⃗ x)) acting on all-zero initial state |0 ⊗n . n denotes the number of qubits (circuit width). We do not assume any special structure for variational parameters and entanglers, while the encoding method is specified as follows. For one input vector is applied to the quantum circuit to upload the data. Data re-uploading means that the gate R s (ϕ i (x i )) is applied to the circuit several times for an i ∈ {0, . . . , d − 1}. The number n i denotes the total number of R s (ϕ i (x i )) gates being applied for an i ∈ {0, . . . , d − 1}. The measurement result is used to compute the expectation value for some fixed observable O: The expectation value is then thresholded to construct a hypothesis set , c ∈ R} for binary classification.

Statistical learning theory
Under suitable measure-theoretical assumptions [52], VC theory provides a general theory of generalization ability for binary classification tasks. We use the definition that the generalization error is E out − E in , where E out is the out-of-sample error (prediction error) and where the randomness is over i.
VC dimension is the maximum number of points that can be shattered by the hypothesis set. In general, d VC could be infinite for an uncountable hypothesis set. If d VC is finite, then the generalization ability of the learning machine is guaranteed by the VC bound and the hypothesis set is called 'PAC-learnable.' Several features of VC theory are worth noting [28]: (1) VC bound is independent of the input distribution. (2) VC bound is non-asymptotic, so it can be applied when the size of training data set is small. (3) VC bound is uniform over the hypothesis set, which means that it is true for all the models in the set. After VC theory, there are latter developments for the generalization ability of learning machines. For real-valued functions, the pseudo-dimension [53] and the fat-shattering dimension [54,55] could be used for generalization bounds. VC theory is also extended to real-valued functions [28]. Rademacher complexity can be used to obtain generalization bounds for classification and regression [32]. PAC-Bayesian bounds are proposed for Bayesian setting [56][57][58][59]. There are also other generalization bounds which are not VC bound but use VC dimension as a measure [60]. Some introductions and comparative study of these measures could be found in [29,31,32,60].

Main result
The main results are presented here. The proofs are extensions of the proofs in [48]. Proof.
We claim that f θ ( ⃗ ϕ(⃗ x)) is a real trigonometric polynomial of d variables, and the degree of the polynomial for each variable is at most n i . Then the theorem is proved by Dudley's theorem for VC dimension of thresholded real vector space function classes [29,31,50,61].
The proof of the claim is as follows. The initial density matrix ρ 0 = |0 ⊗n 0| ⊗n has constant matrix elements. From the assumptions, all the variational unitaries and entanglers do not depend on input vector ⃗ x.

Consider an input dimension x i ∈ [−1, 1] and encoding mapping
then the action of this gate on kth qubit of n-qubit Hilbert space is: = cos where I M denotes M × M identity matrix and A is some constant matrix. The action of this gate on a density matrix ρ is then: If the matrix elements of ρ are trigonometric polynomials of ⃗ ϕ, then the matrix elements of the updated density matrix R Y (ϕ i )| k ρR Y (ϕ i )| † k are trigonometric polynomials where the degree for the variable ϕ i is increased by at most one. Similar argument works if the dimension is uploaded by } is the real trigonometric polynomial basis and {a k (θ)} are the Fourier coefficients. Since f θ ( ⃗ ϕ(⃗ x)) is a real-valued function, the coefficients a k (θ) = f k |f θ ∈ R ∀k. The claim is proved.

Discussions
In this section, we provide some short discussions regarding the obtained theorem.

Applicability of the bounds
Regarding the upper bound, there is no requirement on the structure of variational (trainable) gates and entangling gates of the circuit, except that they should not contain any input data x i . There is no requirement on the encoding gates R s (ϕ i (x i )), except that they should not contain any variational parameter. Notice that in practice, one usually applies some classical post processing techniques to the output expectation values [16]. The VC dimension bound should be adjusted accordingly.
We provide some extensions.

Corollary 1 (Linear combinations of expectations). If the hypothesis set is the real linear combination of several observables for a fixed circuit such that H
, c i ∈ R}, then the bound in Theorem 1 is still true.
Proof. Apply Theorem 1 to each O i . The corollary is then a direct consequence of Dudley's theorem.

Corollary 2 (Mixed states). If the initial state is some mixed state ρ which does not depend on the input vector ⃗ x such that H QCL
, c i ∈ R}, then the bound in Theorem 1 is still true.
Proof. The proof in Theorem 1 remains true if ρ 0 is an input-independent mixed state density matrix.

Corollary 3 (Kraus operations). [62] If any completely positive trace-preserving map ρ →
∑ k E k ρE † k is applied to the system, where the E k 's are independent of the input ⃗ x, then the bound in Theorem 1 is still true.
Proof. If the matrix elements of ρ are trigonometric polynomials of ⃗ ϕ, then the matrix elements of the updated density matrix ρ ′ = ∑ k E k ρE † k are trigonometric polynomials of degrees no more than that of ρ. Corollary 3 includes several types of hardware noise channels [4,6]. It does not include the situations where the density matrix has to be normalized by ρ → E k ρE † k /Tr(E k ρE † k ). For the special case where n i = 1, the upper bound and the lower bound together give 3 d ⩾ d VC ⩾ 2d + 1. For d = 1, the bound is saturated with exact number d VC = 3. For larger d, there is a gap between the exponential upper bound and the linear lower bound to be explored.

Reduction to the previous results
We show how to obtain the special case in our previous work [48] for the ansatz in [20]: This bound can be obtained from the general bound in Theorem 1 as follows. The encoding used in [20] can be understood as performing feature maps x i → x 2 i to increase the feature dimension from d to 2 d. The encoding maps ϕ i (x i ) = arcsin(x i ) and ϕ ′ i (x 2 i ) = arccos(x 2 i ) are used, and are uploaded by . Each dimension is uploaded n i = n d times, and hence we get the bound (2n i + 1) 2d = (2 n d + 1) 2d . The lightcone bound can be calculated by counting n i covered by the lightcone for a specific ansatz.

Independent of the number of variational gates
Notice that the upper bound is based on counting the number of basis functions, hence the bound does not depend on the number of variational parameters. This suggests that the bound is asymptotically tight (constant) with respect to number of trainable parameters, but cannot be tight in general (the constant could be too large). For example, if the number of variational parameter is zero, then the VC dimension is zero. It is desirable to also have a scaling with respect to the number of variational gates, like the cases in [40,44].

Approximation-estimation trade-off considerations
To achieve low prediction error in supervised learning, the approximation-estimation trade-off (also known as bias-variance trade-off) should be considered [31,32]. The generalization error bound discussed in this work is only for estimation error.
Barron [63] gives the approximation error bound for single layer classical neural network hypothesis set where ϕ is a sigmoid function and n is the number of nodes. Barron also analyzed the approximation-estimation trade-off of neural networks [64]. It is shown that neural networks have approximation advantage over linear combinations of fixed basis functions in the sense that the approximation has faster convergence rate for high-dimensional inputs.
One attempt to overcome the limitation of fixed basis functions of QCL was actually proposed in [47]: combining neural networks with QCL to construct, for example, the hypothesis set where the affine transformation W ·⃗ x + ⃗ b is composited with QCL. However, a simple special case {sin(Wx) : W ∈ R} has infinite VC dimension, and hence is not PAC-learnable [28,29,48]. This is because W provides possibly high-frequency oscillations to shatter arbitrarily many data points. One possible way to resolve this problem could be using a sigmoid activation function ϕ for encoding. For example, the input x i could be uploaded by the R s (πϕ(W i x i + b i )) gate. This could be a future direction.

Conclusion
In this work, we give a general VC dimension upper bound and a lower bound for quantum circuit learning, and hence establish the PAC learnability of this hypothesis set. While this result provides a basis for quantum circuit supervised learning, many questions remain. For example, we did not address the issues of sampling error of quantum machines (due to finite readout samples), which could affect the generalization ability. We did not have a bound which scales with respect to the number of trainable parameters. The approximation-estimation trade-off should also be addressed. We do not have experimental result. Numerical simulations for overfitting of data re-uploading QCL could be found in [65]. Entangling dropout is suggested as a regularization technique to avoid overfitting for data re-uploading QCL [65]. It would be desirable to see comparison between theory and experiments for large-scale circuits. These questions are left for future investigations.

Data availability statement
No new data were created or analyzed in this study.