On the practical usefulness of the Hardware Efficient Ansatz

Variational Quantum Algorithms (VQAs) and Quantum Machine Learning (QML) models train a parametrized quantum circuit to solve a given learning task. The success of these algorithms greatly hinges on appropriately choosing an ansatz for the quantum circuit. Perhaps one of the most famous ansatzes is the one-dimensional layered Hardware Efficient Ansatz (HEA), which seeks to minimize the effect of hardware noise by using native gates and connectives. The use of this HEA has generated a certain ambivalence arising from the fact that while it suffers from barren plateaus at long depths, it can also avoid them at shallow ones. In this work, we attempt to determine whether one should, or should not, use a HEA. We rigorously identify scenarios where shallow HEAs should likely be avoided (e.g., VQA or QML tasks with data satisfying a volume law of entanglement). More importantly, we identify a Goldilocks scenario where shallow HEAs could achieve a quantum speedup: QML tasks with data satisfying an area law of entanglement. We provide examples for such scenario (such as Gaussian diagonal ensemble random Hamiltonian discrimination), and we show that in these cases a shallow HEA is always trainable and that there exists an anti-concentration of loss function values. Our work highlights the crucial role that input states play in the trainability of a parametrized quantum circuit, a phenomenon that is verified in our numerics.

Variational Quantum Algorithms (VQAs) and Quantum Machine Learning (QML) models train a parametrized quantum circuit to solve a given learning task.The success of these algorithms greatly hinges on appropriately choosing an ansatz for the quantum circuit.Perhaps one of the most famous ansatzes is the one-dimensional layered Hardware Efficient Ansatz (HEA), which seeks to minimize the effect of hardware noise by using native gates and connectives.The use of this HEA has generated a certain ambivalence arising from the fact that while it suffers from barren plateaus at long depths, it can also avoid them at shallow ones.In this work, we attempt to determine whether one should, or should not, use a HEA.We rigorously identify scenarios where shallow HEAs should likely be avoided (e.g., VQA or QML tasks with data satisfying a volume law of entanglement).More importantly, we identify a Goldilocks scenario where shallow HEAs could achieve a quantum speedup: QML tasks with data satisfying an area law of entanglement.We provide examples for such scenario (such as Gaussian diagonal ensemble random Hamiltonian discrimination), and we show that in these cases a shallow HEA is always trainable and that there exists an anti-concentration of loss function values.Our work highlights the crucial role that input states play in the trainability of a parametrized quantum circuit, a phenomenon that is verified in our numerics.

Introduction
The advent of Noisy Intermediate-Scale Quantum (NISQ) [1] computers has generated a tremendous amount of excitement.Despite the presence of hardware noise and their limited qubit count, near-term quantum computers are al-ready capable of outperforming the world's largest super-computers on certain contrived mathematical tasks [2][3][4].This has started a veritable rat race to solve real-life tasks of interest in NISQ hardware.
One of the most promising strategies to make practical use of near-term quantum computers is to train parametrized hybrid quantum-classical models.Here, a quantum device is used to estimate a classically hard-to-compute quantity, while one also leverages classical optimizers to train the parameters in the model.When the algorithm is problem-driven, we usually refer to it as a Variational Quantum Algorithm (VQA) [5,6].VQAs can be used for a wide range of tasks such as finding the ground state of molecular Hamiltonians [7,8], solving combinatorial optimization tasks [9,10] and solving linear systems of equations [11][12][13], among others.On the other hand, when the algorithm is data-driven, we refer to it as a Quantum Machine Learning (QML) model [14,15].QML can be used in supervised [16,17], unsupervised [18] and reinforced [19] learning problems, where the data processed in the quantum device can either be classical data embedded in quantum states [16,20], or quantum data obtained from some physical process [21][22][23].
Both VQAs and QML models train parametrized quantum circuits U (θ) to solve their respective tasks.One of, if not the, most important aspect in determining the success of these near-term algorithms is the choice of ansatz for the parametrized quantum circuit [24].By ansatz, we mean the specifications for the arrangement and type of quantum gates in U (θ), and how these depend on the set of trainable parameters θ.Recently, the field of ansatz design has seen a Cambrian explosion where researchers have proposed a plethora of ansatzes for VQAs and QML [5,6].These include variable structure ansatzes [25][26][27][28][29], problem-inspired ansatzes [30][31][32][33][34] and even the recently introduced field of geometric quantum machine learning where one embeds information about the data symmetries into U (θ) [35][36][37][38][39][40][41].
Perhaps the most famous, and simultaneously infamous, ansatz is the so-called Hardware Efficient Ansatz (HEA).As its name implies, the main objective of HEA is to mitigate the effect of hardware noise by using gates native to the specific device being used.The previous avoids the gate overhead which arises when compiling [52] a nonnative gate-set into a sequence of native gates.While the HEA was originally proposed within the framework of VQAs, it is now also widely used in QML tasks.The strengths of the HEA are that it can be as depth-frugal as possible and that it is problem-agnostic, meaning that one can use it in any scenario.However, its wide usability could also be its greater weakness, as it is believed that the HEA cannot have a good performance on all tasks [46] (this is similar to the famous no-freelunch theorem in classical machine learning [53]).Moreover, it was shown that deep HEA circuits suffer from barren plateaus [42] due to their high expressibility [46].Despite these difficulties, the HEA is not completely hopeless.In Ref. [43], the HEA saw a glimmer of hope as it was shown that shallow HEAs can be immune to barren plateaus, and thus have trainability guarantees.
From the previous, the HEA was left in a sort of gray-area of ansatzes, where its practical usefulness was unclear.On the one hand, there is a common practice in the field of using the HEA irrespective of the problem one is trying to solve.On the other hand, there is a significant push to move away from problem-agnostic HEA, and instead develop problem-specific ansatzes.However, the answer to questions such as "Should we use (if at all) the HEA? " or "What problems are shallow HEAs good for?" have not been rigorously tackled.
In this work, we attempt to determine what are the problems in VQAs and QML where HEAs should, or should not be used.As we will see, our results indicate that HEAs should likely be avoided in VQA tasks where the input state is a product state, as the ensuing algorithm can be efficiently simulated via classical methods.Similarly, we will rigorously prove that HEAs should not be used in QML tasks where the input data satisfies a volume law of entanglement.In these cases, we connect the entanglement in the input data to the phenomenon of cost concentration, and we show that high levels of entanglement lead to barren plateaus, and hence to untrainability.Finally, we identify a scenario where shallow HEAs can be useful and potentially capable of achieving a quantum advantage: QML tasks where the input data satisfies an area law of entanglement.In these cases, we can guarantee that the optimization landscape will not exhibit barren plateaus.Taken together our results highlight the critical importance that the input data plays in the trainability of a model.

Variational Quantum Algorithms and Quantum Machine Learning
Throughout this work, we will consider two related, but conceptually different, hybrid quantumclassical models.The first, which we will denote as a Variational Quantum Algorithm (VQA) model, can be used to solve the following tasks Definition 1 (Variational Quantum Algorithms).
Let O be a Hermitian operator, whose ground state encodes the solution to a problem of interest.In a VQA task, the goal is to minimize a cost function C(θ), parametrized through a quantum circuit U (θ), to prepare the ground state of O from a fiduciary state |ψ 0 ⟩.
In a VQA task one usually defines a cost function of the form and trains the parameters in U (θ) by solving the optimization task arg min θ C(θ).Then, while Quantum Machine Learning (QML) models can be used for a wide range of learning tasks, here we will focus on supervised problems Definition 2 (Quantum Machine Learning).Let S = {y s , |ψ s ⟩} be a dataset of interest, where |ψ s ⟩ are n-qubit states and y s associated realvalued labels.In a QML task, the goal is to train a model, by minimizing a loss function L(θ) parametrized through a quantum neural network, i.e., a parametrized quantum circuit, U (θ), to predict labels that closely match those in the dataset.
The exact form of L(θ), and concomitantly the nature of what we want to "learn" from the dataset depends on the task at hand.For instance, in a binary QML classification task where y s are labels one can minimize an empirical loss function such as the mean-squared error The architecture of a HEA seeks to minimize the effect of hardware noise by following the topology, and using the native gates, of the physical hardware.Specifically, we consider HEA as a one-dimensional alternating layered ansatz of two-qubit gates organized in a brick-like fashion.In the figure, we show how a first layer of gates is implemented at time t1 while a second layer at time t2.At the end of the computation, a local operator is measured.
with O s being label-dependent Hermitian operator.The parameters in the quantum neural network U (θ) are trained by solving the optimization task arg min θ L(θ), and the ensuing parameters, along with the loss, are used to make predictions.While VQAs and QML share some similarities, they also share some differences.Let us first discuss their similarities.First, in both frameworks, one trains a parametrized quantum circuit.This requires choosing an ansatz for U (θ) and using a classical optimizer to train its parameters.As for their differences, in a VQA task as described in Definition 1 and Eq.(1), the input state to the parametrized quantum circuit U (θ) is usually an easy-to-prepare state |ψ 0 ⟩ such as the all-zero state, or some physically motivated product state (e.g., the Hartree-Fock state in quantum chemistry [5,54]).On the other hand, in a QML task as in Definition 2 and Eq.(2), the input states to U (θ) are taken from the dataset S, and thus can be extremely complex quantum states (see Fig. 1(a)).

Hardware Efficient Ansatz
As previously mentioned, one of the most important aspects of VQAs and QML models is the choice of ansatz for U (θ).Without loss of generality, we assume that the parametrized quantum circuit is expressed as where the {V l } are some unparametrized unitaries, {H l } are traceless Pauli operators, and where θ = (θ 1 , θ 2 , . ..).While recently the field of ansatz design has seen a tremendous amount of interest, here we will focus on the HEA, one of the most widely used ansatzes in the literature.Originally introduced in Ref. [55], the term HEA is a generic name commonly reserved for ansatzes that are aimed at reducing the circuit depth by choosing gates {V l } and generators {H l } from a native gate alphabet determined from the connectivity and interactions to the specific quantum computer being used.
As shown in Fig. 1(b), throughout this work we will consider the most depth-frugal instantiation of the HEA: the one-dimensional alternating layered HEA.Here, one assumes that the physical qubits in the hardware are organized in a chain, where the i-th qubit can be coupled with the (i − 1)-th and (i + 1)-th.Then, at each layer of the circuit one connects each qubit with its nearest neighbors in an alternating, brick-like, fashion.We will denote as D the depth, or the number of layers, of the ansatz.This type of alternatinglayered HEA exploits the native connectivity of the device to maximize the number of operations at each layer while preventing qubits to idle.For instance, alternating-layered HEAs are extremely well suited for the IBM quantum hardware topology where only nearest neighbor qubits are directly connected (see e.g.Ref. [56]).We note that henceforth when we use the term HEA, we will refer to the alternating-layered ansatz of Fig. 1.
3 Trainability of the HEA

Review of the literature
In recent years, several results about the nontrainability of VQAs/QML have been pointed out [42][43][44][45][46][47][48][49][50][51].In particular, it has been shown that quantum landscapes can exhibit the barren plateau phenomenon, which is nowadays consid-ered to be one of the most challenging bottlenecks for trainability of these hybrid models.We say that the cost function exhibits a barren plateau if, for the cost, or loss function, the optimization landscape becomes exponentially flat with the number of qubits.When this occurs, an exponential number of measurement shots are required to resolve and determine a cost-minimizing direction.In practice, the exponential scaling in the precision due to the barren plateaus erases the potential quantum advantage, as the VQA or QML scheme will have complexity comparable to the exponential scaling of classical algorithms.
Being more concrete, let f (θ) = C(θ), L s (θ), i.e., either the cost function C(θ) of a VQAs, or the s-th term L s (θ) in the loss function of a QML settings.For simplicity of notation, we will omit the "s" sub-index of O s and ψ s when f (θ) = L s (θ).In a barren plateau, there are two types of concentration (or flatness) notions that have been explored: deterministic concentration (all landscape is flat) and probabilistic (most of the landscape is flat).Let us first define the deterministic notion of concentration:

Definition 3. (Deterministic concentration) Let the trivial value of the cost function be
The above definition puts forward a necessary condition for trainability.It is clear that if f (θ) is ϵ-concentrated, then f (θ) must be resolved within an error that scales as ∼ ϵ, i.e., one must use ∼ ϵ −2 measurement shots to estimate f (θ).Thus, we define a VQA/QML model to be trainable if ϵ vanishes no faster than polynomially with n (ϵ ∈ Ω(1/ poly(n))).Conversely, if ϵ ∈ O(2 −n ), one requires an exponential number of measurement shots to resolve the quantum landscape, making the model non-scalable to a higher number of qubits.Deterministic concentration was shown in Refs.[57,58], which study the performance of VQA and QML models in the presence of quantum noise and prove that |f (θ) − f trv | ∈ O(q D ), where 0 < q < 1 is a parameter that characterizes the noise.Using the results therein, it can be shown that if the depth D of the HEA is D ∈ O(poly(n)), then the noise acting through the circuit leads to an exponential concentration around the trivial value f trv .
Let us now consider the following definition of probabilistic concentration: Definition 4 (Probabilistic concentration).Let ⟨•⟩ θ be the average with respect to the parameters where the average is taken over the domains {Θ}.
Here we make an important remark on the connection between probabilistic concentration and barren plateaus.The barren plateau phenomenon, as initially formulated in Ref. [42] indicates that the cost function gradients are concentrated, i.e., that where ∂ ν f (θ) := ∂f (θ)/∂θ ν .However one can prove that probabilistic cost concentration implies probabilistic gradient concentration, and vice-versa [47].According to Definition 4, we can again see that if ϵ ∈ O(2 −n ), one requires an exponential number of measurement shots to navigate through the optimization landscape.As shown in Ref. [42], such probabilistic concentration can occur if the depth is D ∈ O(poly(n)), as at the ansatz becomes a 2-design [42,59,60].
From the previous, we know that deep HEAs with D ∈ O(poly(n)) can exhibit both deterministic cost concentration (due to noise), but also probabilistic cost concentration (due to high expressibility [46]).However, the question still remains open of whether HEA can avoid barren plateaus and cost concentration with subpolynomial depths.This question was answered in Ref. [43].Where it was shown that HEAs can avoid barren plateaus and have trainability guarantees if two necessary conditions are met: locality and shallowness.In particular, one can prove that if D ∈ O(log(n)), then measuring global operators -i.e., O is a sum of operators acting non-identically on every qubit -leads to barren plateaus, whereas measuring local operators -i.e.O is a sum of operators acting (at most) on k qubits, for k ∈ O(1) -leads to gradients that vanish only polynomially in n.

A new source for untrainability
The discussions in the previous section provide a sort of recipe for avoiding expressibility-induced probabilistic concentration (see Definition 4), and noise-induced deterministic concentration: Use local cost measurement operators and keep the depth of the quantum circuit shallow enough.
Unfortunately, the previous is still not enough to guarantee trainability.As there are other sources of untrainability which are usually less explored.To understand what those are, we will recall a simplified version of the main result in Theorem 2 of Ref. [43].First, let O act non-trivially only on two adjacent qubits, one of them being the ⌊ n 2 ⌋-th qubit, and let us study the partial derivative ∂ ν f (θ) with respect to a parameter in the last gate acting before O (see Fig. 2).The variance of ∂ ν f (θ) is lower bounded as [43] G where we recall that D is the depth of the HEA, |ψ⟩ is the input state, and with , is the Hilbert-Schmidt distance between M and Tr[M ] 1 d M , where d M is the dimension of the matrix M .Moreover, here we defined as the reduced density matrix on the qubits with index i ∈ [k, k ′ ].As such, ψ k,k ′ correspond to the reduced states of all possible combinations of adjacent qubits in the light-cone generated by O (see Fig. 2).Here, by light-cone we refer to the set of qubit indexes that are causally related to O via U (θ), i.e., the set of indexes over which U † (θ)OU (θ) acts non-trivially.
Equation (7) provides the necessary condition to guarantee trainability, i.e., to ensure that the gradients do not vanish exponentially.First, one recovers the condition on the HEA that D ∈ O(log(n)).However, a closer inspection of the above formula reveals that both the initial state |ψ⟩ and the measurement operator O also play a key role.Namely one needs that O, as well as any of the reduced density matrices of |ψ⟩ an any set adjacent qubits in the light-cone, to not be close (in Hilbert-Schmidt distance) to the (normalized) identity matrix.This is due to the fact that if ) and the trainability guarantees are lost (the lower bound in Eq. (6) becomes trivial).
The previous results highlight that one should pay close attention to the measurement operator and the input states.Moreover, these results make intuitive sense as they say that extracting information by measuring an operator O that is exponentially close to the identity will be exponentially hard.Similarly, training an ansatz with local gates on a state whose marginals are exponentially close to being maximally mixed will be exponentially hard.
Here we remark that in a practical scenario of interest, one does not expect O to be exponentially close to the identity.For a VQA, one is interested in finding the ground state of O (see Eq. (1)), and as such, it is reasonable to expect that O is non-trivially close to the identity [5,6].Then, for QML there is additional freedom in choosing the measurement operators O s in (2), meaning that one simply needs to choose an operator with nonexponentially vanishing support in non-identity Pauli operators.
In the following sections, we will take a closer look at the role that the input state can have in the trainability of shallow-depth HEA.

Entanglement and information scrambling
Here we will briefly recall two fundamental concepts: that of states satisfying an area law of entanglement, and that of states satisfying a volume law of entanglement.Then, we will relate the concept of area law of entanglement with that of scrambling.
First, let us rigorously define what we mean by area and volume laws of entanglement.
where S(ρ) = −Tr[ρ log(ρ)] is the entropy of entanglement.Conversely, the state possesses area law for the entanglement within Λ and Λ if Note that the above definition of area vs. volume law of entanglement is nonstandard.In particular, it is completely agnostic to the geometry on which the given state resides.However, as we will show, it is the relevant one to look at for the purpose of this work.
From the definition of volume law of entanglement, the concept of scrambling of quantum information can be easily defined; the information contained in |ψ⟩ is said to be scrambled throughout the system if the state |ψ⟩ follows a volume law for the entropy of entanglement according to Definition 5 across any bipartition such that |Λ| ∈ O(log(n)).
Here we further recall that an informationtheoretic measure of the quantum information that can be extracted by a subsystem Λ is which quantifies the maximum distinguishability between the reduced density matrix ψ Λ and the maximally mixed state for some constant c > 0.
The definition of scrambling of quantum information easily follows from the definition of volume law for entanglement in Definition 5. Indeed, given a subsystem Λ, one has the following bound for some c > 0. Conversely, the state possesses area law for the entanglement within Λ and Λ if

HEA and volume law of entanglement
As we show in this section, a shallow HEA will be untrainable if the input state satisfies a volume law of entanglement according to Definition 5. Before going deeper into the technical details, let us sketch the idea behind our statement, with the following warm-up example.

A toy model
Let us consider for simplicity the case when O is a local operator acting non-trivially on a single qubit, and let us recall that In the Heisenberg picture we can interpret f (θ) as the expectation value of the backwards-in-time evolved operator O(θ) = U † (θ)OU (θ) over the initial state |ψ⟩.Thanks to the brick-like structure of the HEA, one can see from a simple geometrical argument that the operator O(θ) will act non-trivially only on a set Λ containing (at most) 2D qubits (see Fig. 2, and also see below for the rigorous proof).Thus, we can compute f (θ) as where Λ is the complement set of Λ, and where we assume |Λ| ≪ | Λ|.Since the HEA is shallow, the cost function is evaluated by tracing out the majority of qubits, i.e. | Λ| ∼ n.If the input state |ψ⟩ is highly entangled, since |Λ| ≪ | Λ|, and thanks to the monogamy of entanglement [61], we can assume that there is a subset of Λ, say Λ ′ , maximally entangled with Λ, i.e., , where ϵ ≪ 1 and |ϕ⟩ being orthogonal to the rest.Neglecting terms in ϵ 2 , and choosing ∥O∥ ∞ = 1, this results in a function 2ϵ-concentrated around its trivial value The previous shows that a highly entangled state, such as the one presented above which satisfies a volume law-of entanglement, will lead to a landscape that exhibits a deterministic exponential concentration according to Definition 3.

Formal statement
In this section, we will present a deterministic concentration result.To begin, let us introduce the following definition: Definition 8 (Support of a Pauli operator).Let P be a Pauli operator, we define the support supp(P ) as the ordered set of natural numbers q i labeling the qubits on which P acts non-trivially.
with S the number of qubits on which P acts nontrivially.
We are finally ready to state a deterministic concentration result based on the informationtheoretic measure I Λ (ψ) for HEA circuits and for f (θ) = C(θ), L s (θ) being the VQA cost function or the QML loss function.

Theorem 1 (Concentration and measurement operator support). Let U (θ) be HEA with depth D, and
Then, the following bound on the size of Λ holds: See App.A for the proof.
Let us discuss the implications of Theorem 1. First, we find that the difference between the training function f (θ), and its trivial value f trv depends on the information-theoretic measure of information scrambling I Λ (ψ) for |Λ| = max i |Λ i | (see Definition 6).From Eq. ( 19) it is clear that the size of Λ is determined by two factors: (i) the depth of the circuit D, and (ii) the locality of the operator O.As soon as either the depth D or max i supp(P i ) starts scaling with the number of qubits n, the bound in Eq. ( 18) becomes trivial as one can obtain information by measuring a large enough subsystem with |Λ| ∈ Θ(n).However, as explained above in Sec. 3, we already know that this regime is precluded as the necessary requirements to ensure the trainability of the HEA (and thus to ensure trainability of the VQA/QML model) are: (i) the depth of the HEA circuit must not exceed O(log(n)), and (ii) the operator O must have local support on at most O(log(n)) qubits.Hence, the trainability of the model is solely determined by the scaling of I Λ (ψ).
From Theorem 1, we can derive the following corollaries Then, if |ψ⟩ satisfies a volume law, or alternatively if the information contained in |ψ⟩ is scrambled, i.e., if I Λ (ψ) ∈ O(2 −cn ) for some c > 0, and if ∥O∥ ∞ ∈ Ω(1), then: Here we can see that if the information contained in |ψ⟩ is too scrambled throughout the system, one has deterministic exponential concentration of cost values according to Definition 3.
Theorem 1 puts forward another important necessary condition for trainability and to avoid deterministic concentration: the information in the input state must not be too scrambled throughout the system.When this occurs, the information in |ψ⟩ cannot be accessed by local measurements, and hence one cannot train the shallow depth HEA.
At this point, we ask the question of how typical is for a state to contain information scrambled throughout the system, hidden in non-local degrees of freedom, and resulting in To answer this question, we use tools of the Haar measure and show that for the overwhelming majority of states, the information cannot be accessed by local measurements as their information is too scrambled.
See App.A for a proof.Note that henceforth, we will refer to overwhelming probability as a probability 1 up to a exponentially (in n, the size of the system) decaying correction.In many tasks, multiple copies of a quantum state |ψ⟩ are used to predict important properties, such as entanglement entropy [62][63][64], quantum magic [65][66][67], or state discrimination [68].Thus, it is worth asking whether a function of the form ] can be trained when U (θ) is a shallow HEA acting on 2n qubits.In the following corollary, we prove that for the overwhelming majority of states, I Λ (ψ ⊗2 ) ∈ O(2 −n ), and one has deterministic concentration according to Definition 3 even if one has access to two copies of a quantum state.

Corollary 3. Suppose one has access to 2 copies of a Haar random state |ψ⟩ and one computes the function
be the depth of a HEA U (θ), and 2 , where c = (18π 3 ) −1 , and one has: See App.A for a proof.Note that the generalization to more copies is straightforward.
The above results show us that there is indeed a no-free-lunch for the shallow HEA.The majority of states in the Hilbert space, follow a volume law for the entanglement entropy and thus have quantum information hidden in highly nonlocal degrees of freedom, which cannot be accessed through local measurement at the output of a shallow HEA.

HEA and area law of entanglement
The previous results indicate that shallow HEAs are untrainable for states with a volume law of entanglement, i.e., they are untrainable for the vast majority of states.The question still remains of whether shallow HEA can be used if the input states follow an area law of entanglement as in Definition 5. Surprisingly, we can show that in this case there is no concentration, as the following result holds Theorem 2 (Anti-concentration of expectation values).Let U (θ) be a shallow HEA with depth D ∈ O(log(n)) where each local two-qubit gate forms a 2-design on two qubits.Then, let O = i c i P i be the measurement composed of, at most, polynomially many traceless Pauli operators P i having support on at most two neighboring qubits, and where i c 2 i ∈ O(poly(n)).If the input state follows an area law of entanglement, for any set of parameters θ B and θ A = θ B + êAB l AB with l AB ∈ Ω(1/ poly(n)), then See App.B for the proof.
Theorem 2 shows that if the input states to the shallow HEA follow an area law of entanglement, then the function f (θ) anti-concentrates.That is, one can expect that the loss function values will differ (at least polynomially) at sufficiently different points of the landscape.This naturally should imply that the cost function does not have barren plateaus or exponentially vanishing gradients.In fact, we can prove this intuition to be true as it can be formalized in the following result.See App.B for the proof.Taken together, Theorem 2 and Proposition 1 suggest that shallow HEAs are ideal for processing states with area law of entanglement, as the loss landscape is immune to barren plateaus.Evidently, the previous shallow HEAs are capable of achieving a quantum advantage.However, determining whether a quantum advantage is feasible or not, for such ansatzes is beyond the scope of this work (as it requires a detailed analysis of properties beyond the absence of barren plateaus such as quantifying the presence of local minima) we can still further identify scenarios where a quantum advantage could potentially exist.
First, let us rule out certain scenarios where a provable quantum advantage will be unlikely.These correspond to cases where the input state |ψ⟩ satisfies an area law of entanglement but also admits an efficient classical representation [69][70][71][72][73].
The key issue here is that if the input state admits a classical decomposition, then the expectation value f (θ) for U (θ) being a shallow HEA can be efficiently classically simulated [74].For instance, one can readily show that the following result holds.
Proposition 2 (Cost of classically computing f (θ)).Let U (θ) be an alternating layered HEA of depth D, and O = i c i P i .Let |ψ⟩ be an input stat that admits a Matrix Product State [75] (MPS) description with bond-dimension χ.Then, there exists a classical algorithm that can estimate f (θ) with a complexity which scales as O((χ • 4 D ) 3 ).
The proof of the above proposition can be found in [75].From the previous theorem, we can readily derive the following corollary.

Corollary 4. Shallow depth HEAs with depth D ∈ O(log(n)), and with an input state with a bonddimension χ ∈ O(poly(n)) can be efficiently classically simulated with a complexity that scales as O(poly(n)).
Note that Proposition 2 and its concomitant Corollary 4 do not preclude the possibility that shallow HEA can be useful even if the input state admits an efficient classical description.This is due to the fact that, while requiring computational resources that scale polynomially with n (if χ is at most polynomially large with n), the order of the polynomial can still lead to prohibitively large (albeit polynomially growing) computational resources.Still, we will not focus on discussing this fine line, instead, we will attempt to find scenarios where a quantum advantage can be achieved.
In particular, we highlight the seminal work of Ref. [76], which indicates that while states satisfying an area law of entanglement constitute just a very small fraction of all the states (which is expected from the fact that Haar random states -the vast majority of states-satisfy a volume law), the subset of such area law of entanglement states that admit an efficient classical representation is exponentially small.This result can be better visualized in Fig. 3.The previous gives hope that one can achieve a quantum advantage with area law classically-unsimulable states.

Implications of our results
Let us here discuss how our results can help identify scenarios where shallow HEA can be useful, and scenarios where they should be avoided.The vast majority of states satisfy a volume law, and hence a shallow HEA cannot be used to extract information from them.From the set of states satisfying an area law, only a very small subset admits an efficient classical representation.For these states, the effect of a shallow HEA can be efficiently simulated.As such, there exists a Goldilocks regime where HEA can potentially be used to achieve a quantum advantage: non-classically-simulable area law states.

Implications to VQAs
As indicated in Definition 1, in a VQA one initializes the circuit to some easy-to-prepare fiduciary quantum state |ψ 0 ⟩.For instance, in a variational quantum eigensolver [7] quantum chemistry application such an initial state is usually the unentangled mean-field Hartree-Fock state [54].Similarly, when solving a combinatorial optimization task with the quantum optimization approximation algorithm [9] the initial state is an equal superposition of all elements in the computational basis |+⟩ ⊗n .In both of these cases, the initial states are separable, satisfy an area law, and admit an efficient classical decomposition.This means that while the shallow HEA will be trainable, it will also be classically simulable.This situation will arise for most VQA implementations as it is highly uncommon to prepare non-classically simulable initial states.From the previous, we can see that shallow HEA should likely be avoided in VQA implementations if one seeks to find a quantum advantage.
problem-dependent, implying that the usability of the HEA depends on the task at hand.Our results indicate that HEAs should be avoided when the input states satisfy a volume law of entanglement or when they follow an area law but also admit an efficient classical description.In fact, it is clear that while the HEA is widely used in the literature, most cases where it is employed fall within the cases where the HEA should be avoided [45].As such, we expect that many proposals in the literature should be revised.However, the trainability guarantees pointed out in this work, narrow down the scenarios where the HEA should be used, and leave the door open for using shallow HEAs in QML tasks to analyze non-classically simulable area-law states.In the following section, we give an explicit example, based on state discrimination between area law states having no MPS decomposition, with a possible achievable quantum advantage.

Random Hamiltonian discrimination 7.1 General framework
In this section, we present an application of our results in a QML setting based on Hamiltonian Discrimination.The QML problem is summarized as follows: the data contains states that are obtained by evolving an initial state either by a general Hamiltonian or by a Hamiltonian possessing a given symmetry.The goal is to train a QML model to distinguish between states arising from these two evolutions.In the example below, we show how the role of entanglement governs the success of the QML algorithm.
Let us begin by formally stating the problem.Consider two Hamiltonians H G , H S , and a lo- s=s+2 10: end while We consider the case when U (θ) is a parametrized shallow HEA, and is O a local operator measured at the output of the circuit.We define In the following, we will drop the superscript in |ψ s ⟩ ∈ S to light the notation, unless necessary.Then, the goal is to minimize the empirical loss function: where N is the size of the dataset S.There are two necessary conditions for the success of the algorithm: (i) the parameter landscape is not exponentially concentrated around its trivial value, and (ii) there exists θ 0 such that the model outputs are different for data in distinct classes.For instance, this can be achieved if U † (θ 0 )OU (θ 0 ) = S; as here L s (H S , θ 0 , t) = 1 for any s such that |ψ s ⟩ ≡ |ψ H S s ⟩ t .Then, one also needs to have L s (H G , θ 0 , t) not being close to one with high probability.Note that if the symmetry S is a local operator, and O is chosen to be local, there are cases in which a shallow-depth HEA can find the solution U † (θ 0 )OU (θ 0 ) = S.Such an example is shown below.

Gaussian Diagonal Ensemble Hamiltonian discrimination
Let us now specialize the example to an analytically tractable problem.We first show how the growth of the evolution time t, and thus the entanglement generation, affects the HEA's ability to solve the task.Then, we show that there exists a critical time t * for which the states in the dataset satisfy an area law, and thus for which the QML algorithm can succeed.Since classically simulating random Hamiltonian evolution is a difficult task, the latter constitutes an example where a QML algorithm can enjoy a quantum speed-up with respect to classical machine learning.
Let H G be a random Hamiltonian, i.e., H G = i E i Π i , where Π i are projectors onto random Haar states, and E i are normally distributed around 0 with standard deviation 1/2, (see App. C for additional details).This ensemble of random Hamiltonians is called Gaussian Diagonal Ensemble (GDE), and it is the simplest, non-trivial example where our results apply.In Fig. 5 we explicitly show how the time evolution under such a Hamiltonian can be implemented in a quantum circuit.Generalizations to wider used ensembles, such as Gaussian Unitary Ensemble (GUE), Gaussian Symplectic Ensemble (GSE), Gaussian Orthogonal Ensemble (GOE), or the Poisson Ensemble (P), will be straightforward.We refer the reader to Refs.[77,78] for more details on these techniques.
Consider a bipartition of n qubits, i.e., A ∪ B such that |A| ≪ |B|.Let H S a Random Hamiltonian commuting with all the operators on a local subsystem A, i.e. [H S , P A ] = 0 for all P A .We can choose H S as H S = 1 A ⊗ H B , with H B belonging to the GDE in the subsystem B. Let H G be a random Hamiltonian belonging to the GDE ensemble in the subsystem A∪B.Since the Hamiltonian H S commutes with all the operators in A, we choose the symmetry S to be S ≡ P A ⊗ 1 B , i.e. a Pauli operator with local support on A. To build the data-set S, we thus identify the vector space containing all the eigenvectors with eigenvalue 1 of P A , V P A = span{|z⟩ ∈ C 2⊗n | P A |z⟩ = |z⟩} and follow Algorithm 1.Note that, with this choice, dim(V P A ) = 2 |A|−1 and thus we take |A| ≥ 2. The QML task is to distinguish states evolved in time by H G or by H S .Let us choose O a Pauli operator having support on a local subsystem.Then, the following proposition holds: Proposition 3. Let L s (H, θ, t) be the expectation value defined in Eq. (24), for H ∈ {H G , H S }.If there exist θ 0 such that U † (θ 0 )OU (θ 0 ) = S, then See App.C for the proof.Notably, the symmetry of H S ensures that if the HEA is able to find θ 0 , then the output L s (H S , θ 0 , t) = 1 is distinguishable from the expected value of L s (H G , θ 0 , t) which is exponentially suppressed in t.While in principle it is possible to minimize the loss function L(θ, t) for any t, the following theorem states that as the time t grows, the parameter landscape get more and more concentrated, according to Definition 3.
Theorem 3 (Concentration of loss for GDE Hamiltonians).Let L s (H, θ, t) be the expectation value defined in Eq. (24) for H ∈ {H G , H S }.For random GDE Hamiltonians one has See App.C for the proof.Note the above concentration bound holds for both H S and H G , provided that |A| ≪ |B|.From the above, one can readily derive the following corollary: Corollary 5. Let L s (H, θ, t) be the expectation value defined in Eq. (24) for H ∈ {H G , H S }.Then for t ≥ 4α √ n/ log 2 e, α > 0, and ϵ = e −βn , then: Taken together, Theorem 3 and Corollary 5, provide a no-go theorem for the success of the QML task, as they indicate that beyond t ∼ √ n one encounters deterministic concentration with overwhelming probability.Crucially, the role of the entanglement generated by H ∈ {H G , H S } is hidden in the variable t of the bound in Theorem 3. Indeed, as shown in Refs.[77,78] the entanglement for random GDE Hamiltonians is monotonically growing with t.
random numbers sampled from N (0, 1/2), and corresponds to the eigenvalues of While the previous results indicate that a HEAbased QML model will fail on the random Hamiltonian discrimination QML task for t ∼ √ n (due to high levels of entanglement), this does not preclude the possibility of the model succeeding for smaller evolution times.Notably, here we can show that for t = O( log(n)) the conditions are ideal for a quantum advantage: the states in the dataset will satisfy an area law of entanglement, and since GDE Hamiltonians are built out of a very deep random quantum circuit, their time evolution can be classically hard.In particular, the following theorem holds.
Proof.The corollary easily descends from Proposition 4, Theorem 2 and Proposition 1.
As shown above, for t ∈ O( log(n)), the states generated by the time evolution of GDE Hamiltonians obey to area law of the entanglement with overwhelming probability.Thanks to Theorem 2, we also have that the loss function L(θ, t) anticoncentrates, giving strong evidence of the success of the Hamiltonian Discrimination QML task.

Numerical simulations
In this section, we present numerical results which further explore the connection between the entanglement in the input state, and the phenomenon of gradient concentration.In particular, we are interested in showing how the parameter landscape of a QML problem becomes more and more concentrated as the entanglement in the input state grows.
To create n-qubit states with different amounts of entanglement, we will consider time-evolved states of the form where |ψ 0 ⟩ is a random product state, and where H is the Heisenberg model with first-neighbor interactions with periodic boundary conditions (n + 1 ≡ 1).
Here, σ i with σ = X, Y, Z, denotes a Pauli operator acting on qubit i.As we will see below, as t increases, so does the entanglement in |ψ t ⟩.
Next, we will consider a learning task where we want to minimize a cost-function of the form where O Z = i Z i (i.e., O Z is a sum of 1-local operators), and where U (θ) is a shallow HEA.Specifically, we employ the HEA architecture shown in Fig. 6 which is composed of an initial layer of general single qubit rotations, followed by two-qubit gates on alternating pairs of qubits.The two-qubit gates are themselves composed of a CNOT gate followed by general single-qubit gates on each qubit.
In Figs.7(a,b) we show averaged norm of the gradient ∂ µ L(θ), i.e. ∥∇L∥ ∞ ≡ max µ |∂ µ L(θ)|, as a function of the evolution time t used to prepare the input state of the HEA for different problem sizes.Gradients are computed by averaging over 400 random product states |ψ 0 ⟩, and two sets of random parameters in the HEA for each initial state.Here we can see that for small evolution times the cost exhibits large gradients independently of the system size.This result is expected as we recall that in the limit t → 0 the input state |ψ 0 ⟩ is a tensor product state, which, along with 1-local measurements and the HEA structure, leads to the gradients which norms are independent of n.As t increases, we can see that the gradient norm decrease until a saturation value G sat is achieved.Moreover, we can see that the value of G sat depends on the number of qubits in the system.In fact, as shown in Fig. 7(c), G sat decays polynomially with n.We can further understand this behavior by noting that as t increases, the time-evolution exp (−iHt) produces larger amounts of entanglement in the input state, and concomitantly smaller gradients (as indicated by our main results above).To see that this is the case, we compute rescaled entropy S(ρ 2 ) = −Tr[ρ 2 log(ρ 2 )]/2 where ρ 2 is the reduced state on two nearest-neighbor qubits for a sufficiently large time t such that G sat is achieved.Results are shown in Fig. 7(c).It shows a positive correlation between the decay of gradients and the increase in reduced state entropy.Thus, the more entanglement in the input state, the smaller the gradients, and the more concentrated the landscape.

Discussion and conclusions
Understanding the capabilities and limitations of VQA and QML algorithms is crucial to developing strategies that can be used to achieve a quantum advantage.One of the most relevant ingredients in ensuring the success of a VQA/QML model is the choice of ansatzes for the parametrized quantum circuit.In this work, we focused our attention on the shallow HEA, as it can avoid barren plateaus, and since it is perhaps one of the most NISQ-friendly ansatzes.Currently, the HEA is widely used for a plethora of problems, irrespective of whether it is well-fit for the task and data at hand.In a sense, the HEA is still a "solution in search of a problem" as there was no rigorous study of the tasks where it should, or should not be used.In this work, we establish rigorous results, showing how, and in which contexts, HEAs are (and are not) useful and can eventually provide a signature of quantum advantage.
We first review relevant results from the literature, discussing the notion of cost and loss function concentration and necessary conditions for trainability of HEAs -i.e.shallowness and locality of measurements.Here we highlight the existence of a new source of untrainability of shallow HEAs: the entanglement of the input states.On one hand, we proved that HEAs are untrainable if the input states satisfy a volume law of entanglement, as the cost function is deterministically concentrated around its trivial value.On the other hand, if the input states follow an area law of entanglement, the HEA is trainable.In fact, here we prove that the loss function anti-concentrates, i.e., it differs, at least polynomially, at sufficiently different points of the parameters landscape.
While the role of entanglement in the trainability of VQA and QML models has been explored in Refs.[50,51], the results found therein are conceptually different from ours.Namely, in these references the authors point out that deep parametrized quantum circuit ansatzes create volume law for the entanglement entropy, making the parameter landscape exponentially flat in the number of qubits and thus giving rise to entanglementinduced barren plateaus.As such, these results study the entanglement created during the circuit, but not that already present in the input states.For instance, the shallow HEA cannot create volume law of entanglement, yet, it is still untrainable if such entanglement exists in the input state.Hence, our work provides a new source of untrainability for certain datasets.
Next, we also analyzed the still open question of whether the HEA is able to achieve a quan- tum advantage in a VQA/QML setting.While the answer is far beyond the scope of the paper, we identified regimes in which the HEA can or cannot provide quantum speed-ups.Here we proved that thanks to the shallowness of HEAs, input states with bond dimensions at most polynomially in the number of qubits can be simulated with only a polynomial overhead on classical machines.This result rules out the use of HEA in VQAs: as many examples show, the typical input state for a VQA is an easy-to-prepare product state, thus allowing an efficient classical decomposition.Conversely, for QML algorithms the question still remains open: the portion of area law states admitting an efficient classical description is exponentially small [76].While this is not a guarantee for achieving quantum advantage, this is definitely the window to look at for applications beyond those solvable by classical capabilities.
We indeed push forward the latter intuition and provide an example to which our results apply.Namely, we present a Hamiltonian discrimination QML problem, where initial product states are evolved in time by two types of Hamiltonians, one possessing a given local symmetry, and one completely general.We show that, while the task becomes less and less feasible if the evolution time is long (as entanglement growing in time), for a given time window (scaling logarithmically with the number of qubits) such states possess area law of entanglement, ensuring the absence of barren plateaus in the loss landscape.
Such an example serves as a pivotal one for future, and hopefully fruitful, usages of the HEA.Our recipe to prepare a Barren-plateau-free QML problem is the following: consider entangled enough quantum input data, pass it through a shallow-depth circuit, and then measure with local operators.Importantly, if one wishes to use this scheme for classical data, it will be extremely important to find data-embedding schemes that lead to area-law states, but which are not themselves classically simulable.We expect that the search for such entangled-tamed embedding could be a fruitful research direction.
There are still several directions to be explored after the analysis of the present work.We indeed emphasize that, while for unstructured HEAs, this paper definitely rules out volume law states as input states of QML algorithms with HEA ansatzes, there is the fascinating possibility that, with even a little prior knowledge of the input states, a problem-aware HEA could avoid exponential concentration in the parameter landscapes.Indeed, the choice of some structured, and problemdependent ansatz can avoid barren plateaus: a prominent example is the Geometric Quantum Machine Learning, which exploits the geometric symmetries of the input data-set to design symmetry-aware parametrized ansatzes [35][36][37][38][39][40][41].[89] Salvatore F.E. Oliviero, Lorenzo Leone where c i := Tr[OP i ]/2 n , and Tr[P i P j ] = 2 n δ ij .Then, we define the support supp(P i ) as the ordered set of qubits containing non-identity operators in the tensor product structure of each Pauli operator j is a single qubit Pauli operator acting on the j-th qubit.Explicitly, supp(P i ) = {q is an ordered subset of natural numbers labeling the qubits on which P i acts non-trivially, and S i labels the number of qubits on which P i acts, see Fig. 8. From this, we also define the support of O as the ordered subset of qubits given by the union of the supports of each P i , i.e., supp(O) = i supp(P i ).Moreover, we note that which the latter follows from Holder's inequality: which follows from a counting argument).Let us define the pairwise relative distance between the qubits q k and q k+1 belonging to supp(P i ) as: It is now possible to define clusters, i.e., subsets of contiguous qubits whose pairwise relative distance is less than 2D.The definition can be done recursively: let 1 be the first cluster, then q Li clusters of qubits for any i = 1, . . ., N , such that 1 ≤ L i ≤ S i .Note that, we can write the operator O as α .Consequently, we can rewrite the cost function by expanding It is possible to evaluate the distance between f (θ) and f trv , that can be rewritten as: , the distance between f (θ) and f trv reads: where ψ Λ ≡ TrΛ |ψ⟩⟨ψ|.In the first inequality, we use the triangle inequality of absolute value, while in the second one, we used | Tr[AB]| ≤ ∥A∥ ∞ ∥B∥ 1 , and the fact that unitary operators preserve the trace and any Schatten p-norm.Finally, we used In what follows we will use the following lemma and corollary, which we will prove at the end of this Appendix (see App. D).
Lemma 1.Let U (θ) be a HEA with depth D as in Fig. 1(b), and O α be an operator, as in Eq. (35).Then for any The following corollary descends from the above lemma and the clustering of qubits.

A.2 Proof of Corollary 1
By theorem 1, we have if the information is scrambled, according to Definition 6, we have that, since , there exists a constant c ′ such that n maxi supp(Pi) ∈ O(n c ′ log(n) ), and therefore: where we crudely upperbounded 3n < n 2 (for n > 3).Now, we have that for any k ∈ Ω(1).Choosing k = c/2 for simplicity, we finally reach the final bound: where we considered ∥O∥ ∞ ∈ Ω(1).

A.3 Proof of Corollary 2
The proof of Corollary 2 follows from the proof in the work of Popescu et al. [79].Here we report it for completeness.
Let ψ Λ = TrΛ[|ψ⟩⟨ψ|], consider a complete set of (Hermitian) observables and decompose the state on the complete set of observables Consider the 1-norm distance between ψ Λ and the completely mixed state 1 Λ d −1 Λ : In the first inequality, we have used the norm equivalence, while in the second we have expanded ψ Λ as in Eq. (48).The third inequality follows by simply taking the maximum over i in the summation.Thus: Since the expectation values are Lipschitz functions [79], by exploiting Levy's lemma one can easily prove that: where C = (18π 3 ) −1 .By choosing ϵ = 2 −1/3n , one gets: To conclude, in virtue of Theorem 1, we have: Thanks to the above equation, we can bound I Λi < 2 |Λi| 2 −n/3 with overwhelming probability.Thus, , and D ∈ O(log(n)), it means that there exists two constants c, c ′ such that To conclude, we can loosely write that for any constant k we have for simplicity, we finally reach the final bound: holding with probability where we bound |Λ| < n.

A.4 Proof of Corollary 3
Now suppose one has as input k copies of a Haar random state |ψ⟩ on n qubits, denote it as: and denote d = 2 n the dimension of the Hilbert space in which |ψ⟩ is living.Let us decompose its reduced density matrix | in a Hermitian operator basis P i : Let us prove that each expectation value on |Ψ (k) ⟩ is a Lipschitz function with respect to |ψ⟩.Given a second state in the first inequality we used the fact that ∥P i ∥ ∞ = 1.In the second line, we used k times the following trick.Denote ψ = |ψ⟩⟨ψ|: and we used ψ ⊗k 1 = 1 for any k.Thanks to the fact that Tr[Ψ Λ P i ] is Lipschitz, and denoting as Tr Ψ (k) Λ P i the Haar average over the input state, we know that: where C = (18π 3 ) −1 .We need to bound the probability that I Λ (ψ) ≥ ϵ.Using the same trick as in Corollary 2, we can write: Note that the following inequality holds then where we used the fact that T Λ12 Tr ).We thus find that for ϵ = 2 −1/3n , one has: where we used the fact that (asymptotically): Thus: The calculation becomes more intricate for k > 2. However, making the assumption that Λ is symmetric between the copies of |ψ⟩, with |Λ| = kλ qubits, and setting d Λ ≡ 2 kλ , we can use where the sum is over the conjugacy class c(π) of the symmetric group S k .As evidenced by the above formula, the order of the trace depends on the conjugacy class c.For example for c(π) = ( 12), (23), . . .one has c = 1, for c(π) = (1234), (1243), . . .one has c = 3 and so on.Finally: Thus, we have: provided that ϵ > O(2 −n ).By applying Levy's lemma thanks to the typicality of Tr[Ψ Λ ], choosing ϵ = 2 −1/3n we have: B Proof of Theorem 2 and Proposition 1

B.1 Proof of Theorem 2
Let us consider the following quantity where θ A and θ B are such that θ A = θ B + êAB l AB with l AB ∈ Ω(1/ poly(n)).Our goal will be to show that the variance of this quantity is at most polynomially vanishing.First, we will use the following notation θ n ≡ θ A and θ 0 ≡ θ B .Moreover, it is useful to further divide the path in the parameter space by a sequence of single parameter changes so that where m ∈ O(poly(n)).Here we have defined where êi is a vector with a single one and where at least one l i is such that sin 2 (l i ) ∈ Ω(1/ poly(n)).Note that we can guarantee both m ∈ O(poly(n)) and that there exists an l i such that sin 2 (l i ) ∈ Ω(1/ poly(n)) from the fact that we have at most a polynomial number of parameters, and that θ A and θ B are at most polynomially close.The previous allows us to note that where we defined ∆f i+1,i := f (θ i+1 ) − f (θ i ).Taking the variance of Eq. (75) we have Here we recall that In what follows we will use the following lemmas, which we will prove at the end of this appendix( See App.D).
Let us now go back to Eq. (76).Note that where in the second line we have used Lemmas 2 and 3. Thus, we have Using the fact that θ i+1 = θ i + êi l i , and leveraging the parameter shift-rule for computing gradients [81,82], we then get and where we have defined θi = θ i−1 + êi li 2 .Then, recalling that m ∈ O(poly(n)) and that there exists an l i such that sin 2 (l i ) ∈ Ω(1/ poly(n)), we can use Lemma 4 to find

B.2 Proof of Proposition 1
Here we note that in the previous section, where we have proved Theorem 2 we have shown that the absence of barren plateaus (through Lemma 4) implies cost values anti-concentration.In this section, we prove the converse.First, we note that by anti-concentration we mean that for any set of parameters θ B and θ A = θ B + êAB l AB with l AB ∈ Ω(1/ poly(n)) one has Then, let us use the fact that where we have used the parameter-shift rule, and where θ + = θ + ± êν π 2 such that êν is a unit vector with a one at the ν-th entry.Then, we have that since the difference between θ + and θ − is O(1), we can use Eq.(85) to find which completes the proof of Proposition 1.
C Isospectral twirling and proof of Proposition 3,Thereom 3 and Proposition 4 In this section, we aim to prove Proposition 3, Theorem 3, Corollary 5, and Proposition 4.

C.1 Isospectral twirling
We first, review the useful notion of the Isospectral twirling introduced in Refs.[77,78].Consider an Hamiltonian H written in its spectral decomposition H = k E k Π k , where Π k are its eigenvectors and E k are its eigenvalues.Consider the time-evolution generated by H, i.e., W (t) = exp(−iHt) = k e −iE k t Π k .Denote U(n) the unitary group on n qubits.Define the following ensemble of isospectral unitary evolutions: whose representative element is H.The Isospectral twirling of order k, denoted as R (2k) (W (t)) is the 2k-fold Haar channel of the operator W ⊗k,k (t) := W ⊗k (t) ⊗ W †⊗k (t): Using the Weingarten functions [83][84][85], one can compute the isospectral twirling as: where S 2k is the symmetric group of order 2k, T π is the unitary representation of the permutation π ∈ S 2k , and which governs the behavior of many figures of merit of random isospectral Hamiltonians, as noted in [77,78].While the expression for Isospectral twirling of order k, for k ≥ 2 is cumbersome to report, we do recall the expression for the Isospectral twirling for k = 1: where T 12 is the swap operator between the two copies of i.e., all the eigenvalues E k are independent identically distributed Gaussian random variables with zero mean and standard deviation 1/2.Define c2k (t) := c 2k (t) d 2k , then the average of the normalized 2k-spectral form factors c2k (t) reads Let us now build a notation useful throughout the following proofs.Let F [W (t)] be a scalar function of the unitary evolution W (t) = exp (−iHt).Then we denote with E GDE the Isospectral twirling of the scalar function F [W (t)] followed by the average over the GDE ensemble of Hamiltonian, namely: In the following section, we use the techniques introduced in order to prove Proposition 3.

C.3 Proof of Theorem 3
To prove the theorem, we use the well-known Markov inequality: let x be a positive semi-definite random variable with average µ, then: To prove the concentration consider |L s (H, θ, t)| for H ∈ {H G , H S } defined in the main text.Then by Jensen's inequality, we have: Note that, one can write Then, taking the isospectral twirling, one has: where the Isospectral twirling operator R (4) (t) can be found in Eq. (165) of Ref. [77].Taking the average over the GDE spectra, one has where the result is obtained after some algebra, with the hypothesis of ψ ≡ ψ A ⊗ ψ B and that Tr[O] = 0. Note that the above result is obtained for H ∈ {H G , H S }; we indeed exploited the fact that |A| ≪ |B|, and thus d B = O(d).Using Markov's inequality Eq. (98), the result can be readily derived.

C.4 Proof of Theorem 4 C.4.1 An anti-concentration inequality
To prove the theorem, we make use of Cantelli's inequality: let x be a random variable with average µ and standard deviation σ.Then: To bound the measure I Λ (ψ), we make use of the bound proven in [86] between the Hilbert-Schmidt distance and the trace distance, namely: where ψ Λ := TrΛ[|ψ⟩⟨ψ|], and d Λ = 2 |Λ| .Note that where P (ψ Λ ) ≡ Tr[ψ and, because of Eq. (103), one has: Thus, from the above chain of inequality, it is clear that it is sufficient to obtain a bound on the anticoncentration of the purity P (ψ Λ ).Denoting P GDE := E GDE [P (ψ Λ )], and we can use Eq.(102) to write: Thanks to the right/left invariance of the Haar measure, we can insert unitaries of the form where ψ i.e., we can compute the average purity over GDE Hamiltonian on the average product state between Λ and Λ.A straightforward computation leads us to Following the calculations in Sec.3.3.1 of Ref. [77], the result can be proven to be: Computing the above expression is trickier than computing P GDE because the latter involves averaging over the 8-th fold power of G.It is well known that the permutation group S 8 contains 8! elements, making the calculation too expensive.It is also known that purity and OTOCs possess many similarities (see Refs. [87][88][89][90][91]).Indeed, the strategy to do the calculation will be: where c2k (t) is the normalized 2k-spectral form factor defined in Eq. (91).The strategy is thus to write (113) in terms of 8−OTOCs.First, note that we can write the swap operator as: and, in the same fashion, we can write the state |0⟩⟨0| as: Note that the sum is over P i for i = 1, . . ., 4 and runs over the subgroup P ∈ {1, Z} n ; moreover we defined Pi := U G P i U † G for i = 1, . . ., 4 to light the notation.It is useful for the following to split the identity part of the sum, and write P 2 (ψ Λ ) as: where each term on the first sum is different from 1.To write it in terms of OTOCs, let us use the following identity: let A and B two operators, and P Pauli operators then: where the summation is on the whole Pauli group.Thus Here, Q, K, L are labeling global Pauli operators, running on the whole Pauli group.In order to recover OTOCs, we still need to split the sum of Q, K, L between the identity and the other non-identity Paulis: where we adopted the following convention: the sum contains the sum over all the Pauli's appearing on the summation with the exception on the identity, running on their respective support; more precisely, P 1 , . . ., P 4 runs in the subgroup P i ∈ {1, Z} n but the identity, Q, K, L run over the whole Pauli group but the identity, while P a Λ , P b Λ run over the whole Pauli group defined on Λ qubits but the identity.Using Eq. (112), from Eq. (119), one thus arrives to After further algebraic simplifications, one gets: In order to properly use Eq.(114), one needs to be sure to absorb all the term O(d −1 ).The final result is: Hence, we can readily compute Q GDE by using Eq.(94), and obtain: which concludes the proof.

D Useful Lemmas
Before proceeding to prove Lemmas, we recall that if a given two-qubit gate V in the HEA forms a 2-design, one can employ the following element-wise formula of the Weingarten calculus in Refs.[85,93] to explicitly evaluate averages over V up to the second moment: where v ij are the matrix elements of V , and Here the integration is taken over U(2), the unitary group of degree 2. possessing support only on the j−qubit, and then write U (θ) as in Eq. (3) layer by layer, as where V k are the unitaries acting on each layer k = 1, . . ., D. For sake of simplicity, we drop the parameter dependence.Each V k can be further decomposed as: where each unitaries V α , α = 1, . . ., n/2m, acts on m qubits, being n a multiple of m (m even), with periodic boundary conditions; so that either (i) V 1 acts on the first m qubits, V 2 acts on the second m qubits, up to V n/2m acting on the last m qubits; or (ii) V 1 is acting from qubit m/2 + 1 to qubit 3m/2, V 2 is acting from qubit 3m/2 + 1 to qubit 2m, up to V n/2m acting from qubit n − m/2 + 1 to qubit m/2.At the first layer, only one V (1) α for some α is acting on σ (i) j : (1)   α (134)

Figure 1 :
Figure1: (a) Both VQA and QML models train parametrized quantum circuits U (θ) to minimize either a cost function C(θ) for VQAs, or a loss function L(θ) for QML models.While VQAs start from some fiduciary, easy to prepare state |ψ0⟩, in QML one uses states from dataset |ψs⟩ ∈ S as input of the parametrized circuit U (θ).Both models exploit the power of classical optimizers for the minimization task.(b) The architecture of a HEA seeks to minimize the effect of hardware noise by following the topology, and using the native gates, of the physical hardware.Specifically, we consider HEA as a one-dimensional alternating layered ansatz of two-qubit gates organized in a brick-like fashion.In the figure, we show how a first layer of gates is implemented at time t1 while a second layer at time t2.At the end of the computation, a local operator is measured.

Figure 2 :
Figure 2: The sketch shows the light-cone of a local measurement operator at the end of a shallow HEA.Here we can see how the support of the local operator grows with the depth D of the HEA.Since the HEA gates act on neighboring qubits, the support increases no more than 2D.
Let us consider a VQA or QML task from Definitions 1-2, where the ansatz for the parametrized quantum circuit U (θ) is a shallow HEA with depth D ∈ O(log(n)) (see Fig. 1(b)).Moreover, let the function f (θ) = C(θ), L s (θ), be either the cost or loss function in Eqs.(1) and (2).

Corollary 2 .
Let |ψ⟩ ∼ µ Haar be a Haar random state, D ∈ O(log(n)), and max i supp(P i ) ∈ O(log(n)).Here µ Haar is the uniform Haar measure over the states in the Hilbert space.Then I

Proposition 1 .
Let f (θ) be a VQA cost function or a QML loss function where U (θ) is a shallow HEA with depth D ∈ O(log(n)).If the values of f (θ) anti-concentrate according to Theorem 2 and Eq.(23), then Var θ [∂ ν f (θ)] ∈ Ω(1/ poly(n)) for any θ ν ∈ θ and the loss function does not exhibit a barren plateau.Conversely, if f (θ) has no barren plateaus, then the cost function values anti-concentrate as in Theorem 2 and Eq.(23).

Figure 3 :
Figure 3: Schematic representation of the Hilbert space.The vast majority of states satisfy a volume law, and hence a shallow HEA cannot be used to extract information from them.From the set of states satisfying an area law, only a very small subset admits an efficient classical representation.For these states, the effect of a shallow HEA can be efficiently simulated.As such, there exists a Goldilocks regime where HEA can potentially be used to achieve a quantum advantage: non-classically-simulable area law states.

Figure 4 :
Figure4: Schematic representation of the trainability of two QML tasks with two different datasets S. The first task is trainable since the dataset S is composed of states ρi possessing an area law of entanglement.Conversely, the second task is untrainable being the dataset S composed of states ρi possessing volume law of entanglement.We remark that the first task can enjoy a quantum advantage since not all the area-law states are classically simulable, see Ref.[76] and Fig.3.

Figure 7 :
Figure7: Numerical results.We consider a problem where the input states |ψt⟩ of the HEA are determined by Eqs.(32) and(33) and where the loss function is given by(34).In panels (a) and (b) we respectively show (averaged over |ψt⟩ and θ) norm of the gradient ∂µL(θ) as a function of the evolution time t for different system sizes with n even and odd.Panel (c) shows norm saturation value Gsat of the results presented in (a) and (b) versus system size n.Panel (d) shows the norm of the gradient versus 1 − S(ρ2), where S(ρ2) is the entropy of two-qubit subsystem for an evolution time such that the saturation value Gsat is achieved.Different points correspond to different values of n.The results are averaged over 400 initial states |ψ0⟩ in Eq.(32) and two sets of angles θ for every initial state.
ter for Nonlinear Studies at Los Alamos National Laboratory (LANL).L.C. was partially supported by the U.S. DOE, Office of Science, Office of Advanced Scientific Computing Research, under the Accelerated Research in Quantum Computing (ARQC) program.M.C. was initially supported by the Laboratory Directed Research and Development (LDRD) program of LANL under project number 20230049DR.The authors also acknowledge support by the U.S. DOE through a quantum computing program sponsored by the LANL Information Science & Technology Institute.

Lemma 2 . 77 ) 3 . 78 ) 4 .
Let U (θ) be a shallow HEA with depth D ∈ O(log(n)), and O = i c i α O (i) α with O (i) α being traceless operators having support on at most two neighboring qubits.Then, E θ0 [f (θ i )] = 0 , ∀i .(Lemma Let U (θ) be a shallow HEA with depth D ∈ O(log(n)), and O = i c i α O operators having support on at most two neighboring qubits.Then, E θ0 [∆f i+1,i ∆f j+1,j ] = 0 , ∀i ̸ = j .(Lemma Let U (θ) be a shallow HEA with depth D ∈ O(log(n)) where each local two-qubit gates forms a 2-design on two qubits, and let O = i c i α O (i) α be the measurement composed of, at most, polynomially many traceless Pauli operators O (i) α having support on at most two neighboring qubits, and where i c 2 i ∈ O(poly(n)).Then, if the input state follows an area law of entanglement, we have t)T π ] are spectral functions of the representative Hamiltonian H. π = e (the identity) defines the 2k-spectral form factor c 2k (t) ≡ | Tr[W (t)]| 2k : ) to recover Eq. (27) it is sufficient to note that Tr[ψ s O(θ 0 )] = 1 because O(θ 0 ) ≡ S, and S |ψ s ⟩ = |ψ s ⟩ by definition.
(i) reduce each term of Eq. (113) to a sum of high-order OTOCs, and (ii) use the following asymptotic formula proven in [92]: let A 1 , . . ., A k , B 1 , . . ., B k non-identity Pauli operators, and let B l (t) ≡ W † G (t)B l W G (t): Tr l A l B l (t) G = Tr l A l B l c2k (t) + O(d −1 ) , (

Lemma 1 .
Let U (θ) be a HEA with depth D as in Fig.1(b), andO = i c i P i = i c i α O (i)α be an operator, as in Eq.(35).Then for anyO (i) α : | supp(U (θ)O (i) α U † (θ))| ≤ k∈C (i) Proof.Given O = i c i P i = i c i α O (i)α (from the clustering of qubits), to prove the statement, one has to look at a given operator O (i) α , having support on the cluster C (i) α defined in Sec. A. As a starting point, let us introduce the local Pauli operator σ (i) j

Figure 10 :
Figure 10: For the proofs, it is useful to divide the HEA into different parts: acting before or after Vi (a), or before, after and in-between Vi and Vj (b).
and thus, if |ψ⟩ follows a volume law of entanglement according to Definition 5 for any bipartition Λ ∪ Λ with |Λ| ∈ O(log(n)), one indeed has the exponential suppression of the information contained in Λ, i.e., I Λ (ψ) ∈ O(2 −n ).This motivates us to propose the following alternative definition for states following volume and area law of entanglement Definition 7 (Volume law vs.area law).Let |ψ⟩ be a state in a bipartite Hilbert space H Λ ⊗ HΛ.Let Λ be a subsystem composed of |Λ| qubits, and let Λ be its complement set.Let ψ Λ = TrΛ[|ψ⟩⟨ψ|] be the reduced density matrix on Λ.Then the state |ψ⟩ possesses volume law for the entanglement within Λ and Λ if [77,78]he Isospectral twirling of a scalar function depends upon the particular choice of the spectrum of H, one then averages over spectra of a given ensemble of Hamiltonians E. Relevant examples are the Gaussian unitary ensemble E ≡ GUE, the Poisson ensemble E ≡ P, or the Gaussian diagonal ensemble E ≡ GDE.As one can see, in this picture, spectra and eigenvectors become completely unrelated, since the average over the full unitary group erases the information about the eigenvectors.Although, in Refs.[77,78]many ensembles of Hamiltonians have been considered, in this paper, we are particularly interested in the GDE ensemble which is the simplest ensemble of Hamiltonians: the 2k-point spectral form factors can readily be computed.Let sp(H) := {E k } d k=1 , where d ≡ 2 n .The GDE ensemble is characterized by the following probability distribution for sp(H): H, and c 2 (t) is the 2-point spectral form factor in Eq. (91).One can consider the isospectral twirling of a scalar function of the time evolution operator W (t) (characterized by the operator of interest O), i.e., F O [W (t)], which can be written after the isospectral twirling asF O [G † W (t)G] G := Tr[T σ OR (2k) (t)],where T σ is a particular permutation operator.Its value (depending on the evolution time t) characterizes the average behavior in the ensemble E H of all those Hamiltonians sharing the same spectrum of H.
) via the Isospectral twirling technique.Recall L s (H G , θ, t) = Tr[W (t) |ψ s ⟩⟨ψ s | W † (t)O(θ)],where |ψ s ⟩ is a completely factorized state.We are interested in computing E GDE [L s (H G , θ, t)].Note that L s (H G , θ, t) can be written as: 2 Λ ] is the purity of the reduced density matrix ψ Λ .Let |ψ t ⟩ = exp (−iHt) |ψ s ⟩ the state resulting from the time evolution under a GDE Hamiltonian acting on a completely factorized state.Let Λ be a subsystem such that |Λ| = O(log(n)).Note that if Pr(P (ψ Λ In this section, we compute the second moment of the purity, i.e. ⟨P (ψ Λ ) 2 ⟩.Thanks to the left/right invariance of the Haar measure over G, and from the commutation with T Λ , we can compute the average of the second moment for any factorized state |ψ 0 ⟩ ≡ |ψ Λ ⟩ ⊗ |ψΛ⟩, by computing it with the input state |ψ 0 ⟩ ≡ |0⟩ ⊗n .Namely: ) .
this operation gives rise to 8 different terms, labeled t 1 , . . ., t 8 , each of which proportional to the spectral function c8 (t) thanks to Eq. (114).The coefficients are listed below: t 1 ≡ Tr[P a Λ P 1 P a Λ P 2 P b Λ P 3 P b Λ P 4 ] (120) t 2 ≡ Tr[P a Λ P 1 QP a Λ P 2 QP b Λ P 3 P b Λ P 4 ]