Quantum deep transfer learning

Quantum machine learning (QML) has aroused great interest because it has the potential to speed up the established classical machine learning processes. However, the present QML models can merely be trained on the dataset of single domain of interest. This severely limits the application of the QML to the scenario where only small datasets are available. In this work, we have proposed a QML model that allows the transfer of the knowledge from one domain encoded by quantum states to another, which is called quantum transfer learning. Using such a model, we demonstrate that the classification accuracy can be greatly improved for the training process on small datasets, comparing with the results obtained by former QML algorithm. Last but not least, we have proved that the complexity of our algorithm is basically logarithmic, which can be considered an exponential speedup over the related classical algorithms.

Other than above ML algorithms, whose model are trained on the datasets drawn from the same feature space and the same distribution, there is another type of ML algorithm called transfer learning (TL). The main strategy of TL is to utilize the knowledge of other task domains to reduce the effort to train the model of interest. In many cases, such a strategy is found to be truly beneficial [14][15][16][17][18][19]. Generally, the TL can be performed by four steps, shown by figure 1(a). (1) For a given task with the dataset, find a source domain dataset for knowledge transfer. (2) Train a model on source domain dataset. (3) Build a criteria for the transfer process. Such a criteria is set to be the filter for obtaining the helpful information contained in the output of the source domain model, or in the model parameters generated by previous training process, etc, depending on the specific task. In other words, introduce a proper learning model. (4) Train the target task model on the target domain dataset using the learning information obtained by (3).
Finding the knowledge transfer structure is the key to TL, which is enabled by steps (2) and (3). For different learning tasks, these steps are executed using different learning models. For example, in a feature-based TL task [20], the feature dataset is obtained from the source dataset and characterized by a kernel matrix. Then, the alignments of feature and target data, given by the kernel matrices of them respectively, are employed as an evaluation of their similarity. Thus, by comparing the alignments, one can pick out the relevant feature datasets as a supply for the training of the task model. Obviously, such a strategy is especially supportive when the size of target domain data is insufficient for training an ML model. In fact, broad applications of TL can be found in plenty of daily issues. It has already shown the potential to solve practical challenges in the image recognition, social network analysis, and document classification, etc [14,15].
The exploration on quantum computing indicates that a speed up can be achieved using quantum algorithms when solving certain problems. Proven by series of recent works [21][22][23][24][25][26][27], tasks of ML are also on the list and the corresponding solutions are termed quantum machine learning (QML) algorithms. By far, several QML algorithms have been proposed and demonstrated, including quantum support vector machine [28,29], quantum deep learning [30][31][32][33][34][35][36], quantum Boltzmann machine (QBM) [37,38], quantum generative adversarial learning [39][40][41] and so on. Also, in references [42,43], the authors have discussed three kinds of TL schemes: quantum to classical, quantum to quantum and classical to quantum. They employ quantum variational circuit for the learning of the feature for the classification task. The quantum variational circuit requires a classical computer to obtain the updates of the parameters of the quantum circuit. In essence, they are hybrid classical-quantum TL schemes although they involve quantum to quantum TL. Recently, TL for scalability of neural-network quantum states [44] uses the classical learning model to express the quantum state, and reuses the previously trained parameters obtained by TL, thereby improving the effectiveness of the ground state ansatz. However, the ML strategies beyond the dataset-based training, such as the knowledge transfer enabled by classical TL algorithms, has not been found in the quantum framework for computing (for fully quantum-quantum computing). Those strategies would naturally combine the advantage of TL and quantum computation, removing crucial requirements on dataset, parameter updating, etc for QML and providing a speed-up in all stages of the processing.
In this work, we propose a complete scheme of quantum deep transfer learning (QDTL) based on energy model. As an analogy of classical TL process, our scheme takes advantage of the knowledge from other domain and is completely described by quantum states. We not only propose a generative model for the quantum data to be transferred, but also define the generalized kernel matrices of it, which is applicable for mixed quantum states. Also, the quantum alignment method is introduced to judge the similarity of two kernel matrices. As an example of practical illustration, we apply the scheme to the classification task of Iris dataset, and demonstrate that the classification accuracy can be improved greatly in comparison with the traditional QML algorithm.

Scheme of quantum deep transfer learning
The schematic diagram of the scheme is shown in figure 1(b). It also includes four steps which have a one-to-one correspondence to the classical scheme shown in figure 1(a). All datasets in the whole task are encoded by quantum states. Because the feature dataset here is encoded by mixed states, we use density matrix to describe the corresponding theory. The extract of knowledge transfer structure is also the heart of the scheme, which includes three parts: (1) generate quantum feature datasets using a generative model so that a series of quantum datasets sharing similar features of source domain can be obtained; (2) design a quantum subroutine to compute the kernel matrix of a dataset for describing the features quantitatively; (3) evaluate the similarity of the target datasets and feature datasets by computing the alignments of the kernel matrices of the feature and target data respectively. In the following, we present the detailed algorithm of the three parts. Without special instructions, we denote the quantum source domain dataset {ρ i S } i=1...M S below. The integer index i marks different samples, which are encoded by qubits and represented by density matrices ρ i S s, and varies from 1 to M S . The quantum target domain dataset is also denoted by {(ρ j T , y j )} j=1...M T below, whose sample index varies from 1 to M T . One sample of data is given by a pair of states, where ρ j T is the density matrix of the encoded qubits and y j = |1 or |−1 is the corresponding binary label of ρ j T . Without losing generality, the basis of the qubits is denoted by {|1 , |−1 } and we assume that ρ i S and ρ j T are only the density matrix of the basis states in multi-qubit Hilbert space.

Quantum generative model for feature data
The purpose of obtaining a feature dataset in QDTL scheme can be reached in various ways. Here we design the subroutine for the propose using the network model of QBM. A QBM is a quantum machine described by Ising-type Hamiltonians whose eigenstates probability distribution is a Boltzmann distribution. Starting with an initial state, a QBM can approximately reach the demanded Boltzmann distribution by annealing process [37,45]. However, not all optimization problems can be solved using QBM. Here, we use QBM as an example and other models in QDTL protocols are similar. Specifically, for a given source dataset {ρ i S } i=1...M S , we train a QBM such that the marginal Boltzmann probability of model becomes close to probability of source data. Then, the quantum feature dataset can be obtained by measuring QBM according to the labeled target dataset Figure 2 shows an example of a QBM model with equal number of visible and hidden nodes, depicted by blue and red circles respectively. Each node represents a two-level subsystem, or a qubit, whose basis states are denoted by |1 and |−1 . The QBM can be described by the transverse Ising model [28] so that the Hamiltonian is given by where Γ a is the parameter of transverse field, b a is the threshold of the network, and w ab is the weight between network nodes. σ x a and σ z a are the Pauli operators of x and z given by σ x a = a−1 I ⊗ · · · ⊗ I, with Pauli matrixes σ x and σ z . N is the total qubits number. With the Hamiltonian given by equation (1), the partition function and density matrix of QBM are Z = Tr e −H and ρ = Z −1 e −H [35].
According to the training process in [35], we obtain the workable QBM model, then measure the qubits of the visible layer of the QBM. The measurement operators are set to be the projection operators of the sample states of the target domain. For example, the measurement operator defined by sample state ρ j T can be given by where ..M T , whose number of sample states is the same with target dataset.

Kernel matrix
For the aim of describing the characters of the datasets properly, we design a subroutine for representing the kernel matrix of density matrix of the feature dataset. The computation of kernel matrix for pure states has been proposed in reference [28]. Here, we provide a more generalized form for mixed states. In the scheme, a series of reference qubit systems R = R j j=1...M T are introduced. R j has the same state space with ρ j v T and the orthonormal basis states are denoted by n R j , where n R j is an integer ranging from 1 to d. Following the example of the last part, we set d = 2 N/2 . By such reference systems, one can generate a series of pure state whose subsystem states are represented by {ρ j v T } [46]. A precise purification procedure has been given in reference [47]. We also apply a quantum random access memory (QRAM) [48] that can create a superposition of feature data in a data register, correlated with the address register. The state of address register of QRAM is denoted by ρ = 1 M M T k,r=1 |k r|, where k and r represent the kth and the rth memory cells. In the QRAM, a superposition state can be obtained by the oracle operations on all the sample states ρ j v T of f T , expressed by where |ψ n k (|ψ m r ) represents the nth (mth) basis state of ρ k v T (ρ r v T ) and p n k (p m r ) is the probability of it. After the operation of partial trace on reference system R, a mixed state K 1 can be obtained Then, after the partial trace on the feature states register {ρ j v T } for the mixed state K 1 , a mixed state K can be obtained In fact, the density matrix of K given by equation (5) has a well correspondence with the classical kernel matrix [49].
In order to describe the characters of the labeled datasets, we also need to generalize a state K ideal to encode the normalized kernel matrix of the label state. Because, in our case, the label states are pure quantum states, we do not have to introduce the reference system. The state of kernel matrix about label data y j can be directly obtained One can also address the similar results using the algorithm in reference [28].

Quantum alignment
To compare the similarity of the feature dataset and target dataset, we introduce the quantum alignment. It is defined by the trace of two kernel matrices of the quantum datasets. Our scheme is a more generalized  [50][51][52] for mixed state cases. Specifically, consider two mixed states K and K ideal which encode the kernel matrices of the data samples and their labels, and an ancilla qubit |−1 . In addition, two additional reference qubit systems R α and R β are introduced. R α (R β ) has the same state spaces with K ( K ideal ), and its orthonormal basis states is denoted by {|α } ({|β }), with integer index α (β). Similar to the trick in the kernel matrix part, the purified quantum states of K and K ideal can be given by α √ p α |ψ α |α and β √ q β |φ β |β respectively. |ψ α (|φ β ) is the αth (βth) basis state of K ( K ideal ) and p α (q β ) is the corresponding probability. Then the input state |ϕ of the algorithm for the quantum alignment is expressed as Firstly, Hadamard operation is performed on the ancilla qubit, which change its state |−1 to Then, the Fredkin gate operation is performed on √ p α |ψ α and √ q β |φ β with the ancilla qubit being the control qubit. Finally, another Hadamard operation is performed on the ancilla qubit. After above operations, the output quantum state |φ can be given by The probability of finding the ancilla qubit to be |−1 can be estimated by the measurements, expressed by p = 1 2 [1 + Tr( K K ideal )]. Then, one obtains In fact, equation (9) provides a method to estimate the trace of the product of two arbitrary density matrices. Following the spirit of the classical TL, the quantum alignment of a labeled dataset can be written as The detailed demonstration is presented in the section S1 of supporting materials (https://stacks.iop.org/ NJP/23/103010/mmedia). Applying equation (9), each trace in equation (10) can be estimated so that the quantum alignment can be obtained.
Employing the above three subroutines, a complete QDTL can be realized by obtaining feature datasets multiple times. We compare the quantum alignment of each f T z , denoted by K z f , with that of ρ Three subroutines, quantum generative model for feature data, kernel matrix and quantum alignment are executed in sequence, in order to supplement the target dataset. Training QBM based on the MPD of the visible layer from the last run (the MPD of the first run is from source dataset), then, executing the above three subroutines again.

Examples
For displaying the effectiveness of the QDTL, the numerical simulation of its performance on the classification of Iris dataset is presented. The Iris dataset is a frequently-used dataset for the demonstration of classification task. It contains 150 samples of the Iris information that describe three types of the flowers, setosa, versicolour and virginica. Each sample contains four attributes: sepal length (SL), sepal width (SW), petal length (PL) and petal width (PW). Based on such a dataset, we discuss the binary classification task, that is, classification for two types of Iris. We firstly specify the datasets for our binary tasks. Suppose that the Iris dataset is denoted by I = I 1 ∪ I 2 ∪ I 3 , where I 1 , I 2 and I 3 are the datasets of setosa, versicolour and virginica samples respectively. The target domain dataset T is set to be I p ∪ I q , with p, q = 1, 2, 3 and p = q. The corresponding source domain dataset S is set to be I p ∪ I s , with s = 1, 2, 3 and s = p, q. This means that one sub-dataset of T and S contains the same type of Iris data samples. Here we only present the results when T = I 2 ∪ I 3 in the main text, the source domain dataset is chosen to be S 1 = I 1 ∪ I 2 and S 2 = I 1 ∪ I 3 .
More numerical examples are illustrated in S2 and S3 of supporting materials.
Before applying the above QDTL algorithm, we preprocess these datasets. The attribute of each sample is changed from a real number to the binary number. In other words, four thresholds are defined for the four attributes of samples. When these attributes are greater than the thresholds, they are marked as {1}. Otherwise, they are marked as {−1}. For such a preprocessing, the four thresholds of the attributes are demanded to be properly set so that the new samples do not belong to contradictory types. For example, we set the thresholds of SL, SW, PL and PW to be 5.5, 3.1, 2.0 and 1.6 respectively, for the above requirements.
The results of preprocessed datasets of versicolour and virginica are shown in figure 3(b). The 'category' means different binary data types after preprocessing. For example, '1' represents the binary data (−1, −1, −1, −1) whose elements indicates that all four character values (SL, SW, PL and PW) are below the corresponding thresholds. '2' represents the binary data (−1, −1, −1, 1) whose elements indicates only the last character value (PW) of sample is greater than its threshold, and the remains are less than the corresponding thresholds. The larger numbers follow logically. We can observe that the samples of versicolour are mainly distributed in binary data '11' in blue, but there are also a small number of virginica samples in binary data '11' in black. For comparison, the initial data distributions are shown in figures 3(a1) and (a2). Figure 3(a1) shows the distribution of SL and SW, and figure 3(a2) shows that of PL and PW. It is seen clearly that the initial data points of versicolour and virginica are not linear separation. After preprocessing, two types of data are still not linearly separable. This indicates that the main character of the datasets does not change and the preprocessing methods contributes nothing to the classification task. The results of preprocessed datasets for other cases are similar, which are given by S2 of supporting materials.
After preprocessing, we run our QDTL algorithm by employing the source dataset with 100 samples in each binary classification task. According to the above theoretical scheme, we consider a QBM discussed in section 2.1. This model has 4 visual units and 4 hidden units. We allow full connectivity within the visible layer, the hidden layer and all-to-all connectivity between the layers. We train the QBM for 160 gradient steps in the update rule and set the learning rate to 0.05. The QBM is trained again with the MPD of the visible layer from the last run. After 50 times of such training, we express the obtained the density matrix of the feature dataset and the target dataset as a kernel matrix according to the theory of section 2.2, and then apply quantum alignment to compare the similarity of the feature dataset and target dataset. Finally, we select 4 suitable feature datasets to expand the target data so that the size of the target dataset can be enlarged to 500. In order to complete the binary classification task, we consider the quantum neural network given by [36] and train the model with a three-layer quantum neural network. We train the model for 1000 gradient steps in the update rule and set the learning rate to 0.03.
For comparison, we also train the network on the original target dataset. The results are shown in figure 4. Here the ratio of training sample to test sample is 4:1. The green lines in figures 4(a) and (b) display classification accuracies of the QDTL cases as a function of training times. The classification accuracies are arithmetic averaged over ten test results. The corresponding results based on the original target dataset with 100 samples, whose distributions are given by figure 3, are plotted in blue. In the two cases, the final accuracies increase 16.9% when employing S 1 , and 24.6% when employing S 2 . At the same time, we see that the standard errors of TL data are smaller than the target data, indicating that the former has more stable test results on the test data. The results of other examples are given by S3 of the supporting materials. In conclusion, all the results show a significant improvement of the classification accuracies when the QDTL scheme is applied.

Complexity analysis
In this section, we discuss the complexity of the quantum algorithm. We first estimate the complexity of generating feature dataset. Its complexity is composed of the QBM training and projective measurement process. In this module, we use the QBM model to train source dataset.
In order to make the error of sampling the gradient of the objective function less than or equal to ε, the query complexity is O(M/ε 2 − Mε 2 H ) [38]. We assume that we have an oracle, F H (ε H ), that is capable of taking parameters(Γ a , b a , w ab ) of QBM and outputs the state σ such that e −H /Z − σ tr ε H for ε H 0, where tr represents the trace distance and ε H is a constant. Further let D be an approximation to the gradient of training objective function of the QBM, which is obtained by sampling the expectation values of Hamiltonian.  [53], the gradient of QBM, which is compatible with our scheme, is estimated by comparing finite differences method with automatic differentiation approach. It indicates that the number of circuits mightily impacts the time and resources and we could choose different methods according to actual problems. The number of circuits in the former is Θ(tqp(q + p)) and the latter is Θ(tq 2 (q + p)), where p is the number of Hamiltonian parameters, q is the number of trial state parameters, t is the number of time steps during the Gibbs state preparation. Another essential part of QBM is sampling from the thermal distribution. The complexity of preparing a Gibbs state ρ ∈ C 2 N ×2 N within error ε, is O [ 2 N /Zpoly log 1/ε ] 2 N /Z [54]. This is roughly quadratically better than existing approaches for preparing general Gibbs states [55] if constant ε is required, but constitutes an exponential improvement if 1/ε is large. However, there are still some open problems in the QBM quantum algorithm that need to be solved. For example, how to estimate the number of epochs so that the algorithm converges. This number is unknown and depends sensitively on the training data as well as the learning rate. If QBM algorithm can simulate effectively, then this program cannot be simulated by classical methods unless BQP = BPP [56,57]. According to the above complexity analysis of the gradient and sampling process, especially the analysis of the BQP-hard decision problem, QBM training offers the potential for exponential acceleration compared to classical ML methods.
Further, we obtain the feature dataset by projecting the target domain dataset to visual layer of the QBM. The probability of success of measurement is Θ(1), so that the number of measurements is about |{P v : P v > 0}| in the target dataset. In the case of dataset of the same size, the module of generating feature dataset can provide exponential acceleration compared to classic ML methods.
In the kernel matrix part, we need to calculate the kernel matrix K and K ideal , respectively. According to the kernel matrix preparation method proposed in reference [28], one way to realize the construction of these quantum states is through QRAM, which uses O(M T d)hardware resources but only requires O(log M T d) operations to access them. One of the caveats of QRAM is that it either contains the partial sums of probability amplitude of the quantum states, or pointers to the large entries of the states [58]. Because we encode samples with qubits, |1 and |−1 , our scheme can address the caveat. In this way, the approach to obtain the superposition state of QRAM is through an oracle with the state ρ = 1 M M T k,r=1 |k r|. Then, by performing partial trace operations on the reference system R and feature states register, we can get the kernel matrix. Therefore, the complexity of the kernel matrix is equal to the complexity of QRAM, that is O(log M T d). A total of six kernel matrices need be represented each time. According to Big O notation [46], any constants in product does not affect the order of complexity. So, it can be omitted. This means that the run time of kernel matrix is O(log M T d) compared with the classical counterpart [59]. For the QRAM, however, it is difficult to obtain a known oracle under the existing experimental conditions, and this oracle is required to be an exponentially less query complexity. Reference [60] proposed another method of preparing the kernel matrix in a new Bayesian deep learning algorithm on a quantum computer. This method avoids the use of oracle, but provides a polynomial speedup effect compared to the classic algorithm.
In the quantum alignment part, each result of trace is estimated by a swap test. In reference [50], the numbers of single bit quantum gate used in each swap test are O(log M T ). In the swap test of quantum circuit, the number of quantum gates passing through the ancilla bit path is the largest, and it is equal to the number of quantum gates in the whole swap test process. Therefore, we can estimate that the time complexity of swap test is equal to the circuit depth, that is O(log M T ). In measurement for the ancilla bit, if we require the measurement accuracy of quantum alignment to be κ, then the complexity of each trace operation is O(κ −1 log M T ). Three trace operations are required for each calculation of Align(K), According to Big O notation, the any constants in product can be omitted. The complexity of quantum alignment is O(log M T ), however, the complexity of alignment in reference [20] is O(M T 2 ). The above content provides a detailed analysis about the three main parts of the QDTL algorithm. In the module of generating feature dataset, QBM provides a potential exponential acceleration result. In the kernel matrix modules, the complexity is O(log M T d). In the quantum alignment modules, the complexity is O(log M T ). Further, we also provide the probability of successful postselection of the quantum alignment, which is approximately 0.79. The detailed results are given by S4 of the supporting materials. So we believe that the QDTL algorithm potentially offer an exponential advantage. However, the probability of successful postselection depends sensitively on the similarity of target dataset and source dataset. Further work will be needed to provide good empirical and theoretical bounds that are needed to obtain sufficient feature dataset.

Discussion and conclusion
Recent developments in quantum annealing processors have made QBM possible to experimentally test some of the quantum machine-learning ideas. With minor modifications to the hardware design, such a model can become available. However, as argued in reference [45], not all optimization problems can be solved using QBM. The quantum annealers have a significant dependence on annealing time and temperature. We hope this could open new possibilities in areas of ML and quantum information processing.
We would like to emphasize that for the simple dataset (e.g., the linearly separable one), larger networks are naturally low in efficiency because searching the optimal values of the object function in a large parameter space is high consuming and unnecessary. Great advantage is not shown using the QDTL scheme. In fact, the advantage of the QDTL scheme is for the classification tasks involving complex datasets, not the linearly separable ones. In these cases, the specific linear classifiers do not work well.
In conclusion, we have proposed a complete QDTL process for feature-based knowledge transfer. The key part of the knowledge transfer is composed of three parts: a model for generating quantum feature data based on QBM, an algorithm for computing kernel matrices of mixed quantum states, and a quantum alignment algorithm for comparing the characters of the kernel matrices. By applying our scheme to the classification task of Iris dataset, we have also demonstrated that the classification accuracy of the considered QML model can be greatly improved with the help of the QDTL scheme. Based on our scheme, the exponential speedup over classical counterpart has been demonstrated.