Two-Stage Model-Agnostic Meta-Learning With Noise Mechanism for One-Shot Imitation

Given that humans and animals can learn new behaviors in a short time by observing others, the question we need to consider is how to make robots behave like humans or animals, that is, through effective demonstration, robots can quickly understand and learn a new ability. One possible solution is imitation based meta-learning, but most of the related approaches are limited in a particular network structure or a specific task. Particularly, meta-learning methods based on gradient-update are prone to overfit. In this article, we propose a generic meta-learning algorithm that divides the learning process into two independent stages (skill cloning and skill transfer) with a noise mechanism which is compatible with any model. The skill cloning stage enables a good understanding of the demonstration, which helps the skill transfer stage when the robot applies the learned experience into new tasks. The experimental results show that our algorithm can alleviate the phenomenon of overfitting by introducing a noise mechanism. Our method not only performs well on the regression task but is significantly better than the existing state-of-the-art one-shot imitation learning methods in the same simulation environments (i.e., simulated pushing and simulated reaching).


I. INTRODUCTION A. MOTIVATION
As we know, humans and animals can learn new behaviors quickly by observing or imitating others and can effectively adapt to new environmental changes by using previous knowledge and experience. We expect that artificial agents can learn as fast as humans. Generally speaking, machine learning requires a large number of samples for training, while humans need only a small amount of samples to learn new skills and concepts. For example, humans only need to learn a few examples of cats and dogs to know the differences between their shapes and characteristics, so they can learn to distinguish between cats and dogs. Since the application environment of robots has migrated from simple settings to unstructured and complex environments, it requires a large amount of expert knowledge. And the corresponding programming has also become complicated, time-consuming, and expensive [1], [2]. Generally, we want robots to be The associate editor coordinating the review of this manuscript and approving it for publication was Yangmin Li . adaptable as humans, which is almost impossible to achieve through traditional programming. Through demonstrations, we can unambiguously communicate any manipulation task, and simultaneously provide clues about the specific motor skills required by robots to perform the task [3]. As such, the core problem of meta-learning is how to build a machine learning model or method that can quickly learn from a small number of samples.

B. LITERATURE REVIEW
In imitation learning, one of the leading methods is behavior cloning based on supervised learning(e.g., [4]). However, when faced with a changing environment, methods based on behavioral cloning are often not adaptable and prone to overfitting. Unlike behavior cloning, the main idea of reinforcement learning [5], [6] is to acquire skills through a lot of trial and error, which has achieved remarkable success in many fields, such as continuous control in simulation or real environments [7]- [10], AlphaGo Zero [11], and Quake III Arena [12]. However, these methods based on reinforcement learning often access unsafe or undefined states space during training because of the need to interact with the environment and random behavioral exploration, while still not effectively solving the problem of quickly adapting to a new environment. In contrast to reinforcement learning, inverse reinforcement learning [13] selects the optimal behavior by learning a reward function [14]- [16] as an estimate, but requires additional expert knowledge to optimize rewards [16]. In general, reinforcement learning through random trial and error is a time-consuming task.
Although previous work has produced many impressive achievements in robotics, they have primarily considered each skill separately, rather than learning one skill to speed up learning another. Based on this background, meta-learning has been proposed in various concepts or forms [17]- [19] and is successfully applied to generative modeling [20], image recognition [21]- [25] and weight optimization algorithms [26]- [28]. However, these one-shot or few-shot meta-learning algorithms have not been applied to imitation learning. In the past, most operations of imitation learning (learning from demonstrations) methods were performed at the level of configuration-space trajectories [29]. These trajectories are usually collected by teleoperation [30], kinesthetic teaching [31], or sensors [32].
Unlike traditional methods based on manual programming, Duan et al. [3] proposed an excellent meta-learning framework that attempts to make the robot learn from very few demonstrations of any given task and instantly generalize to new situations without requiring task-specific engineering. Similarly, James et al. [33] introduced Task-Embedded Control Networks to leverage a task embedding to learn new tasks from demonstrations with meaningful results. Furthermore, Shao et al. [34] proposed a significant method for one-shot imitation combined with object detection, which is not an end-to-end framework. As it needs to train a carefully designed autoencoder network, a carefully designed object detection network, a carefully designed motion policy network, and it needs more manual labeling information, this method is very complicated and hard to train. Since these methods are mainly focusing on a specific network structure, it may lack versatility and flexibility for other tasks.
Not aiming at a particular network, Finn et al. [35] proposed a model agnostic meta-learning algorithm (MAML), which is simple, elegant, and powerful to various tasks. Different from previous meta-learning methods [14], [25]- [27] for learning update functions or learning rules, the MAML algorithm neither expands the number of learning parameters nor constrains model architecture (e.g., by requiring a recurrent model [36] or a Siamese network [37]). As shown in Fig. 1 (left), the main idea of MAML is to make the model have the best performance on new tasks after updating the parameters through one or a few gradient steps. However, it is commonly thought that simple gradient descent would get trapped in terrible local minima [38] (see Fig. 2).
Based on MAML, Finn et al. [39], [40] proposed some significant work of one-shot visual imitation learning. According to the provided demonstration, one-shot imitation's ideal goal is to make robots learn to deduce actions in a new environment. However, unlike image classification and regression tasks that need to provide labeled data during training and testing, imitation learning is more complicated. It is not easy to offer real-time actions corresponding to the demonstration in practical application. In other words, we tend to let the model directly infer the corresponding behaviors in a new scene without additional information from teaching videos. Therefore, it is tricky to make the model deduce works from a series of observations to new scenarios.
As shown in Fig. 1, the MAML algorithm for one-shot imation (see Algorithm 1) tries to directly adopt the output of support tasks as a loss function to perform an internal gradient update to update model θ since expert actions are not provided to the inner gradient update. Because MAML's internal gradient update does not provide supervision information for inner loss, the inner loss is unpredictable (we found that the internal loss is quite huge in our experiments). Therefore, it can not guarantee that the model has fully understood the behaviors of demonstrations (support tasks). After being updated by inner gradient descent, the model is directly enforced to adapt new scenes (target tasks). Intuition tells us that this learning model is more like seeking a direct mapping relationship from pictures to pictures than letting the model truly understand the meaning of pictures.

C. PAPER CONTRIBUTIONS
In order to solve the above problems, the contributions of our work are as follows: • We propose a general two-stage model-agnostic meta-learning algorithm (TMAML, see Fig. 1 and Algorithm 2) for one-shot imitation. Compared to the MAML algorithm, TMAML divides the learning process into two independent stages: -skill cloning. In this stage, we try to let the model fully understand the intent of the demonstrations (support tasks) only; -skill transfer. In this stage, we transfer the learned knowledge (learned from support tasks) to the new scenes (target tasks). * It should be noted that the TMAML is not simply adding more inner steps to MAML since it has its own outer gradient update step, and each stage of TMAML is relatively independent.
In other words, we could consider skill cloning as pre-cognition or pre-training of skill transfer.
• In the process of meta-learning, we introduce a noise mechanism (see Algorithm 3) to alleviate the overfitting problem. --It should be noted that our noise mechnism is different from the work that is based on a complex probabilistic model proposed by Finn et al. [41]. In our work, we simply inject a global noise to the model θ at the beginning of the inner gradient update, which is easy to implement and can increase the robustness of the model's adaptability. VOLUME 8, 2020 FIGURE 1. Diagrams of our two-stage model-agnostic meta-learning algorithm (TMAML) and the MAML algorithm when we do not provide expert actions to inner gradient update. (a) Firstly, the MAML algorithm tries to directly adopt the output of support tasks as a loss function to perform an internal gradient update to update model θ since expert actions are not provided to the inner gradient update. Then the updated model θ * is applied to adapt to new environments (target tasks). (b) Our TMAML algorithm attempts to divide the learning process into two stages: skill cloning and skill transfer. The purpose is to ensure that it has mastered the skills or knowledge from the given demonstrations (support tasks) and then apply the learned experience into new tasks (target tasks). In the skill cloning stage, we use the same internal gradient update strategy as well as MAML, but we do not directly apply the updated model to a new environment. In contrast, we provide expert actions for support tasks in the outer loss to make the model understand the actions required to complete according to the demonstrations from support tasks. After that, we assume the model has already grasped skills learned from support tasks, and we turn it into the skill transfer. In the skill transfer stage, we keep the same inner gradient update as before, and we perform a skill transfer at the outer gradient update to finish new tasks (target tasks) based on experience from support tasks. Note that we do only provide expert actions to the outer gradient update in both skill cloning or skill transfer. • In our work, we successfully apply our algorithm to the regression tasks, simulated pushing tasks, and simulated reaching tasks. We demonstrate that our algorithm achieved state-of-the-art performance in the same experimental setting compared to the previous advanced methods.

II. PROBLEM FORMULATION OF META-IMITATION LEARNING
For one-shot imitation, the MAML algorithm aims to train a model that can achieve rapid adaptation (a vision-based policy needed to adapt to a new scene from a single demonstration). In this section, we will define the visual meta-imitation learning problem and present the algorithm's general form.

A. PROBLEM STATEMENT
Now we consider a model, indicated by f. This model's function is to map demonstrations or observations o to corresponding outputs or actionsâ. Because we aim to apply our algorithm to various meta-learning tasks, we intend to introduce the same generic notion of a learning task below for convenience. Formally, we denote each imitation task T i = {τ = (o 1 , a 1 , . . . , o H , a H ) ∼ π * i , q(o t+1 |o t , a t ), L(a 1:H ,â 1:H ), H }, where τ is demonstration data generated by an expert Algorithm 1 Model-Agnostic Meta-Learning (MAML) for One-Shot Imitation (differences in blue) Require: p(T ): distribution over tasks Require: α, β: step size hyperparameters 1: randomly initialize θ 2: while not done do 3: Sample batch of tasks T i ∼ p(T ) 4: Divide the batch of tasks T i ∼ p(T ) into two mini-batches: A i as target tasks, B i as target tasks for all A i and B i do 6: Evaluate Compute adapted parameters with gradient descent: 9: is a transition distribution, L is a loss function used for imitation, and H is an episode length. We assume that the distribution over tasks p(T ) is exactly what our model wants to learn, and we can obtain successful demonstrations for each task. Feedback is evaluated by the loss function L(a 1:H ,â 1:H ) → R, which could be a cross-entropy loss for discrete actions or mean squared error for continuous actions.
In the K -shot learning setting, we draw K samples of task T i ∼ p(T ) as demonstrations for the model to train. For the one-shot imitation, the model needs to learn a new task T i drawn from p(T ) from only one demonstration generated by T i . Regarding meta-training, the model is trained using one demonstration from expert policy π * i based on a random task T i sampled from p(T ), and we test it on a new scene drawn from π * i to get the test error. The learned policy π i is improved by optimizing the test error concerning the model parameters. Therefore, the test error serves as a training error of the meta-training process.

B. MODEL-AGNOSTIC META-LEARNING
In the meta-learning field, the MAML has been successfully applied to various scenarios, such as regression, image recognition, and reinforcement learning. The MAML algorithm tries to learn a model represented by weights θ such that it can utilize standard gradient descent to get fast adaption on new tasks T i drawn from p(T ). Since the algorithm takes the gradient descent as an inner optimizer, it does not need to introduce additional parameters, which is more parameter-efficient than other meta-learning methods. After learning from a demonstration, the model's parameters θ are Algorithm 2 Two-Stage Model-Agnostic Meta-Learning (TMAML) for One-Shot Imitation (differences From MAML in red) Require: p(T ): distribution over tasks Require: α, β, γ , δ: step size hyperparameters 1: randomly initialize θ 2: while not done do 3: Sample batch of tasks T i ∼ p(T ) 4: Divide the batch of tasks T i ∼ p(T ) into two mini-batches: A i as target tasks, B i as target tasks (A i = B i ) 5: while Skill Cloning do 6: for all A i do 7: Evaluate L inner (f θ ) with respect to |A i | examples: 8: Compute adapted parameters with gradient descent: 10: end for 15: end while 16: while Skill Transfer do 17: for all A i and B i do 18: Evaluate L inner (f θ ) with respect to |A i | examples: 19: Compute adapted parameters with gradient descent: 21: end for 26: end while 27: end while updated as θ i to adapt to a new task T i . Especially, the updated θ i is computing through one or more gradient descent. For convenience, we mainly consider the case of a gradient update in the following sections, while multiple gradient updates could be seen as an extension. During training, the model's parameters are optimizing according to the test error of f θ i : where the hyperparameter α is a step size in the inner gradient descent of meta-learning. Note that the meta-optimization is performed through inner gradient descent over parameters θ , VOLUME 8, 2020 Sample batch of tasks T i ∼ p(T ) and random vector g i ∼ N (0, I ) 4: Divide the batch of tasks T i ∼ p(T ) into two mini-batches: A i as target tasks, B i as target tasks for all A i and B i do 6: Compute parameters with noise vector: Evaluate L inner (f θ * ) with respect to |A i | examples: 8: Compute adapted parameters with gradient descent: 10: Evaluate L outer (f θ i ) with respect to |B i | examples: 12: end for 15: end while and it uses the updated θ i to produce results on new tasks. The test error of tasks is optimized by stochastic gradient descent (SGD).

C. MODEL-AGNOSTIC META-LEARNING FOR ONE-SHOT IMITATION
In this section, we will detail the extension of model-agnostic meta-learning algorithms (MAML) to imitation learning setting. Please note that we only provide visual information as a demonstration (supporting task). No other action and state information are included in an inner gradient update because they are tricky to collect in practical applications.
We use o to represent the input (e.g., a video) of the model, o t represents the agent's observation at time t (e.g., an image), andâ = f θ (o t ) indicates the predicted output (e.g., torques) at time t. For simplicity, we denote a trajectory of demonstration as τ = {o 1:T , a 1:T }.
We assume that each task consists of at least two demonstrations for meta-learning. In meta-training, we randomly sample a batch of tasks T i , and each task consists of two demonstrations. For each task, we make a demonstration as the support task (A i ) and another one as the target task (B i ).
In the inner gradient update phase, we only provide visual information to the model without expert actions, and the inner loss is: where A i is a minibatch of tasks drawn from T i ∼ p(T ).
Although we do not provide expert actions in the inner gradient update for one-shot imitation, the expert actions are needed in the outer gradient update to get an objective loss. So in the outer gradient update phase, the expert actions are provided with visual information to the model, and the outer loss is: where B i is a minibatch of tasks drawn from T i ∼ p(T ) (B i = A i ). We summarize the algorithm of MAML for one-shot imitation in Algorithm 3.

III. IMPROVED ALGORITHM
The MAML algorithm has achieved excellent results in regression, image classification, image super-resolution [42], etc. However, unlike these fields, behavior imitation based on vision is a very troublesome problem. For example, in the image classification task, we can easily provide the classification label for reference. Due to the randomness and unpredictability of human behavior, it is not easy to provide information (e.g., actions and states) other than visual information for the model in real-time without the assistance of additional devices. In terms of practical significance, humans can learn new behaviors only based on visual information, so we also hope that agents can complete tasks as efficiently as humans.
Since we tend to only provide visual information to the model in the inner gradient update phase according to the MAML algorithm, this will make the model lose supervision information as a reference during the intermediate learning process. In other words, the model is unable to fully understand what tasks the demonstrations are completing. On the contrary, we think this is a way to forcefully get the mapping relationship between support tasks and target tasks, rather than truly understanding. As a result, Table 1 lists the success rates of one-shot simulating pushing with varying demonstration information provided at test-time. And Table 2 lists the success rates of one-shot imitating reaching with different demonstration information provided at test-time. The results of these methods show that as the information provided in demonstrations (support tasks) decreases, the model's cognitive ability declines sharply. Based on this, we hope to propose a method that can further understand the visual information from demonstrations and make the model have sufficient generalization capabilities.
A. TWO-STAGE MODEL-AGNOSTIC META-LEARNING ALGORITHM As shown in Algorithm 2, we divide meta-learning in training into two stages: (i) skill cloning and (ii) skill transfer. As with the MAML algorithm, in order to ensure the generality of the algorithm, we do not make any assumptions about the form of the model but assume that a certain parameter vector θ has parameterized it, and the loss function is smooth enough in θ that we can use gradient-based methods. Now we use f θ to represent the model function with respect to θ . At the  beginning of training, we need to randomly sample tasks T i ∼ p(T ) and divide them equally into two mini-batches: A i as support tasks and B i as target tasks.
Different from MAML, we first perform skill cloning to let the model understand what actions to complete based on demonstrations. Because the model does not need to switch learning scenarios (from support tasks to target tasks) at this stage, it is easier to understand the purpose of tasks from demonstrations. An intuitive example is: We often practice some fundamental problems repeatedly instead of directly challenging new questions. After we have summarized the law of the problem, we will apply this law to a new similar topic, so that we will not feel confused.

1) SKILL CLONING
In the skill cloning stage, the model only learns from demonstrations A i ∼ p(T ) (which serve as both support tasks and target tasks) and performs internal gradient updates without supervision information (e.g., actions and states) for inner learning: where the hyperparameter α stands for the step size of inner meta-learning. Here we mainly consider the case of one inner gradient update for convenience and directly adopt the internal output as a loss [40]. Still, more gradient updates could be used to produce better results. L A i (f θ ) is: After internal meta-learning, the model's parameters θ are updated as θ . And then the performance of f θ i with respect to θ is optimized via target tasks A i ∼ p(T ): Note that the meta-objective is to finish the target task according to the support task, so we must provide supervision information for the outer loss L A i (f θ i ) . More concretely, the L A i (f θ i ) is different from L A i (f θ ) as: According to the external loss calculated by (7), the final step of skill cloning is to use stochastic gradient descent (SGD) to optimize the model parameter θ : where the hyperparameter β stands for the meta step size of external meta-learning, and we call this step as an outer gradient update in this article. Although we do not provide supervision information in an internal gradient update, we can offer it directly in the outer gradient update. Notably, the support tasks and target tasks are kept the same in the skill cloning stage. Since it does not need to transfer a scene to another scene during training in this phase, it will be more natural for the robots to understand the tasks without facing the ambiguity of different demonstrations (e.g., a pair of related videos A and B).

2) SKILL TRANSFER
After completing a skill cloning stage, we assume that the model has understood the demonstrations of support tasks, and we turn it to the skill transfer stage. In the skill transfer stage, the model not only learns from demonstrations A i ∼ p(T ) (which serve as support tasks) but also apply the learned information to the new environments B i ∼ p(T ) (which serve as target tasks).
In the internal gradient update, the skill transfer stage and the skill cloning stage are consistent: where the hyperparameter γ stands for the step size. Because we have trained A i ∼ p(T ) independently in the skill cloning stage, we assume that the model has learned the corresponding skills in the internal gradient update, and it can be transferred to a new environment in the outer gradient update. Therefore, the meta-objective of skill transfer stage is optimizing the loss for the new tasks According to the loss calculated by the (10), the final step of skill transfer is performing the meta-optimization through stochastic gradient descent (SGG) to update the model parameter θ : where the hyperparameters δ is a meta step size in skill transfer of meta-learning. VOLUME 8, 2020 In the process of TMAML, we alternately perform skill cloning and skill transfer in each iteration (we do not provide supervision information for imitation learning in the internal updates of these two stages). The most significant difference between them is that the former stage updates model parameters θ only based on tasks A i sampled from p(T ). In this stage, the model can learn skills without ambiguity since there is no scene transition. And the latter stage updates model parameters θ based on both tasks A i and B i sampled from p(T ) with scene transition. In the following experiments, we will prove that skill cloning can promote the understanding of skill transfer.
Please note the TMAML is not simply adding more internal steps to MAML for one-shot imitation, since we train the skill cloning and skill transfer separately. Moreover, the support tasks' supervision information of MAML in our one-shot imitation setting is never provided in the internal step during training. On the contrary, we cleverly offer it in the outer gradient update of the skill cloning stage to help the understanding of the demonstration.

B. META-LEARNING BASED ON NOISE MECHANISM
Since the MAML algorithm heavily relies on the gradient direction provided by demonstrations, the gradient direction's accuracy becomes extremely important. However, as mentioned before, methods based on gradient updates often encounter overfitting problems, especially when facing new unseen tasks or demonstrations. For example, we first performed regression experiments based on MAML and found that after the gradient steps were increased to a certain number, the performance did not improve as we expect. We consider that the internal gradient updates have overfitted, especially in the case of simple tasks.
As such, we introduce a noise mechanism in meta-learning. As shown in Fig. 2, we suppose a small ball is on the hill that gets trapped in terrible local minima. If we apply an external force on the little ball, it may move and slide to the global minimum. Here we compare the parameter θ to be optimized to a small ball, and the noise is analogous to the external force exerted on this ball. More concretely, we add some random noise to the model's parameters θ at the beginning of each internal gradient update during training: where σ is a step size and g i could be some random Gaussian noise. Since we first update the model's parameters θ according to some noise, we could alleviate the model's overfitting phenomenon and increase its generalization ability. Based on the updated θ * , we perform an internal gradient update: We describe the specific algorithm in Algorithm 3.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
We mainly conduct experiments in three tasks: regression, simulated reaching, simulated pushing. We design these experiments to answer the following questions: • What is the performance of TMAML compared with previously mentioned methods?
• What effect does skill cloning have on skill transfer in our TMAML algorithm?
• What is the performance of merely training the MAML algorithm twice compared to the skill cloning phase of our TMAML algorithm?
• How does the noise mechanism affect the performance of meta-learning? In order to make our algorithm more convincing, we will compare it with some existing state-of-the-art experimental methods with the same problem settings and network structures.

A. REGRESSION
We first begin with a simple regression problem and compare it with MAML (see Fig. 5), and the experimental settings are consistent with [35]. In the regression task, we map each input x to a specific sine wave output f (x), where the amplitude and phase of the sinusoid are randomly sampled. Please note that the distribution of p(T ) is continuous, where the amplitude takes a value within [0.1, 5.0] and the phase takes a value within [0, π].
During meta-learning, input x consists of K data points, randomly sampled within [−5.0, 5.0]. We use mean-squared error (MSE) to evaluate the loss between the true output y and the predicted output f θ (x), where f θ is a neural network regressor with two hidden layers of size 40 and the activation function is ReLU. In training, we use ten internal gradient updates with K = 10 examples and set Adam [43] as a meta-optimizer (see [35] for more details regarding training and setting about MAML). For TMAML, we set with α = γ = δ = 0.001 and β = 10 −6 . Moreover, we also combine TMAML with noise mechanism and set the σ in Algorithm 3 to 0.01 for training (0.0 for testing).
The experimental results (see Fig. 5 (b) and (c)) show that as the internal gradient step of the MAML algorithm increases, sometimes its performance is not improved but overfitting. In contrast, our ''TMAML + noise'' algorithm can predict regression well even when the provided demonstration used for meta-learning contains some inaccurate points. We show that our algorithm achieves significant performance, and conventional gradient-descent methods cannot adapt to changing regression problems without additional gradient update strategies. Moreover, we performed an ablation experiment on the MAML algorithm, which shows that merely increasing the training times with the same samples does not significantly improve the training performance, thereby confirming our algorithm's effectiveness. In this regression experiment, we can conclude that our algorithm is superior to the previous methods. The introduction of the noise mechanism increases the robustness of the model and alleviates the overfitting phenomenon, and the skill cloning step has a promoting effect on the skill transfer.  Examples of simulated pushing. This experimental dataset is also provided by [39] and [40]. In the simulated pushing task, the robot should push the object to the specific place after watching a demonstration (support task) with a changeable goal in a different scene.

B. SIMULATED IMITATION
Unlike the previous regression problem, we hope our method can map image pixels to corresponding actions without labels or expert actions from demonstrations (support tasks). Since our algorithm is not aimed at a specific model structure, we adopt the excellent network structure proposed in [40] as an experimental model. Based on one-shot imitation learning, we mainly carry out experiments in simulated reaching and simulated pushing.

1) SIMULATED REACHING
As illustrated in Fig. 3, the goal of simulated reaching is to reach a particular color in the target environment after watching a demonstration (support task), which is disturbed by some distractors of different colors. In the simulated reaching problem, we set α = γ = β = δ = 0.001, batch_size = 25, training_iterations = 30000 and the number of convolution layers is 5 with 30 (3 × 3) filters in the model proposed by [40] for TMAML. To combined TMAML with noise mechnism, we set σ = 1e −8 during training and set σ = 0.0 during testing and set the number of internal gradient updates to 1.

2) SIMULATED PUSHING
As illustrated in Fig. 4, the goal of simulated pushing is to push a particular object to a specific place after watching a demonstration (support task), which is disturbed by some distractors of different objects. In the simulated pushing problem, we set α = γ = 0.01, β = δ = 0.001, batch_size = 20, training_iterations = 30000 in the model proposed by [40] for TMAML. Regarding noise mechnism, we set σ = 1e −10 for ''MIL(temporal loss + noise)'', σ = 1e −12 for ''TMAML + noise'' applied to skill transfer only, and we set the number of internal gradient updates to 1.

C. DATA ANALYSIS
Regarding one-shot imitation learning, we show the experimental results in Fig. 6 and Table 3. Now we will combine the data to answer the four questions we propose before as follows: • Question: What is the performance of TMAML compared with previously mentioned methods? --Answer: As shown in Table 3, we get the accuracy of 83.33% in simulated pushing and 95.28% in simulated reaching, which are better than the current advanced methods.
• Question: What effect does skill cloning have on skill transfer in our TMAML algorithm? --Answer: As shown in Fig. 6, we record the training loss of the simulated reaching. We can find that the loss in the skill cloning stage drops very rapidly and gradually becomes flat, which indicates that the model has understood the content of demonstrations (support tasks) without much ambiguity. With the assistance of skill cloning, the loss in skill tansfer drops fast and becomes relatively steady. However, the loss of other methods fluctuates greatly, which shows that it is difficult for the model to transfer learning between scenarios (from support tasks A i to target tasks B i ) directly. VOLUME 8, 2020   [40], which corresponds to our skill cloning and skill transfer respectively. • Question: What is the performance of merely training the MAML algorithm twice compared to the skill cloning phase of our TMAML algorithm? --Answer: As illustrated in Fig. 6, in terms of loss, merely training the MAML algorithm twice is not superior to TMAML. Combining the results in Table 3, we find that although the method has reduced the loss in training, the accuracy drops to 79.50% in the pushing task and 84.23% in the reaching task. We infer the approach is easy to overfit the training set, which produces poor experimental results in the testing set.
• Question: How does the noise mechanism affect the performance of meta-learning? --Answer: First, let us focus on Fig. 5 (c). We find that as the number of internal gradients increases, the MAML-based method's loss does not decrease as we expected. However, with the introduction of noise mechanism, the overall loss of the model has been further improved. In Table 3, we find the accuracies are improved in both pushing and reaching of ''MIL, temporal loss + noise(ours)''. As for ''TMAML + noise(ours)'', the accuracy of reaching increases to 95.47%, but it drops to 82.88% in pushing task. According to these results, we infer that the noise mechanism can promote model performance and alleviate the gradient-based overfitting in some cases (i.e., simulated reaching and regression). However, it should be noted that sometimes random noise can also reduce performance, especially for tasks or environments that are particularly sensitive to noise. For example, simulated pushing is sensitive to any changes since we amplify the loss by 50 times for training.

V. CONCLUSION AND FUTURE WORK
In this study, we presented an effective meta-learning method that is universal to various tasks without specific models, which can quickly adapt to new and unseen scenarios based on demonstrations. We demonstrated the effectiveness of our approach to tasks of simulated pushing, simulated reaching, and regression with state-of-the-art results. The experimental results show that our method has a better understanding of visual information, which can effectively perform generalization of knowledge and transfer to new application scenarios of the same task. We also introduced a noise mechanism for the overfitting problem, which can further improve model performance at a cheap cost. There are many meaningful research directions in the future, such as cross-task experience sharing and knowledge reuse with a universal algorithm. We plan to extend the algorithm to apply it to multitasking mechanisms. As we know, humans can handle a variety of tasks at the same time. Intuition tells us that there may be some common knowledge between different tasks that can be shared and reused quickly, and how to efficiently use the commonality between different tasks and avoid mutual interference of different tasks will be an essential topic in the future. Further and most importantly, we hope to explore how to distill the learning experience that was previously available and quickly apply it to entirely new and different unseen tasks. He is currently an Expert of intelligent manufacturing and energy system engineering. In 1994, he joined the American Energy and Power Research Center, ABB Group. He served as a Project Manager, a Senior Researcher, the Research Center Director, and the Chief Scientist with the ABB Group. He has presided over the development of ABB IRB's third and fourth generation robot controllers. He has invented the flexible, intelligent control technology, and he has completed various ABB controllers from motion controllers to force-vision hybrid controllers. He has a significant contribution to the transformation and upgrading of smart controllers based on behavioral intelligence so that ABB's controller performance ranked first in the area of industrial robots. Additionally, he has also presided over one major demonstration project in China, two international scientific, and technological cooperation projects. He has own 15 U.S. patents and 126 Chinese patents. He has published more than 40 articles, three books. His research interests include the underlying theory and critical technology of swarm intelligence, autonomous intelligent robot, and flexible automatic control.
Dr. Gan received the People's Republic of China International Science and Technology Cooperation Award. He plays a vital role at Fudan University, where he is the President of the Institute of Intelligent Robotics and the Vice President of the Institute of Engineering and Applied Technology. Meanwhile, he is the President in the academy of the intelligent manufacturing industry and the Emergent Group in Ningbo of China.
WEI LI received the B.Eng. degree in automation and the M.Eng. degree in control science and engineering from the Harbin Institute of Technology, China, in 2009 and 2011, respectively, and the Ph.D. degree from the University of Sheffield, U.K., in 2016. He is currently an Associate Professor with the Institute of AI and Robotics, Fudan University. He has published more than 20 academic articles in peer-reviewed journals and conferences, such as the IEEE TRANSACTIONS ON ROBOTICS and NeurIPS. His research interests include robotics and computational intelligence, and specifically self-organized systems, and evolutionary machine learning. XUSHENG WANG (Student Member, IEEE) received the master's degree from the University of International Business and Economics. He is currently pursuing the Engineering degree majored in electronic information with Fudan University. He has been awarded the Honor of Outstanding Inventor in Hebei Province of China. He has participated in many national, provincial, and ministerial critical scientific research projects. He has published many academic articles and applied for 25 patents, including 19 invention patents, and ten authorized patents. His research interests include medical service robots and swarm intelligence.