A Reinforcement One-Shot Active Learning Approach for Aircraft Type Recognition

Target recognition is an important aspect of air traffic management, but the study on automatic aircraft identification is still in the exploratory stage. Rapid aircraft processing and accurate aircraft type recognition remain challenging tasks due to the high-speed movement of the aircraft against complex backgrounds. Active learning, as a promising research topic of machine learning in recent decades, can use less labeled data to obtain the same model accuracy as supervised learning, which greatly reduces the cost of labeling a dataset. Instead of manually developing policies of accessing the labels of desired instances, an improved active learning approach, which can not only learn to classify samples using small supervision but additionally capture a relatively optimal label query strategy, was developed by employing the reinforcement learning in the process of decision-making. The proposed model was first tested with the Amsterdam Library of Object Images (ALOI) dataset and then used to perform aircraft type recognition on one-month real-world flight track data. Our method offers a satisfactory solution for learning new concepts rapidly from a small amount of data, which well meets the needs of aircraft type recognition task in practical application.


I. INTRODUCTION
With the rapid increase in the variety and quantity of aircraft, precise identification of aircraft types is not only an important task of air traffic control in daily life but also a vital military mission.However, aircraft type recognition methods are still in the exploratory stage, and mature aircraft recognition theories and systems have not yet been formed.In order to achieve better recognition accuracy, aircraft type recognition work still requires substantial human input.As a hot topic in both academia and industry, machine learning has made major advances in areas such as pattern analysis [1], image processing and natural language processing.Therefore, the use of machine learning methods to reduce the workload of human experts in aircraft type recognition tasks has become a meaningful research direction.
For many real-world tasks, labeled data are scarce whereas unlabeled data are abundant [2].As is widely acknowledged in this domain, formulating labels is a straightforward strategy to process data that involves plenty of human interaction.It is relatively easy to obtain a large number of unlabeled instances while acquiring labeled instances is expensive (e.g.manually annotated) and is not always available in large volumes [3,4].Prior investigations have demonstrated that accessing the ground-truth label of a dataset not only requires the effort of considerable experts in related fields but also takes more than 10 times longer to label a sample than to collect it [5].As dataset volumes grow continuously, the learning systems tend to generalize better, but the cost of annotation has also increased dramatically [6].To achieve better recognition accuracy, aircraft type recognition work still requires considerable participation of human experts, since labeling is typically done manually, considered to be time-consuming and labor-intensive.Thus, there is a strong demand for training an accurate machine learning model to mitigate the heavy workload of human experts in aircraft type recognition tasks.
As a promising approach to this goal, active learning is a widely applicable machine learning framework that serves to reduce the cost of annotation without sacrificing model performance [7].Human learning process is simulated by active learning approaches in some way: it iteratively queries the labels of certain instances and adds them to the training set, and tries to improve the generalization performance of the model with fewer queries.This method has been well studied during the past years and benefited a variety of practical scenarios, like information retrieval [8], image and speech recognition [9], text analysis [10][11][12] and automatic target recognition [13].
Humans are able to learn and generalize new concepts from only a few labeled instances [14,15].One-shot (or few-shots) learning simulates this process in the literature to some extent [15].Inspired by this, we aimed to design an artificial intelligence agent that could inherit similar capabilities and pose fewer requests for the labels of new instances during the training process [16].In active learning, an ideal situation is that labeling critical instances is still required but the number of queries can be minimized.Thus, we prefer to study a problem at the crossroad of active learning, reinforcement learning and one-shot learning [17,18] rather than a humandesigned criterion.More specifically, the selection or design of the strategy of labeling new instances can be performed automatically.
Our study introduces a novel learning model that not only learns to classify samples using small supervision but additionally captures a relatively optimal label query strategy.We treat active learning method as a meta-learning problem [19] and train this active query strategy network with reinforcement learning.Mostly inspired by the work of Woodward et al. [20,21], this paper can be viewed as a practical extension.We study the case of streamed-based setting where the model considers a stream of instances and needs to classify one sample after another.It's a natural fit for an active learner using reinforcement learning to solve a continuous decision problem, since the next decision is affected by the previous action (when and which instance to query next depending on the current state of the basic learner).Therefore, a cogent nonmyopic strategy can be learned by the active query system trained by reinforcement learning, and effective decisions can be made with little supervision.
In particular, our contributions in this paper can be summarized as follows: (1) We address the challenge of aircraft type recognition in practical application and design the aircraft type recognition task in a novel stream-based online learning way.We collect one month's worth of flight track data in a real-world environment, not in a simulated environment, and greater quantities of flights and types of aircraft are considered than previous studies.(2) We employ a novel reinforcement one-shot active learning approach [21] to the task of object recognition using Amsterdam Library of Object Images (ALOI) dataset [22] and the aircraft dataset.It is sought to be the first time considering the issue of how an aircraft recognition system can improve performance under limited resources by this meta-learning approach.(3) Compared to various state-of-the-art algorithms, we experimentally demonstrate the efficiency of present method in exploring label query strategies based on the uncertainty [23] of instances end-to-end and its ability to learn new concepts rapidly from a small amount of data, which well meets the needs of practical applications.

II. RELATED WORK
For now, investigation on aircraft automatic recognition is still in the exploration stage, and most of the existing studies focus on the method of graphic image processing [24][25][26][27][28]. Radar signal analysis has also been widely used in air traffic management [29].Image-based methods and radar-based methods primarily use features of aircraft profiles to identify the type of aircraft.Aircraft recognition based on contour is mainly to find the approximate invariant features [30][31][32].
In real-time applications, one common technique for identifying military aircraft is Identification Friend Foe (IFF).
Civil aircraft uses an IFF-like technique called Secondary Surveillance Radar (SSR) [29].The fundamental disadvantage of technologies such as IFF and SSR is the need for active pilot cooperation, which makes these technologies inefficient and less practical.
Aiming at lowering the cost of annotation without sacrificing model performance, active learning as a subfield of machine learning has been well studied during the past years [9,34,35].The idea of active learning benefits a variety of practical scenarios, including film recommendation [36][37][38], medical image classification [39], natural language processing and so on.A common view of choosing the appropriate instance for labeling is based on maximizing the expected informativeness for labeled instances [40].Uncertainty sampling [41] is one of the most popular active learning methods, in which the classifier selects the sample with the highest measure of uncertainty to query.Query by committee is another well-motivated active learning framework, in which a committee of classifier is trained on the same data set, and the next query is chosen according to the principle of maximal disagreement [42,43].Ebert et al. proposed a diversity promoting sampling method that uses graph density to determine most representative points [44].Konyushkova et al. proposed a data-driven approach called Learning Active Learning, and the key idea is to train a regressor that predicts the expected error reduction for a candidate sample in a particular learning state [45].In general, most of these strategies rely heavily on heuristics or theoretical measures, such as similarity measures between previous and current instances [46], or the extent of uncertainty in label prediction [46][47][48].However, heuristic-based active learning methods may fail when the data distribution of the underlying learning problems varies (e.g. a new category appears).
To move away from engineered selection heuristics, we cast active learning as a decision process, and use reinforcement learning to learn an action policy for an active learner.The premise of active learning is that costs associated with label requests and making false predictions can be reasonably modeled [20].Those costs can be optimized by reinforcement learning through explicitly setting reward and punishment, and an action strategy can be directly determined.Thus, we believe that the combination of reinforcement learning and active learning is a reasonable and appealing approach to stream-based online cases.Some recent studies have also generated interest in a similar idea.Woodward et al. [20] firstly focused on learning an optimal policy for active learning task with the help of reinforcement learning.They use reinforcement learning with a recurrent-neural-network-based Q function in a sequential one-shot learning task to decide between predicting a label and acquiring the true label at a cost [7].Bachman et al. [2] and Pang et al. [19] studied a poolbased active learning algorithm in a meta-learning fashion.Puzanov and Cohen [16] developed an artificial intelligence classification systems using the same idea.Recent methods such as meta-learning and one-shot learning are closely related to our model [15].A supervised meta-learning model based on memory-augmented neural networks was proposed by Santoro et al. [49], which focused on the same learning task as ours.

III. MODEL DESCRIPTION
The framework of our proposed reinforcement one-shot active learning (ROAL) method is presented in this section.We mainly consider a single pass stream-based online active learning scenario, in which the model decides, while observing instances continuously obtained from the data stream and presented in an exogenously-determined order, whether to predict each instance's label or to pay a cost to query its label.The learner usually observes one unlabeled instance from a continuous stream each cycle and has to choose the appropriate action (predict the label or query the label) for each instance of the arrival [40].A deep recurrent neural network [50] function approximator is used to act as a function approximator for a Q-network, and the output of the network is connected to a fully connected layer, which produces the actual Q-values.Moreover, the cross entropy [51] term is employed in the loss function to improve the performance of the classifier.

A. TASK DESCRIPTION
Obtaining the ground-truth label of a data instance is timeconsuming and expensive in the scenario of stream-based online learning.Therefore, judiciously identifying the number of instances to label is in urgent need for the classification algorithm [35,52].Under the setting of this [35,53], the algorithm makes a decision, whether to request the ground truth label when instance arrives.The classification task that we focus on is a stream of instances (e.g.images or aircraft target track) for which labels must be queried or predicted.In the setting of one-shot learning [15,49], in order to maximize the performance of the model on the new classes that are not present in the training set, the performance of the model is improved over short training episodes and a small number of instances per class.The structure of the active learning task we propose is shown in Fig. 1.At each time step of the episode, an instance   is given to the model, and the model needs to decide an action to take.Assuming that there are up to  possible classes in each episode, the action space is defined as following: Let   be the action that the model takes at time step .When the model predicts the label of the instances as one of  possible classes (e.g.class  ) without requiring the ground truth of the label at time  , action   =   is taken.When the model requests the true label  of the instance, action   =   is taken.The action   is represented by a one-hot vector which the first N bits are consistent with the optionally predicted label  ̂ and are followed by a bit for requesting the label.The model can only perform one action at a time step, either predict the label of the instance or request the label, since only one bit of the vector can be 1.If the model queries the label of instance   , then no other action (prediction) will be made, and the true label   will be sent to the model along with a new instance  +1 at the next time step.If the model chooses to predict, then the ground truth label will not be requested at the same time, and a 0 ⃗ vector will be sent to the model along with the next instance instead of the true label.
is the reward or penalty received after taken action   in state   , and  represents the discount factor for future rewards.At each time step, once the model performs an action, one of the following three rewards is given:   for correctly predicting the label,   for incorrectly predicting the label,   for requesting the label.The goal is to maximize the sum of rewards received in this episode.

B. METHODOLOGY
The purpose of reinforcement learning is to seek practical and superior strategies in complex control and prediction tasks through interaction with the environment.Through explorations and exploitation, it can learn from actions by receiving positive and negative reinforcements following the action performed.In this paper, an efficient model-free reinforcement learning method Q-learning is employed to learn an optimal policy  * (  ) for maximizing the expected reward for any initial state.It can estimate the expected utility from the available operations and adapt to random transitions without understanding the system model [54], thus, Qlearning has been widely used in various decision-making problems [55].In this paper, a long short-term memory (LSTM) is used to approximate the action-value function of Q learning and is connected to a fully-connected output layer to output the Q values, as depicted in Fig. 2.

FIGURE 2. Schematic diagram of the proposed reinforcement one-shot active learning (ROAL).
In reinforcement learning, a definition of an objective function is required to show what action is good in the long term.The idea of Q-learning is not to require a model of the environment, but to optimize a Q function that can be directly calculated: Where  is a discount factor between 0 and 1.The policy which is taken at   is represented as (  ), and outputs an action   at time  .The optimal policy  * (  ) which is better than or equal to other policies always exists. * (  ) is the strategy that maximizes the optimal action-value function  * (  ,   ) .The action-values are consistently updated after observing rewards received after taking different actions in different states, and should ultimately result in a policy that is an estimate of the optimal policy  * .Thus, the action which chosen by the model is given by the optimal policy  * and can be calculated as: According to the Bellman equation, the optimal actionvalue function can be derived as: Normally, (  ,   ) is represented by a function approximator and its parameters is optimized by minimizing the Bellman error.Woodward et al. [20] derived the loss function as following: Here  represents the model parameters, and   are the observations (instances) which the agent receives.
However, the loss function in Woodward's work [20] only considers the maximum value of Q.Thus, in the early stages of training, this loss function tends to be inefficient and prone to encounter gradient vanishing phenomenon.As an important concept in Shannon's information theory, cross entropy is mainly used to estimate the difference between two probability distributions and has been widely used in many machine learning methods to define a loss function.Intuitively, we want to introduce the cross-entropy term to the loss function to make the label prediction probability distribution output by the current model closer to the probability distribution of the real label [21], thus avoiding the shortcomings, speeding up the training and improving the efficiency of the model.The loss function we design is: Where ((  ,   )) are the probability distribution of (  ,   ), (()) are the probability distribution of the true label at time step .
A long short-term memory (LSTM) network [50] is used here, which is connected to a fully-connected layer to output the Q values.Each bit of the vector, which is the output of (  ) , corresponds to an action: is the bias vector of the action-value, ℎ  is the hidden state vector also known as output vector of the LSTM unit,  ℎ are the weight metrics mapping from the LSTM output to action-values.The forms of the equations for the forward pass of an LSTM unit with a forget gate we used are: ̂,  ̂,  ̂, ̂ =     +  ℎ ℎ −1 +  (10) Here,  ̂,  ̂,  ̂ respectively represent the forget gates, input gates, and output gates.Where ̂ denotes the candidate cell state and   represents the new LSTM cell state.  denotes the weights mapping from the observation to the gates, and  ℎ represents the weights mapping from the hidden state to the candidate cell state. denotes the bias vector.σ(⋅) denotes an element-wise sigmoid function.⨀ is element-wise product, and tanh (⋅) represents the hyperbolic tangent function.

IV. EXPERIMENTS
Two classification tasks were examined using our proposed ROAL model under an active one-shot learning set-up, and the results of the ROAL model are compared with the results of previous studies.

1) SETUP
We perform our first experiments on the Amsterdam Library of Object Images (ALOI) dataset [22] to show the general performance for target recognition.ALOI is a color image collection, consisting of 1000 classes of small objects, with 108 images of each object, giving 108,000 total instances.The dataset was split into 700 objects for training and keep the rest 300 objects for testing.Our model interacts with new objects it did not encounter in the training process to measure its test performance.
Following the episodical stream-based setup, every episode consists of a series of 50 images from the ALOI dataset.In each episode, these 50 instances were randomly selected from 5 different classes, and these classes were randomly drawn before every episode without replacement.Here, the number of instances from each class may be unbalanced.Each selected class in the episode wasn't labeled with their true label, but a pseudo-label randomly assigned when constructing the episode.The pseudo-labels are simply one-hot vector of size equal to the number of classes drawn, giving   .A single layer LSTM with 200 hidden units was used to represent .We used Adam with the default parameters [56] to optimize the weights of the model.A grid search was performed over the following hyper-parameters, and the hyper-parameters of the results reported in this article are listed as follows.During training process, the model employed an epsilon greedy exploration strategy, with  = 0.05.The discount factor  was set to 0.5.Unless otherwise stated, each training step consisted of a batch of 100 episodes, the reward values were set as:   = +1,   = −1, and   = −0.05 .The training was carried out on 100,000 episodes.For evaluation, 20 episodes were set as a group from the test set and the average accuracy, request, and precision rate were computed.And 10,000 episodes of evaluation were conducted after training.

2) RESULTS
Here we represent two experimental results of our model on the ALOI dataset.In the first experiment, both active one-shot learning (AOL) [20] and ROAL model were tested on the task in Fig. 1 with the same parameters set-up.During training process, the 1st, 2nd, 5th, and 10th instances of all classes in each episode are identified.Notably, in this analysis, label requests are considered to be incorrect label predictions when calculating the accuracy.The models were trained on 100,000 episodes from the training set.After that, training was ceased, and the models evaluated on 10,000 more test episodes.In these episodes, no further update occurred, and then the model was run on never-before-seen classes pulled from a disjoint test set.We report the results in Fig. 3 and Fig. 4.
As can be seen from the figures, the ROAL we proposed learns to query the label for early instances of a class and makes more predictions for later instances.Meanwhile, the accuracy of the model is improved on subsequent instances of a class.Compared with AOL, ROAL converges faster with higher accuracy and lower request rate.ROAL introduces cross entropy into the loss function, which greatly speeds up the training, and saves time and computing resources.Then, we performed another experiment to explore whether the model can effectively reason its own uncertainty.In previous experiments, instances in each episode were randomly arranged.In this experiment, in order to explore the model's action strategy, the order of instance was manually designed.Under the setting of this task, experiments were conducted on the trained model, and three test classes were randomly chosen for each episode.Two groups of experiments were carried out.In both groups, 1000 episodes were run without learning and the request percentage of episodes for each time step was recorded.In the first group, three instances were assigned which came from different classes to the model at the beginning of each episode.After that, three instances from different classes were given, respectively.We reported the label request rate for the first six time-steps in each episode separately.As can be seen in Fig. 5 (a), after the model saw an instance of that class, it should be able to recognize it next time it sees an instance of the same class, thus, the request rate for later instances of the same class was greatly reduced.This result is consistent with the original intention of active learning.If representative samples can be effectively selected for labeling, the cost of manual labeling can be greatly reduced.However, existing experiments have not been able to prove whether the model chooses actions based on uncertainty of instances, since a naive strategy is likely to be learned, which always requires labels in the first few steps.For further confirmation, another group of experiments was set as: two instances from the first class were given, followed by two instances from the second class and two instances from the third class.As shown in Fig. 5

1) SETUP
The aircraft type classification dataset covers 215 classes of aircraft, with each class consisting of 20 aircraft, for a total of 4300 aircraft.It is based on the time-series data of a month's aircraft flight tracks collected by multiple sensors, and it contains the track information of both military and civilian aircraft.This form of flight track data can be passively collected from far away in almost any location, which varies from sound and radar data which are limited in location (both) and are active (radar) [34].The flight data is comprised of irregular intervals that make up the record of each track.We extracted the motion features as the inputs of the model [1].The dataset was split into 152 classes for training and kept the remaining 67 classes for testing.
For the first experiment, in each episode, a series of 30 aircraft tracks were randomly selected, these 30 instances were randomly selected from 3 different classes, and these classes were randomly drawn before every episode without replacement.The number changed to 50 or 70 tracks per episode when the number of classes per episode changed to 5 and 7.  is represented by an LSTM with 600 units.We used Adam with the default parameters [56] to optimize the weights of the model.The following hyper-parameters were chosen by a grid search and are listed as follows.An epsilon-greedy exploration strategy with  = 0.1 was used for action selection.The discount factor  was set to 0.6.In experiments on aircraft type recognition task, unless otherwise stated, the reward values were set as:   = +1,   = −1, and R  = −0.3 .The training was carried out on 100,000 episodes.For evaluation, 20 episodes were set as a group from the test set and the average accuracy, request, and precision rates were computed.And 10,000 episodes of evaluation were conducted after training.

2) FEATURE EXTRACTION
Because of the differences in aircraft performance and pilot flight habits, useful motion features such as maximum speed, cruising speed, maximum acceleration, maximum rate of climb were extracted as the input [1].

3) RESULTS
In Fig. 6 and Fig. 7 we report the results of our active model on aircraft type recognition task.
As shown in Fig. 6, since the ROAL model learns to query the label for early instances of each class, first-instance accuracy is poor.We can also conclude that ROAL leads to more label predictions for later instances according to the sharp drop in label request rates for later instances.At the same time, the prediction accuracy of the model is further improved on later instances of a class, close to 85%.As shown in Fig. 7, compared with AOL, ROAL converges faster and achieves higher accuracy.Since the tasks we show here are relatively simple, each episode contains only 3 different categories, the label request rate of AOL and ROAL are almost the same low.Student's paired t-test was conducted to evaluate the statistical significance of the comparison results for ROAL and AOL.When the p-value in the hypothesis test was less than 0.05, the result was considered significant.In our results, the statistical significance levels of both the training and test stages of accuracy are significantly lower than 0.05, indicating that the results of ROAL are significantly superior to the results of AOL.These data show that ROAL greatly speeds up the training, effectively avoids the inefficiency in the early training stage, and saves considerable time and computing resources by introducing cross entropy into the loss function.In order to further compare ROAL and AOL, Fig. 8 shows the receiver operating characteristic (ROC) curve analyses results in the multiclassification task.The ROC curve is a graphical plot of the true positive rate (TPR) against the false positive rate (FPR) as its discrimination threshold is varied.It can clearly illustrate the diagnostic ability of a classifier system.A ROC plane is defined by FPR as the X-axis and TPR as the Y-axis, respectively, the axes range from 0 to 1.A random guess would give a diagonal dotted straight line connecting (0,0) to (1,1).The diagonal divides the ROC space.Any classifier that appears above the diagonal performs better than random guessing, whereas curves below the line represent worse classification performances.Since we study the case of multiclassification, not only the ROC curves of the two algorithms for each class but also the macro-average ROC curves that reflect the overall classification effect of the two algorithms are presented.As can be seen in Fig. 8, the ROAL method achieves better upper-left ROC curve results than the AOL method.The areas under the curve (AUCs) of the ROC plot were often used for model comparison in machine learning.The AUC can be calculated by accumulating the trapezoidal areas between each ROC point.The AUC value lies between 0 and 1, and the higher AUC value, the better classification performance.As can be seen in Fig. 8, the macro average AUC of ROAL is higher, which is 0.87, while the macro average AUC of the AOL method is 0.83.And the AUC values for each class of ROAL is also higher than AOL.The results of ROC-AUC analyses show that, compared with the AOL, the ROAL algorithm effectively improves the classification performance.
It is a natural idea to increase the penalty for misprediction to improve the accuracy of the model.And prediction accuracy is the most important thing in aircraft recognition task.In reinforcement learning, this goal can be achieved by changing the setting of reward function.To explore the impact of this, we further trained models using different reward values, which are   = −1 ,   = −2,   = −3,   = −4 , and   = −5.At the same time, we show the results of the AOL model presented on the same problem.As shown in TABLE I, the prediction accuracy increases with the increase of the penalty of incorrect labeling.Compared to AOL, ROAL achieves higher accuracy and a lower request rate with the same reward value setting.The experimental results also verified that the ROAL model can make trade-offs between high prediction accuracy of numerous label requests and a small number of label requests with low prediction accuracy.Higher prediction accuracy can be achieved by increasing the penalty value for wrongly predicting labels.Previous state-ofthe-art aircraft recognition studies have established a baseline of over 90% recognition accuracy.As   becomes more negative, ROAL approaches the accuracy over 97%, with less than 50% label request rate.Notably, we can conclude from the table that with the increase of model accuracy, the request rate increases rapidly.When the model accuracy exceeds more than 95%, the cost of increasing 1% accuracy is the increment of more than 11% label request rate.Therefore, properly  The experiments were further expanded by increasing the number of classes per episode.In the same task, the ROAL model was compared to AOL, a supervised learning model and 5 active learning methods [57](Random Sampling (Random) [58], Diversity promoting sampling (Density) [44], Learning Active Learning (LAL) [45], Uncertainty sampling (Unc) [41], Query By Committee (QBC) [43]) in the same task, where the model must deal with never-before-seen classes in the test set.The results are shown in TABLE II, and the rewards for AOL and ROAL were set as: R cor = +1, and R req = −0.3 .For active learning methods, one labeled instance for each class was needed for setup at the beginning of each episode.The loss of the supervised learning model is the cross entropy between the true label and the predicted label, and the true label is always presented in the subsequent time step.For consistency, we used the same LSTM model in this supervised task [49], and the softmax modification is performed on the output without extra bits for the "request label" action.The results show that the traditional supervised learning method and active learning methods cannot rapidly learn new concepts, so they may be incapable of the task of recognizing new targets in one-shot learning.Through the increment of the number of classes per episode, the ability of the ROAL algorithm to handle more complex tasks is further demonstrated.At the same time, compared with others, the ROAL model significantly reduces the number of requests for labels while achieving the same or even higher accuracy.However, we also found that as the complexity of the problem increases, the request rate of the label also increases rapidly, and the excessive label request rate means a large consumption of human resources.So, in the face of more complex issues, LSTM-based networks will no longer be competent, and a more powerful one-shot learning approach should be introduced.Notably, as explained in [49], human performance is a relevant baseline for one-shot learning.However, a central memory store is limited to 3 to 5 meaningful items in young adults [59].Therefore, for the task like aircraft type recognition with the number of classes far beyond 5, this type of binding surpasses human working memory capacity, which is limited to storing only a handful of arbitrary bindings [49].Compared to our previous work using supervised learning methods for aircraft type recognition [1], methods based on reinforcement one-shot active learning can significantly reduce the dependence on label data and achieve the same or even better model accuracy.

V. CONCLUSION
As an essential technology in air traffic management, aircraft type recognition is attracting increasing amounts of attention from scholars.The existing studies have been mostly based on supervised graphic image processing, which is inherently deficient in highly dynamic real-time applications.In this paper, we first develop a model that learns actively via reinforcement learning with a label query strategy based on data characteristics.Secondly, we apply this meta active oneshot learning approach to target recognition tasks using ALOI and aircraft type recognition datasets.The experimental results demonstrate that the model is good at rapidly learning new concepts and can transform an engineering heuristic selection of samples into learning strategies based on data.Compared to previous studies, we significantly accelerate the convergence, improve the stability, decrease the number of label requests and improve the accuracy of the model.Notably, the proposed model can learn when to label examples and when to request a label instead; thus, it meets the need of intelligent air traffic management and has a wide range of applications.
In future work, we plan to evaluate our approach on more complex datasets and expand the scope of the study to a wider range of targets.For this, we may need a more sophisticated one-shot learning approach such as Matching Network [15] or Memory-Augmented Neural Networks [49].

2 )FIGURE 1 .
FIGURE 1. Task structure.For instances in the datasets, the classes and their labels, as well as specific samples of each class are shuffled and randomly presented at each episode.

FIGURE 4 .
FIGURE 4. Comparison of overall accuracy (a) and request rate (b) results between ROAL and AOL.
(b), the label request rate of the second, fourth, and sixth time step are greatly reduced, and the label request rates of the third, and fifth time step are greatly increased.The difference in request rates between these time steps and the similarity between the percentages of label requests of all the first instance of each class indicate that the model chooses the action based on the uncertainty of instances, since the model is able to query the label when a new class appears and rapidly learn new concepts after that.

FIGURE 8 .
FIGURE 8. ROC plot with AUC values for AOL and ROAL.
value function poses a vital impact on the learning effect of the model.

TABLE I TEST
SET CLASSIFICATION ACCURACIES AND THE PERCENTAGE OF LABEL

TABLE II RESULTS
FOR VARIOUS ARCHITECTURES ON THE AIRCRAFT RECOGNITION