Detection and Tracking Strategies for Autonomous Aerial Refuelling Tasks Based on Monocular Vision

Detection and tracking strategies based on monocular vision are proposed for autonomous aerial refuelling tasks. The drogue attached to the fuel tanker aircraft has two important features. The grey values of the drogue's inner part are different from the external umbrella ribs, as shown in the image. The shape of the drogue's inner dark part is nearly circular. According to crucial prior knowledge, the rough and fine positioning algorithms are designed to detect the drogue. Particle filter based on the drogue's shape is proposed to track the drogue. A strategy to switch between detection and tracking is proposed to improve the robustness of the algorithms. The inner dark part of the drogue is segmented precisely in the detecting and tracking process and the segmented circular part can be used to measure its spatial position. The experimental results show that the proposed method has good performance in real-time and satisfied robustness and positioning accuracy.


Introduction
Aerial refuelling, also referred to as in-flight refuelling (IFR) or air-to-air refuelling (AAR), is an operation whereby fuel is transferred from one aircraft (the tanker) to another aircraft (the receiver) during flight. IFR is an important method for extending the flying distance and speed of the aircraft and is widely used in military aircraft. In unmanned aerial vehicles (UAV), autonomous air-to-air refuelling is needed to ensure flight endurance. There are two kinds of hardware configurations used for aerial refuelling: the first configuration, called the boomand-receptacle refuelling system, includes a rigid boom extending from the tanker aircraft, with a probe and nozzle at its distal end. The boom also includes airfoils controlled by a boom operator stationed on the refuelling aircraft. The airfoils allow the boom operator to actively manoeuvre the boom with respect to the receiver aircraft, which flies in a fixed refuelling position below and aft of the tanker aircraft [1][2][3][4][5]. The second configuration, called the probe-and-drogue refuelling system, includes a refuelling hose which has a drogue deposited at its end trailed behind the tanker aircraft and a probe installed on the receiver aircraft. The probe must be placed or docked into the drogue in order to refuel successfully [6][7][8][9]. Autonomous aerial refuelling relies on three key technologies: target detection, tracking and measurement, in order to allow the receiver aircraft to determine control strategies to enable a robust and safe approach and coupling. The attempt described in this paper is to provide detection and tracking strategies for the probedrogue autonomous aerial refuelling based on monocular vision.
In this paper, the drogue's detection and tracking strategies based on monocular vision are proposed for autonomous aerial refuelling tasks. Two important features of the drogue are used to design the detection and tracking strategies. The first feature is that the grey values of the drogue's inner part are almost the same and are different from the external umbrella ribs. The second feature is that the shape of the drogue's inner dark part is nearly circular. The drogue's detection algorithm includes two parts: the drogue's rough location algorithm and the drogue's fine positioning algorithm. The rough location algorithm is used to define the potential regions in which the drogue may be located, while the drogue's fine positioning algorithm is used to find the drogue in the potential regions accurately if the drogue is in the image. Particle filter is widely used in target tracking because of its robustness [10][11][12][13]. A new particle filter algorithm based on the drogue's shape was proposed to track the drogue. In the new particle algorithm, unique principles of state transition are defined to ensure tracking robustness even when the drogue's position or size changes significantly in two adjacent frames. A switch strategy between detection and tracking was proposed to improve the algorithm's robustness, which provides the link between detection and tracking. This is critical when the tracking has failed or the drogue is not in the image.
The paper is organized as follows. Section 2 gives an introduction to previous works. Section 3 describes the drogue's detection strategy. Section 4 describes the drogue's tracking strategy. Section 5 describes the switch strategy between detection and tracking. Section 6 presents the experimental results and Section 7 concludes the main points of the research.

Previous Works
Machine vision methods used for autonomous aerial refuelling tasks are becoming increasingly popular [6,7,[14][15][16][17]. Advantages of using machine vision methods for autonomous aerial refuelling tasks include the potential for installation without modification being required to the target aircraft and increased measurement precision. The researchers have developed a variety of different machine vision method for the probe-drogue autonomous aerial refuelling system (shown in Figure 1). John Valasek et al. [6] developed a vision-based navigation sensor and system for autonomous aerial refuelling tasks. For application to the endgame docking problem of automated aerial refuelling of aircraft, a VisNav sensor (a position-sensing diode) was mounted on a receiver aircraft and a set of LED beacons were mounted on a drogue being trailed from a tanker aircraft. When light energy from an individual beacon on the drogue was focused on the surface of the position-sensing diode, it generated an electrical current, which was measured with four pickoff leads, one on each side. The six-degrees-of-freedom position and attitude of the sensor aircraft with respect to the drogue can be computed by the four position-sensing signals. The main disadvantage of the method proposed in [6] is that some modifications to the tanker equipment must be made to provide electrical power for beacons, since there is no such power in the hose to which the drogue is attached. Fravolini et al. [18] proposed a docking control scheme for autonomous aerial refuelling of UAVs using a probe-drogue refuelling system. The docking control scheme was based on a fuzzy sensor fusion strategy featuring GPS and machine vision data. The GPS was used to measure the relative position between the tanker and the receiver and the machine vision was used to measure the relative cameradrogue distance. Some markers were placed in the drogue to measure its position and orientation. However, GPS receivers may be affected by interference from electronic devices and GPS signals may be blocked by the tanker. Lorenzo Pollini et al. [19] placed light emitting diodes (LEDs) on the drogue and used a CCD webcam with an infra-red filter to identify the LEDs. Hager and Mjolsness's (LHM) algorithm [20] was used to determine iteratively the translation vector as well as the transformation matrix between the 3D reference systems on the object and the camera, respectively. As in [6], the main disadvantage in [19] is that some modifications to the tanker equipment are required. Carol Martinez et al. [7] proposed a vision-based strategy for autonomous aerial refuelling tasks. The proposed strategy consisted of four stages: detection, initialization, tracking and 3D position estimation. The detection stage was composed of two algorithms: one based on edge-image template matching using the normalized cross correlation (NCC) method, and the second based on image threshold segmentation. The detection method is time-consuming because the drogue images with different variations, such as scale, illumination and position, must be contained in the template images. It is impossible to contain all the conditions of the drogue in the template images, so in order to decrease the failure of detection, an experience threshold was used to segment the image to detect the inner part of the drogue when the edge-image template matching method failed. However, the experience threshold is hard to define because the illumination of the scene may change significantly. The tracking algorithm was a Hierarchical Multi-Parametric and Multi-Resolution implementation of the Inverse Compositional Image Alignment technique HMPMR-ICIA [21].

Rough Location of the Drogue
The aim of rough location is to define the potential regions in which the drogue may be located. According to prior knowledge, the rough location stage is composed of two algorithms: this first based on image segmentation using a series of thresholds, and the second based on contour features of the image regions segmented by image segmentation using a series of thresholds.
It is impossible to define an accurate experience threshold used to segment the image to detect the inner part of the drogue because the illumination of the scene may change significantly. So a series of thresholds are used to segment the same image, as follows: where f is the input images, F is the set of output images which include {gT0, gT1, …, gTN-1}. {T0, T1, …, TN-1} are a series of thresholds used to segment the input image, as follows: where ΔT is an increment of the threshold.
Then all the contours of the output images are extracted, and the set of the contours is expressed as C = {c0, c1, …, cn-1}. The contour of the inner part of the drogue is nearly circular, so aspect ratios of the minimum enclosing rectangles of contours are used to define the potential contours of the drogue from the set C, as follows: where Ratio(ci) is the aspect ratios of the minimum enclosing rectangle of the contour and TRatio is threshold of the aspect ratios of the minimum enclosing rectangle. L(ci) is the length of the minimum enclosing rectangle, TL1 is the lower bound of the length of the minimum enclosing rectangle and TL2 is the upper bound of the length of the minimum enclosing rectangle. W(ci) is the width of the minimum enclosing rectangle, TW1 is the lower bound of the width of the minimum enclosing rectangle and TW2 is the upper bound of the width of the minimum enclosing rectangle.
is the set of the drogue's potential contours, that is to say, the position in which the drogue is located can be obtained from the positions of C′ .
In order to improve the speed of detection, the Multi-Resolution (MR) hierarchical structure [22] is used. The MR structure is created by repeatedly downsampling the images by a factor of two in order to create the different levels of the pyramid. The number of levels pL is defined, taking into account the size of the drogue in the image. The general idea of the acceleration strategy is that the rough location of the drogue is conducted at the lowest resolution level. The advantage of using the MR structure is that many small error contours will not be segmented out at low resolutions.

Location of Edge Points of the Drogue's Inner Dark Part
To every contour in the set of C′ , the circular least-square fitting [23] is used to obtain the contours' centres, as follows: x y x y where (c i centX, c i centY) are the coordinates of the centre of the contour i c′ and (xi, yi) is the point in the contour i c′ . Some contours in the set may belong to the same object because a series of thresholds are used to segment the same image. Thus, the distance between different contours in the set C′ can be used to eliminate unnecessary contours, as follows: where Td is a threshold for eliminating unnecessary contours, is the set of contours after eliminating unnecessary contours and the function u(t) is defined in (9). The set of the centres of contours C l can be expressed as is called the exterior rectangle. The ratios of the sum of the pixel values in the exterior rectangle to the sum of the pixel values in the interior rectangle are calculated when the small rectangle is moved along every half-line. All the ratios in the same half-line are compared to each other to find the position in which the ratio of the small rectangle's two parts is maximal, as follows: where p is the point in the half-line k, ( ) Ratio p is the ratio of the point p in the half-line k and max k p is the position corresponding to the maximal ratio in the half-line k. Then, the edge points corresponding to the jth contour in the set C l or the jth centre in the set C c can be expressed as Most of directions of the half-lines are neither horizontal nor vertical, so coordinate transformations are used to calculate the coordinates of small rectangles in half-lines whose directions are not horizontal or vertical. Figure 3 shows three coordinate systems; the coordinate system (X0, Y0) is centred on the upper-left corner of the image, the X0 axis points to the right horizontally and the Y0 axis points vertically downwards. The coordinate system (X1, Y1) is centred at the point (x1, y1) which is the translation relative to the point O0. The X1 axis is parallel to the X0 axis and the Y1 axis is parallel to the Y0 axis, but their directions are opposite. The coordinate system (X2, Y2) is centred at the point (x1, y1). The X2 axis overlaps with the half-line i and the Y2 axis is perpendicular to the X2 axis.
where s(y) is the sum of the y-th column pixels, ii(H, y) is the value of the last line of the integral image. In Figure 4, l1 is the boundary of the small rectangle's two parts, l2 is the left side of the small rectangle and l3 is the right side of the small rectangle. The value of the integral image at location 2 is the sum of the pixels in rectangle A. The value of the integral image at location 1 are the sum of the pixels in rectangle A+B and the value of the integral image at location 3 is the sum of the pixels in rectangle A+B+C. Therefore, the sum of the pixels in rectangle B is ii1-ii2 and the sum of the pixels in rectangle C is ii3-ii1. The ratio of the small rectangle's two parts is are the values of the integral image at locations 1, 2 and 3 respectively.

Getting Rid of the Bad Edge Points Using Vector Angles
The bad edge points of the drogue's inner dark part may be detected because of image noise or the partial occlusion of the drogue. Assume that the detected edge points of the drogue's inner dark part are the red points {1, 2, …, 16} shown in Figure 5. We define the vectors as {1 2  , 2 3   , …, 15 16  , 16 1   }. As shown in Figure 5, the vector angle θ1 of the detected edge point 1 is the angle between the vector 16 1   and the vector 1 2  . The vector angle θ7 of the detected edge point 7 is the angle between the vector 6 7  and the vector 7 8   . In the same way, we can obtain other vector angles. The types of the bad edge points are diverse, but the bad edge points which own larger vector angles are representative. In Figure 5, the points 4, 8, 13 and 15 are the typical bad edge points which produce larger vector angles. The vector angles can therefore be used to get rid of the bad edge points, as follows: where P={p0, p1, …, pn1} is the set of the remaining edge points after getting rid of the bad edge points using vector angles, np is the number of the edge points detected in Section 3.2.1, n1 is the number of the remaining edge points after getting rid of the bad edge points using vector angles, θT is the threshold of the vector angles, θi is the vector angle of the edge point and the function u(t) is defined in (9). The bad edge points 4, 8, 13 and 15 in Figure 5 can be eliminated by equation (18).

Location of the Drogue Using RANSAC
The styles of the bad edge points are diverse, so some bad edge points cannot be eliminated by vector angles. For example, the bad point 9 in Figure 5 cannot be eliminated by the vector angle because the vector angle θ9 is less than the threshold θT of the vector angles. The shape of the drogue's inner dark part is nearly circular, so good edge points can be determined by this prior knowledge. RANSAC [24], an abbreviation for "random sample consensus", is an iterative method to estimate parameters of a mathematical model from a set of observed data which contains outliers. Each contour l i c in the set of l C corresponds to a set P i ={p i 0, p i 1, …, p i n1} of the remaining edge points after getting rid of the bad edge points using A threshold TW is used to define whether the set l C includes the contour of the drogue, as follows: where S is the state of detection, S=1 indicates there is a drogue in the image and S=0 indicates there isn't a drogue in the image. The pseudo-code of the algorithm for location of the drogue using RANSAC is presented in Figure 6.

Tracking Strategy
The tracking algorithm is based on particle filters which are sequential Monte Carlo methods [25] based on point mass. Particle filters are suitable for any non-linear system that could be represented by a state model. The tracking object is the inner dark part of the drogue in the tracking algorithm, so the centre of the inner dark part of the drogue can be defined as the state xt of the drogue at time t.

Selection of Particles and State Transition
The disturbance of the drogue is uncertain when in the air, so it is hard to establish the accurate motion model of the drogue during air-to-air refuelling. The state xt-1 of the drogue at time t-1 is selected as the particle at time t. The number of the particle is Nin+ Nex. Nin is the number of particles called interior particles whose range of state transition is limited to the interior of the circle Ccir whose centre is xt-1 and radius is r, while Nex is the number of particles called exterior particles whose range of state transition is limited to some half-lines in the exterior of the circle Ccir. In Figure 7, the half-lines to which exterior particles whose range of state transition is limited are assumed to start at the position which is l away from the centre xt-1, and extend outward. The angle between two adjacent half-lines is θ and the distance between two adjacent exterior state transition particles in the same half-line is s Δ . The process of state transition of inner particles is as follows: where in k t x is the interior state transition particle, ( ) where ex ij t x is the exterior state transition particle which is the jth particle in the ith half-line, ( )

Particle weights and posterior probability
The particle weights of state transition particles in Section 4.1 can be calculated through the following steps: where ( Xt, Yt ) are the coordinates of the drogue's state xt, ( X i t, Yt i ) are the coordinates of the state transition particles, r i t is the radius calculated with the same method in Section 3.2.3 corresponding to the ith state transition particles and rt is the radius of the inner dark part of the drogue at time t.

Switch Strategy between Detection and Tracking
The detection stage must be enabled automatically to detect the drogue either at the start of the run or when the drogue has gone out of the field of view of the camera, or alternatively because the tracking algorithm has failed to track the drogue. Therefore, performance assessment criteria should be defined to switch between detection algorithm and tracking algorithm. The detection and tracking algorithm is initiated with a lost status L=1(i.e.. no drogue has been detected). The detection algorithm is then enabled to find the drogue. The drogue is detected successfully when the state of detection S=1 in Section 3.2.3, then the lost status is L=0 and the tracking algorithm is enabled. The performance assessment criteria of the tracking algorithm can be defined according to the weights of the drogue's states in ks successive frames as follows: (9), W i is the state weight at time I and Tt is a fixed threshold. If the lost status is L=0, the tracking algorithm continues running. If the lost status is L=1, the tracking algorithm is stopped and the detection stage is enabled in the region of interest (ROI) of the image. The lost status is L=0 and the tracking Figure 8. The proposed visual detecting and tracking system for air-to-air refuelling algorithm is enabled if the drogue is detected successfully, otherwise the lost status is L=1 and the detection stage is enabled in all regions of the image. The process of the strategy for switching between detection and tracking is shown in the proposed visual detecting and tracking system for air-to-air refuelling in Figure 8.

Experimental Results
In this section, experiments were conducted on the real drogue of air-to-air refuelling at different air scenes. The performance of our method was compared to the performance of the algorithm proposed by Carol Martinez et al. [7]. Three experiments were carried out to detect and track different drogues at different air scenes. Speed of processing and percentage of correct location are compared between Carol Martinez's method and ours. The speed indicators were the average time tfave of processing each image, the maximum time tmax, the minimum time tmin and the average time tave between the adjacent outputs when the drogue was in the image. The percentages of correct location are compared between Carol Martinez's method and ours at different location error thresholds. The proposed algorithm was developed in C++ and the OpenCV libraries were used for managing image data and the experiments were carried out on a PC with a AMD Athlon (tm) II X4 645 Processor and a 3.1GH clock.

At Air Scene 1
125 frames of images with 1440×900 pixel size were used in the experiment at air scene 1. The experimental data were obtained from the website http://www.youtube.com/watch?v=nWmFpLVl8MQ. Eleven edge templates as shown in Figure 9 were used to find the drogue in the lowest resolution image in the image pyramid, the number of pyramid levels was pL=3 and the threshold used to segment the image was 85 in Carol Martinez's method [7]. In our method, the number of pyramid levels in the application was pL=3, the lowest threshold was T0=20, the number of thresholds was k=53 and the increment of the threshold was ΔT=3 in equation (4), The aspect ratio was TRatio=0.7, the lower bound of the length of the minimum enclosing rectangle was TL1=4 and the upper bound of the length of the minimum enclosing rectangle TL2 was equal to one third of the image's height in (6). The lower bound of the width of the minimum enclosing rectangle was TW1=4 and the upper bound of the width of the minimum enclosing rectangle TW2 was equal to one third of the image's width in (7) and (8). The threshold for eliminating unnecessary contours was Td=1.5 in (12); the number of the half-lines corresponding to each contour was 20, the length of each half-line was 60, the height of the small rectangle was three and the width of the small rectangle was eight in Section 3.2.1, The threshold of the vector angles was θT =80° in (18); the threshold TW was 15 in (20); the number of samples was ns=5 and the threshold Tr was six in the pseudo-code in Section 3.2.3, The number of interior particles was Nin=20, the number of exterior particles was Nex=45, the radius of the circle Ccir was 15 and the parameter l was 15. The distance between two adjacent exterior state transition particles in the same half-line was Δs=22. Three exterior particles were in the same half-line and the angle between two adjacent half-lines was θ=24° in 4.1. The threshold TP was 15 in 4.2; the threshold Tt was 15 in Section 5.
Nine result frames in our method are shown in Figure 10. The magnified target is displayed in the top-right corner of the frame in Figure 10. The green circle segmented in detecting and tracking process is the inner dark part of the drogue, the white point is the green circle's centre and the red points are the state transition particles in Figure  10. The comparison of processing time of each frame between our method and the method of Carol Martinez et al. is shown in Figure 11. The processing time of each frame in our method is clearly less than the processing time in the method of Carol Martinez et al. The processing time in our method is not affected by the scale of the drogue in the image. In Figure 11, the cyan triangle is the detection time (about 16ms) in ROI in the 116th frame and the nine magenta pentagrams stand for the detection time (less than 172ms) in the whole images in our method. The green squares represent the time in which no data is output (or the algorithm thinks there is no target in the frame) in Carol Martinez's method. As shown in Figure 11, no data is output from the 32th frame to the 39th frame, though the targets in these frames are clear in Carol Martinez's method, while our method gives the drogue's positions in the entire frames. In the method of Carol Martinez et al., the processing time is affected by the size of the reference image. The detecting algorithm finds the first reference image with pixel size 213×206 in the 0th frame as shown in Figure 11 and the average tracking time corresponding to the first reference image is 1099.5ms. The detecting algorithm finds the second reference image with pixel size 217×213 in the fifth frame as shown in Figure 11 and the average tracking time corresponding to the second reference image is 1163.3ms. The detecting algorithm finds the third reference image with pixel size 246×254 in the 40th frame as shown in  Table 1. The percentage of correct location in the method of Carol Martinez et al. is less than 30%; in our method it is 100% when the location error threshold is five pixels, as shown in Figure 18.

At Air Scene 2
One hundred and ninety frames of images with pixel size 1440×900 were used in the experiment at air scene 2. The experimental data were obtained from the website http://www.youtube.com/watch?v=cG6rMZF6mIw. The appearance of the drogue at air scene 2 was different from the appearance of the drogue at air scene 1, but both had the two important features. Eleven edge templates were used to find the drogue as shown in Figure 12 and the threshold used to segment the image was 40 in Carol Martinez's method. The parameters in our method are the same as the parameters in Section 6.1. Nine result frames are shown in Figure 13 and the processing time of each frame is shown in Figure 14 in our method. The processing time of each frame in our method is clearly less than the processing time in the method of Carol Martinez et al. as shown in Figure 14.  In Figure 14, the seven cyan triangles are the detection time in ROI and the 11 magenta pentagrams stand for the detection time in the whole frames in our method. The detection time represented as the black pentagrams in Figure 14 in Carol Martinez's method is obviously greater than the detection time in our method. The green squares are the time in which no data is output in Carol Martinez's method. As shown in Figure 14, no data is output from the 121th frame to the 129th frame and a wrong target is detected and tracked from the 130th frame to the 136th frame. This is probably because the drogue is partially occluded. In our method, only in the 138th frame is no data output. The speed indicators in our method are better than the method of Carol Martinez et al., as shown in Table 1. The percentage of correct location in the method of Carol Martinez et al. is less than 20%, while in our method it is near 80% when the location error threshold is five pixels, as shown in Figure 18.

At Air Scene 3
123 frames of images with 1440×900 pixel size were used in the experiment at air scene 3. The experimental data were obtained from the website http://www.jokeroo.com/videos/cool/aerial-refueling.html. The appearance of the drogue at air scene 3 was different from the appearance of the drogues at air scene 1 and air scene 2, but all of them had two important features. Eleven edge templates were used to find the drogue as shown in Figure 15 and the threshold used to segment the image was 50 in Carol Martinez's method. The parameters in our method are same to the parameters in Section 6.1. Nine result frames are shown in Figure 16 and the processing time of each frame in our method is shown in Figure 17. The processing time of each frame in our method is clearly less than the processing time in the method of Carol Martinez et al. as shown in Figure 17 since the HMPMR-ICIA [21] algorithm adopted by Carol Martinez is a time-consuming iterative optimization method during tracking. The speed indicators in our method are better than the method of Carol Martinez et al., as shown in Table 1. The percentage of correct location in the method of Carol Martinez et al. is less than 70%, while in our method it is more than 80% when location error threshold is five pixels, as shown in Figure 18.

Conclusions
Detecting and tracking strategies for aerial refuelling tasks based on monocular vision were proposed. According to the drogue's prior knowledge, multithreshold segmentation and shape-distinguishing arithmetic were used to detect the drogue. Multithreshold segmentation decreased the rate of missed detection and the figure-distinguishing arithmetic pinpointed the drogue's position precisely. A new particle filter algorithm based on the drogue's shape was proposed to track the drogue. In the new particle algorithm, unique principles of state transition are defined to ensure tracking robustness even when the drogue's position or size change significantly in two adjacent frames. A strategy of switching between detection and tracking was proposed and the switching strategy enhanced robustness of our method. In the method, the inner dark part of the drogue was segmented precisely in the detecting and tracking process and the segmented circular part can be used to measure the spatial position of the drogue. The speed of the proposed method is fast, is less affected by the size of the drogue in the image and it is highly accurate. In the future, we will try to use the segmented circular part to measure the position of the drogue in Cartesian space based on monocular vision by the method of combing the 3D model of the drogue and the imaging principle.

Acknowledgments
This work is partly supported by National Natural Science Foundation of China under Grant 61227804 and 61105036.