International Journal of Advanced Robotic Systems Vision-aided Inertial Navigation for Small Unmanned Aerial Vehicles in Gps-denied Environments Regular Paper

This paper presents a vision-aided inertial navigation system for small unmanned aerial vehicles (UAVs) in GPS-denied environments. During visual estimation, image features in consecutive frames are detected and matched to estimate the motion of the vehicle with a homography-based approach. Afterwards, the visual measurement is fused with the output of an inertial measurement unit (IMU) by an indirect extended Kalman filter (EKF). A delay-based approach for the measurement update is developed to introduce the visual measurement into the fusion without state augmentation. This method supposes that the estimated error state is stable and invariant during the second half of one visual calculation period. Simulation results indicate that delay-based navigation can reduce the computational complexity by about 20% compared with general augmented Vision/INS (inertial navigation system) navigation, with almost the same estimate accuracy. Real experiments were also carried out to test the performance of the proposed navigation system by comparison with the augmented filter method and a referential GPS/INS navigation.


Introduction
Small UAVs typically use a low-cost IMU integrated with a GPS receiver to build a navigation system due to a limited on-board computational capability and payload capacity [1,2]. Although the IMU is able to track sudden motions both in angular velocity and linear acceleration of a short duration, it is subject to a growing unbounded error caused by low frequency bias. The GPS can bind the accumulated error by estimating the absolute velocity and position of the vehicle directly. However, it is easily disturbed by the weather and the flight environment. Under these GPS-denied circumstances, general GPS/INS navigation is no longer reliable. Therefore, other sensors are required for autonomous flight.
Computer vision has played an important role in both the control and navigation of small UAVs over the past decade [3]. Visual measurement can be used in navigation [4], terrain following [5], obstacle avoidance [6] and urban canyon flight [7]. According to the selection of the camera, there are two main methods for visual applications: stereo and monocular. Stereovision systems have been successfully applied in small UAVs for years [8,9], but the distance between the two cameras limits the useful altitude range [10]. Monocular vision also offers a good solution in terms of weight, accuracy and scalability for small UAVs, while the scale is usually estimated by fusion with other sensors [11,12].
Map-based approaches for the visual navigation of small UAVs focus on the SLAM (simultaneously localization and mapping) algorithm [12,13]. This estimates the state of the vehicle and builds a map at the same time. Another method using mosaics as environment representations is proposed in [10]. Optical flow (OF) is a typical mapless method for small/mini UAVs [14,15]. This method runs at a very high frequency, but is limited to domains where the motion between consecutive frames is expected to be small. OF usually requires an inertial measurement to compensate for the effect of the rotation on the optical flow field. Visual odometry (VO) is considered as another popular mapless approach for the visual navigation of small UAVs [16]. In this system, the visual sensor is in charge of the position and orientation estimation by tracking ground template and the attitude is measured by the on-board IMU. A major defect of the original VO algorithm is the accumulated error of the state estimate, as it fails to distinguish the translation from the rotation without inertial compensation.
As to the above defect, methods that can estimate the motion of the vehicle in six degrees of freedom have also been applied in visual navigation. An EKF-based visual and inertial navigation system with an L-M-based nonlinear estimation method has been proposed [17,18]. The state of the EKF was expanded by including the states of the vehicle at two consecutive moments. A method to combine inertial and visual measurements by an unscented Kalman filter (UKF) was developed in [19]. The state was augmented so that the homography estimated by the visual sensor could be directly used as the measurement of the filter. Moreover, a delay state EKF that allowed loose coupling of the inertial estimation and the VO-based visual measurement was described in [20]. The delayed state was created by appending the position and orientation of the vehicle at the last VO frame. Other methods have been proposed to build vision-aided inertial navigation systems and they also use state augmented approaches to introduce visual measurement into the filter [21,22]. In the above systems, the visual motion estimate is usually taken as a relative measurement, so the most intuitive method is state augmentation. One major defect is that the augmented state requires extra calculations for the filter state and the state covariance.
This paper describes the whole process of the visionaided inertial navigation for small UAVs, from feature detection to the fusion of inertial and visual measurements. A scale invariant feature transform (SIFT) approach is adopted to detect and describe the features from aerial images. Then the matched features in consecutive frames are used to estimate the motion of the vehicle in six degrees of freedom with a homographybased method. We propose a new approach to combine the inertial and visual measurements to estimate the state of the vehicle based on an indirect EKF. The main contribution of this paper is the derivation of a delaybased measurement update method for the EKF to fuse the visual and inertial measurements without any state augmentation. A simulation environment will be set up to compare the proposed and the general augmented methods by both computational complexity and estimate accuracy. The performance of this navigation system will also be verified by real experiments.

Homography-based visual navigation
The visual navigation that estimates the motion of the vehicle according to the visual information is described in this section. Stable features are detected and matched to estimate the homography between consecutive frames. Afterwards, the homography is decomposed into the translational and rotational motions of the vehicle.

Feature detection and match
In this paper, SIFT [23] is used for feature detection. This method searches for features not only in the geometry space, but also in the scale space established by the Gaussian difference of the image in different scales. A descriptor, which is invariant to rotation, translation and scale, is established to describe the corresponding feature. Since thousands of features might be extracted even from a small image, a fuzzy-based threshold adjustment method is proposed to stabilize the number of features from one image regardless of the change in the image sequence. This method can reduce the calculation of the image processing without affecting the accuracy of the posterior visual estimation. Further details of the proposed adaptive method are available in [24].
Matching the features in consecutive frames is implemented by the calculation of Euclidean distance between associated descriptors, as all the descriptors have been normalized in the process of feature detection. A fast nearest-neighbour algorithm is used to match the features quickly. A matched pair is accepted if and only if the distance between the two descriptors is the shortest, less than a threshold and shorter than 0.8 times the distance of the second nearest neighbour. A bi-direction match method is applied to improve the robustness of the match.

Robust homography calculation
Homography is modelled to indicate the transformation between two images, including scale, rotation and translation. The homography model is defined as follows: where �� � � � 1� � and�� � � � 1� � are homogeneous positions for the corresponding points of two consecutive frames in the pixel coordinate, � � �� �� � ��� is the homography and � is a scale factor. The homography has eight degrees of freedom, as it is defined up to the scale factor.
A random sample consensus (RANSAC) approach is applied to eliminate the erroneous matches as not all the pairs are matched correctly. A bucketing technique [25] is adopted in the step of the subset selection in RANSAC to avoid a dramatically enlarged error of the estimate when features from one subset are too close. Singular value decomposition (SVD) is used to estimate the homography by the matched feature pairs that pass the RANSAC filter. An M-estimator algorithm is applied to add weights to the matched pair according to the match accuracy to improve the robustness of the estimate.

Homography-based motion estimation
Suppose a small UAV with an on-board downwardlooking camera is flying above plane Π, � � and � � are two projections in the camera coordinate of a fixed point � as shown in Figure 1. � � � and � � � are defined as the homogeneous points corresponding to the projections in the pixel coordinate. Their relationships are given as: where A is the camera intrinsic matrix, which can be calibrated accurately before the flight and � � � and � � � are the positions in the z direction of the two projected points in the camera coordinate.
Without a loss of generality, it is considered that the world coordinate is fixed to the camera coordinate of position 1. Then � �� is defined as the rotation matrix and � �� is the translation vector. They are both described in the camera coordinate of position 1 to express the motion of the vehicle. The relationship between the two projections of the fixed point � is expressed as: Since point � is on plane Π, the above equation is rewritten as: where d is the Euclidean distance between position 1 and plane Π, n is the unit normal vector of the plane and I is a � � � identity matrix. Define the scale factor as � � � � � � � � ⁄ and the combination of Equations (2), (3) and (5) gives: The calibrated homography is defined as: Equations (6) and (7) indicate the relationship between the homography and the camera motion. The rotation and the translation are obtained by the SVD of the calibrated homography. Further details on the calibrated homography decomposition are available in [26]. Since the visual sensor is fixed to the vehicle, the transformation between the camera coordinate and the body coordinate can be accurately measured before flight. Then the motion of the camera estimated by homography calculation is converted to the motion of the UAV.

Fusion of inertial and visual estimates
In this section, the EKF is employed to establish a Vision/INS navigation system to improve the state estimation of the vehicle in GPS-denied environments. A continuous-discrete formulation is used in this filter. This means that the state estimation is propagated based on continuous nonlinear system dynamics, whereas the measurement is updated at discrete time iterations. The state update is propagated by the output of the IMU at 50Hz. The measurement update is driven each time when the visual computer finishes the motion estimation. A novel delay-based measurement update method is proposed, which can introduce visual measurement into the navigation state filter without any state augmentation.
The world coordinate is defined as a North, East and Down (NED) relative coordinate system with the origin at the initial location of the vehicle. The navigation coordinate is the same as the world coordinate in this paper. Quaternions are adopted to avoid the singularity, which may occur in Euler angles. The bias in the state can provide a continuous estimate of the current drift of IMU to compensate for the time-varying bias effects. Suppose that the installation of the IMU and GPS antenna has been compensated, the inertial navigation process model in continuous state-space form is written as: Here � ��� means the m order identity matrix � ��� is the zero matrix with m rows and n columns, C � � is the direction cosine matrix (DCM) from the body coordinate to the navigation (world) coordinate, � � is the gravity vector in the world coordinate, ���� is the system input from the inertial sensors, � � � and � � � are the outputs of the accelerometers and gyroscopes respectively and �������0� �� represents the process noise of the system. The attitude equation in the form of a quaternion is written as: where:

Error state propagation
The navigation state, as shown in Equation (10), should be linearized and discretized before the state propagation. The error state vector is applied in an indirect extended Kalman filter, which is defined as: where � � ��� is the estimate of the state. The state model is linearized by the first order Taylor expansion and discretized as: where: and �� is the sampling time interval of the IMU. The covariance matrix of the equivalent white noise sequence W � is derived as: where Q � � ��� � . Then the error state-based propagation of the discretized nonlinear extended Kalman filter is derived as:

Visual measurement update
The direct output of the visual process is the rotation matrix � �� and the translation vector � �� of the camera positions between two consecutive frames in the camera coordinates. As a relative measurement, the visual motion estimation is usually introduced into the system filter by the state augmentation. The augmented state of the filter contains the states of the vehicle at both of the two positions where the consecutive frames are taken and calculated. The visual measurement is treated as the motion between these two states of the vehicle. The major defect of this method is the unavoidable extra calculation required by the state augmentation, which has a great effect on the on-board navigation of small UAVs.
A novel measurement update method is proposed in this section. If the time interval of the visual process and the coordinate conversion between the camera and the vehicle are known, the transformation is converted to the motion of the vehicle in the body coordinate. Then, with the help of the DCM from the body coordinate to the navigation coordinate provided by the filter estimate at position 1, the attitude, displacement and velocity can be obtained. Since the velocity measurement requires a delay process, the visual update is divided into two parts: attitude and position update as well as velocity update.

Attitude and position measurement update
The DCM from the body coordinate at position 2 to the navigation coordinate is derived as: where � �� � is the DCM from the body coordinate at position 1 to the navigation coordinate, which can be calculated from the state of the EKF filter at that moment, � � � is the transformation matrix from the camera coordinate to the body coordinate and � � � is the transformation matrix from the body coordinate to the camera coordinate. Then the attitude of the vehicle � in the form of quaternion can be calculated from � �� � .
The displacement of the UAV from position 1 to position 2 in the navigation coordinate is derived as: Suppose the position of the vehicle � � � at position 1 has been estimated, then the position at position 2 is given as: The attitude and position calculated from the visual observation are the measurements for the undelayed filter. The measurement of the filter with a discrete model is written as: where The error state-based attitude measurement update is derived as follows:

Velocity measurement update
Suppose an image is taken and calculated at � ��� and another consecutive image is dealt after m steps of the state propagation at � � . The average velocity in the navigation coordinate during � ��� and � � is derived as: with � � � � � − � ��� . Since the velocity in the filter state is the instantaneous velocity, the above average velocity cannot be introduced into the filter directly. Thus, there is an assumption that during a short time interval in the flight, the velocity cannot fluctuate remarkably. Then, the average velocity during � ��� and � � can be considered as the instantaneous velocity at � ��� � ⁄ , as: Figure 2 illustrates the relationship between the average velocity and the instantaneous velocity. In this example, the real velocity of the vehicle increases gradually. It can be seen that the above assumption is reliable and reasonable. The difference between the average velocity and the instantaneous velocity at � ��� � ⁄ is much smaller than that at � ��� and � � . This assumption might be not so accurate under some extreme circumstances, but it is not the focus of this paper. We suppose that the UAV is in a stable flight and the above assumption is always tenable. Under the above assumption, it is noteworthy that the velocity at � ��� � ⁄ is obtained at � � . The visual velocity measurement of the vehicle suffers a delay of about � � � ⁄ compared with the state propagation. A delay-based approach for this measurement update is proposed and the main idea is that the estimated error state �� � is supposed to be invariant during the second half of a visual process, from � ��� � ⁄ to � � , as �� � � � �� � ��� � ⁄ . The measurement at � � is used to calculate the estimated error state �� � ��� � ⁄ and then �� � ��� � ⁄ is used to correct the state estimate at � � .
The delay-based measurement model for the velocity estimation is written as: where:  After the real flight, the acquired sensor data are analysed offline and the measurement covariance is optimized to get a better performance from the filter.
The initialization of the bias in the inertial sensors is another important element in the filter. In the simulation, the biases are initialized by the following approach. The UAV is put on the ground without any movement. The visual measurement is replaced by information on the changeless position, zero velocity and changeless attitude. Then, after a few seconds, the bias of the inertial sensors will be well estimated. Figure 6 and 7 show the initialized process of the three-axis gyroscopes and the three-axis accelerometers. Both these kinds of bias can be initialized accurately.

The effect of the measurement delay on the filter
Firstly, the effect of the measurement delay on the EKF filter is analysed. In this filter, the attitude estimated by the camera is a real-time measurement, while the velocity measurement is set with a different delay, from 0 to 2 seconds with an interval of 0.2 seconds. The ratio of the standard deviation of the velocity estimate after the stabilization is calculated for the analysis of the delay effect on the filter. The ratio is defined as ���� � ������ ����0� ⁄ , where ������ is the standard deviation with a delay of t seconds and ��0� � �. The simulation runs several times at each delay step and the average is taken as the final result.  Figure 8 shows the delay effect on the velocity estimation. It can be seen that the standard deviations of vx and vy increase linearly with the measurement delay. When the delay reaches 2 seconds, the standard deviation is about four times larger than that estimated by the non-delayed measurement. Our on-board avionics run at about 2Hz for the visual measurement. It can be concluded from this figure that a measurement delay of 0.5 seconds results in an increase in the standard deviation of about 50%. The delay of the visual measurement has a non-negligible effect on the estimate accuracy of the filter. The delay effect on the estimate of the velocity in the down direction is not as evident as others. One reason is that vz does not change as much in the desired trajectory, compared with the velocities in the horizontal directions. Hence, there is no clear difference between the delayed and non-delayed measurements.

The effect of the time interval of the visual process
In the velocity measurement update, the time interval � � between two visual processes is considered as an important factor to the filter. The frequency of the measurement updates is determined by the time interval. For example, if � � = 0.5 seconds, the measurement update runs at 2Hz. Moreover, the time interval affects the assumption in Equation (26). If � � is too long, this assumption might be invalid. So another simulation is carried out to test the effect of the time interval. Two kinds of measurements are simulated: one is the velocity generated under the assumption at Equation (26); another is the "real" instantaneous velocity. The output of the system is defined as ���� � ��� � ��� ��� � ��� ⁄ , where  The effect of the time interval on the velocity estimation is shown in Figure 9. The time interval changes from 0.1 to 1.9 seconds, but the standard deviation does not increase accordingly. The biggest is only 1.24 times the referential, when � � equals 1.7 seconds. The time interval has some effect on the filter by the frequency of the measurement update. However, the velocity estimate is stable in spite of the change in the time interval. This means that the assumption is reasonable in this application and the velocity calculated by our proposed method can be used in the measurement update of the filter.

Comparison with the augmented navigation
The last simulation test is about the comparison of the proposed navigation and the general augmented system. Both of them take the homography-based method, which is described in Section 2, to obtain the visual measurement. In the augmented filter, the state is expanded by including the state of the vehicle when the previous visual measurement is calculated. The augmented EKF is built mainly according to [17]. Matrix partition is applied for the optimization of the augmented filter calculation.  The error comparisons of the attitude and the velocity are shown in Figure 10 and Figure 11, respectively. Both the methods can limit the attitude and velocity errors within �−1 � , 1 � � and �−1m/s, 1m/s�. They have similar performances after the filters are stable. Figure 12 shows the comparison of the position error. The lines labelled "integrated" are the differences between the real position and the position calculated by the estimated velocity integration of the proposed filter. The correction of the proposed filter can be seen by comparing it with the integrated. The error of the position estimate of the proposed method is generally smaller than that of the velocity integration. Some drift is expected in the position estimation by these mapless methods. However, the drift has been remarkably reduced compared to approaches using inertial sensors alone. The comparison of the standard deviations, which is shown in Table 1, also verifies the above conclusion. In general, the proposed delay-based navigation system has almost the same performance as the augmented system in estimate accuracy. The average computing time is also recorded in Table 1. In the simualtion, these methods are  The attitude comparison of the whole flight is shown in Figure 16. The visual measurement does not work during the take-off and landing, in which periods the visual estimation is not accurate enough due to the rapid change of the aerial images. Thus, the visual measurement takes effect from the 16th second to 107th second in this figure.
The GPS signal is cut off when the visual measurement is available. The augmented filter is also included in this comparison. The error here means the difference between the estimate and the output of the GPS/INS navigation. Figure 17 and 18 illustrate the comparisons of velocity and position estimates in the navigation coordinates.  The effectiveness of the proposed Vision/INS method has been verified by this experiment. In the proposed system, the visual measurement is applied to fuse with the inertial sensors for the state estimation of the vehicle when GPS is unavailable. This filter is able to provide stable and reliable estimates of the attitude and velocity with high accuracy, while the position estimate is subject to drift. This is a well-known drawback for any mapless method. In this filter, the drift has been greatly limited by the visual measurement. The average error and standard deviation of the proposed method are shown in Table 2. This paper presented a vision-aided inertial navigation system for small UAVs in GPS-denied environments. A delay-based measurement update method was proposed to introduce the homography-based visual measurement into the indirect EKF without any state augmentation. The effects of the delay measurement and the time interval of the visual process on the filter have been analysed by simulations. The simulation results indicated that the proposed delay-based method can reduce the computational complexity by about 20%, compared with the general augmented filter, with almost the same estimate accuracies. The effectiveness of the proposed augmented proposed method has been verified by an experiment. The navigation system developed in this paper achieves its good performance by comparing with the referential GPS/INS navigation for small UAVs. In summary, the proposed vision-aided inertial navigation method can provide a reliable state estimate for small UAVs in GPSdenied environments.
Future work will be carried out in two aspects. Firstly, other sensors will be added to measure the relative distance more precisely. Secondly, as well as the attitude and the velocity, the position of the vehicle can also be estimated by the proposed navigation but suffers a drift. This drift is mainly due to the accumulation of the error from the velocity estimation. Additional observations are required if the full autonomy of the UAV in GPS-denied environment is considered. In the future, a monocular SLAM system will be developed based on this system to improve the position estimation and the autonomy of small UAVs.

Acknowledgments
This work was supported by the National High-