A direct optic flow-based strategy for inverse flight altitude estimation with monocular vision and IMU measurements

With tiny and limited nervous systems, insects demonstrate a remarkable ability to fly through complex environments. Optic flow has been identified to play a crucial role in regulating flight conditions and navigation in flies and bees. In robotics, optic flow has been widely studied thanks to the low computational requirements. However, with only monocular visual information, optic flow is inherently devoid of a scale factor required for estimating the absolute distance. In this paper, we propose a strategy for estimating the flight altitude of a flying robot with a ventral camera by combining the optic flow with measurements from an inertial measurement unit. Instead of using the prevalent feature-based approach for calculation of optic flow, we implement a direct method that evaluates the flow information via image gradients. We show that the direct approach notably simplifies the computation steps compared to the feature-based method. When combined with an extended Kalman filter for fusion of inertial measurement units measurements, the flight altitude can be estimated in real time. We carried out extensive flight tests in different settings. Among 31 hovering and vertical flights near the altitude of 40 cm, we achieved the RMS errors in the altitude estimate of 2.51 cm. Further analysis of factors that affect the quality of the flow and the distance estimate is also provided.


Introduction
In recent years, we have witnessed a dramatic growth of research and deployment of micro aerial vehicles (MAVs) owing to numerous foreseeable applications. At decimeter scale, multi-rotor systems have thrived thanks to its mechanical simplicity and high maneuverability. Yet, small flying robots are still severely limited in terms of payload capabilities and computational power [1][2][3]. There remain several key challenges associated with the energetic cost of staying airborne, autonomous navigation through unstructured environments, etc that prevent more pervasive uses of these small robots [4,5]. These issues pose considerable complications towards the goal of autonomous operations. Fortunately, for several tasks related to sensing, localization, and navigation, biological systems provide us further inspirations for tackling the mentioned challenges.
With poor resolution compound eyes and only tiny nervous systems, insects demonstrate exceptional aerodynamic maneuvers such as grazing landing and agile collision avoidance. Scientists have discov-ered that birds and insects frequently use optic flow in short range navigation [6], flight speed regulation [7], landing [8], as well as obstacle avoidance [9]. Instead of directly measuring the distance, the optic flow technique measures the image motion, providing the ratio of flight speed to the distance to the surface [10]. Inspired by insects, researchers have pioneered the use of optic flow for robotics applications, aiming to l everage the low computational requirement and reducing the sensory components, which are crucial considerations for small aerial vehicles such as a 100 mg flapping-wing robot [11] or a 39 g pocketsized quadrotor [12].
In robotics, optic flow-based algorithms have been proposed and implemented in mobile robots and aerial robots for guidance and navigation [13]. In [14], Fuller and Murray devised an insect-inspired controller for detecting patterns of optic flow for a mobile robot with flight-like dynamics to navigate a corridor. Using the concept of time-to-contact, Izzo and de Croon introduced a strategy for landing on a surface using ventral optic flow based on the expansion of imaged ground [15]. In [3], a 2.6 g omnidirec-2 P Chirarattananon tional vision sensor enabled a 30 g miniature coaxial helicopter to control flight speed and stabilize the heading using wide-field optic flow. In addition to this, several optical devices have been developed to facilitate the use of insect-inspired strategies in robots. Examples include a miniature artificial compound eye with a panoramic field of view [16], a robust minimalistic high frame rate optic flow sensor [17], an open hardware camera with a CMOS image sensor for optic flow estimates designed for MAVs [18], and a 33 mg 1D optic flow sensor for a flying microrobot [11].
Recently, researchers have explored the integration of optic flow information with other known quantities [19] or measurements from other sensors to exploit the 'scaling' information, allowing true velocity or distance to be estimated [18,[20][21][22]. In [20], the opto-aeronautic algorithm was developed for estimation of wind speed and flight height using inputs from optic flow and air velocity sensors, whereas in [18], an ultrasonic sensor was incorporated. In other instances, inertial measurement units (IMUs) are chosen as they often exist for attitude stabilization. The fusion of measurements from two sensors usually requires a robust efficient strategy. In [21,23], extended Kalman filters (EKF) were employed to recover a scale factor. In [22], the authors formulated a nonlinear estimation scheme for the task, allowing the transient response of the observer to be tuned. Based on the knowledge of control inputs, optic flow was used to robustly estimate the distance for landing in [19].
Several researchers have also demonstrated other novel methods that use optic flow in aid of the distance estimation or landing tasks. By imposing the control to ensure a constant time-of-flight and measuring the acceleration, the distance can be estimated with the approached proposed in [24]. Using only the knowledge of the control inputs and monitoring the oscillations of the robot without any knowledge of the robot's acceleration, the optic flow-based flight controller was shown capable of providing the estimate of flight altitude in hover as well as during landing [25].
Compared to popular alternative keyframe-based visual-SLAM methods that involve building and maintaining a map of visual features [26][27][28], these optic-flow based approaches are potentially less computationally intensive and more robust against failures as they do not need prolonged continuous feature tracking [21,22].
Thus far, the majority of research involving combining optic flow and inertial measurements calculate the flow by detecting and tracking features between consecutive image frames using an established Lucas-Kanade (LK) algorithm [9,22,25]. The LK method is suitable for computing optic flow for a sparse feature set [29]. It was suggested that the required feature extraction and matching steps occupy most of the processing time, up to 85% in the case of [21]. Alternative to the LK method, there exists an alternative optic flow method developed by Horn et al [30], which directly provides time-to-contact information from image gradients without the need to identify and track image features. This principle has been extended to estimate the time-to-contact (without resolving for a scale factor) for motion control of mobile robots in [31].
In this paper, we aim to exploit this direct optic flow method in the context of flight altitude estimation. To achieve that, we consider a flying robot with a downward-looking monocular camera. We outline a theoretical basis to relate the motion of the robot and its altitude to the optic flow from the camera. Detailed analysis is included to identify factors that affect the performance of the computed flow. Then we propose an estimation routine based on a Kalman filter framework to incorporates the measurements from IMU to provide the scale factor needed to estimate the absolute distance. Static and flight experiments were performed to verify the models and test the estimation method. Further detailed analysis of the results is then discussed. In summary, our key contributions are threefold: (i) the modeling and implementation of the direct optic flow method in the context of a flying robot with a monocular vision. That is, we offer a model relating the expected optic flow to the motion of a robot flying over a flat terrain; (ii) analysis of camera settings and flight parameters that have critical consequences to the accuracy of the optic flow data; (iii) an extended Kalman filter for estimating the inverse of flight altitude with a known observability condition. While alternative implementations with have been previously shown [22,24], our method is simple, highly accurate at low altitude and robust against vibrations from flight.
The next section begins with a brief introduction of optic flow and the descriptions on how the LK method and the direct approach can be applied to deduce the motion (up to a scale factor) of a flying robot with a ventral vision. The analysis includes the introduction of three dimensionless quantities that are relevant to the accuracy of the computed flow. Section 3 shows a formulation of the inverse flight altitude and other flow measurements in the state-space representation form suitable for an EKF. An analytical expression for the observability condition is provided. In section 4, we perform two sets of experiments to demonstrate the advantage of the direct optic flow approach over the LK method and to verify the importance of the introduced dimensionless quantities. Flight experiments are carried out in section 5 to examine our proposed estimation framework. Finally, conclusion and further discussions are provided in section 6.

Optic flow
Suppose a stationary point of interest P = X Y Z T in the inertial frame is projected onto the the location (u, v) on the camera plane as depicted in figure 1. Let I (u, v, t) represents the intensity of an image pixel at this position at time t. Due to the movement of the camera, we expect the image of point P between two subsequent images to move from (u, v) to (u + ∆u, v + ∆v). Under the constant brightness assumption or optic flow constraint, it is expected that I (u, v, t) = I (u + ∆u, v + ∆v, t + ∆t) [10]. To the first order approximation [32], it follows that ∂I ∂u where du/dt and dv/dt are components of the optic flow field.

Optic flow and camera motion
To relate the motion of a camera with respect to the inertial frame to the resultant optic flow, we consider a camera with a focal length f located at the point P c = X c Y c Z c T defined in the inertial frame o −xŷẑ as shown in figure 1. Let the coordinate frame associated to the camera be o c −x cŷcẑc . The rotation matrix R = x cŷcẑc relates the orientation of the inertial frame and the camera frame. The image plane is perpendicular to the camera axis (ẑ c ) at the distance f (focal length) away from the origin o c . According to this arrangement, the point P is projected onto the image plane at the position (u, v) given by where e i 's are basis vectors (e.g. e 1 = 1 0 0 T ).
If we denote the rotational rate of the frame o c with respect to itself as ω = ω x ω y ω z T , the time deriv-atives of the camera frame axes become ẋ c = ω zŷ − ω yẑc , ẏ c = ω xẑ − ω zxc , and ż c = ω yx − ω xŷc . Assuming that point P is stationary with respect to the inertial frame (Ṗ = 0), it can be shown that the time derivatives of the image location (u, v), or optic flow, are From this point, we consider the scenario where the camera is pointing towards the ground such that the camera axis is almost vertical. Define d as the distance from the origin of the camera frame to the ground plane along the camera axis, it turns out that the projection of point P along the camera axis, or Furthermore, in the case that the camera axis is almost vertical, such as a downward-looking camera attached to a hovering quadrotor, we have R 31 , R 32 R 33 . Keeping only the first order terms (see [33] for a complete derivation), equation (4), together with equation (3), simplifies to Figure 1. Definitions of the camera frame, the image plane, and the inertial frame.
Bioinspir. Biomim. 13 (2018) 036004 where we have introduced ϑ x =x T cṖc /d, ϑ y =ŷ T cṖc /d and ϑ z =ẑ T cṖc /d. These quantities, which from now on will be referred to as visual observables [25] (also defined as apparent egovelocity in [20]). They provide the velocity information of the camera frame up to a scale factor (d).

Lucas-Kanade optic flow algorithm
Equation (5) relates the optic flow (du/dt, dv/dt) to the motion and attitude of the camera. A number of previous works use feature extraction methods to initially detect features on the image and then employ the Lucas-Kanade tracker (LK) [32,34] to determine the optic flow between consecutive image frames [13,[22][23][24][25]. To estimate the visual observables with the LK method, we simplify equation (5) it satisfies where we have defined a 2 × 7 matrix Λ i and a vector of unknowns b. With multiple features, we can construct express the equation as a linear least-squares problem: Ψ = Λb to solve for b. The approach enables us to efficiently compute, for example, ϑ x + ω y , ϑ y − ω x , and ϑ z using the optic flow data from various locations on the image. With the knowledge of the angular velocities from the IMU measurements, ϑ x and ϑ y are readily obtained.

Direct optic flow method
Instead of relying on feature detection and tracking method, it has been shown in [30,31] that we can directly combine the optic flow constraint in equation (1) with the equation describing the camera motion (in the scenario described in figure 1, this corresponds to equation (5)). If we let I ij denote the pixel intensity at I u i , v j , t , the result directly relates the image gradients to the visual observables and the camera's attitude: where we have defined a scalar Υ ij , a 1 × 7 matrix χ ij , and a vector of unknowns b. Once again, we have neglected two terms: ϑ x R31 R33 u and ϑ y R32 R33 v to arrive at equation (7). Similar to equation (6), this equation can be directly solved for b using the least-squares method in the form Υ = χb (with Υ = Υ 11 Υ 12 ... T and χ = χ 11 χ 12 ... T ) using the values of I ij 's from consecutive image frames. In contrast to the LK method, this direct method does not require the feature tracking process. Featureless areas in the image produce pixels with the values of ∂I/∂t, ∂I/∂u, and ∂I/∂v close to zero. Consequently, these points do not invalidate equation (7). This significantly reduces the computational requirement of the process.

Effects of pixelation, fame rate, and feature size on the accuracy of visual observables
In this section, we inspect the effects of camera pixelation, feature size, and frame rate on the accuracy of the optic flow estimate. To understand the relationship between different parameters, we consider a simplified situation where a hypothetical one-dimensional camera (the camera produces a one-dimensional array of pixels instead of the typical two-dimensional arrays) resides in a two-dimensional plane (o −xẑ). This setting can be regarded as a simplified version of the setup in figure 1.

Horizontal motion
Suppose the camera stays at a constant distance d from the ground, but it traverses laterally (along the x c -axis) at speed v x . Furthermore, let the camera x c -axis coincides with the x-axis of the inertial frame and L be a length that describes the dominant lengthscale of the pattern on the floor. Assuming a uniform ideal lighting condition and a purely sinusoidal pattern on the floor, the brightness of the point P(x) can be written as I0 2 sin 2π x L + 1 . Without loss of generality, the pixel intensity at point u on the image can be expressed as where f is the camera's focal length. Notice that the pixel brightness varies overtime as the camera moves at speed v in the direction parallel to the ground.
In principle, to calculate the visual observable ϑ x according to equation (7), we obtain ∂I/∂t and ∂I/∂u as and ϑ x is found to be as anticipated.
In practice, however, the camera sensor has a finite pixel size. If we let w denote the width of each pixel, the image brightness of a pixel located at u is averaged over the distance w such that (11) Here, we introduce the notation δ [·] to represent the difference between the ideal value and the practical value of a quantity of interest. The difference between I (u, t) and Ī (u, t), or δ [I (u, t)], is the consequence of pixelation. We find that (12) To numerically evaluate the time and spatial derivatives of the pixel value for the calculation of visual observables, we consider the central difference: After some algebraic manipulation, it can be shown that From equations (12) and (14), it can be seen that the numerical errors of ∂I/∂t and ∂I/∂t can be quantified using two dimensionless quantities πdw/fL and 2πv/f p L. For the sake of simplicity, we define γ d = πdw/fL and γ x = 2πv x /f p L. Physically, γ d represents the ratio of the pixel size with respect to the feature size as expressed in the image frame, whereas γ x is the ratio of the camera movement to the feature size between consecutive frames. Both quantities in equation (14) approach zero as γ d , γ x → 0. Subsequently, we can approximate the error in the calculated visual observable due to the pixelation and numerical differentiation as (15) For illustrative purposes, we consider the case where γ d , γ x → 0. It follows that equation (15) simplifies to (16) This suggests that both γ d and γ x directly affects the estimation of ϑ x . We anticipate the accuracy of the estimate of ϑ x to deteriorate as, for example, the feature size becomes too small compared with the altitude, or the camera moves excessively fast relative to the frame rate.

Vertical motion
In this case, we assume the camera, with the focal length f, is in an identical situation to the previous scenario, but it traverses vertically instead of horizontally. The expression for an image intensity is similar to equation (8): where z = z(t) such that ż(t) = v z . Partially differentiating equation (17) with respect to t and u yields the solution for the corresponding visual observables according to equation (7) as ϑ z = − (∂I/∂t) / (u∂I/∂u). The effects of pixelation and sampling, nevertheless, only allow us to obtain the estimates of ∂I/∂t and u∂I/∂u through the numerical differentiation in a similar fashion to equation (13). Employing the same method, it can be shown that where d is the instantaneous height of the camera, i.e. z(t) = d(t). Notice that the expression of δ ∂I ∂t / ∂I ∂t in equation (18) is distinct from the one in equation (14) owing to the presence of u. To further understand the discrepancy brought by the discretization, we define u * = max u, as the maximum value of u, which is limited by the camera sensor's size, and γ z = 2πv z u * /f p fL . In the limit that γ d , γ z → 0, we find that The uncertainty of ϑ z has the same structure as that of ϑ x in equation (16). However, the value of γ d = πdw/fL, which is a function of the altitude, is now time-varying.

Estimation of inverse distance
In contrast to stereo vision, information from monocular vision such as optic flow or visual velocity is inherently devoid of a scale factor. The scale ambiguity can be resolved by fusing the visual observable from optic flow with acceleration measurements from the IMU. To begin, we evaluate the time derivative of d from the definition (P c + dẑ c ) · e 3 in section 2.1 aṡ (20) Define α = d −1 as the approximate inverse distance of the camera from the ground, with the assumption that ẑ and ẑ c are almost parallel (R 31 , R 32 Subsequently, we obtain the time derivative of the visual observables according to the definitions given in section 2.1: Notice that the terms with P c represent the acceleration of the camera in the camera frame. In the circumstance where the camera and an IMU are part of the same object, these terms are equivalent to the readings from the accelerometers (a x , a y , and a z ) combined with the projected gravity (where the angles, also used for flight attitude control, are obtained from fusion algorithms such as [35]). For instance, we have where a x is the the reading from the x axis of the accelerometer and R 31 is approximately the pitch angle of the robot as internally estimated by the IMU. Treating X = α ϑ x ϑ y ϑ z T as a state variable, its dynamics are inherently nonlinear. Nevertheless, we can still express Ẋ in the form resembling the state-space representation: (23) where F(t) functions as a state matrix. The corresponding output vector of the system Y contains the visual observables as estimated by the algorithm in section 2: where C is an output matrix. Together, equations (23) and (24) describe the nonlinear dynamics of the state variable in a state-space representation form .

Extended Kalman filter
The dynamics of the system described by equations (22) and (20) are nonlinear in nature. While we can directly calculate the visual observables directly from the captured images and IMU measurements, the distance d, or its inverse α, is not directly measured.
In theory, an observer can be designed to estimate α. In the case of a linear system, Kalman filter can be employed to recursively find an optimal estimate of the state vector. Nonlinear systems similar to our proposed system typically requires nonlinear observers or other variants of Kalman filters such as extended Kalman filter or unscented Kalman filter. By expressing the dynamics of the state vector in the standard statespace representation as in (23) and (24), it allows us to promptly estimate the state vector using the extended Kalman filter. This approach significantly simplifies the computation as well as the hardware implementation while providing a decent performance.
To employ the Kalman framework, we discretize the system in equations (23) and (24) using the forward Euler method [36]. Let T be a sample time and k be a time index such that where we define A k as a discrete-time state matrix, and µ k and ν k as 4 × 1 process noise and 3 × 1 observation noise assumed to be drawn from a zero mean multivariate normal distribution with covariance Q k and R k : µ k ∼ N (0, Q k ) and ν k ∼ N (0, R k ). Equation (25) represents the system model. The update equation for the state estimate X k is then with K k being the optimal Kalman gain as described in [37]. The scheme allows us to iteratively obtain the estimate of X k , which includes the inverse altitude, from the measurements of visual observables in Y k .

Observability
To ensure that the proposed filter can reliably estimate α, the discrete-time system described in equation (25) must be observable. Herein, we analyze the observability of the respective system by inspecting the observability Gramian of a simplified scenario.
For a time-varying discrete-time system in the form of equation (25), the k-step observability Gramian is defined as The system is observable over k steps if and only if Q k is full rank, i.e. Q k is nonsingular [38]. For the case of our system described by equation (25), the 2-step observability Gramian is Q 2 = C T C + A T 1 C T CA 1 . Assuming minimal angular velocity (ω x , ω y , ω z ≈ 0), the determinant of the observability Gramian is It can be seen that, for a finite sample time T, the system is observable as long as In fact, each of these terms corresponds to the acceleration of the camera along its body axis. The summation is the square of the total acceleration. In other words, the system is observable as long as its net acceleration is not zero. In theory, this means we are unable to get the estimate of α if the camera is mounted on a robot in a perfect hovering condition. In practice, slight movement is likely adequate to ensure that the system is observable. This observability condition, however, does not indicate the degree of observability.
One measure of observability is the square root of the condition number of Q k [39]. This condition number, κ, indicates how large the change in the initial condition in one direction affects the output in another direction. The task of finding the state vector from the observed outputs becomes ill-conditioned at excessively large condition number.

Performance evaluation
In this section, we perform a set of experiments to compare the performance of the direct optic flow method to the feature-based method outlined in section 2. In addition, the effects of pixelation, fame rate, and feature size are investigated in terms of two parameters (γ d and γ f ) introduced in section 2.4.

Flow calculation and experimental setup
The setup for the following experiment is illustrated in figure 2 (in the handheld experiment configuration). We use an oCam-5CRO-U camera (Hardkernel) and a single board computer Odroid-XU4 (Hardkernel) for capturing images, processing, and computing the visual observables. The camera is physically attached to an AscTec Hummingbird quadrotor (Ascending Technologies). The onboard computer communicates with the quadrotor via serial communication. This allows us to access the measurements from the IMU on the robot at the rate of ≈150 Hz. Command signals from the ground station for the robot are transmitted to the single board computer via UDP over wifi at 100 Hz.
We use a Python script for managing the communication between the onboard computer and other devices (the camera, robot, and ground station). Another python routine is implemented to process images for visual observables. For the LK method, the pipeline starts by detecting 50 features in each image using the FAST corner detectors [40]. The features are matched to the next image with the Lucas-Kanade optical flow algorithm to directly provide du/dt and dv/dt [34]. The corner detectors and LK algorithms were implemented in Python using OpenCV wrappers. The visual observables are then calculated from 50 points using equation (6) and the method of least squares. To eliminate the angular velocity components present in equation (6), the corresponding measurements from the IMU are used for subtraction.
The direct flow method involves the spatial and time derivative of pixel values, which can be susceptible to image noises. Therefore, each image is initially smoothened using a 5 × 5 average kernel. The spatial derivatives, ∂I/∂u and ∂I/∂v, are computed using normalized 3 × 3 Sobel kernels in their respective directions. The time derivative is directly obtained by subtracting two consecutive smoothened frames. The visual observables are computed using the method of least squares based on equation (7), and the angular velocities are subtracted out by the IMU measurements. To further reduce the computational requirement, only one in every sixteen pixels (in a grid arrangement) was used in the least squares regression. This amounts to 4800 pixels out of the total of 76 800 pixels for a 320 × 240 image. These computations were carried out in Python.
Moreover, for both methods, we may opt to apply equations (6) or (7) to fully estimate seven unknowns. Then only the first three elements of b are used for the EKF, discarding other elements. Instead of using the full models, we may neglect the last four rows of Λ i and χ ij to estimate only the truncated b (which is a 3 × 1 vector in this case). This approximation is reasonable due to the assumption that the camera axis is almost vertical (R 31 , R 32 R 33 ) and the angular velocities are relatively small compared with the visual velocities. These reduced models potentially decrease the computational load.
The combined camera-computer-robot system was placed in the 3.0 × 3.0 × 2.5 m arena fitted with six motion capture cameras (OptiTrack Prime 13w). The motion capture system tracks four retroreflective markers placed on the robot to provide real-time position and orientation feedback for the ground computer. This information can be processed to obtain the expected visual observables, acceleration, velocities, and angular velocities of the camera. Thanks to the accuracy (∼0.1 mm) and sample rate (240 Hz) of the motion capture system, we treat these quantities as ground truth measurements for evaluating the performance of the estimated visual observables or IMU measurements.

Comparison between LK and direct methods
First, we aim to compare the visual observables calculated from the traditional LK method and the direct method. We placed a checkerboard pattern with the square size of 5 × 5 cm on the floor. The camera-robot prototype was commanded to take 320 × 240 images at 30 and 60 frames per second (fps) and store the images locally. Simultaneously, IMU measurements from the robot were logged. The communication between the onboard computer and the ground station allows us to synchronize and associate each image frame with the exact pose of the robot as determined by the motion capture system. The robot and the camera were handheld manually moved around at the distance ≈30 cm to 100 cm from the ground. Several image sequences were recorded, providing more than 2500 images for each frame rate. The recorded data were then post-processed to compute the visual observables using the LK method (equation (6)) and the direct method (equation (7)), both with full and reduced models. Since the same sets of images were used for both methods, we can make a direct comparison of the results. Figure 3 shows the experimental results of the calculated visual observables from the LK and direct methods using reduced and full models. The computed visual observables are compared with the ground truth values calculated from the camera trajectory provided by the motion capture system to obtain RMS errors for comparison. The results show that at 30 fps, both methods produced visual observables with similar RMS errors. The difference between full and reduced models are also insignificant. The accuracy of both approaches improves dramatically (the RMS errors are ≈60−70% smaller) when the frame rate is 60 fps. The improvement is slightly more pronounced for the LK method. Overall, there is no significant difference between the direct method and the LK method in terms of RMS errors.
The time plot in figure 3 reveals a similar trend in more detail, compared to the ground truth, the computed ϑ z 's are subject to high frequency disturbances. With an increased frame rate from 30 to 60 fps (effectively reducing γ x , γ y , γ z by a factor of two), both methods provide smoother ϑ i 's that follow the ground truth more closely with less oscillations. Furthermore, we compare the time the Python script used to produce the visual observables by both methods in full and reduced modes (including only the image processing time, i.e. the without capturing or image acquiring process). Figure 4 presents the distribution of the time taken per image frame. The data reveals that the median computational times per frame (solid horizontal lines) are approximately 10 ms, similar for all configurations. Nevertheless, it can be seen that the reduced direct method consistently used only 9.3-10.0 ms (25th − 75th percentile) per frame, whereas the reduced LK method required 1.7-22.3 ms (25th-75th percentile) per frame. The noticeable small deviation in processing time of the direct method is due to its deterministic nature of the algorithm. In contrast, the speed of feature-based methods needs more computational time for complex images or when features are lacking. As a result, on average, the LK methods require slightly more computational time than the direct methods (presented as solid dots in figure 4). The time used in image processing task is crucial as it was shown to occupy as much as 85% processing time in the estimation routine in [21]. The consistency in computation time is desirable for a realtime estimation process.

Comparison between different patterns
In the previous performance tests and subsequent experiments, we used a checkerboard pattern to provide visual texture for optic flow. The advantages of a checkerboard pattern are, for example, the presence of clear edges and corners and the lengthscale of the pattern can be precisely defined. However, it is possible that a particular pattern may affect the performance of the direct or LK methods differently.
To verify this hypothesis, we carried out further handheld experiments with identical conditions to the tests in section 4.2. In this experiments, however, the checkerboard pattern was replaced with two different patterns: connected rings and leaf as shown in figure 5. The two new patterns differ from the original checkerboard pattern by the absence of straight edges. The connected ring pattern is also devoid of sharp corners. Figure 3 shows the results of the calculated visual observables from the LK and direct methods using the reduced models from images taken at 60 fps. The RMS errors of ϑ i 's from the two new patterns are consistently smaller than that of the checkerboard pattern. However, the improvement is marginal. The tests here suggest that both methods are reasonably robust to the texture used in indoor environments.

Characterization of pixelation, speed and feature size on the accuracy of visual observables
In section 2.4, we examined how the finite pixel size, camera frame rate, and relative motion affect the estimate of visual velocities under some simplifying assumptions. Three dimensionless quantities: γ d , γ x and γ z , were found to play a role in the estimates of ϑ x and ϑ z . To validate our analysis for the case of a camera having some horizontal speed, we systematically carried out a static experiment with the details as follows.
An identical camera (oCam-5CRO-U) and a single board computer (Odroid XU4) to the previ-  ous experiment were statically mounted as shown in figure 2. A checkerboard pattern was affixed on a linear motorized stage such that the pattern was parallel to the image plane, approximately 50 cm away from the camera. Here, the checkerboard pattern approximately represents the sinusoidal pattern in the analysis in section 2.4. To elaborate, for a Checkerboard pattern with 5 × 5 cm squares, the texture repeats itself every 10 cm. The corresponding lengthscale L is, therefore, 10 cm (see figure 5 for an example). The motorized stage allows the checkerboard pattern to programmatically travel along the camera's x c -axis. The motion of the pattern relative to the camera essentially provides equivalent effects to a moving camera pointing towards the fixed ground as previously assumed.
The camera was commanded to record images at 60 fps for 200 s. In the meantime, the stage moved the pattern back and forth at constant speed of approximately 3.8 cm · s −1 as measured by the motion capture system. This generates a constant ϑ x . The experiment was repeated with the checkerboard patterns of different sizes, including 1 × 1, 2 × 2, 3 × 3, and 5 × 5 cm, to vary the values of γ d without altering d. The stored images were then downsampled to simulate lower camera frame rates down to 1 fps. This results in different values of γ x . We then calculated the visual observable (ϑ x ) of different cases using the reduced direct method. With the ground truth value of ϑ x from the aid of the motion capture system, we obtained the RMS errors of ϑ x from different conditions. The errors are shown in figure 6(a) in terms of γ x and γ d , along with the theoretically predicted bounds by the formula from section 2.4.1.
The trend seen in figure 6(a) generally agrees with the predictions that γ x and γ d have important roles in the accuracy of the visual observable estimates. That is, the errors grow as γ x increases. For the same value of γ x , γ d = 0.25 produces greater errors than γ d = 0.12. There, nevertheless, exist two anomalies. First, we observe relatively large errors in ϑ x at small γ x (∼0.3 or smaller). Second, at very small γ d (0.05 and 0.08), the errors are unexpectedly substantial, exceeding the predicted bounds.
To further understand the cause of the relatively large errors at low γ x , we reproduce figure 6(a) to show  the RMS errors as a function of the image velocity in the unit of pixels per frame (this quantity corresponds to v x f /wdf p ). Apart from those with γ d = 0.25, which have relatively large RMS errors owing to the the large γ d , other sets of data suggest that the RMS errors in ϑ x are minimized when the motion between image frames is just below 5 pixels. In other words, the errors grow as the motion drops below 5 pixels per frame. We speculate that the implementation of the 5 × 5 average kernel to smoothen image in the processing step prevents the detection of minuscule movements between two consecutive image frames, causing the rise in the RMS errors of ϑ x .
Other factors may also contribute to the observed anomalies. For example, we used checkerboard patterns for the experiments to approximate the sinusoidal functions. The implementation of the algorithm to calculate visual velocities also deviates slightly from the theoretical method as the average filter and Sobel operators were used to deal with noisy images. The large errors from exceptionally small γ d = 0.05 (from a 5 × 5 cm pattern) could be a result of relatively sparse data points as the majority of image pixels are black or white with no spatial or temporal variation.

Flight experiments
To verify that the proposed inverse distance estimation framework can reliably produce the estimate of flight altitude despite the presence of noise from calculation of optic flow, evaluation of the time derivative of visual observables, and the use of accelerometer measurements, in this section, we perform flight experiments with a quadcopter in the indoor flight arena to investigate the performance of the proposed strategy under various flight conditions.

Flight tests
To perform flight experiments, we setup the arena and prepare the robot similar to the experiment in section 4.1 as depicted in figure 2. Here, the onboard computer processes the captured images for visual observables in real-time at 60 fps using the reduced direct flow method implemented on a Python script. The arena ground was fitted with a 5 × 5 cm checkerboard pattern. According to the characterization result in figure 6, we aimed to ensure that γ x is less than 0.5 in flight to minimize the errors of computed visual observables. For a hypothetical horizontal flight speed of 0.5 m · s −1 and a 5 × 5 cm checkerboard pattern, γ x is expected to be ≈0.5. In contrast, if a 1 × 1 cm pattern was chosen, γ x would be excessively large at ≈2.5.
The image processing routine is identical to the handheld experiment in section 4.1. The single board computer is also responsible for retrieving IMU measurements and passing on the position feedback for the robot to ensure that it stays at the commanded position. The IMU measurements and computed visual observables are transmitted to the ground station via UDP. We found that the visual observable data consistently lagged behind the IMU measurements by ≈84 ms. This number represents the onboard image processing time, which can be compensated by delaying the IMU measurements by the same amount. The ground controller executes the estimation algorithm in real-time using the xPC target environment (MathWorks Matlab) at the rate of 500 Hz. The computation involved with the inverse distance estimation is relatively inexpensive compared to the calculation of visual observables. We opted to implement this part of the algorithm on the ground station owing to the ease of implementation and debugging purposes. In flight conditions, the single board computer generally reports the CPU usage of less than 10% on one of the eight available cores.
In the implementation of the Kalman estimator, in theory, the covariance matrices associated to the process noise and the measurement noise (Q k and R k ) can be estimated from the current system state and empirical data from the IMU. For simplicity, we treated them as constant and they were experimentally tuned. The gyroscopic readings were used for the angular velocities, the accelerometer outputs were taken for a x , a y , and a z needed for the matrix F in equation (23), and the pitch and roll angles from the IMU fusion used for flight attitude stabilization were also used as R 31 and R 32 for the matrix F. Moreover, a 3 rd -order Butterworth low pass filter with a cutoff frequency of 6 Hz was applied to all IMU measurements and computed visual observables in order to minimize the measurement noise and vibration effects. To investigate the performance of the proposed estimation schemes, we carried out the flight tests with three flying patterns: hovering, vertical, and horizontal flights. First, the robot was commanded to hover at 40 cm altitude (γ d ≈ 0.04). Second, the robot was instructed to follow a sinusoidal altitude setpoint, centered at 40 cm with the amplitude of 10 cm, with the period ranging from 5 to 10 s (γ z < 0.01). Third, we let the robot fly in a horizontal circular pattern with a radius of 30 cm at the constant altitude of 40 cm. The periods were set to values between 8 and 20 s (γ x , γ y < 0.25 ).
In total, we conducted 8 hovering flights, 23 vertical flights, and 23 horizontal flights. For each flight, we extracted the estimation data from the middle portion of the flights, excluding taking off and landing periods. Each flight provides 30-45 s of valid data for us to examine the results. Figure 7 shows the altitude estimation results (from inversion of the estimated α's) and relevant quantities from three flights selected to represent different flight modes (hovering, vertical and horizontal). The RMS errors in α and altitude (d) of these flights are 0.12, 0.08, 0.21 m −1 and 1.91, 1.48, 3.86 cm respectively. According to the plots, the estimate of α from the horizontal flight is visibly poorer than the others. In fact, the same trend is observed across several flights.

Estimation results
In terms of visual observables, the hovering flight produced relatively small visual observables in all directions. The vertical flight resulted in a noticeably greater ϑ z , corresponding to the vertical flight speed of up to ≈12 cm · s −1 . On the other hand, the horizontal flight reached the maximum speed of ≈40 cm · s −1 , rendering ϑ x and ϑ y to periodically peak at ≈±1 s −1 . When compared to the ground truth calculated from the motion capture measurements, however, the exists no significant correlation between the RMS errors of the visual velocities and their magnitudes. Figure 7 further suggests that, among these flight conditions, the robot did not deviate considerably from the upright orientation as the pitch and roll angles (R 31 /R 33 and R 32 /R 33 ) remained bounded by 0.12 rad, or 7 • . To understand what contributes to the relatively large estimation errors seen in horizontal flights,

Analysis of the results
In fact, figure 9(a) suggests that the magnitudes of ϑ x and ϑ y might play an important role in the quality of the inverse altitude estimates. The plots between the RMS errors in α and the RMS of other quantities, including pitch and roll angles, acceleration, and the condition number, do not show any visible trend. To quantify the correlation between the RMS errors of the estimate and these parameters, we calculated the correlation coefficients between the pairs, ρ (·, ·). The magnitudes of the correlation coefficients are presented in figure 9(b), with 0 indicating no correlation and 1 indicating total correlation.
One may expect some relationship between the estimation errors and the pitch/roll angles owing to the assumption R 31 , R 32 R 33 used in the derivation of the estimation method. In addition, the absence of acceleration (P c ) may violate the observability condition according to equation (26). Similarly, the square root of the condition number, κ 1/2 , could have a similar role as a large condition number leads to a reduced observability. However, figure 9(b) verifies that only ϑ x , ϑ y and ϑ x R 31 /R 33 , ϑ y R 32 /R 33 have a significant correlation with the estimation errors, with the coefficients of 0.88 and 0.81.
A possible explanation for the observed phenomena is, for hovering flights or flights with low ground speeds, the variations in the pitch/roll angles and acceleration are insignificant. It can be seen in figure 9(a) that the RMS values of R 31 /R 33 , R 32 /R 33 or P c of horizontal flights cannot be distinguished from those of hovering or vertical flights. The same observation applies to the condition number. This means that all flights are similar in attitude and degree of observability.
The likely explanation why the magnitudes of ϑ x and ϑ y have considerable effects on the quality of the estimates could be due to the approximation made to achieve equation (7). Therein, we assumed ϑ z − ϑ x R 31 /R 33 , ϑ z − ϑ y R 32 /R 33 ≈ ϑ z based on the assumption on attitude: The former condition becomes less accurate when ϑ x , ϑ y ϑ z , which is expected for the case of horizontal flight with little variation in altitude. A similar approximation was also made to obtain equations (21) from (20). Therefore, for horizontal flights, we anticipate the algorithm to perform better without assuming ϑ z − ϑ x R 31 /R 33 , ϑ z − ϑ y R 32 /R 33 ≈ ϑ z . This, nonetheless, will inevitably lead to increased complexity in the estimation routine.

Conclusion and discussion
In this paper, we have tackled the problem of an online altitude estimation for flying robots with a monocular vision and an IMU using an optic flowbased approach. Compared to the traditional featurebased methods, we experimentally demonstrated that the performance of the direct method is comparable to that of the former method both in terms of accuracy and computational speed for providing the visual observables in our specific test conditions. The quality of the computed visual observables was also shown to depend on various factors related to the camera and flight configurations. We introduced three dimensionless parameters to use as guidelines for appropriately determining the camera settings under various flight conditions.
With the proposed EKF scheme, we successfully estimated the flight altitude with the average RMS error of 2.51 cm considering 31 hovering and vertical flights at ≈40 cm altitude. The quality of the estimate was found to deteriorate upon an increase in the horizontal components of the visual observables. This is likely owing to the approximations made in the calculation of the visual observables and the estimation scheme.
Our estimation results compare favorably with other prominent works. Among 31 flights with the average RMS errors of 2.51 cm, our best three flights have the RMS errors of 1.42, 1.43, and 1.48 cm. We believe one contribution to the relatively small errors we achieved is the computational efficiency of the direct optic flow approach. This, in turn, enables us to compute the visual observables at 60 fps, significantly improving the accuracy of the estimation. However, thus far, our work still critically relies on the assumption of flat ground with no camera occlusion. In con-trast, in [22], a random sample consensus (RANSAC) algorithm was employed to reject outliers in optic flow measurements. The strategy improved the robustness of the estimation in more realistic, cluttered environments.

Effects of vibration on altitude estimation
Unlike applications on mobile robots or controlled experiments [24,31], in the context of flying robots, one major challenge in achieving accurate estimates of the inverse distance from the combination of IMU measurements and the images is owing to the present vibrations in flight. In evidence of this, in [21], it was reported that the RMS errors in velocity estimates in flight were higher than those from a handheld experiment, citing the vibrations and motion blur as the cause of inaccuracies. Similarly, in [22], the results showed that the RMS errors of the estimated velocity  from flight are approximately one order of magnitude larger than the values obtained from simulated data or a handheld experiment.
To further inspect this claim, performed additional handheld experiments to imitate vertical flights in section 5. In this situation, the robot was manually moved vertically with the amplitude of ∼10 cm at the altitude of ∼40 cm, similar to the previous vertical flights. The camera captured series of images, computed the visual observables and the inverse altitude was estimated using the extended Kalman filter in a similar fashion. Five simulated flights were recorded. The estimation results and some parameters are presented with the previous vertical flight data in figure 10. The figure also verifies that the handheld experiments were performed at the mean altitude of ∼40 cm, comparable to previous vertical flights.
We found that the errors in altitude estimates from the handheld experiments are marginally larger than those from vertical flights. To explain the results, we found that, when compared to the ground truth, the quality of computed visual observables (ϑ z ) from the handheld experiments did not improve from the vertical flights (see figure 10). This suggests that the mechanical vibration did not significantly affect the quality of the images. On the other hand, the vertical component of the acceleration from the IMU (a z − R 33 g in equation (23)) from the handheld experiments were found to be more accurate than the ones from flights when compared to the ground truth. This alone is likely to adversely affect the altitude estimates.
The oscillation from flight, nevertheless, also brings about the more overall acceleration required for the observability of the inverse altitude as described by equation (27). In evidence of this, we compare the condition number of observability Gramian (Q k ) in the form of log 10 κ 1/2 as shown in figure 10. It can be seen that condition numbers of handheld flights are consistently and noticeably higher than those from actual flights, indicating that handheld flights have relatively poor observability. This leads us to believe that the vibration from flight causes unfavorable effects on the IMU measurements, but simultaneously improve the observability of the estimation scheme.
While we experimentally found that the flight vibration can be beneficial to the estimates, there are several experimental conditions that may affect the final outcomes. For instance, the degree of oscillations may vary between robots and flight controllers used. Therefore, we believe that our findings here should be interpreted with care.

Altitude estimation during landing maneuvers
Using the strategy borrowed from biology, flying robots have demonstrated an ability to achieve smooth landing by holding constant the time to contact with the surface [15,41], similar to how honey bees control their speed by holding constant the rate of optic flow [31]. The technique enables insects and robot to automatically reduce the speed of flight upon reaching the contact without the knowledge of the actual distance to the surface.
With the estimate of distance, aerial robots potentially benefit from the ability to deploy a landing gear at a suitable moment or to approach the surface at arbitrary speed [42,43]. That is, the landing task is no longer constrained to a constant rate of optic flow. To illustrate that our proposed strategy can reliably provide the distance estimate as the robot approaches the ground, we plot the landing trajectories from five representative flights in figure 11. This shows that the EKF accurately predicts the altitude at various landing speed with the RMS errors on the order of 2 cm. Upon approaching the surface at approximately constant speed, the respective visual observable ϑ z rises radically and reaches the value of ≈4 s −1 in final moments (significantly higher than 0.3 s −1 observed in flights as shown in figure 7). The values of ϑ z calculated from the direct method were highly accurate until the robot was ≈10 cm from the ground, at which point the field of view is likely too small for features to be captured by the camera.

Estimation errors at higher flight altitude
In the flight experiments, the robot was commanded to perform various flights near the altitude of 40 cm with the resultant RMS estimation errors of 2.51 cm. When it comes to landing flight with the flight altitude below 40 cm, we found that the RMS errors were generally below 2.50 cm. Because of the nonlinear dependence of the visual observables on the flight altitude d (for example, at very large d, v i s converge to zero for the same flight speed), it is reasonable to assume that the RMS errors of α or d are dependent on the flight altitude. The estimation is likely more accurate at lower altitude.
To demonstrate this, we performed ten more hovering flights at higher setpoint altitude ranging from ∼70 to 125 cm. The RMS errors in α and d are plotted against the average flight altitude with previous hovering flight data included in figure 12. The results suggest that as d 0 increases, the RMS errors of α declines, but the estimation error of the flight altitude still increases. When the robot hovers at ∼1.2 m, the altitude estimation error is ∼10 cm. This is comparable to the results from in [22], where a robust feature-based ego-motion estimation algorithm was shown to provide the RMS error of 9.23 cm for a flight at 1 m altitude. Our results indicate the effectiveness of the proposed implementation of the direct optic flow method and the EKF scheme that are relatively simple in comparison.
To gain further insights into the relationship between the expected estimation errors from various flight altitude, we consider the steady-state condition of the Kalman estimation in a simplified situation. Assume that the robot is only moving vertically. The vertical trajectory oscillates about the nominal altitude d 0 with some speed ẑ T cṖc =v z and acceleration a z − R 33 g =ā z . If we treat v z , ā z and d 0 to be approximately constant, the state matrix originally defined in equation (23) becomes With this constant trajectory assumption, we assume further that the process noise and measurement noise are independent of d 0 and are also constant. In such situation, the extended Kalman filter predicts a steadystate covariance matrix Σ that can be evaluated from the Ricatti equation [37] where A = I + FT according to the discretization previously used in equation (25). Up to this point, we have argued that, if the robot was following the same trajectory at different altitude, we anticipate that the only difference would be d 0 as the actual velocity and acceleration are independent of altitude. For the sake of simplicity, we also assume that the vertical velocity and acceleration are also constant. Doing so allows us to evaluate a steady-state covariance matrix of the state estimation. This covariance matrix will differ when d 0 varies. Hence, it allows us to understand the effect of d 0 on the anticipated altitude estimation error.
Using v z = 0.2 ms −1 and ā z = 0.2 ms −1 as nominal conditions, we solve the Ricatti equation for Σ . The expected variance of α is given by the first element of Σ . If we let α 2 denotes this value, it is conceivable that α relates to the RMS error of α from flight experiments. We find that α also reduces as d 0 increases as shown as a dashed line in figure 12. In terms of the actual flight altitude, the fact that α = d −1 translates to α ∼d/d 2 0 . Therefore, we can also predict the uncertainty in d from α. This prediction, d , is also shown in figure 12 as a dashed line. The trend agrees with our experimental results that the estimation error of flight altitude grows as the altitude increases.
To further understand the simplified analysis here, it should be mentioned that the role of d 0 on the value of α stems from the diagonal element of F in equation (28). The characteristic of plots in figure 12(b) is insensitive to simulation parameters. For instance, if the covariance matrix Q and R are scaled up by a factor of γ, equation (29) shows that Σ will be scaled up by the same factor. That is, the shape of lines in figure 12 are intact.
It is also important to note that, the trend predicted in figure 12 is a result of the particular implementation of the extended Kalman filter we propose. It can be seen that the implementation here differs from previous works [19,22]. The proposed scheme can be used whether the visual observables are obtained via the LK method or the direct method and the estimation errors are likely to depend on the altitude with a similar characteristic. Other implementation methods will likely result in a different relationship between the estimation error and altitude from our results in figure 12. The quality of the altitude estimation depends not only on the accuracy of the visual observables but also on the estimation method.