Passive Initialization Method Based on Motion Characteristics for Monocular SLAM

Visual SLAM techniques have proven to be effective methods for estimating robust position and attitude in the field of robotics. However, currentmonocular SLAM algorithms cannot guarantee timeliness of system startup due to the problematic initialization time and the low success rates.This paper introduces a rectilinear platform motion hypothesis and thereby converts the estimation problem into a verification problem to achieve fast monocular SLAM initialization. The proposed method is simulation tested on a fixed-wing UAV. Tests show that the proposed method can produce faster initialization of visual SLAM and that the advantages are more profound on systems with sparse image features.


Introduction
In recent years, with the application of Graph-based Optimization [1] and Bundle Adjustment (BA) [2] in Visual Simultaneous Localization and Mapping (vSLAM) and the emergence of excellent open-source libraries [3,4], vSLAM systems are increasingly used in autonomous motion platforms.Exceptional open-source vSLAM systems also help popularize vSLAM techniques.Presently, vSLAM systems have been applied to UAV autonomous navigation [5,6] and obstacle avoidance [7,8] problems in GPS-denied environments.However, current vSLAM systems usually take a long time to initialize [9], posing difficulties for real-world engineering problems.
Generally speaking, feature-based vSLAM techniques rely on epipolar geometry constraints or homography constraints [17]; they obtain the R and t corresponding to the minimum Reprojection Error with RANSAC or Least Squares methods.As for direct methods, they are usually initialized through randomized approaches, as exact pointto-point mappings cannot be obtained directly, leading to noncomputable R and t.
As can be seen from the above method, most of the classical monocular vision SLAM method does not consider the motion characteristics of the platform during the initialization phase.However, the basic equations of the SLAM system are composed of equations of motion and observation equations.Most of the current research focuses on the observation equations.This paper believes that the reasonable introduction of motion hypothesis can effectively improve the robustness of observations, especially in the initialization phase.
PTAM's initialization works with the hypothesis that captured images are mainly composed of flat surfaces; initial camera motion R and t are then computed with homography matrix (H) accordingly.ORB-SLAM algorithms are effective extensions of PTAM that compute computing essential matrix (E) and H simultaneously; the final initialization method is then selected by comparing the respective scores.LSD-SLAM and DSO, as direct methods, cannot compute R and t through Reprojection.Therefore, they initialize through random variables.When camera motions cover enough distance, initialization will be effectuated by locking into specific depths.SVO's initialization is similar to that of PTAM, except that SVO integrates an additional assumption that the motion direction is perpendicular to the photographed plane, as SVO is originally designed for rotor UAV use.These five classic methods are renowned in the field of monocular SLAM/VO, each possessing unique strengths.They have been successfully applied in their respective environments with satisfactory performance.The initialization workflows of the feature-based algorithms are summarized in Figure 1.
Theoretically, the initialization process depicted herein can initialize any movement except pure rotation.Firstly, corresponding points from separate frames are identified through feature-based or optical flow methods.These point mappings are then utilized along with monocular-camera imaging characteristics in computing H or E under the epipolar geometry frame.H or E is then decomposed to produce R, t, and finally the initial map points with the additional assumption that the mapped points contain no actual movement.This concludes the traditional initialization process, where the frames can be adjacent or nonadjacent, and the decomposition utilizes RANSAC, Eight-Point Method, or Bundle Adjustment.Subsequent processes will use the initial R, t and map points (3D) for the chain processes maintaining the monocular SLAM system.Due to the scale uncertainty of the monocular visual SLAM system, no initialization method can produce real-world distance of the map points; the dimensionless depths are provided instead.The initial map points (3D) play an important role for subsequent frames.The indirect 3Dto2D correspondences between pixel points and map points (3D), together with the geocalculated DLT/P3P [18]/EPnP [19]/UPnP [20] or the optimized BA, are used to determine the subsequent frames' positions and orientations relative to the preceding key frame.Frame I is an initialization frame as well as a key frame.As the camera moves on, the number of indirect 3Dto2D correspondences that can be established will gradually decrease, leading to probable failure of the aforementioned chain processes.It is then necessary to consider inserting new key frames to replenish the map points (3D) needed for the chain processes.Complete SLAM  algorithms also involve another important process termed loop closure, which will not be discussed further, as it is not much related to the present paper.It can be seen from above that the E or H is obtained from point correspondences is the initial enabler of the entire monocular SLAM system.However, when point correspondences are insufficient or too inaccurate, the obtained E or H may contain large errors, affecting the accuracy of the map points (3D), and thus compromising the subsequent processes.Current methods are based on limited errors of R and t.Therefore, the actual implementations contain much strict computation on E or H, leading to low success rates in many cases.When the monocular SLAM systems are applied to fixed-wing unmanned aerial vehicles (UAVs), the initialization success rates are even more worrying [5,6].
In this paper, we add a generalized motion characteristic hypothesis in the initialization process to transform the solution of camera motion R and t into the error elimination problem during the initialization process.In this way, the success rate of initialization is increased.In view of the error caused by the hypothesis, this paper reduces the error by multiframe optimization method, thus improving the accuracy of the initialization process.
1.1.Contribution.Firstly, this paper proposes the platform motion characteristics, which represents the motion state of the platform in most of the time.Secondly, this paper introduces the platform motion characteristics into the initialization phase of monocular vision SLAM and avoids the solution of the essential matrix and the homography matrix by optimization.Finally, this paper uses the subsequent BA to convert the initialization from a transient process to a convergent process of several consecutive frames.

Monocular SLAM Initialization Method Based on Platform Motion Characteristics and Optimization
2.1.Overview.The present paper proposes a monocular SLAM initialization method based on platform motion characteristics and optimization, the flowchart of which is shown in Figure 2.
The proposed method contains an offline process and an online process.In the offline process, initial motions R and t are computed with platform motion characteristics of the camera installation mode.In the online process, firstly, a set of Frame I and Frame T are used to detect and match the feature points, and then initial map points are generated under the initial motion hypothesis, whose initial errors are eliminated by Init BA.Finally, the subsequent BA is performed with the matched feature points of Frame I and Frame n, further reducing the errors of R, t, and the map points.The initialization is considered to have succeeded when the errors converge.

Initial Motion Hypothesis Combining Platform Motion
Characteristics and Camera Installation.The proposed method utilizes an initial motion hypothesis to initialize the system.Cars on ground generally run along straight lines, while aerial vehicles usually fly at a fixed angle of attack.This is a very broad description, as ground vehicles may turn, and aircraft may roll.General motion characteristics can be expressed with Equation ( 1) is a 6-DOF rectilinear description of any motion platform.For the monocular vSLAM initialization, the rectilinear hypothesis of the platform motion needs to be expressed in the coordinate system of the camera.The require conversion is derived from the mounting characteristics of the camera and is expressed with V in (2) is the transformation matrix of the camera coordinate system with respect to the platform coordinate system.This matrix can be obtained from the camera installation characteristic.A general form of  V is given in Substitute   and  V  in (2) by ( 1) and (3), respectively, the camera motion model under the aforementioned rectilinear hypothesis is obtained.
T  introduces the rectilinear hypothesis; therefore, it is necessary to restrain the errors in the map points' coordinates.
{X  , R  , t  } = arg min The map points can be optimized once by (6), which reduces the error caused by the hypothetical model.Due to the quality of feature point matching, only Init BA cannot make the error of X, R, and t small enough, so the method introduces the subsequent BA to further reduce the error.Equation ( 6) describes the Init BA optimization of the map points.Due to the quality of the matched feature points, Init BA alone cannot reduce the errors of X, R, and t to an acceptable margin.The propose method utilizes the subsequent BA to further reduce the errors.

Error Reduction with Subsequent BA.
Limited by the number and distribution of matching feature points, the errors contained in X   , R  , and t  cannot be evaluated, so this paper introduces subsequent BA to achieve further error suppression and initialization accuracy evaluation.The main idea of subsequent BA is to optimize X, Rinit, and tinit with each subsequent frame and then decide whether to continue the subsequent optimization by judging its convergence (Algorithm 2).(1) Extract feature points of Frame n and match with x I to obtain matched feature points x n .
(2) Optimize the Reprojection Error with Subsequent BA and thereby obtain optimized X  , R  , t  .
(  The errors contained in X   , R  , and t  cannot be evaluated through Init BA, due to the scale and distribution of the matched feature points.Subsequent BA is thus utilized for initial error evaluation and further error reduction.The main idea of subsequent BA is to optimize X   , R  , and t  with each subsequent frame.Convergence evaluation is performed to determine when to stop the subsequent optimization. For each frame of the subsequent input, the coordinate   of the map points is optimized as shown in the process of Figure 3.When it converges, the error elimination process is considered to be ended, and the subsequent BA process in the figure is as shown in (7).
For each subsequent frame,   is optimized with the error reduction process shown in Figure 3 and It can be seen from ( 7) that the scale of optimization gets larger with continuous input of subsequent frames, which ensures to some extent the reliability of the optimized results.

Data and Image recoder
The optimized map points are viewed as initialization results, providing input for subsequent chain processes.The quality of initialization is evaluated with where E  is the sum of the Reprojection Errors of all map points in all frames participating in the optimization.

Simulation
3.1.Simulation System.In order to better reproduce vehicle motion characteristics, the present study builds a hardware-in-the-loop (HIL) simulation system, as illustrated in Figure 4.It consists of four parts, namely, the Xplane10 flight simulation software, the Pixhawk2 flight controller, the Xplane10 plays the most important role in the entire simulation system, providing aircraft models and simulation images.The Pixhawk2 controller performs autonomous control of the fixed-wing aircraft in Xplane10, with QGround-Control acting as the data relay.Specifically, Xplane10 sends the aircraft states to QGroundControl through local loopback UDP; QGroundControl forwards the aircraft data to Pixhawk2 via the USB port using the Mavlink protocol; Pixhawk2 sends out the control commands through the same protocols.The data logger records the uplink-downlink data through UDP and the first-person view (FPV) simulation images through the video capture card.
Data transmitted in the simulation system can be roughly classified as periodic data and sporadic data.Periodic data includes the control commands and the aircraft states.Sporadic data includes the start signal, the waypoint-planning instruction, etc.Through meticulous testing, the frequency of the periodic commands is set at 65HZ, and the image sampling frequency is set at 25HZ.

Performance Indicators.
Reasonable and balanced performance indicators are needed to evaluate the initialization methods.This paper proposes two groups of performance indicators, for self-evaluation (Table 2) and comparative evaluation (Table 3), respectively.
This study holds that the convergence frame number and the initial error are key indicators to assess the proposed passive initialization algorithm.As the initial error is only affected by the aircraft state at the initial time and the rectilinear motion hypothesis, the convergence frame number is a stronger indicator of the usability of the proposed method.
The optimized map points and pose information are only usable after convergence.Since error rotation   is quite unintuitive, for ease of understanding, this paper decomposes   into  ℎ ,   ,   to facilitate the evaluation of performance.The difference in length between t  and t  (true value of t  ) is not considered due to the depth uncertainty of monocular vSLAM initializations; only the difference in angle between t  and t  is considered.

Test Design.
In order to thoroughly test the proposed initialization method, we devise a simple self-evaluation test and an advanced comparative-evaluation test.The selfevaluation test measures the inherent capabilities of the new method, while the comparative-evaluation test runs the competing algorithms on different terrains.The test scenarios include: taxing, climbing, level flight, BTT turn, diving, and landing.

Self-Evaluation Test and Result Analysis. The test results
of the algorithm in this paper are shown in Figure 6, and the convergence curve of the algorithm is given, where Figure 6(a) gives the convergence curve for initializing in the running state.It can be seen that in this state, the initial error is small, because the state of motion of the aircraft is very close to the motion assumption in the slipping state, and the error is within 1 ∘ even if the error is not eliminated.Figure 6(b) gives the convergence curve for the initialization of the aircraft at the moment of takeoff.It can be seen that there is a large error in the motion state of the aircraft and the motion assumption at this time.Due to the characteristics of the fixed-wing aircraft, it is mostly in a level flight during the cruise flight.In this state, the motion of the aircraft is similar to the motion assumption.Therefore, several special states are selected in the test, including the climbing to level flight (Figure 6(d)), level flight to BTT turn (Figure 6(e)), level flight to dive (Figure 6(f)), etc.Thus Figure 6 gives the convergence of the algorithm in each typical state during a complete flight, which does not reflect the ability of the algorithm in the whole process.
Figures 7 and 8 and Table 4 summarize the convergencerelated initialization performance at different poses throughout one complete flight.Figure 7 shows the convergence time statistics of the algorithm in the whole flight process.Figure 8 indicates the error distribution of the algorithm under different thresholds.Table 4 gives the exact values of Figures 7 and 8.

Comparative-Evaluation Test and Result
Analysis.This paper selects ORB-SLAM2 and DSO as the competing classical algorithms for the comparative-evaluation test.In order to better reflect their performance, this study runs the all methods on plain terrain (Figure 5) and mountainous Complexity 7      terrain (Figure 9).Considering the stochastic nature of ORB-SLAM2, the study conducts five comparative-evaluation subtests on each terrain.The best subtest results are viewed as the illustrative test results.
Figure 10 gives the initialization results of the three algorithms in two terrains.  in proposed method is set to 0.5 during test.It can be seen that SRI of the proposed method in both plain and hilly terrain is greater than that of ORB-SLAM2 or DSO.Considering the effect of   on proposed method, SRI under different   is compared, as in the Table 5 In addition to the comparative-evaluation performance indicators introduced in Table 3, this paper also compares the number of matched feature points(ANMFP) needed by ORB-SLAM2, DSO and the proposed method, respectively, to effectuate successful initialization (Table 6).
It can be seen from Table 6 that the ANMFP value of the proposed algorithm is between 50 and 70, while the ANMFP of ORB-SLAM2 is above 200.The DSO algorithm requires a larger ANMFP, because it uses a direct method framework.It can be seen that the number of feature points required by the proposed algorithm is much smaller than that of ORB-SLAM2 and DSO.The reason for this result is determined by the basic structure of the algorithm in this paper.The algorithm does not directly calculate the H or E by relying on the correspondence between the feature points of two adjacent frames, but continuously optimizes the initial pose by using the feature point correspondences that can be continuously observed in successive frames.That is to say, for this method, there is no need to have so many feature points in the initial frame.This method could get an acceptable initial attitude as long as enough points can be continuously observed in successive frames.This also explains from another side why the algorithm can achieve a higher SRI.Therefore, the method in this paper can achieve better results from little feature points when dealing with sparse image features.

Conclusion
In this paper, we propose a rectilinear hypothesis of platform motion and thereby derive a passive initialization method for monocular SLAM.Init BA and subsequent BA are utilized in reducing the errors between the actual motion and that of the proposed hypothesis.A simulated fixed-wing aircraft is selected as the test platform for the proposed method.Results show that the success rate of monocular SLAM Initialization   is greatly improved compared with that of ORB-SLAM2.However, this method is only effective on platforms with strong motion characteristics and cannot be used indiscriminately on platforms characterized by randomized motions, such as humans and animals.At present, the method has yet to be tested in real-world environments, which will be rectified in future works.

Require:
Frame I: Initialization Frame Frame n: Subsequent Frame x I : Feature points of Frame I x 1 , x 2 . . .x n−1 : Feature points of previous frames corresponding to x I R 1 , R 2 . . .R −1 : Optimized rotation matrices of previous frames t 1 , t 2 . . .t −1 : Optimized translation vectors of previous frames X  : Initialized Map Points Ensure: Map points X  R  t  and Initialization Evaluator V.
Home(alt:0m) TakeOff(alt:100m) Land Point(alt:50m) L L L L L L L L L L

2 Complexity Table 1 :
Summary of initialization methods.
Frame I:Initialization Frame Frame T:Target Frame   :Camera Motion Hypothesis Ensure: Initialized Map Points X  ,R  ,t  (1) Perform feature point matching for Frame I and Frame T, resulting in matched points x I and x T .(2) Triangulate x I and x T using hypothesis   to generate map points X  (3) Optimize X  , R  , t  with Init BA and update accordingly.(4) return X  R  t  cos   cos   sin   sin   cos   − cos   sin   cos   sin   cos   + sin   sin     cos   sin   sin   sin   sin   + cos   cos   cos   sin   sin   − sin   cos     − sin   sin   cos   cos   cos Require:

Table 5 :
Initialization results under different terrains.

10 Complexity Table 6 :
Number of matched feature points needed.