Vision-Based Object Recognition and Precise Localization for Space Body Control

The space motion control is an important issue on space robot, rendezvous and docking, small satellite formation, and some on-orbit services. The motion control needs robust object detection and high-precision object localization. Among many sensing systems such as laser radar, inertia sensors, and GPS navigation, vision-based navigation is more adaptive to noncontact applications in the close distance and in high-dynamic environment. In this work, a vision-based system serving for a freefloating robot inside the spacecraft is introduced, and the method to measure space body 6-DOF position-attitude is presented. At first, the deep-learning method is applied for robust object detection in the complex background, and after the object is navigated at the close distance, the reference marker is used for more precise matching and edge detection. After the accurate coordinates are gotten in the image sequence, the object space position and attitude are calculated by the geometry method and used for fine control. The experimental results show that the recognition method based on deep-learning at a distance and marker matching in close range effectively eliminates the false target recognition and improves the precision of positioning at the same time. The testing result shows the recognition accuracy rate is 99.8% and the localization precision is far less than 1% in 1.5 meters. The high-speed camera and embedded electronic platform driven by GPU are applied for accelerating the image processing speed so that the system works at best by 70 frames per second. The contribution of this work is to introduce the deep-learning method for precision motion control and in the meanwhile ensure both the robustness and real time of the system. It aims at making such vision-based system more practicable in the real-space applications.


Introduction
Space programs on space robot, debris removing, rendezvous and docking, satellite formation, and other on-orbit service applications all involves the technology of moving body control [1][2][3][4].The precondition for moving body control is to at first be acquainted with the body movement information, such as inertia, position, attitude, and velocity.Figure 1 shows several examples of on-orbit service application with vision system.
Generally, there are many techniques for measuring the relative position and attitude between two objects.Sensors such as GPS, gyroscopes, accelerometers, and star sensors are commonly used for self-navigation, and their position information is sent to each other by wireless communication.
Optical-electronic sensors such as laser radar and visionbased system may be more suitable to measure relative position and attitude when the two objects are at the close range, especially in autonomous vehicles or aircrafts [5][6][7].In addition, the vision-based system based on computer vision is also widely used for object localization in industry manufacturing lines, medical instruments, and some intelligent applications.Since the vision-based measurement is of low cost and flexible to setup, vision system is increasingly applied in space body control.
Vision system also has many schemes such as monocular vision [8], stereo vision [9], and active vision with structure light [10].Besides, the active cameras such as Flash-LIDAR's can be used to detect the unknown object [11].For the noncooperative system localization, stereo vision and monocular vision [12] can both recognize the unknown objects by edge detection and feature matching.Active vision with structure light obtains the object 3D information when structure light scans the object surface, and it can not only help to recognize the object but also reconstruct the 3D information of unknown object.While monocular vision is more difficult to handle the unknown object, its precision and speed is not worse than stereo vision or active vision in applications of known target and environment.
This work is involving in an on-orbit service application.A free-floating robot can move inside the spacecraft, and the function of the robot includes routing inspection, astronaut assistant, and autonomous docking and charging.Many similar programs have been carried out in the satellite or Space Station such as the SHPHEREs [13], SCAMP [14], mini AERCam [15], and Astrobee [16], which are shown in Figure 2.This kind of robot does not require space orbit control but only serves for relative movement inside the spacecraft.However, once the environment suitability permitted, it can also work out of the spacecraft.In this work [17], a vision navigation camera is configured on a robot, and another camera is fixed at the docking place to recognize this robot.These two kinds of vision system above have the same function to measure the relative position and attitude.
The main problem of the positioning system is that the complex background and light environment may influence the image recognition.Another problem is the requirement of real-time processing speed and high precision in the control system.To resolve these problems, firstly, the deeplearning method is introduced robust for object detection and localization in the image sequence.Secondly, after the object is detected, the geometry of object position and attitude calculation is solved by P4P (perspective in 4 points) method, and the explicit calculation is gotten.The embedded electronic platform driven by GPU is applied for accelerating the image processing speed.
The ground test platform is established, and its testing result indicates that our measures greatly improved the recognition accept rate so that the precision of object localization is up to 1% and the embedded platform can process the image sequence at best by 70 frames per second.The works aimed at making such vision-based system more practicable in the real dynamic environment.

System Design and Working Mode
The vision-based positioning system works as shown in Figure 2. The system consists of a light source, camera and lens, and an embedded computer base on GPU and ARM.The system is tested in a scene with both simple and complex background that simulate the real-space environment.The reference marker with robust patterns of hamming code is set up both on the fixed place and on the robot, which are shown in Figure 2.
The system works in two modes as follows: (1) Long-Distance Navigation and Control.When the object is at a distance, the system recognizes the target from the image sequence and gets its coarse distance and position.This can be realized by identifying the position of the object located in the image and the image size d, then the coarse distance of the object can be known according to the real size of it.
The robot is controlled close to the object under the help of a navigation system.In this working mode, we only estimate the position and distance of the object, without calculating its accurate attitudes (2) Close Docking and Precise Control.When the two objects are less than 1.5 meters apart, the camera seeks the known marker on the target and gets its accurate information to implement the precise control

Object Detection
According to the system working modes, the object detection process includes two parts: one is coarse recognition to find the object at a distance in the complex scenes, and the other is fine localization for calculating the accurate spatial position at close range.

Object Detection by Deep
Learning.The traditional method to localize the target from the image is to directly adopt the precise image matching such as by SIFT features and ORB features [18,19], but this method does not work well to distinguish the target in a complex background and may lead to mismatch and decrease in positioning accuracy.
With the development of machine learning, the deep learning shows higher robustness and effectiveness to solve the problem of target detection in complex scenes [20,21].In this work, we use a method of target detection based on convolution neural network.
The target detection process is as follows: firstly, prepare the training set, label the positive and negative samples, and train and get the model with this dataset.Then, the algorithm is embedded in the computer, and the software will automatically determine whether our target is in the scene.When the iterations are 768, the accuracy rate of training is 99.976% and the error rate of testing is 0.157%.The system basically reaches the steady state.And when the iterations are 30000, the model can basically meet the needs of this experiment.The training is done offline, and the time is about 60 hours in CPU or 24 hours in GPU.

Recognition Algorithm Test.
To verify the reliability of the method, we use the database of 4110 samples to match, as shown in Table 1.With the alternation of ratio values, 10 groups of FPR (true positive rate) and TPR (false positive rate) are obtained.The recognition algorithm is compared with the traditional image matching method of ORB features.The former results are FPR1 and TPR1 while the latter is FPR2 and TPR2.It is often recommended that the threshold should be taken as 0.8 for image matching.In our application, we set the threshold value between 0.4 and 0.6, where the matching result was the best.If the threshold value is greater, the generalization performance will decrease.
Among the 4110 tested samples, the method with CNN has no mistake to identified negative samples as positive samples, and only 21 positive samples are not identified.The  3 International Journal of Aerospace Engineering accuracy rate reached 99.847%, which is far better than the traditional object detection method.It benefits from the CNN and deep-learning technology.

Marker Recognition and Fine
Localization.Marker recognition is performed when the object is less than the range of 1.5 meter and after the object has been recognized to a   small range.Marker detection includes the following steps: firstly, the image is normalized and converted to binary image with a certain threshold.Then, after image segmentation and contour extraction, the polygon is approximated and four vertices of polygons are found.According to the polygon position, the polygon image can be segmented and represented by 1 and 0. This binary vector is matched with hamming codes, and the number of the April Tag and its orientation is recognized in the image, as shown in Figure 6.In addition, the marker is not the only target we can use during this stage.In fact, we tried other targets that we can apply, as shown in Figure 7.However, considering the self-correcting property of the marker, the stability of it is better.
After the four vertices of the marker are confirmed, they are used for further calculation.Considering that if one marker covers 50 pixels in the 640 × 480 image, then the precision of one pixel leads to 1% error of localization, that is, 10 mm within the range of 1 m.For more precise localization, the coordinates of four control points are corrected to subpixel, and the error could decrease to 0.1%.There are many methods for subpixel edge detection, and we directly use the classic function of OpenCV for subpixel processing in this work.The subpixel edge detection is as shown in Figure 8.Since the subpixel detection is time-consuming, it is only used in fine operation control.The translation vector T can represent the threedimensional position of the target in the camera coordinate system, and the rotation matrix R can characterize the attitude of the target in the camera coordinate system.Support P 0 is the original position of the moving object.The position measurement question can be shown as the below images: the question to calculate the robot's position and attitude relative to the camera.Here is the equivalent to resolve the matrix rotation matrix R and translation vector T in

Position-Attitude Measurement
Let P 0 , P 1 , P 2 , P 3 as four coordinates of the marker vertices in the space and q 0 , q 1 , q 2 , q 3 as their projection coordinates in the camera plane, which is shown in Figure 9. Suppose that c, P 0 P 1 //P 2 P 3 , P 0 P 1 = d 1 , P 0 P 3 = d 2 , and the camera intrinsic parameters are known.The question to resolve the R and T is a classical P4P problem [22].
Define the control points in the camera coordinate as q 0 x 0 , y 0 , f , q 1 x 1 , y 1 , f , and q 2 x 2 , y 2 , f , where f is the camera focus, and define the optical center as O C , the plane π 1 is formed by O C and q 0 , q 1 .The normal vector of π 1 is N 1 = n x , n y , n z T , which can be calculated by linear equation of q 0 , q 1 and camera intrinsic parameter. Let Since vector P 0 P 1 is perpendicular to vector N 1 , there is From equations (3)~(4), k 0 and k 1 can be solved, and correspondingly P 0 and P 1 is solved as T .The X W axis in the camera coordinate is x P1 − x P0 , y P1 − y P0 , z P1 − z P0 and is normalized as r 11 , r 21 , r 31 , which is the first column of rotation matrix R.
The Y W axis in the camera coordinate is x P3 − x P0 , y P3 − y P0 , z P3 − z P0 and is normalized as r 12 , r 22 , r 32 , which is the second column of rotation matrix R.
The Z W axis in the camera coordinate r 13 , r 23 , r 33 can be calculated by the cross multiplication as r 13 , r 23 , r 33 = r 11 , r 21 , r 31 × r 12 , r 22 , r 32 5 It is the last column of R, and by now the rotation matrix R has been solved.A platform is set up to validate the precision of the system as shown in Figure 10.The reference marker is fixed on a 6-D precision displacement table, and the camera is installed on the other mechanic table.We record the displacement of the 6-DOF table as the real value, and the calculated displacement by vision system as the test value.The difference between the real value and the test value represents the precision of the system.

Precision and Speed Testing
When the mechanical platform is moving, the computer software samples the image sequence, calculates the accurate position and attitude information of each picture, and draws the moving trajectory in the real time, as shown in Figure 11.The red square indicates the marker in camera coordinate.To reflect the direction of the marker, the upper left corner of it is defined as the first corner when it is standing and the other three corners in the counterclockwise order.

Camera Calibration.
Before the setup of the camera, the camera should be calibrated offline at first.The camera intrinsic parameter commonly is described as where α x = f /dX is the normalized focus of u axis, α y = f /dY is of v axis, γ is the distortion factor, and u 0 , v 0 is the image coordinate of optical center.These factors only depend on the camera itself and can be calibrated in advance.The method of camera calibration is a commonly used operation.According the theory of Zhang [23], we only need one 2D checkerboard marker and a picture of it in various angle and position, then the inner parameters can be obtained by the least-square method.
In this work, we use a 5 mm lens and a camera with 640 * 480 gray in 1/3 inch; the camera parameters are calibrated as Table 2.
According to Table 2, the normalized focus of u and v axes is α x , α y = 1338 84, 1345 02 , the image coordinate of the optical center is u 0 , v 0 = 618 15, 475 73 , the distortion factor of u and v axes is γ = 0, and the radial distortion parameter is k = −0 1116, 0 1242 .

Single-Dimension Precision
Test.Move the mechanical table, respectively, in X \ Y \ Z and R x \ R y \ R z axes, and test whether the calculated value is consistent with the real value.6 International Journal of Aerospace Engineering Figure 12 is the test results in Z and R y R z , and the error is less than 1%, except for some points in the side face of R y .

Multidimension Decoupling Test.
Since the measurement of multiple degrees of freedom may have the coupling and its error model is difficult to analysis, the decoupling influence must be given by an actual test.The corresponding experimental schematic diagram of the multidimension decoupling test is as shown in Figure 13, and Figure 14 is the curve when the object moves in a rectangle of 40 × 60 cm at a surface.At the same time, we record the value change of Z axis and R Z ; the results show the decoupling with Z-XY and R Z -XY is, respectively, less than 1 mm and 2 degrees.This error may be caused by the camera calibration or the position calculation or other system error.
5.3.Real-Time Speed Test Results.For real-time operation, we choose high-speed camera with low exposure time and use a GPU embedded platform to implement and speed up the image processing algorithm.Here, the camera is from Basler gc300 (640 × 480 and 300 fps), and GPU is the Tegra TX1 of NVIDIA company.These industry components are for ground test and may be enhanced for space environment.
GPU performs well on big-size images and can be accelerated for image processing and CNN applications [24].FPGA performs well on small-size images and parallel processing, and the heterogeneous computing with GPU and FPGA will be more powerful for real-time applications.The information below is only for the GPU as the processor individually.
The vision-based system works with the following steps: (1) Image sequence sampling (2) Image normalization or image preprocessing (3) Image recognition and matching (4) Position and attitude calculation Among them, the image sequence sampling is mainly related with the camera performance, and this step costs the most time.For image processing, the processing time depends on the working modes.The image recognition by CNN is a little time-consuming, since it will seek and match the feature in the whole image.When the objects are less than 1.5 meters, the recognition algorithm is much faster, since the reference marker feature is matched in a small range and the image has been converted to binary image.The time is shown in the Table 3.
As a conclusion in this chapter, ground experiment shows that the system sampling and the fasted processing speed can get up to 70 Hz in docking mode and 30 Hz in navigation mode, which has been able to satisfy most demand of vision-based control system.Actually, there are more strategies to improve the system speed, such as using the image tracking method to reduce the seeking size of    International Journal of Aerospace Engineering image, or optimizing GPU program codes in CUDA accelerator, that will be discussed in the future.

Conclusions
A vision-based system that is mainly for space body 6-DOF position-attitude measurement and control is introduced in this paper.The configuration of this system, the image processing and object detection algorithms, and the position-attitude measurement formula are given.Our work mainly focuses on the practical problems for the system.Firstly, we bring forward CNN with deep-learning method for object detection and get 99.8% accuracy rate.Secondly, we resolve the P4P questions for object position and attitude    8 International Journal of Aerospace Engineering problem and test it on the ground.The results show the precision is far less than 1% of range.The third is we use high-speed camera and GPU as the processor and accelerate the system to nearly 70 frames per second.The computer technology's fast development such as GPU and deeplearning brings us great benefits for object detection applications.In the future, we will continue to optimize algorithms and reduce the time consuming and ensure this kind of vision-based system more robust and faster and to be used in more space applications.

Figure 1 :
Figure 1: Examples of on-orbit service application with vision system.(a) On-orbit servicing, (b) SHPERE, (c) minCAM, and (d) our system on the ground.

3. 3 .
Training Configuration.In this paper, the neural network model is divided into eight layers, five convolution layers, and three fully connected layers.In each convolution layer, the excitation function ReLU (rectified linear unit) and local response normalized LRN (Local Response Normalization) are included and then downsampling for pooling.The batch_size is set to 50, and the training sample has 9590 images, so it takes 192 iterations to complete all the samples once.The test sample has 4110 images, batch_size is 50, and it takes 83 times to complete the test at a time.We set the snapshot to 1000, and the loss curve and accuracy curve are shown in Figure5.

Figure 2 :
Figure 2: System working mode: long-distance navigation and close dock in.

Figure 3 :
Figure 3: Positive sample set in different environment.

Figure 4 :
Figure 4: Negative sample set in different environment.

Figure 5 :
Figure 5: Loss curve and accuracy curve during training.

4. 1 .r 11 r 12 r 13 r 21 r 22 r 23 r
The Localization Algorithm.In the camera projective model, define the center of the marker as (X W , Y W , Z W ) in the marker coordinate system and (X C , Y C , Z C ) in the camera coordinate, Here, the translation vector is T = T X , T Y , T Z T and the rotation matrix is R = 31 r 32 r 33 1

Figure 6 :
Figure 6: Reference marker recognition and decoding process.

Figure 7 :
Figure 7: Various targets we can choose.

Figure 11 :
Figure 11: Real-time trajectory of the marker.

Figure 12 :
Figure 12: The calculated value when the object moves in different dimension.(a) Move in Z axis, (b) move in SZ axis, and (c) Move in SY axis.

Figure 14 :
Figure 14: The coupling value when the object moves in a X-Y plane.(a) The XY curve, (b) Z-XY curve, and (c) R z -XY curve.

Table 1 :
Image matching accuracy with different threshold.

Table 2 :
Intrinsic parameter result of camera calibration.