Relative pose estimation of cooperative target based on monocular vision

With the vigorous development of the global logistics industry, warehousing automation has become a popular applied research direction in computer vision. In this paper, the fork and the fork hole’s relative posture during the forklift picking process were taken as the research object. The relative position of the two is solved based on the Aruco cooperative logo PnP algorithm. In this paper, the initial values of the relative poses of the two are addressed by the linear algorithm DLT, and the relative poses are further optimized by Bundle Adjustment (BA). The iterative error observation function is constructed based on the visual pose estimation model, and the Gauss-Newton method is adopted. The least-square fitting iterative process is carried out many times, and the relative position and posture of the fork and the fork hole with high precision are finally obtained.


Introduction
The forklift is an industrial handling vehicle, which is often used to load, unload, and stack commodities 0. It has a significant role in the warehousing and logistics industry. However, in actual operation, it is difficult to align the fork with the fork hole, and it takes a long time. This situation causes a significant amount of time, labor, and material costs to sink. Therefore, for automated warehousing, the forklift's autonomous fork picking of goods needs to be solved [2]. The high-precision relative position and attitude estimation of the fork and the fork hole is the critical link in the forklift's autonomous fork picking process [3]. This paper presents a method for estimating the relative pose of the target based on monocular vision, which is used to obtain the fork's relative posture and the fork hole.
Pose estimation based on vision has vast advantages such as low cost, high precision, non-contact, and a large amount of information [4]. The traditional visual pose estimation methods are divided into two types based on cooperative goals and non-cooperative goals. The pose estimation method based on the collective target needs to artificially set some collaborative signs with known geometric structure feature information on the destination, which can significantly improve the detection accuracy of the goal[5] [6]. Compared with other cooperative symptoms, the Aruco code can provide enough corresponding information to obtain the camera pose. At the same time, its internal binary system makes the sign maintain absolute stability in error checking and correction, and the corners are easy to extract. Therefore, this article chooses the Aruco code as the cooperation mark, which is placed near the fork hole. The positional relationship is shown in Figure  1. In most existing vision-based pose estimation methods, there are two necessary steps: feature extraction and pose parameter solution. The extracted features are used to construct 2D-3D correspondence, and the pose estimation is implemented according to the composed communication. Because the feature points of Aruco codes are rich in information, the features used by the solution method in this paper mainly feature points.
In the case of known camera internal parameters, it is a classic PnP problem (Perspective-N-Point) to solve the relative pose relationship between the target space coordinate system and the camera coordinate system from the corresponding relationship between the feature points and their image points [7]. PnP problem-solving methods are roughly divided into two types: linear algorithms and nonlinear iterative algorithms. P3P has at most four solutions, and Gao et al. gave the complete solution classification by Wu method[8] Quan and Lan gave linear SVD algorithms for P4P and P5P problems [9]. They transformed PnP problems into multiple P3P issues, but the algorithm's computational complexity increased. Moreno and Lepetit proposed an efficient and high-precision linear algorithm EPnP [10] with computational complexity. Still, after experimental verification, the EPnP algorithm is more sensitive to the depth information of the picture [12], that is, when the camera is too close to the target At this time, the algorithm has poor stability and low accuracy. Another typical linear algorithm is the direct linear transformation (DLT immediate linear transformation) algorithm [11], which ignores the orthogonal constraints of the rotation matrix. Through the matrix transformation, linear constraint equations for 12 independent variables can be obtained. The computational complexity of the DLT algorithm is only, but it is greatly affected by noise, and the accuracy is not high. The classic nonlinear iterative algorithm is to minimize a particular cost function as the goal, transform the pose solution problem into a nonlinear least-squares problem, and iteratively solve it using the nonlinear optimization algorithms Gauss-Newton method or Levenberg-Marquardt method. The nonlinear iterative algorithm has high precision and has high anti-noise performance. Still, the calculation amount of this method is much more significant than that of the linear algorithm. The solution process is very time-consuming, and it depends on the initial value of the iteration. This paper proposes a pose estimation algorithm based on feature point positioning technology, which can effectively and quickly solve the relative pose of the fork and the fork hole, the algorithm obtains the low-precision estimated initial value through DLT solution and further performs several nonlinear iterative operations on the initial cost to get a high-precision six-dimensional posture of the relative position of the fork and the fork hole.

Pose transformation theory
Transforming from one coordinate system to another requires two processes: posture transformation and parallel translation. That is a rotation through three posture angles and adaptation in the direction of three coordinate axes. This article refers to three coordinate systems, namely the camera coordinate system , fork coordinate system (reference coordinate system) , fork hole coordinate system (world coordinate system) , the three The positional relationship is shown in Figure 1.

Vision estimation model
The estimation model is established according to the camera perspective projection imaging model is the camera coordinate system, is the world coordinate system, is the image coordinate system, and is the intersection point of the axis and the image(Image center point). The conversion relationship of the pixel homogeneous coordinate vector , , 1 at point Q and the camera similar coordinate vector , , , 1 : represents the internal parameter matrix of the camera, , represent the equivalent focal length of the camera, , are the pixel coordinates of the image center point. The internal camera parameters can be obtained by camera calibration. In this paper, the calibration toolbox of MATLAB is used to calibrate the camera.
Combining formula (2), the corresponding relationship between homogeneous pixel coordinates and world seamless coordinate vector , , , 1 is: | 3 Where is a 3×3 rotation matrix, and is a translation matrix between the camera coordinate system and the world coordinate system.

Direct Linear Transformation
Noting , , 1 , , , according to equation (3), we get the homogeneous coordinate vector of the 2D point to the 3D aspect of the pinhole camera machine Projection matrix : | 4 We record the jth row of matrix as ℎ and simplify to get three equations about : If the orthogonal constraint of the rotation matrix is ignored and each element of is used as an independent variable, the following two independent equations can be derived: Each feature point can provide two independent equations of equation (6). The projection matrix H has 12 unknown quantities. When the number of feature points N ≥ 6, we can obtain the unique solution of the mapping matrix . Through formulas (1) and (4), we can get the six-dimensional pose of the world coordinate system 3D point relative to the camera coordinate system. However, the pose accuracy achieved by the DLT solution is low, and we can improve efficiency by minimizing the reprojection error.

Bundle Adjustment
The essence of Bundle Adjustment is an optimization model to minimize reprojection errors. The reprojection error is the projection of real 3D space points on the image plane (the pixels on the image) and reprojection (the virtual pixels obtained by performing the second projection calculation using the 3D points according to the currently calculated projection matrix ). This article uses the Gauss-Newton method to optimize the six-dimensional pose obtained by the DLT algorithm iteratively. The quadratic term of the Taylor expansion of the function is removed: 9 In the formula, J is the Jacobi matrix obtained by obtaining the first-order partial derivative of the matrix function. The ideal value of the reprojection error, ‖ ‖, is 0, so if we directly set the function to 0, the iterative optimization model of BA can be obtained: 10 11 In summary, this paper constructs two iterative error observation functions , : The variables of the iterative function are six-dimensional pose vector , , , , , , then the first-order partial derivatives of and are respectively obtained: The BA optimization iteration model in this paper: Wherein like reference calibration outside the machine, when adjusting the forks coincides with the coordinates of the world coordinate system, when the feature points N≥6, According to formulas (1) and (3), the rotation translation matrix of the camera relative to the fork coordinate system can be obtained, so the estimation model in this paper is: In this paper, an industrial-grade infrared camera was used. When the fork coordinate system coincided with the fork hole coordinate system (both relative 6-dimensional poses are 0), images were collected, and the poses are solved. During the experiment, the camera signed cooperation from 0.97m. We kept the fork, fork hole camera relatively static. The data collected in the experiment was 500 groups, and the results were as follows: Figure 4 Experimental results of pose estimation Actual value and estimated value comparison Figure 4 showed the actual and predicted values of the X-axis translation distance and the rotation angle , the Y-axis translation distance , and the rotation angle . After a comparative analysis, we could obtain that the X-axis translation distance static error was about 2mm, the dynamic error was in the range of [0mm, 4mm]; the X-axis rotation angle static error was about 0.05°, the dynamic error was in the field of [-0.05°, 0.1°]; the Y-axis translation distance static error was about -1.5mm, the active error is [0.5mm,-3mm] interval; Y-axis rotation angle static error was around 0.05°, the dynamic error was [0.15°, 0°] interval.  . After a comparative analysis, we could get that the Z-axis translation distance had a static error of about 0mm and a dynamic error of [-1.5mm, 2mm]; The static error of Z-axis rotation angle was about -0.05°, and the dynamic error was in the range of [-0.15°, 0.05°].
The accuracy of the translation distance of the pose estimation algorithm in this paper reached a millimeter level through experimental verification. The accuracy of the rotation angle was within 0.2°.

Conclusions
In this paper, a pose estimation algorithm is studied based on at least six coplanar feature points. Based on the pinhole camera perspective projection model, it is proposed to use the DLT linear algorithm to obtain the initial value of the relative pose of the fork and fork hole, and the Gauss-Newton nonlinear iterative algorithm is used to optimize it, which improves the accuracy of pose estimation. Experimental results show that under ideal conditions, the translation distance error does not exceed 4mm, and the rotation angle error does not exceed 0.2°. The algorithm has an extensive fluctuation range of pose estimation value, and further optimization is needed.