Visual identification and pose estimation algorithms of nut tightening robot system

To realize the automatic tightening of nuts, a nut tightening robot system strategy based on machine vision was designed in this work. The strategy was based on three stages: nut image calibration, nut identification, and nut pose estimation. In the first stage, the template pose image of the nut and the coordinates of the nut center in this nut image were obtained by calibration. In the second stage, a nut identification algorithm based on improved the backbone feature extraction network and area generation network of Faster-RCNN was presented, which improved the efficiency and accuracy of nut identification. In the last stage, a nut pose estimation algorithm based on Fourier and log-polar coordinate transformation was presented to solve the rotation angle, translation, and scale of the nut relative to the template nut image and then obtain the rotation angle of the sleeve and the central coordinate of the nut. An experimental nut tightening robot platform was also set up in this work. The results of 50 tests showed that the proposed detection methods could identify nuts with 100% accuracy, and with the proposed pose estimation methods, the average error of the rotation angle of the nut was 0.057°, and the average error of the center position of the nut in x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x$$\end{document} and y\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y$$\end{document} directions was ± 0.05 mm and of z\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z$$\end{document} direction was ± 0.5 mm. The experimental results showed that the nut tightening robot scheme and algorithm designed in this work were feasible in nut identification and pose estimation and met the requirements of insertion accuracy in the process of nut tightening.


Introduction
Bolt connection is the most common connection method used in assembly. It is widely used in automotive parts, petrochemical industry, aerospace, power equipment, and other fields. The efficiency and quality of nut tightening directly affect the reliability of bolt connection and the efficiency of assembly. The realization of high precision and high universality is a difficult problem in the field of automatic nut tightening. At present, the nut tightening process is mainly carried out in the form of semi-automatic, which has the disadvantages of low efficiency, high labor cost, high labor intensity, and so on. In addition, under the harsh working conditions of high temperature, high pressure, nuclear power, and other environments, the manual nut tightening operation cannot be carried out. Therefore, it is very important to develop a solution with high precision, good robustness, and automatic nut tightening.
At present, the mainstream methods of target identification mainly fall into two categories: methods based on image features and methods based on deep learning [1]. Zou et al. [2] presented a new spherical feature descriptor for spherical target identification, which can be obtained by extracting and fitting the contour and then processing the convex hull. Cristian et al. [3] presented a 3D object recognition algorithm based on 2D image gradient histogram. In recent years, deep learning has become the mainstream technology of target identification. Classic one-stage target detection algorithms include SSD series, yolov1, yolov2, yolov3, and yolov4, and two-stage target detection algorithms include RCNN, SPPNet, fast RCNN, fast RCNN, and mask RCNN [4][5][6][7][8]. More and more companies and scientific research institutions apply the deep learning target identification algorithm to robots and implement the algorithm in industrial production. Wang et al. [9] proposed a point cloudbased end-to-end deep learning method, PointNetRGPE, to 1 3 estimate the grasping poses of the SCARA robot. In Point-NetRGPE model, the point cloud and class number are fused into a point-class vector, and several PointNet-like networks are used to estimate the robot grasping pose, containing 3D translation and 1D rotation. Wang et al. [10] proposed a pose estimation system that can meet the requirements for robotic spray painting to accurately estimate poses. The image of the target is segmented using a modified U-SegNet segmentation network, and the resulting segmentation is registered with the pre-scanned model candidates using iterative closest point (ICP) registration to obtain the estimated pose. The traditional single pose estimation architecture based on deep learning has been mature in low-precision scenarios such as robot grasping and spraying, but there are still problems such as low positioning accuracy and poor robustness in nut assembly field with high precision requirements.
At present, great progress has been made in image target pose acquisition methods, and relevant research results have been applied to robot vision [11][12][13][14]. The mainstream pose identification methods of workpiece targets are mainly methods based on image features and methods based on template matching. In view of the slow detection speed of the traditional feature point extraction algorithm, Calonder et al. [15] presented the BRIEF (binary robot independent elemental features) algorithm. Soetedjo and Yamada [16] presented a pose estimation algorithm based on three-dimensional arc convex target for vehicle pose detection. Yoo and Han [17] presented a template matching algorithm based on NCC by using image Fourier transform, correlation analysis, and image integration. Gavrila and Philomin [18] proposed a template matching algorithm based on distance transformation and object shape for pose recognition of objects with different shapes. The above pose identification methods do not have good robustness and high accuracy in the field of part assembly with high accuracy requirements. Bin picking solution based on a modified RANSAC approach [19] and a fast object registration method [20] have been developed. The former is more robust, but less accurate and it is also time-consuming. The latter is fast but only accurate to within ± 2 mm. The robotic bin picking solutions [21] are accurate and fast, but they use feature-based object detection method which requires the feature to be visible all the time to the vision system. Oh et al. [22] developed a bin picking method for automotive sub-assembly automation, which identifies and localizes objects by fitting primitives such as planes on the surface of the object. This method has a very high process runtime and also has large errors in estimating the object pose. The errors are more than 10 mm in translation and 1.5 in rotation. Also, they are suitable for components that can be decomposed into geometric primitives.
The pose alignment is essential before assembly [23][24][25][26][27], and vision-based pose alignment methods are widely investigated [24]. Liu et al. [25] proposed a coarse-to-fine method for the pose alignment of long cylindrical components. Shen et al. [26] designed an automatic alignment system to align the viewing samples from multiple directions with the nanorobot's rotation axis. Liu et al. [28] proposed a two-stage pose alignment method, in which orientation and position alignments are conducted in sequence. Although the methods mentioned above realize pose alignment, the efficiency is low.
The above research work has carried out in-depth research on the pose estimation and alignment algorithm itself but has not considered the influence of interference, modeling error, and various uncertainties in the real system. Therefore, it is also crucial to introduce a nonlinear control algorithm into the robot tightening system [29,30]. Some scholars put forward advanced optimal control algorithm for hydraulic drive. Nedic et al. [31] proposed a new cascade load force control design for a parallel robot platform. A parameter search based on firefly algorithm (FA) is suggested to effectively search the parameters of the cascade controller. With the rapid development of industrial informatization and deep learning technology, modern data-driven fault diagnosis (MIFD) methods based on deep learning have been receiving attention from the industry. Tao et al. [32] design a parameter optimization and feature metric-based fault diagnosis method with few samples for the problem of sparse fault samples and cross-domain between datasets in real industrial environments. In addition, Zhuang et al. [33] proposed an effective iterative learning control (ILC) approach based on successive projection scheme, while its design objective considers the mathematical expectation of its tracking error to evaluate the task performance. Therefore, this algorithm can be introduced into the robot tightening system, and the performance of the proposed algorithms is verified using a numerical model.
Our motivation is to achieve the identification and pose estimation for nut tightening robot. We mainly focus on identifying the nut and estimating the pose of the nut, to realize the robot to insert the sleeve into the nut accurately. So, the main contributions of this work are as follows.
(1) A nut tightening robot system is developed for precise insertion of sleeve and nut during nut tightening. Classic nut tightening process, i.e., identification and pose alignment, is completed manually, and the torque application process is completed automatically. The nut tightening robot system has higher flexibility than existing semi-automatic nut tightening process. (2) A vision-guided nut tightening robot system strategy is proposed, which consists of three phases, i.e., calibrating, identifying, and aligning. In the nut image calibration phase, the template pose image of the nut and the coordinates of the nut center in the nut template pose image are obtained by calibration. In the nut identifi-cation phase, a nut identification algorithm based on improved the backbone feature extraction network and area generation network of Faster-RCNN is proposed, which improves the efficiency and accuracy of nut identification. (3) A pose alignment strategy based on Fourier and logpolar coordinate transformation is proposed. The efficiency and accuracy are improved compared to the classic nut tightening methods. Finally, the feasibility of this system strategy is demonstrated by analyzing all the measured data from experiment equipment.

Nut tightening robot system
A nut tightening robot system is designed as shown in Fig. 1, which mainly consists of an industrial robot, depth camera, pneumatic wrench, and laser ranging sensor. Industrial robot is a series robot with six degrees of freedom. The pneumatic wrench is an actuator that applies the torque required to tighten the nut. It is installed on the end actuator and supplied with pressure by the air compressor. The design of the fixed bracket of the pneumatic wrench ensures that the center of the spindle of the pneumatic wrench is aligned with the center of the flange at the end of the robot. The depth camera is fixed at the end of the robot and adopts the eye in hand working mode. The hand-eye relationship is obtained through calibration, which is used to identify the nut and estimate the pose of the nut. The laser ranging sensor is installed on the robot end effector to estimate the depth in the z direction during the nut insertion process. The position relationship between the luminous point of the laser sensor and the sleeve of the robot end effector is obtained through calibration. Also, the translational and rotational accuracies of recognition should be less than 1 mm and 1°, respectively, to comply with the specifications required for industrial applications. Hand-eye calibration is not the focus of this study, so it is not described in detail.

Image calibration of nut template orientation
The nut tightening robot system strategy proposed in this work is based on a nut template orientation image, i.e., the nut is axisymmetric in the template image. By calculating the angle transformation between the current nut and the template, the robot end sleeve is rotated by the corresponding angle to realize the pose matching between the sleeve and the nut. The accuracy of nut template orientation directly affects the attitude transformation accuracy of subsequent nuts. Therefore, it is necessary to calibrate the nut orientation. The orientation calibration process of nut template is as follows: The end effector of the robot moves in pose F , maintains the end sleeve pose, and moves directly above the nut. Adjust the pose of the nut so that the sleeve can be accurately inserted into the nut and move the end sleeve of the robot out of the nut. The robot moves the moving camera to 200 mm above the nut (the image of the nut collected by the camera at this position is relatively complete and within the depth measurement range of the camera); a moving camera is used to capture the pose image of the nut at this time.
Then, the nut image is processed by OpenCV to get gray scale and denoised nut image. Because the gray values of different nut image background are different, the image is binarized by manually selecting the appropriate binarization threshold, and then, the image edge is extracted. The Hough transformation is used to detect all the straight lines in the image, and the desired straight line angle is manually filtered, which is the rotation angle of the nut to the Pneumatic wrench template posture. Finally, the corrected nut template orientation image is obtained by affine transformation, and the corrected nut image is cropped to 380 × 380 pixels (the value makes sure that the nut is in the image and changed according to nut type). The calibration process is shown in Fig. 2.
The rotation angle 1 of the nut is converted to the end sleeve of the robot through hand-eye relationship, and the rotation angle 2 of the sleeve is obtained. After the end of the robot is rotated 2 , the sleeve posture that can be inserted into the template nut is obtained. Identify the rotation angle of the nut relative to the nut template posture, and then rotate the sleeve to match the nut posture.

Nut template image center position calibration
The nut tightening robot system strategy proposed in this work is based on a template nut image with known center pixel location of the nut, so the center pixel location of the template nut image needs to be calibrated. For template nut center calibration, a method for calculating the center position of the template nut image based on the largest inscribed circle is proposed in this work. All contour pixels of the image need to be traversed in this algorithm, which is slow in calculation. We need to filter the binarization threshold and identify the inner circle manually, so it is not robust and is only suitable for calibration and cannot be used when actually identifying nut center.
The nut image is processed by OpenCV to get gray scale and denoised nut image, and appropriate binarization threshold was manually filtered out. Then, the image is binarized to get the image after the nut background is binarized to white. Contour detection is performed on binary images and the detected contours are traversed. Polygon detection is performed on each point in the image to get the distance from the point to the current outline. The distance from each point to the outline is stored in a Mat object, and the largest distance value corresponds to the radius of the largest inner circle, which is the center coordinate of the largest inner circle. The process of image processing to obtain the calibration results is shown in Fig. 3.
The maximum inner circle of the nut can be well identified by filtering the binarization threshold, and the center pixel coordinate of the nut is (189.001, 192.187). Since the obtained pixel coordinates are decimals, the nearest neighbor interpolation method is used in this paper to obtain the center pixel coordinate of the nut in the template nut image as (189,192). Meanwhile, the calibration method is also applicable to nuts of different colors and types. The orientation and center position of the nut can be obtained by adjusting the binary threshold value in the above calibration process.

Nut recognition algorithm
The result of nut recognition will directly determine whether to estimate the pose of the nut and the accuracy of the pose estimation. In order to precisely locate the center of the nut, it is necessary to identify the nut on the nut installation plane without texture and color features. Based on the classic

Nut recognition based on improved Faster-RCNN
Faster-RCNN is a very good two-stage target detection algorithm. Compared with one-stage target detection algorithm, two-stage algorithm is more complex and has higher detection accuracy, but the training and identification speed is slower [34][35][36]. Faster-RCNN adds RPN (region proposal networks) based on Faster-RCNN. Region generation network extracts proposal (proposal box), bounding box regression, and classification into one network, which greatly improves the overall performance [37]. This paper takes the nut as the object of detection and describes the improved Faster-RCNN algorithm detection process as shown in Fig. 4. Before the nut image was input to Faster-RCNN, the short edge of the image was resized to 600 and keep the aspect ratio unchanged. Therefore, the image will not be distorted when input to Faster-RCNN network. Firstly, backbone (backbone feature extraction network) was used to extract features from images, and ResNet50 is used as an improved backbone. Feature map (shared feature layer) will be obtained after the image passes through the backbone feature extraction network. In this paper, the size of feature map was 38 × 38 × 1024, and the next operations were carried out on feature map. As can be seen from the figure, there are two trends in the shared feature layer. The trend on the left is the convolution of 3 × 3, and the trend on the right is the ROIPooling (region of interest pooling) layer. The trend on the left is the improved proposal network. First, the convolution results of 3 × 3 were convoluted twice. Among them, the number of channels of 1 × 1 convolution was 6, The convolution of 6 channels can be split into 2 × 3, and the convolution of 12 channels can be split into 4 × 3. The reason for this split was related to the shared feature layer. The shape of the shared feature layer was 38 × 38 × 1024, which was equivalent to dividing the input image into 38 × 38 grids, and each grid had 3 priori boxes. The prediction results of the two 1 × 1 convolutions were used to judge whether the priori box really contains objects. Then, the prior box will be adjusted into a proposal, in which the convolution of 2 × 3 was used to judge whether the interior of the prior box really contains objects. The number 2 represented the probability that the prior box was the background, and the probability that the prior box is the interior of the object. The convolution of 4 × 3 was used to predict the change of each prior frame on each grid point on the common feature layer. The number 4 represents the adjustment parameter of the a priori box; only four parameters can be used to determine a box. These four parameters were used to adjust the prior box to obtain the proposal. After two convolutions, the proposal can be obtained and combined with the shared feature layer. The ROIPooling layer will intercept the shared feature layer using the proposal. Because the size of the proposal obtained was different, the local feature layer shape obtained when intercepting the shared feature layer through the proposal was different. Therefore, the ROIPooling layer will pool all local feature layers in different regions, and then, the shapes of all local feature layers will be the same. Then, all the local feature layers obtained were used for classification prediction and regression prediction. The result of regression prediction will adjust the size of the proposal to get the final prediction box. The result of classification prediction will judge whether the interior of the proposal really contains objects and judged the type of objects. Finally, the prediction box can be obtained and the type of objects in the prediction box can be judged. The improvement part is described below:

Backbone network section
The original Faster-RCNN algorithm used VGG-16 [38] as the backbone network. VGG-16 was proposed by Simonyan, Karen, and others in 2014. The network has a deeper number of layers and a wider feature map, so it is suitable for classification and positioning of large datasets. The disadvantage is also obvious. This network has a large number of parameters, consumes a lot of computing resources, and cannot solve the problem of the disappearance of gradients and network degradation with the network deepening. Therefore, ResNet50 [39] is used in this work as the backbone feature extraction network to replace VGG-16 in the original algorithm. By adding residual network, ResNet50 directly skips one layer of data output from the first few layers into the input part of the later data layer, so that the content of the later feature layer can be partially contributed by one of the previous layers.
For deep network layers, if all the layers behind them are identical mappings, the deep network will degenerate into a shallow network. It is more difficult for some layers to fit the potential identity mapping function H(x) = x directly. Therefore, ResNet50 uses the ResNet50 has two basic blocks, Conv Block and Identity Block, both of which are residual network. The biggest difference between Conv Block and Identity Block is whether there is convolution on the residual edge and the residual edge of Conv Block. There are convolution and standardization on the upper, and there is no operation on the residual edge of the Identity block. When there is convolution on the residual edge, the dimension of the output feature layer can be changed by changing the step size and the number of channels of the convolution. When there is no convolution on the residual edge, the input and output dimensions are the same. Therefore, the function of Conv Block is to change the dimension of the network, and the function of Identity Block is to concatenate and deepen the network.
The network layer used in the backbone feature extraction network of the improved Faster-RCNN is shown in Fig. 5. The input image is set to 600 × 600. This part only includes four compressions of the network dimension. After the last layer, the shared feature layer will be output, and it will be compressed for the fifth time in ROIPooling.

Region proposal network part
The original Faster-RCNN algorithm added an attention mechanism, mainly relying on the RPN to tell the network what to look for. The original Faster-RCNN algorithm will slide a 3 × 3 window in the feature map in order to generate a proposal. This 3 × 3 window corresponds to a large receptive field in the original image, and VGG is used as the backbone feature extraction network; the corresponding receptive field is 228 × 228 pixels. An anchor judges whether there is an object according to the region of 228 × 228, and it can map multiple anchor boxes. The classification and regression of the proposal are performed in the receptive field. There are 9 anchor boxes in each receptive field, and 9 anchor boxes are cut in the receptive field. The 9 anchor boxes include three areas {128 × 128, 256 × 256, 512 × 512} and three scales {1:1, 1:2, 2:1}, some of which are larger than 228 pixels, and some are smaller than 228 pixels, respectively, representing objects of different sizes and shapes are finally extracted based on these anchor boxes.
The improved region generation network is shown in Fig. 6. The size of the feature map obtained through the ResNet backbone sign extraction network is 38 × 38 × 1024, so a 3 × 3 convolution with a channel number of 1024 will be performed on the feature map. Each anchor after convolution corresponds to a receptive field of 176 × 176. In this paper, nuts are identified and single classified, and the proposal of nut can be assumed to be square. Therefore, in order to reduce the training time and improve the identification speed, the original 9 anchor boxes of different shapes and sizes in each receptive field are changed to 3 square anchor boxes, the size is {128 × 128, 256 × 256, 512 × 512}, and then, the convolution of 1 × 1 is performed twice, the number of channels are 2 × 3 = 6 and 4 × 3 = 12, of which 3 in 2 × 3 corresponds to 3 anchor boxes in each grid, and 2 is used to judge whether the anchor boxes really contain objects; 3 in 4 × 3 also corresponds to 3 anchor boxes in each grid, and 4 is used to correct the coordinates of anchor boxes and obtain the proposal. The improved anchor boxes are shown in Fig. 7. Three anchor boxes are drawn on each grid, and anchor boxes are drawn in the image. There are many points in this image, equivalent to a 600 × 600 image, with 38 × 38 network on it. The blue point in the image is the center of the grid. Each grid center has three anchor boxes and three squares of different sizes. These are the anchor boxes that were originally drawn in the image. The predictions of the proposals network will determine if these boxes contain predictions and then adjust the boxes to get the proposals.
The training process of Faster-RCNN is also divided into two parts. First, the network of proposals is trained to get the prediction results of the proposals. Secondly, the ROI (region of interest) network is trained and the predicted position of the proposal network is intercepted. In fact, the proposal of the previous prediction is used as an anchor box of the ROI network to get the corresponding prediction results. The post-processing NMS (non-maximum suppression) module and the loss function in this paper are both modules from the original Faster-RCNN algorithm, so they are not introduced in this paper.

Model training
The experiment used 5000 nut images as datasets, including 4500 training images and 500 validation images. The datasets was trained on NVIDIA Quadro P5000AQ3 graphics card. The training is divided into two stages, freezing and thawing. During the freezing phase, the network backbone is frozen, and the feature extraction network does not change, so the existing occupancy is smaller. Therefore, batch size is set to 30 and the step size is set to 1 × 10 −4 in the freezing phase, and 400 epochs are trained. During the defrosting phase, the network trunk is defrosted, and the feature extraction network is changed, which takes up a large amount of existing data. Therefore, batch size is set to 5 and the step size is set to 1 × 10 −5 in the defrosting phase, and 200 epochs are trained. The two loss values change during the training process as shown in Fig. 8.
There are two loss functions in Fig. 8. The solid line is the classified loss L cls and the dashed line is the regression loss L reg . Classified loss uses log loss, and regression loss uses robust loss. Figure 8 shows that when 400 epochs are trained, loss drops very slowly and the training is completed. The mAP, F1, recall, and precision values of the training results are shown in Table 1.
AP is a concept of area to average the accuracy of each category. mAP is the sum and average of AP values of all categories. In this paper, it was the single classification of nuts, and AP was equal to mAP. Precision represents the proportion of the results that the model predicts to be correct that the true value is also correct. Recall represents the proportion of all data for which the true value is correct that the model can predict correctly. F1 comprehensively considers the precision and recall. FPS represents the frames per second.

Nut pose estimation algorithm
Using the nut recognition algorithm in the previous section, the nut can be identified and the coordinates of the nut in the pixel coordinate system can be obtained. The depth camera is used to convert pixel coordinates into three-dimensional coordinates in the world coordinate system and control the end of the robot to move about 200 mm above the nut to achieve initial positioning of the nut. At this time, the camera is directly above the nut, and the Faster-RCNN algorithm is used to identify the nut. As can be seen from Fig. 9, when the camera is facing the nut, the Faster-RCNN algorithm has a better recognition effect and can accurately identify the image of the nut. The recognition results of Faster-RCNN are used to cut the nut prediction box to 380 × 380 pixels. In this section, a nut pose estimation algorithm based on Fourier and log-polar coordinate transformation is proposed by using the extracted nut image and template nut image.

Two-dimensional discrete Fourier transform
The Fourier transform is a tool for decomposing a waveform (function or signal) into alternative representations, characterized by sine and cosine [40]. Generally, in the physical world, most of the signals are limited aperiodic signals, and the computer can also process discrete signals of finite length, so the discrete Fourier transform (DFT) appears. The fast Fourier transform (FFT) is used in practical computations to efficiently compute the DFT [41]. Take the Fourier transform of an image is to compute the 2D DFT of gray data of the image, transform the image from the spatial domain to the frequency domain, and analyze and process the Fourier spectral characteristics of the image. The frequency of an image is the gradient of the gray value of the image changing in image space. The low-frequency regions in the image are pixels with gentle grayscale changes, and the high-frequency regions are pixels with sharp grayscale changes.
The two-dimensional Fourier transform is defined as follows: Given a function f (x, y) , which represents the gray value of point (x, y) in the pixel coordinate system, the 2D Fourier transform is defined as follows: where (u, v) is the amplitude value of f (x, y) in the frequency space and (u, v) and (x, y) in the image are not one-to-one correspondence.
The 2D inverse Fourier transform is defined as follows: For the image matrix with the size of M × N , its 2D DFT is defined as follows: where u = 0, 1, … , M − 1 , v = 0, 1, … , N − 1 , x, y is the coordinate position of the pixel under the pixel coordinate system, and u, v is the frequency value.
The 2D inverse DFT is as follows: x, y is the coordinate position of the pixel under the pixel coordinate system, and u, v is the frequency value. The transformed spectrum, power spectrum, and phase angle can be obtained from the 2D DFT: where R(u, v) is the real part of F(u, v) and I(u, v) is the imaginary part of F(u, v).

Phase correlation method
The phase correlation method was proposed by Kuglin et al. [42] and is used to find the translation between two images. In image information, the correlation between two images is to convolve the two images, and the result of convolution is used to measure the similarity between images. After the discrete Fourier transform of the image, the modulus and phase information of the Fourier spectrum are obtained. The size of the modulus can obtain the gray level information of the image, and the phase information can see the azimuth change information of the image. The phase correlation is calculated from the phase information, and the changes between the images can be obtained. The phase correlation method is based on the Fourier shift theorem and exploits the properties of the phase spectrum, so the translation between images can be found.
Let the image f 0 (x, y) be f (x, y) after translating x 0 and y 0 in x and y directions, respectively, i.e.
Their corresponding Fourier transforms F 1 (u, v) and F 0 (u, v) will be related by The cross-power spectrum of two images f (x, y) and f 0 (x, y) with Fourier transforms F 1 (u, v) and F 0 (u, v) is defined as where F * 1 is the complex conjugate of F 1 . The shift theorem guarantees that the phase of the cross-power spectrum is equivalent to the phase difference between the images. By taking inverse Fourier transform of Eq. (10) in the frequency domain, we will have a function that is an impulse x 0 , y 0 ; that is, it is approximately zero everywhere except at the displacement that is needed to optimally register the two images. By searching the pulse position, we can get the translation x 0 and y 0 between the two images (Fig. 10).

Nut pose estimation based on Fourier and log-polar coordinate transformation
In order to insert the sleeve into the nut accurately, a nut pose estimation method based on Fourier transform and logpolar coordinate transformation is proposed in this paper. The nut rotation angle, translation, and scale relative to the template nut image are found out, and finally, the rotation angle of the sleeve and the center coordinates of the nut is determined.
Assume that there is only translation and rotation between the template nut image and the current nut image and no scale. If f 2 (x, y) is a translated and rotated replica of f 1 (x, y) with translation x 0 , y 0 and rotation , then According to the Fourier translation property and the Fourier rotation property, transforms of f 1 and f 1 are related by Let M 1 and M 1 be the magnitudes of F 1 and F 2 . Therefore, from Eq. (12), we have If we consider the magnitudes of F 1 and F 1 , then from Eq. (13), it is easy to see that the magnitudes of both the spectra are the same, but one is a rotated replica of the other. Rotational movement without translation can be deduced in a similar manner using the phase correlation by representing the rotation as a translational displacement with polar coordinates. According to the transformation relationship of log polar coordinates, i.e. Equation (14) in log-polar representation using phase correlation (Sect. 5.2) angle 0 can be easily found out.
In the actual recognized nut image, the nut has rotation, scale, and translation, so it is necessary to comprehensively consider the rotation, scale, and translation to calculate the pose change of the nut. Since the obtained nut images are taken directly above the nut, the scale of the nut is equal in the x and y directions of the image. Therefore, the nut attitude estimation algorithm considering translation, rotation, and scale is given by the following method: If f 1 (x, y) is scaled replica of f 2 (x, y) with scale factors a for the x and y directions, then they can be defined in polar representation According to the Fourier scale property, the Fourier transforms of f 1 (x, y) and f 2 (x, y) are related by Using Eqs. (16) ~ (19), if f 1 (x, y) is translated, rotated, and scaled replica of f 2 (x, y) , their Fourier magnitude spectra in log-polar representation are related by (13) By converting the axes to log-polar coordinates, scale can be reduced to translation (ignoring the multiplication factor 1∕a 2 ), which gives i.e., where = log and d = loga.
Using Eqs. (22) and (23) and the phase correlation technique, scale a and angle can be found out. Once the scale and angle information are obtained, the nut image with high resolution is scaled and rotated by amounts a and , respectively, and the amount of translational movement is found out using phase correlation technique. Finally, the rotation angle of the sleeve and the center coordinates of the nut is determined.

Model performance analysis
The performance comparison between the improved Faster-RCNN model proposed in this paper and the original Faster-RCNN under the same training set and training parameters is shown in Table 2. It can be seen from the results in Table 2 that the improved Faster-RCNN introduced the characteristics of residual network, precision and map are increased, and the corresponding recognition accuracy was improved. Secondly, the proposal network in RPN network was improved to reduce the number of anchor boxes in the training and prediction process, so the amount of calculation was reduced, and training and recognition speed are greatly improved.

Model test results
The nut was identified based on the improved Faster-RCNN detection model in this paper. The recognition effect is shown in Fig. 11. It can be seen from the figure that the model shows relatively good detection performance for nuts of different scenes and sizes. Therefore, the improved Faster-RCNN algorithm has better recognition efficiency, higher accuracy, and better generalization performance. The four vertex coordinates of the prediction box can be obtained through the prediction results. The spatial position of the nut can be calculated by combining the central coordinates of the prediction box with the internal parameters of the depth camera, and the spatial position can be used as the initial positioning result of the nut. Next, the end of the robot was moved directly above the nut. Then, using the recognition results of Faster-RCNN at this time, the nut image was cropped according to the four vertex coordinates of the prediction box, and the pose estimation algorithm was used to further estimate the pose of the nut.

Experimental results and analysis of nut pose estimation
Using the nut recognition and initial positioning results in the previous section, the current nut posture image was obtained, as shown in Fig. 12. By taking the FFT of the current nut pose image and template nut pose image, the frequency domain image was obtained after centralization, as shown in Fig. 13.
The spectrograms of the template nut image and the current nut image in the Cartesian coordinate system are mapped to the log-polar coordinate system, as shown in Fig. 14.
The complex transformation of rotation between images was translated into translation along polar axis by log-polar mapping. Considering the scale, the scale a and rotation of the current nut pose image relative to the template nut pose image in Cartesian coordinates can be mapped to the translation in polar coordinates. By performing phase correlation on the log-polar coordinate spectrum, we can get the Fourier crosspower spectrum. Then, the inverse FFT was performed, and an impulse function would be obtained at the point x 0 , y 0 in space, as shown in Fig. 15. By searching for the extreme position of the Once the nut image has been derotated and scaled, any remaining translation between the two images can be estimated using the phase correlation method again.
The nut template pose calibration method proposed in Sect. 3 was used to obtain 10 nut images with known rotation angle and center position. The nut pose estimation algorithm based on Fourier and log-polar coordinate transformation proposed in this paper was used to find out the rotation angle and nut center position of 10 nut images. Compared with the calibrated data, the results are shown in Table 3. From the results in the table, the estimation error of the average rotation angle of the nut was 0.057°.

Nut identification and positioning experiment
The result of nut recognition affected the initial positioning of the nut, and the result of initial positioning directly affected the proportion of nut in the image. In order to ensure that the proportion of the nut in each cropped nut image and the template nut image was as close as possible to its position and improved the accuracy of the nut pose recognition estimation algorithm, the ideal cutting method was to follow the camera to the same position (200 mm) directly above the nut to recognize and cut the nut. Fifty experiments were carried out to test the accuracy of nut recognition and initial positioning before the robot tightens the nut; the depth camera was used to identify the nut and control the robot end   26.12 (194.29,199.41) moving camera to move to 200 mm above the nut. In each experiment, a moving camera was used to collect the image of the nut at 200 mm, and the nut image was cropped using the recognition result to obtain the cropped nut image. The process of nut recognition and initial positioning is shown in Fig. 16, in which the robot moved from (a) to (d).
After 50 experiments, the experimental results of nut recognition and robot positioning 200 mm above the nut were obtained. The positioning results and errors are shown in Fig. 17.
It can be seen from Fig. 17 that Faster-RCNN is reliable in nut recognition and initial positioning. The nut can be identified and positioned in 50 experiments, the positioning success rate was 100%, and the robot moving camera was moved to the initial position. The average error of the initial positioning of 50 experiments was 1.03 mm, which can meet the requirements of the next nut pose estimation.

Robot insert nut experiment
Using the nut image identified by the initial positioning in the previous section, combined with the nut pose estimation algorithm, 50 sleeve and nut pose matching experiments were carried out. The center pixel position of the nut was identified in each experiment, and the center coordinate of the nut was changed. The image coordinate system was transformed to the robot coordinate system. The nut was inserted into the sleeve by adjusting the center position of the robot end effector sleeve (orientation adjustment) to align the center position of the sleeve with the center position of the nut. During the nut insertion process, the robot moves in Cartesian space, and the robot movement speed was 0.01 m/s. The depth information of the insertion process was obtained in real time by using the laser ranging sensor. The sleeve was moved to a distance of 3 mm from the nut mounting platform and the insertion is completed. The insertion process is shown in Fig. 18. The error of the nut center position calculated in each insertion process in 50 experiments in x and y directions is shown in Fig. 19, and the average errors Δx and Δy in x and y directions were ± 0.05 mm.
After 50 robot insertion experiments, the average distance between the sleeve and the nut installation plane measured by the laser ranging sensor is 3.4 mm. The depth error in the z direction of each experiment is shown in Fig. 20, and the average error Δz in the direction z is ± 0.5 mm (Table 4).  Fifty repeated insertion experiments were carried out on robot, the robot successfully inserted the sleeve into the nut in forty-seven experiments, and there was an average distance between the sleeve and the nut installation plane of 3.4 mm, and the insertion failed in three experiments. The experimental results show that the nut recognition and pose estimation scheme designed in this paper can reliably insert the nut and then tighten it. It can also be seen that the situation in the experiment was inconsistent with the expectation. The reason for the failure of insertion was analyzed, and it was found that the two insertion experiment failures were due to the failure of nut center positioning. In-depth analysis was because when using the algorithm in Sect. 5.3 to estimate the nut center position, the scaling solution was not accurate due to the external light and the nut itself, so the calculated nut center position has deviation. One pose matching failure was due to inaccurate estimation of nut rotation angle.
Finally, five nuts of different sizes and colors were selected for the above calibration tests to obtain the template position and orientation of each type of nut. The accuracy and universality of the scheme proposed in this paper are tested by solving the translation and scaling between template position and orientation images and then performing a nut insertion comparison experiment. The experimental results are as follows: Fig. 19 The error of the center of the nut in the x direction and the y direction during the insertion experiment Fig. 20 The error of the center of the nut in the z direction during the insertion experiment As can be seen from the results in Table 5, the larger the nut size, the greater the error in the x and y and z directions, but both within the allowable errors for insertion. The results show that the nut calibration and position estimation schemes proposed in this paper have good robustness and versatility on nuts of different types and colors.

Conclusion
In this work, the authors presented the development of new identification and pose estimation algorithms tested on a robot system for nut tightening environment.
A nut tightening robot system strategy based on machine vision was introduced in this article. A demonstrator rig was presented, comprising robot, depth camera, a tightening tool, and laser ranging sensor. A series of laboratory experiments were carried out. The results of 50 tests showed that the proposed detection methods can identify nuts with 100% accuracy, and the proposed pose estimation methods get the average error of the rotation angle of the nut is 0.057°, the average error of the center position of the nut in x and y directions is ± 0.05 mm and the average error of z direction is ± 0.5 mm. This is considerably more accurate than existing identification methods (reported in literature as 20 mm and 1.5°) and is reasonably within the range required for industrial applications (i.e. 1 mm and 1°). Furthermore, compared with the existing methods, a two-stage pose estimation algorithm was proposed in this paper. The accuracy of identification and initial positioning by the improved Faster-RCNN algorithm reached 100%, and then, the pose of the nut was accurately estimated by Fourier transform and log-polar transformation. The success rate of the interpolation reached 94%, which met the requirements of practical industrial tightening.
The authors are not aware of any other similar method capable of identifying nuts and accurately estimate nut pose; hence, a direct comparison was not possible to compare the performance and accuracy of the proposed approach with the existing methods.
The main contributions of this work include the developed robotic nut tightening system with an industrial robot and the proposed nut identification and pose estimation strategies. The important two aspects in precision assembly, i.e., precision and efficiency, are improved compared to the precision assembly systems with multiple manipulators, which is helpful to realize batch precision assembly in industrial. In particular, an automatic insertion strategy combining depth vision and laser ranging was proposed. The tightening method can tighten other types of nuts by replacing the sleeve, and it provides an effective method for automatic tightening of nuts. Experiments on nut recognition and pose alignment were carried out on the designed robot assembly platform, and the effectiveness of the proposed method was verified.
In the future work, we will improve and complete our work. Meanwhile, the proposed method also has some weaknesses. For example, the proposed method in this article is only suitable for the flat mounting nut workpieces. We will improve our method to adapt to different nut connection scenarios. The accuracy and intelligence of the robotic assembly system need to be further increased. In the future, the high precision movement stages will be combined with the industrial robot. Besides, we will concentrate on assembly skill learning to improve the intelligence of robotic assembly system.  As a final point, the authors had concluded that in almost every automated fastening method developed to date, including the one proposed here, the speed of certain fastening processes such as tightening a nut to a complex low-volume and high-value component may not be higher than manual operation by a worker. However, the existing demand for automation in this sector of industry is not only due to need for faster process. The other factors such as recruiting, training, and avoiding human errors have been proved very costly and could be avoided by such automated systems. In terms of implementation and the added value for industrial applications, it was concluded that the proposed nut automated fastening system is potentially reliable to be implemented in an industrial environment; however, further test on actual industrial components would be necessary to evaluate the robustness of the proposed system for industrial applications. Nonetheless, this research is a proof-of-concept to demonstrate to what extend the automated technologies can be considered for some of the current manual tasks, which are heavily relied on operator skills. While the results of this work may not be recognized as a financially viable solution for some industries, they could help evaluating the potential of automation technologies in terms of speed, precision, and decision-making in various complex industrial scenarios.
Author contribution Zhou Yibang was a major contributor in writing the manuscript. All authors read and approved the final manuscript.

Funding
The authors wish to express their gratitude to the financial support from the National Natural Science Foundation of China (No. 51975213).

Data availability Not applicable.
Code availability Code is available when required.

Declarations
Competing interests The authors declare no competing interests.
Open access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http:// creat iveco mmons. org/ licen ses/ by/4. 0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.