Grasp Detection Based on Light-Weight Hierarchical Fusion Convolutional Neural Network

This paper presents a light-weight Hierarchical Fusion Convolutional Neural Network (HF-CNN) which can be used for grasping detection. The network mainly employs residual structures, atrous spatial pyramid pooling (ASPP) and coding-decoding based feature fusion. Compared with the usual grasping detection, the network in this paper greatly improves the robustness and generalizability on detecting tasks by extensively extracting feature information of the images. In our test with the Cornell University dataset, we achieve 85% accuracy when detecting the unknown objects.


Introduction
With the development of robotics in recent years, there has been an increasing focus on the ability of robots to perform tasks and interact with the environment. Among these capabilities, robot grasping, especially for unknown objects, plays a critical role as the basic function of robots and is getting more and more attention. However, there are many problems in it, such as how to grasp dynamic objects with a high accuracy, how to detect the target objects among many objects, and how to improve the robustness and generalizability of the g function. Therefore, more and more researchers have devoted themselves to the study of related algorithms, and they have proposed approaches to try to solve the existing problems. [1,2] access the optimal gripping posture by calculating from the geometric relationship of objects. Although this method is able to perform a variety of grasping tasks, it encounters difficulties when facing objects with complex shapes. [3] proposed a way to capture objects by repeatedly extracting visual information and constantly reconstructing the point cloud. This method solves the problem of grasping for unknown objects to a certain extent, but it cannot adapt well to the grasping task in different scenes and has low generalizability. And it lacks the ability to select the grasping among multiple objects. [4] solved these problems by simplifying the gripper into a C-shape cylinder, which makes the efficiency of grasping improved. But this method still cannot improve the accuracy of object grasping very well.
All of the above solutions use mathematical theory base on geometric information of the object. Therefore, the accuracy is largely affected by the occlusion relationship of objects, ambient lighting and other conditions. They, meanwhile, are limited by the shape of objects, lack generalization performance, and are not highly robust. On the other hand, with the development of deep learning, the advantages of convolutional neural networks in object detection and feature extraction are constantly highlighted. Compared with traditional method [5,6], Convolutional neural networks can better adapt to different environments with strong robustness, and can adapt to various shapes of objects with strong generalizability. Convolutional neural networks also do not need to pre-select the features to be learned, and more importantly, they greatly improve the accuracy of object recognition and grasping. With the creation of various types of networks, this accuracy is still improving and is increasingly becoming an important part of the grasping function.
Therefore, convolutional neural networks are gradually becoming the choice of many researchers when faced with object recognition grasping problems. In [7] on this learning recognition method was used to design a convolutional neural network based on dilated convolution and was able to perform tasks such as static and dynamic object recognition and multiple object recognition. However, this convolutional neural network also has many problems. First, this neural network doesn't have a high accuracy, and the convergence speed is slow, and the robustness of object recognition in different environments is low. Secondly, this network still has more parameters with limited accuracy and bad real-time performance. In this paper, a real-time lightweight layer fusion convolutional neural network ( Fig.1.1) is proposed. The advantages of this convolutional neural network are: first, the network adopts a network structure of residual module and atrous spatial pyramid pooling, which greatly improves the generalization performance of the network and has good performance in multiple object recognition. Secondly, the key feature information in the image is further extracted by an encoding-decoding based feature fusion approach. This method is good at improving the accuracy of object recognition and ensuring that the number of parameters is maintained at a low level.
I evaluate the performance of the network with Cornell's dataset which include 249 test set and we achieve 91.1% on It., showing an extremely high robustness and generalization performance.

Related Work
Posture generation for grasping unknown objects, how to obtain the object's pose is always an important issue. The correct pose can largely reduce the difficulty of the grasping task and increase the success rate of grasping. Generally speaking, there are two methods of pose generation. One is a mathematical geometric model-based pose acquisition algorithm. The second is the method based on deep learning.
For mathematical geometric models algorithms are used to determine the grasping strategy by the shape of the objects acquired in real space. The most important of them is the method of object features obtained visually which 3D point clouds method is one of them. [3,4] are using this way. The 3D model of the object is constructed by combining the visual and point cloud methods. However, this method of simply applying the object model and determining the poses by geometric relationship method does not establish a good bridge between the real situation and the simulated scene. Only local information of the object can be obtained in the face of unknown objects, such as when there are occlusion relations, ambient lighting, etc. The ability to model unknown objects is greatly affected and leads to a decrease in accuracy.
Another very important method in mathematical geometric models is the bit-pose generation by means of object edge detection [8]. This method is based on the knowledge of image processing morphology, which identifies the pixel gradients of the object edges and the environment to distinguish the two and generate the corresponding poses. It solves to some extent the influence of the environment on the accuracy rate. It is able to maintain good accuracy and high robustness without obtaining complete information about the object or being influenced by the environment. However, this method is difficult to generate effective grasping strategies for complex objects.
The deep learning based method is very good at improving the accuracy of the capture as well as the accuracy rate. Of course, this method, based on existing data sets, has a very high accuracy rate in the face of learning objects, using images or models as input to the convolutional neural network, and then acquiring recognition features from them through deep learning methods. While ensuring high robustness and accuracy, it is still able to face the situation of unknown objects and unknown environment very well.
Therefore, this paper also uses the deep learning method to use the depth image as the input of the convolutional neural network for extracting features, which greatly improves the accuracy of the grasp recognition. It also ensures robustness and stability and has better performance on unknown objects.
Deep learning for the methods in this paper the deep learning effect directly determines how well the features are extracted, so building convolutional neural networks is a crucial part. In learning-based grasping tasks, many convolutional neural network techniques are generated. [9, 10] used the residual module approach to design network structures containing fewer parameters, but the relative network robustness is not high. [11,12] used R-CNN to improve the object recognition performance by continuously generating candidate regions and calculating the cross-merge ratio to make it more suitable with image tasks. This approach clearly improves the robustness of the neural network and performs relatively well in various environments, although it is slow and takes up a lot of memory due to the cropping problem of candidate frames.
The Faster-RCNN [13,14] improves on the RCNN approach by proposing a ROI pooling pooling layer structure, which inserts the ROI pooling layer before the fully connected layer, and therefore does not require the cropping step required for RCNN, which speeds up the operation to a great extent. However, this network approach has poor recognition and generalization performance for smaller objects and objects whose distinction lies in certain details.
In contrast, the convolutional neural network of [9] recognizes each pixel and generates the corresponding grasping bit pose using a smaller network, which enhances the generalization performance and can be applicable to objects with different types of objects, as well as under the presence of mutual occlusion and relative displacement between objects. However, the accuracy of the network is not high.
Therefore, the network in this paper adopts the network structure of residual module and void space convolutional pooling pyramid composite, which greatly improves the generalization performance and convergence speed of the network, and further extracts the key feature information in the image by the feature fusion-based method. This is an end-to-end convolutional neural network, and at the same time this approach is good at improving the accuracy of object recognition and ensuring that the number of parameters is maintained at a low level. For the grasping problem we need to pay particular attention to the grasping pose, and in this paper the grasper is centered on the center point. For a grasper, its center point determines its relative position in 3D space, while the parameters of the grasper itself determine the grasping pose. Let = ( , , ) define a grasp, executed perpendicular to the x-y plane. The gripper's center position = ( , φ), The opening and closing width of the gripper is w, and the rotation angle around the z-axis is φ. To evaluate the merit of a grasping pose, we define a contribution factor . The magnitude of determines how much we approve of a grasping value. By adding the contribution coefficients to the functions corresponding to position and pose, we obtain the complete definition for a grasper = ( , , ) ∈ 3× × . In order to be able to capture the pose on the image, we translate the definition of the capture in 3D space to the plane. According to the basic principle of coordinate transformation. We obtain the basic internal and external parameters of the used camera. Two transformations are performed, namely, the transformation from the world coordinate system to the camera coordinate system, with the transformation matrix defined as , and the transformation from the camera coordinate system to the image coordinate system, with the transformation matrix defined as . Finally, for the purpose of bitpose identification at each point we then transform the image coordinate system to each pixel point, defining the transformation matrix as . we set the plane pixel coordinates = ( , ). We refer to the set of grasps in the image space as the grasp map, which we denote = ( , , ) Also, we can obtain the transformation formula from the world coordinate system to the pixel coordinate system, which we denote = Further for a pixel in f we have for the amount of pose in the pixel coordinate system and for the contribution factor in the pixel coordinate system. This allows us to derive a pixel pose equation = ( , ) ∈ 3× × Let the input image be . We can determine that there is a certain correspondence between each input object and the coordinates we get, and we define this correspondence as , then = ( ). And the goal of our task is to find the corresponding to the maximum value of among all the pixel points.

Method Grasp Piont Definition
Convolutional Neural Network After determining the best pose we estimate this correspondence by convolutional neural network. we can estimate the true value of the best pose from this correspondence by the input image, the correspondence of the existing system is and the true correspondence is , which we define = lim → ( ) So the loss function is applied to approximate with the output, such that Where represents an estimate for the pixel pose. It contains two quantities in itself, the pose parameter and the contribution coefficient for each pixel point. Where is determined by the image input, which contains two quantities Ф and ∈ × . Ф describes the relative angle of each pixel point. And describes the width of each grabber at the time of grabbing.
And is used to evaluate the contribution coefficient of each pixel capture, as long as it is a successful capture, the contribution coefficient is set to 1, and on the contrary if the capture fails, the contribution coefficient is set to 0. Network structure the network ( fig. 2) in this paper is mainly composed of two downsamples and two upsamples. In order to obtain a wider range of information at the target scale. We add a void pyramidal pooling structure in the middle of the residual link. In this structure, the input is subjected to a normal 3 × 3 convolution and three kinds of 3 × 3 null convolutions with sampling rates of {6, 12, 18}, respectively, in this paper. Each of them can capture information at different scales to further improve the network performance. The input is then subjected to one more adaptive pooling and convolution. Finally the features of the above outputs are fused by a 1 × 1 output convolution.
After that we further deepened the network by using a residual structure between each sampling [13,14]. Because as the number of layers of the network increases it will face many problems such as long and computationally intensive computation time and overfitting. Also even the problem of degeneracy phenomenon can occur. Therefore by adding the input value with the output that goes through the network and taking it as the next input value, we can obtain the input-output relation for each time of the network residual structure as follows.
In this paper, we add a layer of convolution on the input side as well, so we can obtain further residual structure input-output relations.
Where represents the weight parameter of the residual network and is the parameter of the bypass network. x_n is the input of the previous session, and +1 is the output of this session and the input of the next session. f is the correspondence between the residual structure and the output, and g is the correspondence of the bypass (note the translation). Each residual structure contains two identical convolutional units, and the units consist of convolutional layers, normalization and activation functions. Each of these residual blocks has the same weight, while in this paper, the residual block of the network is set to 2.
This link solves the degradation problem well and improves the accuracy of the network at the same time. And this method increases the upper limit of the number of network layers that can be added.

Experimental Srt Up
The experimental environment of the computer used for training the network is Windows 10, with a CPU of i5 8300H and a graphics card NVIDIA GeForce GTX 1050ti to assist the computational process. The Python compiler is python 3.9.4, the acceleration library is CUDA 11.2, and the dependent deep learning framework is pytorch 1.81 (GPU version). In the training process, the Adam optimizer is used, the number of iterations is 50, and the batch size is set to 8.
The dataset used in this paper is the Cornell University dataset with 2049 data, of which 90% is chosen as the training set and 10% as the test set. We performed the same processing of the dataset as done in [11] on the dataset, transforming the images to a size of 300 × 300. And we increased the sample size by random rotation and cropping to generate a dataset containing 8840 depth maps.
We generate a rectangle in each image to calibrate the grasping device, and the pose of this rectangle represents the pose of the grasper. And we take one-third of the rectangle length as the valid region for grasping. Set the contribution factor of all image pixels in all valid regions to 1 and the others to 0. Set the angle Ф of the pixels in the valid region of the image to the angle Ф of the rectangle of the valid region with respect to the horizontal coordinates and set the value of the grabber width to the width of all pixels in the rectangle of the valid region. and use these three matrices as the labels of the dataset.

Experiments
We evaluated testing our network by designing several sets of related experiments. To ensure the validity of the experiments and the reliability of the experimental results. The experiments were all conducted in the same environment, using the same training set and dataset, for a total of 50 epoch.
To evaluate the basic performance of the network itself. We set the valid rectangle of the label of the number of image of the test set as and the valid rectangle generated by the recognition result as B_i, for the mIOU value of the model computed on the test set for each image.

= argmax ∪ ∩
Where the value greater than 0.25 is recorded as a positive sample P for a successful crawl recognition and vice versa as a negative sample N. Finally, the ratio of the number of successful crawls to the total number of crawls is taken as the accuracy .

= +
First we use GGCNN and GGCNN2 to compare with networks that have not undergone knowledge distillation. Secondly to verify that the network structure does not contain redundant links, we compare the performance of the base network containing four layers of convolutional two-layer upsampling with the residual structure, ASPP and both, respectively, and investigate the role each link plays in terms of performance as well as robustness.  Fig. 3 the curve of the accuracy rate based on different model. It can be seen that our network has higher accuracy with fewer parameters in comparison, and converges faster and the loss decreases more rapidly. The advantages of the structure of the network proposed in this paper for hierarchical extraction of image features can be seen.

Compare with the other network
Also, after analysis, structurally speaking, the network in this paper has a smaller convolutional kernel, but at the same time has a deeper network and feature fusion. Therefore, the number of channels and the number of convolutional kernels are considered to contribute less in improving the performance of the network, and more parameters are added. Instead, the network depth and the breadth of feature fusion play a more critical role in this regard.   To illustrate the effectiveness of our network improvement links, we selected method 1: using a base network with a network structure containing four convolutions and two upsamplings, respectively. Method 2: Adding ASPP module to the network structure. Method 3: Adding a residual module to the network. Method 4: Adding both residual module and ASPP module to the network.

Compare with different structures
Observing the experimental results, we found that the base network parameters had a total of 34388, but the highest accuracy rate was 71.1% (177/249), and the network performance was very bad. We added the ASPP structure after we downsampled the network twice in (1) (Fig.). The resulting parameters totaled 41140, while the accuracy improved by 6.4% to 77.5% (193/249). In another experimental group (2) (Fig.), two layers of residual structure were added with the number of resblock in each layer, resulting in a total of 53,940 parameters, while the accuracy rate improved by 11.6% to 82.7% (206/249). The final network built in this paper, on the other hand, contains both the residual structure and the ASPP structure. The resulting parameters totaled 60692, while the accuracy improved by 14.0% to 85.1%.
By incorporating these structures, the network in this paper improves the overall performance of the network and improves the generalization and robustness of the network on the basis of increasing the number of few parameters. We can see that both this residual structure and ASPP module can improve the network performance very well. And in this paper, the hierarchical fusion of these two structures obtains a new network with better overall performance.

Conclusion
Deep learning is an important tool in the task of unknown object grasping. In this paper, the proposed layer fusion convolutional neural network algorithm successfully achieves the improvement of unknown object grasping recognition accuracy based on the low number of parameters by applying the residual structure, null space convolutional pyramidal pooling to 85.1%, and further applying the knowledge distillation method to increase the accuracy to 91.1%. And this method has high robustness and generalization performance. It provides an important idea of neural network light-weighting improvement, which has great value in practical applications.