AGV monocular vision localization algorithm based on Gaussian saliency heuristic

,

position of the robot, Complete the autonomous navigation and localization function of the robot.At present, most AGV vision localization systems are divided into three steps: visual information collection, target detection, and localization algorithm.Among them, target detection is the key to affect the localization accuracy of visual localization.For the existing AGV-based visual object detection, it is very difficult to detect accurately and quickly in complex environments.
Therefore, in order to solve these problems, this paper proposes a new AGV vision localization method.In this method, AGV visual detection network (GAGV-net) based on Gaussian saliency heuristic is proposed.First of all, in order to improve the feature extraction ability of the network to the target, we introduced the Gaussian salient target feature extraction module in the feature extraction part of the network to enrich the feature expression of the target.Through the excellent feature extraction capability of the feature extraction module, the model parameters of the network are greatly reduced, which realizes the purpose of model lightweight.Secondly, in the part of network classification decision, we introduce the joint multiscale classification module to improve the classification accuracy of the network.The experimental results show that the proposed detection method has better detection performance than the existing advanced detection methods, and the model size is much smaller than the existing methods, which greatly improves the detection speed.The contributions of this paper are as follows.
(1) A new AGV monocular vision localization framework based on Gaussian saliency heuristic is proposed.(2) An AGV vision detection network (GAGV-net) based on Gaussian saliency heuristic is proposed.Compared with the existing methods, the network has a higher detection accuracy and detection speed, which provides technical support for rapid and accurate AGV vision-aided localization.The experimental results show that, compared with the existing detection methods, the detection accuracy of the proposed detection network is improved by 12%, and the detection speed is improved by 27.38 FPS.(3) In the GAGV-net network, an efficient feature extraction module of target saliency is proposed.Through this feature extraction module, the feature extraction ability of the network is greatly improved, thus reducing the parameters required for model fitting.(4) In the GAGV network, a joint multi-scale classification module is proposed, which greatly improves the classification accuracy of the network.

AGV visual positioning
In general, accurate target detection algorithm is the key of AGV vision localization technology.Therefore, a large number of scholars have conducted research on AGV visual localization algorithm, hoping to achieve more accurate target localization through accurate target detection technology.In these methods, Kang et al. [6] used cameras to capture tags and used SVM predictors to classify tags.Ding et al. [7] used the decision tree model to pre classify the targets and used the long short-memory (LSTM) network to distinguish uncertain state data to improve the fault detection accuracy.Kuang et al. [8] proposed a Hough-based fuzzy inference algorithm transformation to solve the problem of slow inspection speed, so as to improve the real-time performance of the entire system.Yang et al. [9] improved YOLOv5 model to achieve more accurate target inspection.This method introduces the attention mechanism in yolo-v5 to improve the feature representation of the target, and realizes the effective constraint of the network by improving the loss function.Liu et al. [10] proposed an end-to-end edge detection method based on traditional adaptive threshold method and depth learning to overcome the problem of non-uniformity and achieve accurate target detection.Dong et al. [11] proposed a vision-aided localization and navigation system to enhance the intelligence and capability of traditional AGVS equipped with 2D LIDAR sensors and make it more robust in various environments, which integrates the advantages of cameras and 2D LIDAR.Li et al. [12] proposed a vision-based adaptive localization algorithm for global attitude correction and visual servo motion controller, which realized the automatic driving function of AGV.Although the existing AGV visual inspection algorithms have achieved good detection performance, as the application scenarios become more complex, the detection performance of existing methods is poor.In addition, the existing detection methods often have a large number of parameters, which leads to poor real-time detection.

Deep learning target detection
Target detection is a vital problem in computer vision, focusing on identifying and localizing specific objects in images or videos [13][14][15][16].In recent years, the performance of target detection has significantly improved with the advancement of deep learning technology.Currently, deep learning-based target detection algorithms have become the mainstream approach.Classic algorithms like Fast R-CNN [13], YOLO [14], SSD [15], and RetinaNet [16] have gained popularity in this field.These algorithms utilize different network structures and loss functions, each with its own strengths and limitations.The R-CNN series, including R-CNN, Fast R-CNN, and Faster R-CNN [17], are well-known algorithms in target detection.The fundamental concept behind these algorithms is the conversion of target detection into candidate region extraction and classification.Initially, a set of candidate regions is extracted using techniques like selective search.Then, each candidate region is subjected to extraction and classification, resulting in the output of the target's location and category [18].The YOLO series is a single-stage target detection algorithm, represented by YOLO, YOLOv3 [19], YOLOv5 [20], and others.These algorithms transform the target detection problem into a regression problem.The image is divided into multiple grids, and each grid predicts the target's location and category.This approach offers fast detection speed and high real-time performance, but it may not yield optimal accuracy for small targets.SSD is another single-stage target detection algorithm that utilizes multi-scale feature maps.SSD predicts the target's location and category on feature maps of varying scales and fuses these predictions to obtain the final target detection outcome.This algorithm provides fast detection speed and robust performance for small targets.RetinaNet [16] addresses the issue of category imbalance in target detection by employing focal loss.It replaces the traditional cross-entropy loss function, prioritizing challenging samples for classification.RetinaNet offers improved detection accuracy and better generalization capability while maintaining fast detection speed.Mask R-CNN [21] is a target detection algorithm developed based on the R-CNN series.In addition to detecting the target's position and category, Mask R-CNN generates semantic segmentation results for the target.It achieves this by adding a segmentation branch to the R-CNN algorithm.Mask R-CNN provides improved semantic segmentation accuracy and higher detection accuracy.In conclusion, deep learning-based target detection algorithms offer fast detection speed and high accuracy.However, they still face challenges such as ineffective detection of small targets and category imbalance.
Researchers are actively exploring new algorithms and technologies to enhance the performance of target detection.

Method
In this part, a new AGV visual localization framework based on Gaussian saliency heuristic will be introduced.Figure 1 shows the proposed AGV visual localization framework based on Gaussian saliency.The target will first be imaged by a monocular camera to obtain the target image.Secondly, the obtained image will enter the proposed GAGV-net network for target detection.Finally, the detection result image will be found through the feature point and the target will be located using PnP [22] algorithm.1.
AGV vision-aided localization is a key technology using vision localization.This technology includes visual imaging technology, target detection technology, and visual localization technology.Among them, target detection technology is the key to AGV visual localization, and accurate target detection can greatly improve the localization accuracy.Therefore, in order to greatly improve the accuracy of AGV visual localization, this paper proposes a Gaussian saliency inspired AGV visual detection network (GAGV-net) to improve the localization accuracy.Figure 3 shows the GAGV-net detection network framework proposed in this paper.In the feature extraction stage of the target, the input image will first go through the backbone network for initial feature extraction.In order to obtain better initial feature extraction effect and lightweight network, GAGV-net Fig. 1 The proposed AGV vision localization framework selects ResNet [23] network as the backbone network.After the input image passes through the backbone network, the feature matrix of the target is obtained.
Then, the obtained initial target features will be extracted through two target salient feature extraction modules to obtain the salient features of the target.In the classification and regression stage, in order to enrich the feature expression of the target and   obtain excellent detection performance, the extracted initial features and salient features of the target will be classified and regressed through the joint multi-scale classification module, and finally the detection results will be obtained.The network parameters of the backbone network are shown in Table 2.

Problem definition
In order to provide a more intuitive analysis of the AGV visual localization problem we are addressing, we have formulated it as follows.
I in ∈ R H ×W ×C represents an input image, where H, W, and C represent the height, width, and number of channels of the image, respectively.y ∈ {1, 2, 3, . . .K } represents the category of the target in the input image, where K is the total number of target categories; b ∈ R 4 represents the bounding box of the target in the input image, where b = (x min , y min , x max , y max ) defines the coordinates of the top left and bottom right cor- ners of the target.The target detection network can be represented as function f det (•) , which detects the target in the input image I in and accurately marks its bounding box b.The PnP localization algorithm can be represented as a function f loc (•) , which receives the prediction of the target boundary box b and outputs the coordinate position P of the target in the real world.P 1 is the actual target location.Therefore, our goal is to find an optimal model f det (•) that minimizes the error between P and P 1 .The specific formula for our objective function is as follows.
where L{•} is the mean square error function.

Target salient feature extraction module
In the field of target detection [24][25][26][27][28], the performance of the detection network is limited by its ability to extract target features.Deeper networks generally have better target feature extraction ability, but they often result in larger model sizes and slower detection speeds.These drawbacks are not suitable for AGV visual detection tasks.To address this, (1)  we propose a feature extraction module for object saliency in this paper.This module extracts the salient features of the target, enriching the feature expression and improving the network's feature extraction capability.Instead of deepening the network depth, which can lead to parameter redundancy, this module reduces the number of parameters required for model fitting while maintaining excellent feature extraction ability.
Inspired by the human visual mechanism, where the target stands out from the surrounding environment, we simulate this using Gaussian convolution.In convolutional networks, as the receptive field decreases, the scale of the target in the feature matrix decreases but its significance increases.We leverage Gaussian convolution to extract salient features of the target, enhancing the contrast between the target and the background.
By using different scales of Gaussian convolution kernels, the proposed object salient feature extraction module improves the contrast of the target.This allows the network to focus more on the target rather than the background.The module employs Gaussian convolutions with different kernel sizes to extract saliency features of different scales, enriching the feature expression of the object.Figure 4 illustrates the structure of the target salient feature extraction module, and the specific feature extraction process can be described as follows.
where I in is the input image, Backbone() is the backbone network, and F L is the initial target feature.F o is a feature obtained by ordinary convolution, and F S1 and F S2 are sig- nificant features.Concat() is the splicing operation, and Gconv 5×5 (•) and Gconv 3×3 (•) (4) In the GAGV-net, considering the computational cost, we set the size of the Gaussian convolution kernel to 3 × 3 and 5 × 5.It is worth noting that θ m and θ n in G(m, n) are constantly updated and optimized in the network training process and do not need to be set manually.Compared with ordinary convolution, in the training process of the network, only the scale parameters θ m and θ n of the Gaussian saliency convolution are constantly updated, hence the kernel function of the Gaussian convolution is always a two-dimensional Gaussian distribution.This enables GAGV-net to effectively extract the salient features of the target.

Joint multi-scale classification module
the backbone network and target salient feature extraction module, the rich features of the target are fully extracted.In order to further enrich the target feature expression and improve the detection performance, we designed a joint multi-scale classification module to achieve the final detection.The ability of object feature expression of network is the most important factor to determine the detection accuracy.Therefore, different from the existing target detection methods, we perform feature fusion on different levels of target features after target feature extraction in the network.We use the proposed joint multi-scale classification module to fuse the target features, so as to enrich the target feature representation.
Our joint multi-scale classification module is shown in Fig. 3.As can be seen from the figure, the joint multi-scale classification module includes two parts: global feature fusion module and decision feature fusion and classification module.First of all, the global feature fusion module integrates the global features by introducing the attention mechanism [29].The feature expression of the target in this module is greatly enriched, which provides the basis for accurate classification.In addition, in order to further enrich the feature expression of the target, we fused the global fusion features of different depths in the decision feature fusion and classification module, and finally realized the detection of the target.
Attention mechanism has been widely used in the field of target detection, so in this paper, we use attention mechanism to design a global feature fusion module to enrich the feature expression of the target.Figure 3b shows the proposed global feature fusion module, which includes channel attention and spatial attention.Its specific operation can be expressed as follows.
where F out is the global feature after fusion, F in is the input features, F c is the feature matrix after channel attention operation, and F s is the feature matrix after spatial atten- tion operation.C attention () and S attention () are channel and spatial attention operations, respectively.
The feature expression of the target has been greatly improved after the global feature extraction by a module.However, in order to fully utilize the feature of the target to improve the detection performance, we designed the decision feature fusion and classification module to fuse the global features of different depths, and then more fully utilize the feature of the target.
The specific operation process of the decision feature fusion and classification module can be described as follows.
where F J is the multi-scale feature of the fused target.F S1 , F S2 and F L are shallow fea- tures, which can be seen in Fig. 5.
The feature matrix after feature fusion will carry out the frame regression and target classification of the target, and finally obtain the detection results.

Loss function
In order to constrain the training of the network, we choose the cross entropy [24,30] and IOU loss function.The IOU loss function is used to constrain the border regression of the network, and its calculation method can be shown in Fig. 6 and Formula (8).The crossentropy loss function is used to constrain the classification task of the network.
where L Box is the border loss, L cls is the classification loss, and L is the total loss.W box is the weight of border loss, and W cls is the weight of classification loss.(11)

Experimental setup
In order to verify the effectiveness and progressiveness of the proposed methods, we conducted experimental verification.To validate the proposed method, we collected a large amount of data to train our detection model.The training set used in the experiment consists of 2000 images collected by AGV's visual camera, containing many scenes.Specifically, we use the camera on AGV to capture images and create a standard dataset for model training.The collected dataset targets include electricity meters, key, 0-9 and F digital displays, totaling 13 categories of targets.Figure 7 shows some of the images in the dataset.When training the network, we use 50% of the data set to train the network, 30% as test data, and 20% as verification data.Our experiment was carried out on a computer with NVIDIA 3080ti graphics card.The software is installed with Python 3.7, Python 1.1, and Pycharm 2021.2.5.
In the field of target detection, recall (R) and precision (P) [24][25][26][27][28]30] two important indicators to verify the performance of the method.P and R are defined as follows.
(16) P = TP TP + FP where TP is true positive, FP is false positive, FN is false negative.

Comparative experimental results and discussion
In order to verify the effectiveness and progressiveness of the proposed method, we compared it with the current state of the art (SOTA) method.When conducting comparative experiments, for fairness, all comparative methods were retrained on our dataset.Moreover, the experimental settings for these comparative methods are the same as those in their original literature.These methods include Faster R-CNN [24], SSD [31], Yolo-V3 [32], Retinanet [33], Efficientdet [34].The experimental results are shown in Fig. 8 and Table 3. Figure 8 shows the PR curve of each comparison method, from which we can clearly see that the proposed method has the best detection performance.in this paper has been improved by at least 5%.In addition, Table 3 also shows the detection speed of different comparison methods, from which we can see that the proposed method has the highest detection speed, the detection speed is improved by at least 27.38 FPS. Figure 9 shows the detection results of the detection algorithm on the AGV trolley.It is worth noting that in Fig. 9, the detection results were obtained through the camera on the AGV, and we captured the detection results on the upper computer software for display.From the figure, we can clearly see that our algorithm has a high detection accuracy and can effectively deal with targets of different sizes.This also proves that the proposed Gaussian convolution can effectively extract the salient features of the target, enabling the detection network to focus on the target itself, thereby greatly improving detection performance.In addition, since the proposed method utilizes Gaussian convolution G(m, n) to extract the salient features of the target, theoretically, the detection performance will be affected by Gaussian noise.However, due to the dynamic adaptive setting of θ m and θ n in Gaussian convolution in the detection network, this greatly reduces the impact of Gaussian noise.
To verify the performance of the proposed method in positioning accuracy, we conducted experimental verification.As shown in Fig. 10, the detection performance of the proposed method in visual positioning is shown, with the blue curve representing the actual target position and the red curve representing our prediction results.It can be clearly seen from the figure that the proposed method can locate the target with minimal error.It is worth noting that for the convenience of display, x and y in Fig. 10 are the normalized coordinates of the target position.

Ablation experiment
In order to further verify the effectiveness of the proposed method, we conducted ablation experiments.The experimental results are shown in Table 4. From the table, we can clearly see that the target salient feature extraction module and the joint multi-scale classification module proposed in GAGV-net are both effective in improving the detection performance.

Conclusion
In order to solve the problems of poor detection performance and slow detection speed of existing AGV-based visual localization methods.This paper proposes an AGV visual inspection network GAGV-net based on visual saliency.The network enriches the feature representation of the target through the designed feature extraction module of the target saliency, thereby reducing the parameters required for model fitting.At the same time, in order to improve the detection accuracy, a joint multi-scale classification module is proposed in the GAGV-net network, which improves the detection accuracy by fusing features of different depths of the target.The experimental results show that the proposed method has better detection performance than the existing advanced methods.

Fig- ure 2
shows the flowchart of the AGV visual localization algorithm pro-posed in this paper.The acronyms appearing in the article are shown in Table

Fig. 2
Fig. 2 The flowchart of proposed AGV vision localization algorithm

Fig. 3
Fig. 3 The proposed GAGV-net.The joint multi-scale classification module includes two parts: a, d, where a is the global feature fusion module and d is the decision feature fusion and classification module.Global feature fusion module integrates global features by introducing attention mechanism to improve the feature expression of the target.b, c are channel attention operation and spatial attention operation in global feature fusion module, respectively

Fig.
Fig. Decision feature fusion and classification module

Fig. 6 Fig. 7
Fig. 6 Schematic diagram of IOU calculation, G is the groundtruth, D is the detection result

Fig. 8 P
Fig. 8 P-R curves of different comparison methods

Fig. 9
Fig.9 The detection result image of the GAGV-net deployed on the AGV trolley

Table 1
Acronyms description

Table 2
The network parameters of the backbone network

Table 3
shows the P and R values of different comparison methods.From the table, we can clearly see that compared with the existing methods, the proposed methods are superior to the existing advanced detection methods in all performance indicators.For the detection accuracy P, the detection method proposed in this paper is improved by at least 12%.For the recall rate R, the detection method proposed

Table 3 P
and R of different detection methods