Monocular vision based on the YOLOv7 and coordinate transformation for vehicles precise positioning

Logistics tracking and positioning is a critical part of the discrete digital workshop, which is widely applied in many fields (e.g. industry and transport). However, it is distinguished by dispersed manufacturing machinery, frequent material flows, and complicated noise environments. The positioning accuracy of the conventional radio frequency positioning approach is severely impacted. The latest panoramic vision positioning technology relies on binocular cameras. And that cannot be used for monocular cameras in industrial scenarios. This paper proposes a monocular vision positioning method based on YOLOv7 and coordinate transformation to solve the problem of positioning accuracy in the digital workshop. Positioning beacons are placed on the top of the moving vehicle with a uniform height. The coordinate position of the beacon on the image is obtained through the YOLOv7 model based on transfer learning. Then, coordinate transformation is applied to obtain the real space coordinates of the vehicle. Experimental results show that the proposed single-eye vision system can improve the positioning accuracy of the digital workshop. The code and pre-trained models are available on https://github.com/ZS520L/YOLO_Positioning.


Introduction
The smart workshop has gained extensive interest from people from all walks of life due to its highly digital, informative, and intelligent characteristics and has progressively come to be seen as the future development path for discrete production workshops.In contrast, the workshop logistics trolley and related mobile device positioning technology is a fundamental part of the realisation of the smart workshop.The positioning of the logistics trolley in the discrete workshop of smart manufacturing was a problem that the authors encountered during the implementation of the digital workshop construction of the smart factory.Based on the idea of resolving this practical problem and its significance for the positioning of the logistics trolley in the discrete workshop, this paper was created.
In recent years, indoor positioning has become a fundamental requirement for mobile users driven by indoor location services (ILBS).With the rise of the Internet of Things (IoT), heterogeneous smartphones and wearable devices are becoming ubiquitous.However, ILBS for heterogeneous IoT devices face significant challenges, such as differences in received signal strength (RSS) due to hardware heterogeneity, multi-path reflections in complex environments, and positioning times limited by computational resources.(Ye et al., 2021) At the present time, indoor positioning technologies have mainly relied on WIFI (Yang & Shao, 2015), radio frequency identification (RFID) (Merenda et al., 2021), WSN (Gao et al., 2021;Huang et al., 2022), and UWB (Yao et al., 2021).However, implementing the above methods requires the appropriate equipment to be arranged in advance.It is only suitable for indoor positioning with low accuracy requirements due to signal interference and attenuation.Vision-based indoor localisation techniques rely on a priori maps and feature descriptors for image retrieval and image matching.In enclosed, semi-enclosed, multilayered interior situations with strong electron magnetic solid interference, they are able to localise with a high degree of accuracy.In addition, vision-based positioning is an accurate and cost-effective solution for indoor positioning.It relies on cameras to collect information about the structure of the house, texture differences and static objects (doors, windows, etc.) from the environment to confirm the position, avoiding interference from reflections and refraction caused by the use of radio magnetic signals when obstacles are encountered (Li et al., 2020).As a result, multistorey enclosed or semi-enclosed indoor spaces are suitable for the employment of vision-based indoor localisation systems.However, current research directions in vision-based positioning require high deployment costs and cannot be adapted to simple changes in the environment.Therefore, there is huge potential for applications for the research and development of indoor positioning techniques with minimal equipment costs and straightforward deployment.
Existing vision-based indoor localisation techniques include mobile camera-based indoor localisation techniques (Chen & Chen, 2021), mainly based on image matching (Xia et al., 2018), in which localisation is calculated by matching the current photo with a photo stored in an image database, and such methods suffer from two problems.On the one hand, because there are so many photographs in the offline database, image retrieval takes a long time.On the other hand, the localisation accuracy is also unsatisfactory, as there are many mismatched pairs between the query image and the matched image and it is difficult to establish an accurate coordinate transformation relationship (Jia et al., 2021).
To address the problems mentioned above, this paper proposes an indoor localisation scheme based on target detection and image-to-space mapping, which can achieve centimetre level localisation accuracy.In fact, prior to that, our team suggested a mine video monitoring system based on cloud-side collaboration and a real-time video processing system for underground coal mines based on the edge-cloud collaboration framework.The main contributions of this paper can be summarised as follows: (1) The idea of two-dimensional space was first put forth as a way to unify the placement height of positioning beacons, and a three-point approach for creating a coordinate mapping from two-dimensional images to two-dimensional space was developed.
(2) The target detection algorithm is applied to the indoor movable object localisation problem and successfully transforms the accuracy problem of spatial object localisation into the validity problem of image target detection.
(3) The proposed localisation system is training-free for different scenarios.As the pretrained target detection model is stable, no secondary training is required without changing the localisation beacon.The system can also work with existing security surveillance systems, which dramatically reduces the deployment cost and is easy to promote.

Related work
With the rapid development of computer vision and deep learning technologies, it has become possible to achieve real-time target detection and obtain location information from images.Additionally, image-based visual localisation provides excellent visualisation effects, is extremely interference-resistant, and is replete with contextual data.As a result, academics from all around the world have given optical localisation techniques a lot of attention.Based on whether or not human markers are used, there are currently two basic groups of visual localisation techniques.

Indoor positioning techniques that rely only on pre-existing environments
Three steps are typically involved in this type of approach: extraction of environmental information for con-structuring a feature database (Liao et al., 2019), image retrieval to find the best feature map to match, and image to space coordinate transformation to determine the camera's position (Zhang et al., 2021).For example, Yu M et al (Yu et al., 2021) analyzed and converted image data into mobile phone movement distance and pose by a coordinate transformation method (four-parameter fitting model), (Zhou et al., 2022) used improved convolutional neural network based on monocular vision for indoor localisation, and (Jung et al., 2021) used point cloud and RGB feature information to accurately acquire indoor 3D space.Citations (Chae et al., 2016) Using a stereo vision system, the saliency map is found, parallax and distance from the stereo vision image are calculated, and then absolute distance is found based on camera characteristics (e.g.focal length) and parallax influenced by point-of-view differences between cameras.Such methods, however, necessitate the construction of a sizable database of environmental variables for image retrieval beforehand and are sensitive to changes in the environment, making them unsuitable for the application's large-scale extension.The large database makes the retrieval take a long time, and the mainstream solution to this problem is to divide the database according to semantics (Dai et al., 2019;Jia et al., 2021;Zatout & Larabi, 2022).For example, (Dai et al., 2019) proposed a semantic and content-based image retrieval (SCBIR) approach.By dividing the offline database into semantic databases of different semantic types, the retrieval of images is narrowed down and the retrieval time is reduced.Jia S et al (Jia et al., 2021) also proposed a semantic-based indoor visual localisation method, in which representative infra-structure objects were first selected using semantic extraction and classification to build a semantic-based offline database; a semantic constraint-based feature point selection method was used to process the image retrieval The best matching images are obtained to perform user location estimation.There are also clustering classification databases (Jia et al., 2020).
Image retrieval's second issue is that it struggles to adjust to environmental changes.for this problem, extracting key semantic features is an effective solution (Jia et al., 2021), and Wen H et al (Wen et al., 2018) propose the idea of lifelong learning by iterative compression to obtain reliable features.The third problem of image retrieval is the low accuracy of matching, for which the resolution of the image can be improved and the image can be deblurred to facilitate feature extraction (Jia et al., 2022).Jia S et al (Jia et al., 2021) suggested matching multiple images by considering the contextual information of the environment.There are also processing strategies such as wavelet denoising (Wang et al., 2019) and foreground background separation (Zheng et al., 2021).

Indoor positioning method relying on pre-arranged beacons
For example, Bookmark (Pearson et al., 2017) supports scanning barcodes of books in a library to obtain the current location relative to the library.Robinson et al. Citations (Robinson et al., 2014) demonstrate the potential of barcodes for localisation in real large library scenarios.Reference (Kunhoth et al., 2019) also proposes a system that uses QR codes and BLE beacons to locate the user's position.The use of mobile robots to identify ceiling features is also a hot area of research due to the nature of ceilings that are not easily obscured (Xu et al., 2009;Zhang et al., 2018).Tyukin et al (Tyukin et al., 2016) proposed an image-processing based robot navigation and positioning system consisting of a simple monocular camera and non-illuminated coloured beacons.However, these techniques necessitate the pre-arrangement of a sizable number of beacons, are insensitive to beacon motions or missing beacons, and do not provide localisation accuracy guarantees.

Indoor localisation method based on target detection
From the perspective of image processing, the above methods all belong to the category of image classification (Shereena & David, 2014).Compared with previous methods, the literature (Wang et al., 2018) proposed a binocular visual localisation method based on region of interest, and inspired by this, this paper introduces monocular camera-based target detection to the indoor localisation problem for the first time.first establishing a coordinate mapping from a two-dimensional image to a two-dimensional space by the three-point method, then performing target detection on fiducials placed at a uniform height, followed by an image-to-space coordinate transformation of the detection centroid to obtain the object to be located the exact actual spatial coordinates of the object to be located are then obtained.Without the need for extra hardware, this solution can be implemented on top of already installed security surveillance systems.In addition, as the positioning beacon is placed on top of the object to be located, there are few problems with line-of-sight obstruction and it is not affected by changes in the environment.

Proposed system
In this section, first, the proposed method is described in detail.Then, the key modules and important algorithms are analyzed in detail, including the initialisation of the system parameters, YOLOv7-based target detection and the mapping of images to spatial coordinates.

System architecture
We have designed an indoor movable object condition monitoring system based on a single image, which is capable of achieving centimetre-level positioning accuracy.
The system consists of three parts, as shown in Figure 1.The process consists of (a) initialisation of the system parameters and (b) determination of the image model to the spatial location.
The system parameters are initialised as shown in Figure 1a.First, we place the camera in a suitable position and then collect the 3D actual coordinates and the corresponding 2D pixel coordinates of the three different position markers, which require human assisted markers.We use the 3D actual coordinates and 2D pixel coordinates to obtain the internal reference matrix of the camera, which will be mentioned in the following.
The model from image to position is shown in Figure 1b.We obtain the coordinates of the locating beacon centroid on the image through the YOLOv7 model, and then obtain the location of the forklift in the real world through a mapping of image to spatial coordinates.

System parameters
In this section, the method of initialising the system parameters is described.It is always possible to artificially obtain the real-world coordinates of the beacon centroid and the corresponding image coordinates when the beacon is in a different position, the input to the system being the three pairs of corresponding coordinates that are not co-linear.
According to the principle of small aperture imaging, the midpoint of a line segment in the real world corresponds to the midpoint of that line segment in the image.For a planar coordinate system, two non-coincident vectors can represent any vector in that plane, so we only need three non-coincident corresponding points to build the system to find the parameters needed for the system.Assuming that the camera shoots without distortion,   As shown in Figure 2, we constructed a world coordinate system with a corner of a wall in real space as the origin, parallel to the wall as the X and Y axes, and perpendicular to the ground as the Z axis, respectively.The camera coordinate system is based on the camera optical centre as the origin, and the X and Y axes are parallel to the X and Y axes of the image coordinate system (shown in Figure 3).For the experiments we placed the auxiliary positioning beacons uniformly on the top of the machine.Since the logistics vehicles in the workshop are always of equal height, we can ignore the spatial dimension occupied by the height and thus simplify the problem.As shown in Figure 3, the point O is the midpoint of the line AB and the three points ABC are the inputs to the system, i.e. the corresponding realworld coordinates (Xw, Yw) are known.the spatial coordinates of the point O are calculated by the following equation (1).
Considering the linear relationship between the three points of AOB, equation ( 2) can be obtained.
The calculation of the midpoint coordinates reveals that we need a relative reference point and a deflation ratio on the XY axis, which first leads us to the two key parameters of the image coordinate system, Kcwx and Kcwy.Equation ( 3) is calculated as follows.
where Kcwx and Kcwy are the image to space deflation ratios on the X and Y axes respectively.
Finally, the point Oc is chosen as the relative reference point with the following equation ( 4).
In the above derivation process, it is actually sufficient to use two points that are not parallel to the image coordinate axis.In order to ensure the normal initialisation of the system parameters, it is recommended to choose three points that are not co-linear to balance the error, where the formula for determining the system parameters with any two points is the same as above.In addition, when the special case of Ac, Ab parallel to the image xy axis in Figure 3 arises, the system will take the midpoint o of bc for the calculation.

YOLOv7-based target detection
In this section, the rationale for model selection is explained, the process of constructing the dataset used for training is analyzed, and the migration learning strategy introduced to increase the speed of training.
As an end-to-end target detection model, the YOLOv7 Transformer-based detector SWIN-L Cascade-Mask R-CNN is 509% faster and 2% more accurate, and is 551% faster and 0.7% more accurate than the convolution-based detector ConvNeXt-XL Cascade-Mask R-CNN.(Wang et al., 2022) The YOLOv7 series models include YOLOv7-E6E, YOLOv7-D6, YOLOv7-E6, YOLOv7-W6, YOLOv7-X, YOLOv7 and YOLOv7-tiny-SiLU.Among them YOLOv7-tiny-SiLU has the lowest number of parameters and boasts a GPU operating speed of up to 286 FPS.YOLOv7-tiny-SiLU was chosen first in order to meet the real-time requirements in industrial application scenarios, and the accuracy after testing met expectations and the model was chosen.Traditional neural networks require learning parameters from large amounts of data, and although small-sample training methods such as those proposed by Liu J et al. have emerged (Liu et al., 2021), they require modifying the model to work with them.methods such as Dropout and regularisation also require changing the model structure to reduce model complexity, and they all limit the distribution of model parameters, making the model more difficult to understand.Data augmentation, on the other hand, does not reduce the complexity of the network, nor does it increase the computational complexity or tuning effort, and is an implicit regularisation method.It is more meaningful in practical applications and reflects the centrality of data.This paper therefore chooses data augmentation to expand the dataset.
Data augmentation is a machine learning technique that improves the performance of a model by adding new samples to the training data.The advantage of data augmentation over traditional neural networks is that it can improve the generalisation of the model, thus making it more adaptable to new data.Data augmentation can also reduce the risk of overfitting the model, thus improving the robustness of the model.In addition, data augmentation can effectively expand the training data set, thus providing richer information to the model and thus improving its performance.Thus, data augmentation techniques have numerous advantages that can effectively improve the performance of a model.
After varying degrees of rotation, panning, blurring, noise addition and colour interference processing, a data enhancement dataset 81 times larger than the manually acquired dataset was obtained.It is worth noting that we have used rotation and panning as the first step in the processing, on which blurring, noise addition and colour interference are based, as shown in Figure 4. Figure 5 shows the before and after comparison.
For the above data enhancement methods, the main advantages are: rotating the image increases the robustness of the model to the object's orientation.Panning the image increases the robustness of the model to the position of the object.Adjusting the image contrast increases the model's ability to adapt to changes in object brightness.Adding noise can increase the robustness of the model to disturbances.In summary, data augmentation in the above ways can effectively improve the model's ability to generalise, thus making the model more adaptable to new data.
The dataset for target detection differs from image classification in that the centre coordinates (x, y) of the beacon to be detected and the height h and width w of the anchor frame are also required by the network.To reduce the human annotation workload, we design a method for automatically generating the corresponding parameters for data augmentation.Firstly, we manually annotate the original dataset with the coordinates of the centre of the anchor box (x, y) before the move, +x0 for the upward move and +y0 for the rightward move, and then (x + x0, y + y0) for the centre of the anchor box after the move.
Rotating the image not only changes the centre coordinates of the anchor frame, but w and h also become meaningless, assuming that the centre of rotation is (x1, y1), the angle of rotation is θ and the scale is β.Then the parameters of the anchor frame after rotation are calculated as follows.α = arctan y−y 1 x−x 1 + θ (5) The coordinate system is reconstructed with (x, y) as the origin and α is the angle of the centre of rotation, the line connecting the centre of the anchor frame after rotation, with respect to the x-axis of the new coordinate system.
Where: d is the length from the centre of rotation after deflation to the centre of the anchor frame after rotation.
where: f (x, y) is the rotational coordinate transformation function f (x, y)max and f (x, y)min are the maximum and minimum values of the horizontal and vertical coordinates of the four boundary points of the anchor frame after rotation, respectively.Blurring, noise addition and colour interference do not affect the change in the parameters of the anchor frame, so the parameters remain the same as before the change.
To improve the speed of model training, a popular choice is to base the training on existing model weights, which in this paper are based on the training weights of the COCO dataset.

Image to space coordinate mapping
In section 2.1, we constructed a coordinate mapping from 2D space to camera captured images, at that time to obtain the transformation parameters of the system, which will be used in this section to detail the inverse image to the space mapping process.
The coordinates of the centroid of the localised beacon, denoted as (Xp, Yp), can be obtained from the pre-trained target detection model and its conversion to a twodepersonalized space is calculated as follows.
where (Xw, Yw) are the two-depersonalized spatial coordinates, (Ocx, Ocy) are the camera image coordinate origins, and Kcwx and Kcwy are the deflation coefficients in the x, y axis direction respectively.

Experimental results
In order to test the effectiveness and reliability of the proposed method, both simulation and field deployment experiments have been carried out.Error tests were also carried out for different placement heights of the positioning markers and for two states of fixation and movement.

Simulation model
In the experimental setup, the 3D modelling software Solid Edge ST8 was used to build the simulation experimental scene shown in Figure 6 below, the length and width of the site were 20m and 8m respectively, and the size of the positioning beacon was 30cm * 30cm and the height was unified at 1.5m from the ground, in order to calculate the system parameters, we treated the centre of the positioning beacon as a hole through, which will help to collect the three-point method required This will help to collect the coordinate parameters needed for the three-point method.
Without moving the main viewpoint, the object to be positioned is moved so as to simulate the camera's viewpoint.We used the grid method and set the grid size to 2m * 2m and collected 27 experimental sample points within the field of view, the distribution of sample points is shown in Figure 7.In order to verify the reliability of the three-point method, three sets of data required for the initialisation of the system parameters were collected, corresponding to the following results in Table 1.
Without changing the camera angle, almost identical image origins and deflation factors can be obtained for any three pairs of corresponding coordinate points, verifying the correctness of the system parameter initialisation method.
The results of the target detection are shown in Figure 8.As the locator beacon is relatively small in relation to the entire field of view of the camera, the error in the centre of the detection frame from the centre of the locator beacon is also relatively small.
We fed the 270 sample points collected into the pre-trained model, and the average error profile is shown in Figure 9.
The true-time distribution of sample points and a comparison of the predicted results are shown in Figure 10.

Error and reliability analysis
The height uniformity of the positioning beacons is a major drawback of the system.In order to assess the impact of height differences, we designed the following four sets of experiments, the results of which are shown in Table 2 below.Without changing the system parameters, only the height of the positioning beacon was modified and, unsurprisingly, there was a significant increase in the positioning error for both, but even though the modifications were of the same magnitude, the errors were not the same due to the opposite direction.After analysis, we believe that this is due to the fact that the target detection itself is subject to a certain amount of error, and here the phenomenon of error neutralisation occurs.In addition, the change in beacon height corresponds to a movement on the image, and with x,y remaining constant, the error is theoretically affected by the angle of camera placement, which is not discussed too much here.Although the difference in height will introduce a small error, the system can be used perfectly for the location of personnel, that only need to use the helmet as a positioning beacon.
Although the simulation experiments have achieved good results, which only proves the feasibility of the theory, the field test is more necessary.To deploy the model, the layout of the workshop needs to be taken into account, so that a suitable coordinate origin and coordinate system can be chosen.Figure 11 shows an equally scaled plan of the workshop.A plane coordinate system is first established with the top left corner as the coordinate origin, the long side as the x-axis and the short side as the y-axis.The car is parked in the auxiliary marker area and the image coordinates are obtained by manually clicking on the centre of the marker at the monitoring end to obtain the coordinate transformation factor of the camera.To improve the accuracy of target detection models, a common idea (Sun et al., 2022) is to design novel feature extraction networks that generate high quality feature representations.Meanwhile, li et al. (Xia et al., 2020) propose an efficient framework for salient target detection based on distributed edge guidance and iterative Bayesian optimisation, taking full account of colour, spatial and edge information.Inspired by this, in this paper we propose a new idea.As shown in Figure 12: the sign to be detected is placed on top of the logistics cart with a uniform height of 2.1m from the ground.When manually labelling the data set it was found that one would subconsciously locate the logistics cart first, so the joint detection frame was designed taking into account the relevance of locating the sign and the logistics cart, characterised by the fact that the sign frame is always in the area above the interior of the cart frame.
In the training process of the yolov7 model, the training epoch is set to 300, the batch size is set to 64, the adam optimiser is used, and the momentum parameter is set to 0.999.Figure 13 shows the results of the model training, where the mAP_0.5,precision and recall metrics are close to 1.The results of model training are as expected.
Figure 14 shows the results of the target detection: the appropriate enclosing frame is always well chosen, which makes the centroid of the target detection very close to the coordinates of the sign centroid on the image, and the test point error of the sign is shown in Table 3.The industrial scenario in which the experiments were conducted requires an accuracy of 0.2m or less, and the model meets this requirement very well, while our positioning accuracy is better than traditional positioning methods (Huang et al., 2022).
In addition to the fixation error test, we also designed a movement error test in which the estimated cart movement trajectory of the model is compared with the real movement  trajectory of the cart while ensuring that the cart travels in a straight line as shown in Figure 15.
The above results show that there is some error when the car is at rest and moving, but the overall error is within an acceptable range.Compared with systems that require the deployment of a large number of devices to obtain high-precision positioning results, the system proposed in this paper is built on the basis of existing security monitoring systems, and the almost zero deployment cost will have greater deployment possibilities.
The brightness or darkness of the environment is an important indicator that affects the accuracy of target detection.In order to test the adaptability of the pre-trained model to different light intensities, we selected scenes at the same location and different time periods for the experiments.The results are shown in Figure 16.The pre-trained model has good detection accuracy in brighter or darker scenes.This shows that the model has a high resistance to interference and a better adaptation to the environment.

Conclusions
To obtain three-dimensional coordinates of space requires more complex system design, which is rather a rather difficult problem at present.In this paper, we propose the idea of two-dimensionalized space, which simplifies space into a plane, so that only plane to plane coordinate transformation relations need to be constructed, which is relatively easy to achieve, and the experiment also proves that the error is within the acceptable range.
The approach proposed in this paper shifts the pressure of the indoor localisation problem to the accuracy of target detection.As the localisation beacons can be reused, the parameters of the model do not require redundant training, making it easy to deploy and easy to generalise.The experiments validate the feasibility of the idea of treating space in two dimensions and demonstrate that even if there is a certain height difference in the beacons, it does not cause excessive errors, but rather that the height variation may have a positive impact due to the inherent error in the centroid coordinates obtained through target detection.This means that it is feasible, for example, to locate coordinates precisely by means of the human head.The system can be effectively combined with security monitoring systems to make the best use of the system; in addition, the precise positioning provides the basis for motion tracking and path planning.
The limitation for this paper is that although the process of coordinate conversion is reliable, there is an unavoidable error in the coordinates prior to conversion.To address this issue, we can consider improving the localisation beacon to make the results of target detection during the process more accurate.In order to further improve the detection accuracy, in the next step of our work we will discuss and validate the way in which multiple positioning beacons fit together.

Figures 2
Figures2 and 3show the rules for establishing the spatial and image co-ordinate systems respectively.As shown in Figure2, we constructed a world coordinate system with a corner of a wall in real space as the origin, parallel to the wall as the X and Y axes, and perpendicular to the ground as the Z axis, respectively.The camera coordinate system is based on the camera optical centre as the origin, and the X and Y axes are parallel to the X and Y axes of the image coordinate system (shown in Figure3).For the experiments we placed the auxiliary positioning beacons uniformly on the top of the machine.Since the logistics vehicles in the workshop are always of equal height, we can ignore the spatial dimension occupied by the height and thus simplify the problem.As shown in Figure3, the point O is the midpoint of the line AB and the three points ABC are the inputs to the system, i.e. the corresponding realworld coordinates (Xw, Yw) are known.the spatial coordinates of the point O are calculated by the following equation (1).

Figure 5 .
Figure 5.Comparison of before and after image processing.

Figure 6 .
Figure 6.Simulation of the experimental environment.

Figure 7 .
Figure 7. Distribution of sample points.

Figure 16 .
Figure 16.Detection results of the model at different light intensities.

Table 1 .
Initialisation of system parameters.

Table 2 .
Positioning errors corresponding to different heights.

Table 3 .
Positioning errors for different test points.