An Ultralightweight Object Detection Network for Empty-Dish Recycling Robots

The emergence of empty-dish recycling robots has alleviated problems, such as labor shortages, caused by an aging population. The detection and grasping of dishes play a crucial role in empty-dish recycling robots. However, due to the limited resources of edge devices, traditional object detection models require more space to store parameters and much computational overhead, limiting the development of empty-dish recycling robots. Therefore, this article proposes an ultralightweight dish detection model YOLO-GS for an empty-dish recycling robot. We use the modified CSPDarknet as the backbone structure and design an ultralightweight neck structure for efficient feature fusion. Meanwhile, we design a lightweight head structure for object classification and bounding box coordinate regression by combining ghost shuffle convolution (GSConv2D) and the anchor-free method. For the empty-dish recycling robot to grasp the dishes, we design a dish grasp point extraction algorithm using image processing. Finally, TensorRT is used to optimize and accelerate the model for efficient and intelligent detection of dishes on the NVIDIA Jetson Xavier NX. The experimental results show that YOLO-GS achieves 99.380% mean average precision (mAP) with a parameter amount of 0.606 M. The inference speed of the TensorRT-optimized YOLO-GS algorithm reaches 31.371 FPS, which meets the needs of real-time dish detection by the empty-dish recycling robot. The image of the empty-dish recycling robot demo is available at https://www.youtube.com/watch?v=pCBo1nzm3qU&t=22s.


I. INTRODUCTION
W ITH the aging of the global population and the relative reduction of the total labor force, many countries have experienced serious labor shortages. The labor shortage and the increasing labor cost are seriously restricting the development of the food service industry. With the rapid development of artificial intelligence technology, intelligent food service robots relying on those technologies emerge as the times require solving the labor shortage problem and the increasing labor cost [1]. This article aims to study the empty-dish recycling robot in intelligent food service robots.
In the food service industry, empty-dish recycling robots need to solve indoor location, dish detection, dish grasping, recycling, walk-off, and so on. Among them, the effective detection and grasp of dishes scattered on the desktop are the critical problems of the empty-dish recycling robot. With the development of convolution neural networks (CNN), computer vision-based object detection algorithms have made breakthroughs in the field of dish detection [2], [3]. Yue et al. [1] use traditional YOLOv4 to detect dishes and achieve more than 96.00% high accuracy on precision, recall, and F1 values. Wang et al. [4] use traditional YOLOv3 to detect 16 classes of dishes and achieve a mean average precision (mAP) of 96.40%. Yue et al. [5] propose a dish grasp point extraction algorithm based on image processing technology, which can extract the grasp point coordinates of dishes in a 2-D plane. The empty-dish recycling robot needs to have high mobility. Therefore, edge devices are selected as the control and inference platform. Due to the limited resources of edge devices, traditional object detection models take a long time to load models on edge devices and require ample space to store models. Therefore, designing an object detection model suitable for running on an edge platform and meeting the requirements of real-time and accurate dish detection have become an urgent problem to be solved.
To solve the abovementioned problems, we develop an ultralightweight object detection model YOLO-GS for the empty-dish recycling robot, where GS represents the model using ghost shuffle convolution (GSConv2D). GSConv2D is to reorder the output of the Ghost module (a structure that generates a large number of feature maps with a few computations) through channel shuffle, thereby preserving the hidden connections between each channel. We choose a lightweight CSPDarknet network as the backbone structure and adjust the structure to reduce the parameters while ensuring feature This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ extraction capabilities. We propose an ultralightweight neck structure using GSConv2D to reduce the number of parameters and calculations while enhancing features' interaction ability and improving feature fusion efficiency. Simultaneously, we design object classification and bounding box coordinate regression as two parallel branches, reduce the number of parameters and floating point operations (FLOPs), and form a lightweight head structure. Ultimately, we design an image processing-based dish grasp point extraction algorithm to grasp the dishes. In addition, we use TensorRT to quantify YOLO-GS in floating-point 16-bit and deploy it on the Jetson Xavier NX, an empty-dish recycling robot control platform.
The contributions of this work are summarized as follows. 1) We design an ultralightweight neck structure for efficient feature fusion with minimal parameters and FLOPs. 2) We propose an ultralightweight dish detection model YOLO-GS for the empty-dish recycling robot. The model has only 0.606 M parameters, and the mean AP reaches 99.380%. The model requires very little storage space, and the load time is significantly reduced. 3) We design a dish grasp point extraction algorithm to extract the grasp points of the detected dishes through image processing and obtain the dishes' grasp point coordinate information in the 3-D space of the empty-dish recycling robot. 4) We quantify YOLO-GS with 16-bit floating-point nondestructive precision and deploy it on the control system Jetson Xavier NX of the empty-dish recycling robot. The remainder of this article is arranged as follows. Section II reviews the previous overview of empty-dish recycling robots, lightweight object detection models, and object detection applications in the edge platform. In Section III, we describe the proposed method in detail. The experiments, discussions, and feature work are presented in Section IV. Finally, Section V concludes this article.

II. RELATED WORK
In this section, we first summarize the recent progress of empty-dish recycling robots and then review the literature on lightweight object detection models, model quantification, and deployment on edge platforms.

A. Overview of Empty-Dish Recycling Robot
In the intelligent service robot, the empty-dish recycling robot needs to solve tasks, such as indoor location, dish detection, dish grasping, and so on, which have high complexity. Therefore, robots are mainly used for automatic food serving, table cleaning, and so on. There is little research on empty-dish recycling robots [6].
Yin et al. [7] propose a table cleaning and inspection method using a human support robot (HSR) through a lightweight deep convolutional neural network (DCNN) to recognize the food litter on top of the table and then generate cleaning paths based on the detection of food litter to perform cleaning operations. Yue et al. [1] apply YOLOv4 to the dish detection of the empty-dish recycling robot, quantify the YOLOv4 model through TensorRT, and deploy it on Jetson Nano. However, the final inference speed of the model is only 2.3 FPS. Yue et al. [5] propose a lightweight object detection model YOLO-GD, which is used to detect dishes in images, such as cups, chopsticks, bowls, towels, and so on, and based on the method of image processing, the grasp point coordinate method for extracting different types of dishes are designed. Significantly, the dish detection model has only 11.17 M parameters, and the detected mAP reaches 97.42%.

B. Lightweight Object Detection Model
Object detection algorithms based on deep learning are mainly divided into two categories. The first is the two-stage object detection algorithm based on candidate regions [8], including R-CNN [9], fast R-CNN [10], faster R-CNN [11], and so on. The other is the one-stage object detection algorithm based on regression problems, including YOLO [12], [13], SSD [14], retina net [15], and so on.
In many real-world applications, object detection must be performed in a timely and power-saving manner with computational resource constraints. Many other vision tasks have built lightweight models using methods, such as weight quantization [16], [17], network compression [18], computationally efficient architecture design [19], [20], [21], and so on. For some vision tasks, lightweight networks aim to achieve the best tradeoff between accuracy and efficiency, showing their superiority by reducing the model size and FLOPs with a little performance drop [22].
Meanwhile, several studies have proposed object detection models based on lightweight backbone structures. Guan et al. [23] propose a lightweight three-stage detection framework consisting of a coarse region proposal (CRP) module, a lightweight railway obstacle detection network (RODNet), and a postprocessing stage for recognizing obstacles in a single-railway image. Fan et al. [24] propose a lightweight meter recognition method that combines deep learning and traditional computer vision techniques for an automatic meter reading. Cai et al. [25] propose a one-stage object detection framework based on YOLOv4 for object detection in autonomous driving. At the same time, an optimization network pruning algorithm is proposed to solve the problem that the computing resources of the vehicle-mounted computing platform are limited and cannot meet the real-time performance.

C. Application of Object Detection in Edge Platform
High-level graphics processing units (GPUs) are commonly used in high-performance deep learning applications. However, building a high-performance platform is expensive in terms of cost and power consumption. In real-world application scenarios, the object detection network needs to be deployed on the edge platform. Due to the limited resources of the edge platform, quantification deployment of the object detection network becomes a key factor [26], [27].
Wang et al. [4] propose a YOLOv3-based dish detection network on an FPGA platform, and through different sparse training and pruning methods, the model size is reduced from 62 to 12 MB. Koubaa et al. [28] present a real-world case study of deploying a face recognition application using the MTCNN detector and FaceNet recognizer and demonstrate that TensorRT optimization provides the fastest execution on edge devices. Liu et al. [16] propose a fast and accurate power line edge intelligent detection method called RepYOLO by using the C++ language combined with TensorRT that is employed to optimize and accelerate the model on the NVIDIA Jetson Xavier NX embedded platform, fulfilling efficient power line edge intelligent detection. Tu et al. [17] propose a real-time defect detection method for tracking components based on an improved lightweight instance segmentation network, using the TensorRT inference framework to accelerate the defect detection network and realize edge platform deployment.

III. METHODOLOGY
A. Overview of Empty-Dish Recycling Robot Fig. 1 shows an illustration of the empty-dish recycling robot at work. The robot consists of a robotic body, arms, fingers, and a camera that collects information about the dishes. The workflow of the empty-dish recycling robot is as follows: 1) the robot determines the table information that needs to receive dishes in the restaurant and uses the sensor and the drive system to arrive accurately at the empty-dish recycling location based on the stored location information.
2) The robot loads the ultralightweight dish detection model YOLO-GS and moves the camera (Intel RealSense D435) to the top of the table, waiting for the camera to take images. 3) Use the camera to take images, detect the type and position of dishes in the image through the loaded model, and calculate the different types of dishes' grasp points through the proposed extraction algorithm. 4) Send the obtained coordinates of the dish grasp point to the robotic arm control system, calculate the rotation angle of each joint through the inverse kinematics equation, move the robotic arm to the grasp point position, and grasp the dish with the fingers. 5) Put the dish into the recycling station and repeat steps 4 and 5 until all the dishes are recycled.

B. Overview of the YOLO-GS Framework
We aim to build an ultralightweight and efficient object detection network for the dish detection task of the empty-dish recycling robot. Therefore, we consider many factors, such as convolution method, lightweight backbone, lightweight feature fusion structure, computational efficiency, computational costeffectiveness, and so on, and design an ultralightweight object detection model YOLO-GS. The YOLO-GS network structure is mainly composed of three parts, namely, backbone, neck, and head. The backbone network is used for feature extraction, and the output is three effective feature layers. The neck is a feature fusion network that fuses features of different scales output by the backbone network. Head is a prediction network that predicts objects and bounding boxes on the feature map output by the neck network. Simultaneously, YOLO-GS adopts the mosaic data augmentation method to splice images through random scaling, cropping, and arrangement, which enriches the diversity of data and reduces the use of GPU memory.

1) Ghost Shuffle Convolution:
In the conventional CNN model, many feature maps are similar and have more redundancy. Han et al. [20] propose the ghost module to generate a large number of feature maps with only a few computations (cheap operations). As shown in Fig. 2, in the ghost module, conventional convolution is first used to generate partial feature maps, then the cheap operation is used to generate redundant features on the generated feature maps, and finally, all the feature maps are concatenated. The dense convolution computation preserves the hidden connections between each channel, while the cheap operation severs these connections completely. Therefore, in this work, the output of the ghost module is reordered by the channel shuffle [19], which improves the flow of global information.
The computational cost of GSConv2D is only 60%-70% of standard convolution, but the contribution to the model learning ability is comparable to standard convolution [29]. Fig. 2 is a schematic of GSConv2D. Specifically, GSConv2D uses the "halved" convolution operation to retain the interaction information between channels. The features generated by convolution perform simple linear operations (cheap operations) Fig. 4. Overview architecture of ultralightweight detection of neck and head models. The GS bottleneck is a stacked structure of two GSConv2D and adds the input to the output. Up-sample is to multiply the width and height of the data by using the nearest neighbor sampling method.
to generate more similar features maps, that is, where ⊙ represents the concatenate operation, F denotes the feature map, F 1×1 (·) is the stacking structure of 1 × 1 convolution operation for half of the output channel, and the batch normalization (BN) operation and activation function are nonlinear operations of the sigmoid linear unit (SiLU).
(·) is the linear operation for generating a feature map, S(·) represents the channel shuffle operation.
2) Backbone: CSPDarknet in YOLOX-tiny [30] is an excellent feature extraction network that satisfies most feature extraction tasks for dish detection scenarios. Since the features of the dish object are relatively simple, we adjust the structure and parameters based on the CSPDarknet network to reduce the number of parameters while ensuring feature extraction capability. The structure of the backbone network is shown in Fig. 3. At the input of the backbone, the image is downsampled using focus without losing feature information. It uses slice operation to split the high-resolution feature map into multiple low-resolution feature maps. The backbone network employs the residual structure, and residual skip connections retard the gradient vanishing problem. The CSPNet [31] structure produces richer gradient combination information while requiring less calculation. The SiLU activation function is used in the backbone network's nonlinear expression. As seen (2), SiLU has no upper and lower bounds, smoothness, and nonmonotonicity, which plays an essential role in optimization and generalization 3) Ultralightweight Detection of Neck and Head Models: The features have been described in the backbone structure, and when these feature maps reach the neck, they are already slender enough (the channel dimension reaches the maximum, and the width and height dimensions reach the minimum) and no longer need to be transformed. Therefore, using GSConv2D in the neck structure can better describe the features than in the backbone. The low level in the backbone network has less semantic information but accurate object locations. The high level has richer semantic information but coarse object locations. We aim to fuse low-level and high-level features using fewer parameters efficiently. Through the research on FPN [32] structure, PANet [33] structure, and other methods, we design an ultralightweight Neck structure, as shown in Fig. 4.
Through GSConv2D and spatial pyramid pooling (SPP), the low-level features improve the scale invariance of the image, enrich the expression ability, and expand the receptive field. SPP can be expressed as Among them, ⊙ represents the concatenate operation, F denotes feature map, f k×k means k × k filter, MaxP means max pooling operation, C(·) means concatenate operation.
YOLOv3 [12], v4 [13], and v5 all follow the original anchor-based method, but there are many known problems with the anchor mechanism. First, to achieve optimal detection performance, cluster analysis is required before training to determine a set of optimal anchors. Second, the anchor mechanism increases the complexity of the detection head and the number of predictions per image. The anchor-free method proposed in YOLOX [30] does not need to preset anchors but only needs to regress the object center point and the width and height of feature maps with different

Algorithm 1 Extraction of Grasp Points
Input: The object classes and coordinates. The depth image of dishes. The number of dishes in the image (N um dish ). Output: Grasp point coordinates of the object in 3D space.
We adjust each position of the head into two outputs, one for predicting the classes of objects in each feature point. The other is used to predict the regression parameters of each feature point and determine whether each feature point contains an object. This method reduces the number of parameters and FLOPs, alleviates the imbalance of positive and negative samples, and avoids the adjustment of anchor parameters.

4) Loss Function:
The loss function is the difference measurement between the predicted value and the true value. The loss of the network, like the prediction result of the network, is also composed of three parts, namely, the Cls part, the Obj part, and the Reg part, which can be formulated as The Cls part is the class of objects contained in the feature points, and the binary cross-entropy (BCE) loss is calculated according to the class of the real-bounding box and the class prediction result of feature points as the loss of the Cls part. The Obj part evaluates whether the feature points contain objects and calculates the BCE loss using the positive and negative samples and the prediction results of whether the feature points contain objects as the loss of the Obj part. The BCE loss is calculated as follows: ). (5) Among them, n represents the total number of samples, t i ∈ {0, 1} is the binary label, and y i is the probability of the label value.
The Reg part is used to predict the regression parameters of the feature points and calculate the CIoU loss using the real-bounding box and the predicted bounding box as the loss of the Reg part. Reg loss is expressed as follows: Among them, IoU is the intersection over union, (b, b gt ) represents the center point of the prediction bounding box and the real-bounding box, ρ is the Euclidean distance between the two center points, and c represents the diagonal distance of the minimum closure area that contains both the prediction bounding box and the real-bounding box. ð is the tradeoff parameter, and ν is a parameter used to measure the consistency of the aspect ratio. w gt , h gt represents the real width and height, and w and h represent the width and height of the prediction bounding box, respectively.

C. Extraction of Grasp Points
Effective grasping of the dish by a robotic arm is a difficult task in the recycling process. When the grasp points are extracted from the whole image, mutual interference occurs between the individual dishes. By segmenting the individual dishes, we can extract the grasp points effectively. At the same time, we use different methods to extract grasp points for different types of dishes. The height information of the corresponding grasp point is obtained through the RealSense D435 sensor, and finally, the coordinate information of the grasp point in the 3-D space of the empty-dish recycling robot is determined. The extraction method of grasp points is shown in Algorithm 1.
We divide all dishes into five types: circle, ellipse, square, polygon, and irregular. The process of extracting grasp points is shown in Fig. 5, where the circle represents the round dish, the square represents square dishes, the polygon represents polygon dishes, the irregular represents irregular dishes, and the ellipse represents oval dishes. original represents the original image segmented according to the detection result, gray represents the grayscale converted image, Canny represents Canny edge detection, Guass represents Gaussian filtering, line represents line detection, threshold represents binarization processing, and result represents the grasp point extraction result.
We first segment the object dish for all detected dishes based on the detected coordinate information. Grasp-point extraction is then performed on various types of dishes. For circular dishes, we first perform the grayscale conversion and strengthen the edge information through Canny edge detection, then filter out the redundant information through Gaussian filtering. Finally, Hough circle detection is used to find the dish's contour and the grasp points. For square dishes, we use grayscale conversion and Canny edge detection to keep the lines in the image with an intersecting angle between 85 • and 95 • , then obtain each line's intersection points, calculate the smallest circumscribed rectangle of the intersection, and finally, obtain the four vertices' coordinates of the rectangular box to calculate the grasp points. For polygonal dishes, we use grayscale conversion, Gaussian filtering, and binarization conversion to convert the original image into a clear binarized image and directly find the largest contour in the image, and then calculate the smallest circumscribed rectangle whose center is the grasp point. The grasp point extraction of the irregular-shaped dishes is through grayscale conversion, Canny edge detection to extract the outline information of the dish, and then uses the Hough line detection to retain all the straight lines in the image and performs polygon fitting on the vertices of all straight lines, and the center of the fitting polygon is the grasp point. Through grayscale processing, Canny edge detection, and the closing operation in image processing to find the contour of the elliptical dishes, the center, the major axis, the minor axis, and the rotation angle information of the ellipse are calculated by ellipse fitting, and the grasp point is calculated.

D. Model Optimization Based on TensorRT and Deployment on Edge Platform
TensorRT is a high-performance deep learning inference SDK launched by NVIDIA, which provides low latency and high throughput for deep learning inference applications. TensorRT supports INT8, FP16, and FP32 calculations and achieves the purpose of accelerating inference by achieving an ideal tradeoff between reducing the amount of calculation and maintaining accuracy. More importantly, TensorRT reconstructs and optimizes the network structure. Fig. 6 shows the inference optimization process of TensorRT.
TensorRT eliminates useless output layers in the network to reduce computation by analyzing the network model. Through the vertical fusion of the network structure, the three layers of convolution, batch normalization, and Relu of the current mainstream neural network are integrated into one layer. Layers whose inputs are the same tensors and perform the same operations are fused together through the horizontal fusion of the network. Finally, the input of the concat layer is directly sent to the following operations, which reduces the transmission throughput and speeds up the inference process to a certain extent [16]. Moreover, quantize 32-bit floats in the network to 16-bit half floats or 8-bit integers to speed up inference.
We evaluate YOLO-GS on two edge platforms, Jetson Nano and Jetson Xavier NX, Table I shows the parameter comparison of the two devices. Jetson Nano contains a quad-core CPU and a GPU with 472 Floating-point Operations Per Second (FLOPS). Jetson Xavier NX contains a six-core CPU and a GPU with 21 Tera Operations Per Second (TOPS). All devices have the same underlying GPU architecture, so the underlying hardware instruction set remains constant and is comparable   [28].

IV. EVALUATION
A. Experimental Configurations 1) Implementation Details: We conduct all experiments on an i9-10900 CPU and a single NVIDIA GeForce RTX 3090Ti GPU. The operating system is Ubuntu 21.04, the CUDA version is 11.4, and the GPU acceleration library cuDNN is 8.2.4. The proposed method is implemented using the TensorFlow library.
The training of all experiments is conducted using the Adam optimizer, with parameters β 1 = 0.937, β 2 = 0.999. We decay the learning rate with a warm-up cosine annealing for each epoch as follows: η · min represents the minimum learning rate. η · max represents the maximum learning rate. T cur is how many epochs have been trained. T i is the total number of epochs. W i is the epochs of warm-up. In the whole training process, η · max is set to 1e-3, and η · min is set to 1e-5. We train the proposed model for 300 epochs, and the batch size is set to 4. During the evaluation, confidence is set to 0.5, and IoU is set to 0.3 for nonmaximum suppression.
2) Dataset: We use the public dish dataset Dish-20, 1 which contains 506 images in 20 classes. Among them, 409 images are used for training, 46 images are used for validation, and 51 images are used for testing [5]. The image size of the dataset is resized to the YOLO-GS default input size (416 × 416) previously. 1 http://www.ihpc.se.ritsumei.ac.jp/obidataset.html 3) Evaluation Metrics: To evaluate the effect of the object detection approach, this article mainly uses AP, mAP, parameters of the model, FLOPs, and inference speed (FPS) as evaluation metrics. AP and mAP represent the accuracy of the model. The number of parameters, FLOPs, and FPS of the model represents the computational resources required by the model [34]. The meanings of these evaluation metrics are as follows: TP represents true positives, FP represents false positives, and FN represents false negatives. P means precision, and R means recall. AP is calculated by the area under the precision-recall curve (P-R curve), expressed as among them, R n represents the recall of the n-th value, P max [R n , R n+1 ] represents the maximum AP value in the range of [R n , R n+1 ] The mAP is shown in (15). C is the number of classes and AP j is the AP of the jth class.

B. Performance Comparison
We compare the proposed YOLO-GS with 18 state-of-theart object detection methods, including faster-RCNN [11], Efficientdet [35], SSD [14], YOLOv3 [12], YOLOv4 series [13], YOLOv5 series, YOLOX series [30], and YOLO-GD [5]. Table II shows the quantitative results. In all tables, −1.000 means no relevant data is detected. The results demonstrate that YOLO-GS achieves the same accuracy as state-of-the-art object detection methods, especially in terms of mAP, AP 11 , and AP 50 . For example, our method achieves comparable performance with state-of-the-art two-stage detection networks faster-RCNN and YOLOX series but significantly reduces the parameters and FLOPs. Our proposed YOLO-GS has only 0.606 M of parameters, which is three times smaller than YOLOv5-Nano (1.800 M) with the most minor parameters. Our method achieves an inference speed of 108.006 FPS, which is comparable to the inference speed of YOLOX-S and YOLOX-Tiny, but on the premise of equivalent performance, the parameters amount is only 1/8 of YOLOX-Tiny and 1/14 of YOLOX-S. Although the FPS is 1/2 of YOLOv4-Tiny, we only need 1/9 parameters of YOLOv4-Tiny, and also, our method gets a higher mAP. The FLOPs of YOLO-GS are only 2.131 G, which is smaller than other state-of-the-art models (slightly larger than 1.796 G of YOLOv5-Nano, but the parameters are only 1/3 of it). Because the number of FLOPs is related to energy consumption, YOLO-GS has the minimum FLOPs, and the complexity is the lowest. Hence, it is friendly to embedded devices with limited energy. Moreover, YOLO-GS has the highest potential to improve inference speed further.  To better illustrate the tradeoff between accuracy and efficiency, we present three images in Fig. 7, showing mAP against the number of parameters, the number of FLOPs, and inference speed, respectively. In the figures of mAP versus parameters and mAP versus FLOPs, YOLO-GS is in the topleft corner, which means the YOLO-GS has an ultralightweight setting and good accuracy. In the figure of mAP versus FPS, YOLO-GS is in the upper middle corner, demonstrating its good tradeoff between accuracy and inference speed. Therefore, we can conclude that YOLO-GS achieves a good tradeoff between accuracy, the number of parameters, FLOPs, and inference speed. Table III verifies the effectiveness of using the Ghost module and GSConv2D in the neck. We found that the results of using  GSConv2D in the neck are significantly better than the Ghost module. For example, after using GSConv2D, mAP increased by 0.56%, AP 11 increased by 1.7%, AP 50 increased by 0.6%, and AP 75 increased by 1.7%. We found that GSConv2D significantly improves the accuracy of the model while improving the generalization ability of the model. Table IV shows the results of the combinatorial comparison of our proposed backbone structure with different state-ofthe-art neck + head structures. Compared with Table II, it can be seen that after using our backbone in YOLOv3, the number of parameters is reduced from 61.679 to 20.915 M, the FLOPs are reduced from 65.520 to 17.802 G, and the speed is increased from 83.364 to 144.608 FPS. The mAP has increased from 84.160% to 94.710%. It is the same effect on YOLOv4. On YOLOv5-Nano, our backbone effect becomes unsatisfactory but also reduces the number of parameters and improves the inference speed. Under the same mAP, YOLOX-Tiny significantly reduces the number of parameters and FLOPs. It proves the effectiveness of our proposed backbone. At the same time, the comparison between the five models shows that our proposed neck + head has the smallest number of parameters and FLOPs. Although mAP is slightly lower than YOLOX-Tiny, our model has only 1/5 of the parameters, and the inference speed is comparable. It is proven that our proposed neck + head structure performs feature fusion with the least number of parameters. Table V shows the results of our proposed YOLO-GS on the test set. It can be seen that the AP value of 13 categories of dishes in the 20 classes is 100.00%. Among them, the "squarebowl" with the lowest recall and AP values is 95.83, the "fish-dish" with the lowest precision is 93.10, and the "toweldish" with the lowest F 1 is 0.90. At the same time, the AP 50 of YOLO-GS is 0.990, and the test accuracy meets the work requirements of the empty-dish recycling robots. Fig. 8 shows the results of dish detection and grasp point extraction using our proposed ultralightweight object detection model YOLO-GS and grasp point extraction algorithm. In the complex desktop environment, it can be seen that YOLO-GS detects the target dish well. However, some dishes do not appear completely in the image, and the contour fitting of the dish is incomplete during the extraction of the grasp point process, resulting in the ineffective extraction of grasp point information. Our method effectively extracts the grasp points of the detected dish that appears completely in the image. As a result, our grasp-point extraction algorithm satisfies the requirements of the empty-dish recycling robot.

E. Model Optimization and Deployment on Edge Devices
The performance of our proposed YOLO-GS on different quantization methods of GPU and Jetson edge platforms is compared, as shown in

F. Robotic Fingers Grasp Dishes
In practical applications, we found that three pneumatic fingers cannot grasp "Chopsticks," "Paper," "Spoons," "Fishdish," and so on very well, so we use a two-finger gripper with a suction cup to solve the problems shown in Fig. 9(a). Fig. 9(b) shows the robot using a suction cup to absorb and recycle a dish that has a smooth surface and is not easy for fingers to grasp. Fig. 9(c) shows the robot uses a two-finger gripper to grasp and recycle shallow dishes. After experimental evaluation, 20 classes of dishes can be grasped by our two-finger gripper with a suction cup. The results have proved that the extracted grasping points perfectly cooperate with the robotic fingers to grasp the dishes.

G. Discussion and Future Work
Compared with the optimal model, the detection performance of YOLOX is slightly higher than that of our proposed YOLO-GS. However, the parameters and FLOPs of the YOLOX model are too large, so it takes more time to load the model on the edge device and requires more storage space. We use related operations that reduce the number of parameters and FLOPs. For example, the computational overhead of GSConv2D is only 60%-70% off standard convolution. However, the inference time is slower than the standard convolution due to the edge devices' access quantity and memory usage limitation. Although YOLOv4-Tiny is twice as fast as YOLO-GS, the accuracy is lower than our 99.380%. When the empty-dish recycling robot works, the In practical applications, factors, such as the external environment, significantly affect the quality of the dish images. Errors occur in the feature extraction process of dish images, eventually leading to errors in the contour fitting and shifts in the position of the grasp point. For example, since the "winecup" is a transparent dish, it is easy to fit the bottom contour to the overall contour in the image processing stage, resulting in deviations in the grasp points extraction. However, there is a gap between the robot fingers, and the fingers can grasp the dish when the grasp point is in the fingers' gap. Hence, the robot finger works well in the case of a margin of error. In our tests, all dishes are effectively grasped. In future work, we will optimize the grasp point extraction algorithm to solve the grasp point extraction error caused by environmental factors. For all the scattered dishes on the table, we will design the optimal grasping path to solve the problem of overlapping dishes.
At the same time, the inference speed of the model on specific hardware is not only affected by the amount of calculation but also by many factors, such as memory access, hardware characteristics, software implementation, and system environment. The proposed YOLO-GS has achieved an ultralightweight structure with fewer parameters and calculation amounts (FLOPs), and it also has the most potential for significantly improving inference speed. In future work, we focus on solving factors other than the parameters and FLOPs that slow down the inference speed, such as the amount of memory access, hardware characteristics, and so on, to improve the inference speed of our model.

V. CONCLUSION
Instead of only focusing on model accuracy, this article explores a new direction of object detection, namely, ultralightweight object detection networks, aiming to achieve a good tradeoff between accuracy, efficiency, parameters, and FLOPs. Therefore, we propose an ultralightweight object detection model YOLO-GS and design an algorithm to extract the grasp points of dishes. Experimental results show that YOLO-GS has only 0.606 M parameters and 2.131 G FLOPs and achieves 99.380% mAP in the Dish-20 dataset. The model is quantized with a floating-point 16-bit through TensorRT, and the inference speed of 31.371 FPS is obtained on the edge device Jetson Xavier NX under the premise of ensuring the same accuracy.
To the best of our knowledge, this is the first attempt at object detection toward an accuracy-efficiency tradeoff and ultralightweight models. We demonstrate that our proposed ultralightweight object detection model YOLO-GS effectively detects dishes and extracts the coordinates of grasp points. YOLO-GS has only 0.606 M of parameters, which is much smaller than the current object detection model and is not constrained by the storage capacity of edge devices, which has far-reaching significance for the development of empty-dish recycling robots.