Waste detection in Pomerania: non-profit project for detecting waste in environment

Waste pollution is one of the most significant environmental issues in the modern world. The importance of recycling is well known, either for economic or ecological reasons, and the industry demands high efficiency. Our team conducted comprehensive research on Artificial Intelligence usage in waste detection and classification to fight the world's waste pollution problem. As a result an open-source framework that enables the detection and classification of litter was developed. The final pipeline consists of two neural networks: one that detects litter and a second responsible for litter classification. Waste is classified into seven categories: bio, glass, metal and plastic, non-recyclable, other, paper and unknown. Our approach achieves up to 70% of average precision in waste detection and around 75% of classification accuracy on the test dataset. The code used in the studies is publicly available online.


Introduction
Litter is scattered in a wide variety of environments. The massive production of disposable goods in the last years resulted in an exponential increase in produced garbage, reaching about 2 billion tons per year [1]. It was estimated that humans eat up to 250 grams of micro plastics each year [2]. Almost 90 percent of this plastic comes from bottled and tap water. Today, more than 300 million tons of plastic is produced annually and only 30 years are left for the amount of garbage in the ocean to exceed the number of sea creatures [3].
The most difficult challenge in the waste management is complex, and not unified guidelines regarding segregation rules. Due to the large energy requirements and related costs plastic waste separation on many litter sorting lines is done manually. However, new possibilities for the automatic selection of these materials and their reuse are constantly being tested [4].
Inception [13] network stacks modules called Inception instead of single layers. Inception blocks incorporate multiscale convolutional transformations utilizing a split-transform-merge strategy. Moreover, it allows for increasing the number of units at each stage without an uncontrolled blow-up in computational complexity. The Inception models have evolved by utilizing batch normalization and factorization of convolutional layers [14], simplification, and the use of asymmetric filters [15]. The proposed Inception-Resnet structure [15] combines both the power of inception modules and residual connections.
The improvement of connectivity pattern in the DenseNet structure [16] facilitated the training and accuracy. For each layer, the feature maps of all preceding layers are used as inputs, and its feature maps are used as inputs into all subsequent layers. DenseNets have several significant advantages: they minimize the vanishing gradient problem, improve feature propagation, encourage feature reuse, and reduce the number of parameters.
ResNeXt [17] structure similarly to Inception utilizes a split-transform-merge strategy. However, unlike all Inception or Inception-ResNet modules, the same topology among the multiple paths is shared and paths are aggregated by summation. The authors introduced a conception of cardinality -the number of paths in one module. It was observed that an increase in cardinality is more effective than going deeper or wider when increasing the capacity.
EfficientNet [18] architecture consists of modules constructed by the neural architecture search process that optimizes both for accuracy and FLOPS. The mobile-size baseline model called EfficientNet-B0 was built stacking those modules. Scaling strategies were used to produce more complex and accurate models EfficientNet-B1-7. Scaling of a convolutional neural network most often is performed in one of the following dimensions width, depth, or resolution. Whereas, EfficientNets scale three dimensions jointly, significantly improving efficiency and accuracy EfficientNetv2 [19] improves EfficientNet by providing faster training and better parameter efficiency by eliminating EfficientNet bottlenecks. Neural architecture search was used to optimize accuracy, training speed, and parameter size. Unlike, standard EfficientNet, EfficientNetv2 uses non-uniform scaling of depth, resolution, and width. Moreover, to limit computational cost an increase in resolution was limited.
Inspired by the successes in Natural Language Processing, recent computer vision architecture advances rely on the Transformer module. The authors of [20] trained Visual Transformer (Vit) -a pure transformer architecture applied directly to sequences of image patches that perform very well on image classification tasks. The authors stated that training transformers for vision tasks requires large amounts of training samples and extensive computing resources. This particular trait was addressed in Data-efficient image Transformers (DeiT) [21] in which a new model distillation procedure is shown. In [22] authors present Pyramid Vision Transformer (PVT) which is designed for dense prediction tasks such as object detection. PVT introduces to transformers the pyramid structure known from CNNs. Therefore it is more suitable for object detection tasks than Vit, which was designed for image classification. The application of transformers in vision is further investigated in [23]. Authors introduce convolutions to ViT architecture to introduce shift, scale, and distortion invariance. Proposed Convolutional vision Transformer (CvT) architecture accomplishes this goal by two main modifications: introduction of a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. The application of transformers in vision tasks is not yet a well-investigated topic, and we observe many teams that are currently working on improving it.

Object Detection
Object detection is a well-studied task in computer vision [24]. It is defined as a localization of Axis-Aligned Bounding Box (AABB) and classification -assignment of a single or multi-label. In many previous works object detection was approached using two types of techniques namely one-stage and two-stage detection. One-stage architectures provide both locations and classes for each object in a single step, while two-stage detectors find class-agnostic object proposals first, classifies them into the class-specific detections in the second stage.
Two-stage detectors were the first object detection methods. They used the sliding window approach in the image pyramid to generate object proposals in multiple scales. Then in a second stage a classifier such as a cascade classifier [25] was used. Such a system achieved 15 Hz for the face detection system which is a single class object detection problem. A significant improvement was presented in [26] where authors introduced a novel descriptor namely Histograms of Oriented Gradient (HOG) and showed a method of human detection using Support Vector Machine (SVM) to classify objects. The authors used a detection window to scan across the image at all positions and scales. Classification is invoked multiple times in similar regions and using multiple image pyramid levels.
Using convolutional neural networks for object detection was a challenging task since CNNs' inability to localize features. In most approaches, this problem was solved by using the recognition using regions paradigm [27]. Identified regions can yield richer information than single pixels or pixels with a fixed local neighbourhood. Regions are used for object detection successfully in the Selective Search algorithm [28]. Authors generate initial regions which are grouped by similarity and merged in the iterative process. Object hypotheses from regions were used to extract features using key point feature descriptors and classify those proposals using SVM. Selective Search was also used in R-CNN [29] to extract region proposals, on which CNN features where computed to classify regions using per class SVMs finally. Fast R-CNN [30] solved R-CNN's main problems: multiple training sessions required since SVMs are involved, inefficient processing of regions through the CNN. Fast R-CNN computes convolutional features for the whole image in the first step to reduce computations for overlapping regions. The other improvement is integrating classification into the network architecture, which allows allows training of the entire network in a single multi-task training session. When Faster R-CNN [31] was introduced State-of-the-art object detection networks used region proposal algorithms to generate object location hypotheses. In this architecture a Region Proposal Network (RPN) concept was used instead of the Selective Search algorithm. RPN is integrated into Fast R-CNN architecture and uses shared weights. Many researchers used Faster R-CNN while replacing its backbone (feature extraction CNN) with newer architectures in the following years. Faster R-CNN is like its' predecessors, a two-stage object detection method.
Single-stage detection was popularized in Deep Learning mainly by two detector architectures: Single Shot MultiBox Detector (SSD) [32] and You Only Look Once (YOLO v1, 9000, v3, v4, scaled V4: [33,34,35,36,37]). Those networks perform both, object detection and object classification in a single step. Therefore the detection time can be reduced. Such a result is obtained by generating a constant number of predictions per class per image. For SSD it was 8732 detections for each class for a 300x300px image. Each detection is characterized by a category score for each category and 4 offsets for a fixed set of default bounding boxes. Default boxes have different aspect ratios at each location in several feature maps with different scales. In this type of methods, many false detections need to be removed considering objectness or classification score and overlaps between detections. YOLOv1 used a single scale image for prediction with 98 detections per class and did not use default boxes YOLOv2 (YOLO9000) predicts nine bounding boxes at each cell from a 13x13 grid, resulting in in 1521 boxes. YOLOv3 introduces a multi-label approach to classification since, in datasets such as Open Images [38], class labels do not have a guaranteed hierarchy level. It uses image 3 image pyramid levels for detection. YOLOv4 introduces a new backbone network while using the same detector as YOLOv3. This paper studies data augmentation, post-processing methods. A month after YOLOv4 [36] was proposed, a new approach to object detection tasks was introduced in [39]. Detection Transformer (DETR) introduced Transformers known from Natural Language Processing tasks [40] into object detection while preserving image processing with CNNs. DETR predicts a sparse, fixed number of objects which are in training matched with ground truth labels using bipartite matching. DETR's main limitations were slow convergence and limited feature spatial resolution. It required 10 times more epochs than other approaches to converge to a similar error level. Limited spatial resolution lowered the Average Precision metric for small objects' sizes. The following year, the main issues were mitigated in Deformable DETR [41] thanks to the introduced deformable multi-head attention module.
EfficientDet, a single-stage object detector, was introduced in [42] a month after [39]. The main goal of the authors was to improve the efficiency of object detection models. The goal was separated into two challenges, namely efficient multi-scale feature fusion and model scaling. They introduced a weighted bi-directional feature pyramid network (BiFPN) to address the first challenge. BiFPN adds a bottom-up pathway to fuse multi-scale features similarly to Neural Architecture Search Feature Pyramid Network (NAS-FPN) [43], which used NAS to find the best top-down, bottom-up pathways for feature fusion. The second challenge was addressed using a compound scaling method for object detectors inspired by [18]. It jointly scales up the resolution/depth/width for all backbone, feature network, and box/class prediction network. EfficientDet is proposed in 8 variants D0-D7 with backbone networks EfficientNet B0-B7. The approach shown in [42] reduced the number of parameters and latency in the object detection network and simultaneously increased the Average Precision metrics. Instance Segmentation is another task for automated image processing. The goal of this task is to provide a segmentation mask for each object instance. Approaches introduced in the pre-Deep Learning Era relied on methods such as edges or superpixels. This changed in Deep Mask [44] which was one of the first deep neural networks used for the instance segmentation task. The proposed network predicts object proposals for the whole image. Those proposals are class-agnostic segmentation masks, and for each of them, an object likelihood score is computed. Mask RCNN [45] is an extension of Faster R-CNN, and it ads the instance mask branch in parallel to the classification and bounding box regression branch. In Mask RCNN recognition precedes segmentation, which is faster and more accurate according to authors.

Data collections
In recent years multiple attempts to detect, classify and segment waste using deep learning have been made. Litter classification of common waste categories based on images has been attempted using few pre-trained convolutional neural networks -AlexNet, MobileNet, InceptionResNetV2, DenseNet, Xception [46,47,48] -achieving average accuracy in ranges 22% [46] and 98.2% [48] for pictures of waste on a plain background. Experiments have also been conducted on the detection of litter on the streets and homes, as well as the segmentation of different types of waste using: Faster R-CNN, Mask R-CNN, SSD, and different types of YOLO (even Tiny-YOLO) [49,50,51,52,53,54,55,56]. Calculated mAP value for detection task varies between different datasets and architectures from 15.9% for TACO [53] with Mask-RCNN, up to 81% for Trash-ICRA19 [51] with Faster R-CNN.
Nonetheless, very few attempts at the detection and classification into well-recognized recyclable classes have been made. As deep learning requires diverse data to fit the model's parameters optimally, a comprehensive review of existing litter datasets in terms of litter detection is provided. Over ten different data collections and their crucial statistics (e.g. number of images, annotation type) are presented in Table 1 (see Fig 1  TrashNet. The TrashNet dataset [46] contains over 2100 labeled images. Each of them belongs to one of the six classes: glass, paper, cardboard, plastic, metal, and trash. The pictures were taken by mobile phone camera using sunlight and/or room lighting. Photographed objects were placed on a white background or fulfill the whole view (cardboard). All images have size a of 512x384px.
Open Litter Map. Open Litter Map [57] is a free, open, and crowd-sourced dataset with over 100k images taken by phone cameras. All images are provided with information such as type of presented litter, coordinates, timestamp or phone model. Images come from all over the world, taken by different people. Therefore, they differ significantly from one another.
Waste Pictures. Waste Pictures [58] contains almost 24000 waste images scraped from Google search, divided into 34 classes. The type of images is very diverse, including even x-rays and drawings of garbage. Sizes also differ significantly. However, most of the photos are below the size of 2000x2000px. Due to the origin of images, they should be manually reviewed for use in a classification task.
Extended TACO. Trash Annotations in Context (TACO) [53] is a crowd-sourced dataset of waste in the wild with high-resolution mobile phone images. The TACO dataset contains 1500 annotated images with almost 5000 objects. All trash have been assigned to one of 60 classes that belong to 28 super (top) categories, including the category Unlabeled litter for hard to recognize or heavily obscured objects. The annotations are provided in the well-known COCO format [59] on the instance segmentation level with an extra background description -Trash, Vegetation, Sand, Water, Indoor, Pavement. Additionally, TACO offers around three thousand of unannotated images, which we have taken advantage of: we provided annotations on the detection level 2 for over 3000 images achieving over 14 000 instances in total. A great advantage is that TACO is characterized by various litter types and high diversity of the backgrounds, from tropical beaches to London streets. However, due to the crowd-sourcing nature of the dataset, labels may contain some user-induced errors and bias, i.e., not all objects in TACO can be categorized strictly as litter as their category is often based on context.
Wade-AI. The Wade-AI dataset [60] contains images of waste in the wild environment, provided by Google Street View. It consists of nearly 1400 images with 2200 manually labeled instance masks annotations in COCO format with only one class, called rubbish. The environment and size of the images vary due to the source of the images. Most images are less than 1000x1000.
UAVVaste. Another publicly available dataset, which also provides instance segmentation masks in the COCO format, is UAVVaste [56] dataset. It contains 772 hand-labeled aerial images of waste with over 3700 objects of one class -"rubbish". Data was collected in the urban and natural environments e.g. streets, parks and lawns using Unmanned Aerial Vehicles (UAV). The annotated litter is usually relatively small (median of object shape is 76x68px, while median of image shape is 3840x2160px).
TrashCan and Trash-ICRA. TrashCan [52] and Trash-ICRA [51] are datasets both containing underwater images. They are comprised of frames of video showing trash, remotely operated underwater vehicles (ROVs), and undersea flora and fauna. Both datasets are sourced from the JAMSTEC E-Library of Deep-sea Images (J-EDI) dataset [citation], curated by the Japan Agency of Marine-Earth Science and Technology (JAMSTEC) captured from real-world environments, providing a variety of objects. The clarity of the water and quality of the light vary significantly between images creating a diverse dataset. The image sizes in these datasets are 480x270px and 480x360px. Provided annotations are in COCO format. The TrashCan dataset is annotated on the instance segmentation level (7212 images and 6214 annotations) with 16 classes for Material Version (8 classes are trash related and followed trash_ name pattern) or 22 for Instance Version. On the other hand, the Trash-ICRA19 dataset is annotated on the detection level (7668 images and 6706 annotations). It contains seven categories based on the material of the objects.
Drinking Waste. Drinking Waste [61] contains over 4800 images of drinking waste belonging to 4 classes: Aluminium Cans, Glass bottles, PET bottles, and HDPE. Provided bounding-box annotations are in YOLO format. The dataset was created with a 12 MP phone camera. Images look similar -there is usually one object in the center on the indoor, plain background. Most of the images have the size of 512x683px.
MJU-Waste. MJU-Waste dataset [55] is comprised of 2475 indoor trash images manually annotated in the form of an instance mask in COCO format. It allows two-class semantic segmentation (waste and background). For each color image, the co-registered depth image captured using an RGBD camera is provided. Objects are hand-held and mostly in the center of the image. In most cases there is only one object per image. The only image size is 640x480px.
Cigarette butt. The cigarette butt dataset [62] consists of above 2k images, which were synthetically composed photos of small cigarettes lying on the ground. It is an artificial dataset created by applying random scale, rotation or brightness to the foreground cutouts of 25 different cigarette butts, placed in around 300 ground photos. Instance mask annotations are provided for each cigarette object with a median size of 66x65px. Original resolution of images is equal 3024x4032px.
Places. Places [63] is a repository of 10 million scene photographs, labeled with 434 scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Images were downloaded by online image search engines (Google Images, Bing Images, and Flickr). Minimal size of images is 200x200px. Despite the fact that this is not a trash dataset, it can be used to identify natural and urbanized places without trash.

Two-stage framework to detect and sort litter
Litter sorting methods around the world differ significantly. On the one hand, this provides a large variety of objects (litter), as well as diverse backgrounds -waste can be commonly found indoor and outdoor, and in environments such as households, offices, roads and pavement scenes, and sometimes even under water. On the other hand, this affects available datasets, which do not provide a large enough number of annotated images. Moreover, a huge variety of categories and annotation levels (from classification to instance segmentation) resulted in the need for well-defined waste categories that could be applied in the industry. Hence, seven litter categories are provided based on a real-world waste segregation system in Gdańsk (Poland): bio, glass, metal and plastic, non-recyclable, other (e.g. batteries, large household appliances, tires), paper, and unknown (old, degraded litter).
To utilize all of the available data we tackle the problem by dividing the detection into two separate steps: litter localization and litter classification. The localization model is used to find regions in the image containing garbage. Then, each region is extracted and passed to the classification model, that assigns its category. The proposed pipeline is presented in the Fig. 2.
This section outlines the details behind the training procedure, the methodology, and our experiments. We compare various architectures, as well as numerous hyperparameters, to ensure the efficiency of the solution. Additionally, we present how each dataset can be used in the training process -separately for waste localization, classification or both, depending on the annotation type. Figure 2: A pipline of a two-stage framework to detect and sort litter

Litter in images
During Detect Waste in Pomerania project, twelve publicly available datasets and an additional data collected using Google Images Download [64] were used. Segmentation and detection datasets combined resulted in a new detectwaste dataset [53,56,52,51,55,61,60] (with the addition of cigarette butt dataset [62] detect-waste+ dataset) with a single class -Litter -used in the first stage of our framework: Localization (see Fig. 2 top panel). For the purpose of the second stage -Classification -additional images of bio, glass, other, and paper [46,58,64] waste were added to the set of clipping images created by cutting out litter instances from some photos used in the detection stage [53,61,57]. These images form a final classification dataset named classify-waste. Furthermore, the dataset was supplemented with more than a thousand images containing garbage-free backgrounds [63] and around 55k pseudolabeled images [57], creating the classify-waste+ dataset.
The localization task was mainly performed on a detect-waste dataset with a single class called litter. In the last experiments, the dataset was extended to the detect-waste+ dataset, which is previous detect-waste dataset with additional pictures presenting cigarette butts [62] to boost the detection of small objects. Fig. 3 shows the datasets included in the detect-waste+ dataset along with the number of used images, divided into groups depending on the trash environment (in the wild, underwater, inside homes and artificial). Additional experiments were performed for multi-class detection on Extended TACO. The division of this dataset into Gdańsk segregation categories is presented in the Fig. 4.  In the classification task, in addition to instances cut-out from Extended TACO and Drinking waste, some classes from TrashNet (paper and glass) and Waste Pictures (bio and other) datasets have been added. Additionally, we scrapped remaining data from the web. The Google Images Download [64] software was used to search and collect more images with bio and other waste like batteries, medicines or rubble, and also images that depicted scenes without the presence of garbage (background) from Places dataset [63]. Semi-labeled (with predicted localization) images from Open Litter Map [57] were also utilized. Fig. 5 shows number of used images per category (according to the obligatory segregation rules of the city of Gdańsk) without images from Open Litter Map, while Table 2 provides information from which dataset the images come from, for which category. Table 3 presents numbers of images from semi-labeled Open Litter Map with some established seven litter class pre-assignment.  Most of the trash found in the classification dataset is metal and plastic. Unfortunately, the second numerously represented category is unknown -the litter that has probably decomposed so much that it is hard to classify it. This makes our dataset highly imbalanced and requires special attention in the following steps. Another problem is related to annotations errors coming from datasets without strict annotation rules (especially when it comes to the trash label) -unfortunately some annotations are of poor quality. Moreover, we manually rejected some of the images that were malformed or mislabeled.
To ensure that the training and test data distributions approximately match, we randomly selected 80% of the original images as the training set. We kept the rest as validation/testing set for each part of the used dataset separately. The split was done by preserving the percentage of samples for each class.

Litter Detection
In the first step, the proposed framework localizes litter in the image without recognizing its class. Three different neural networks popular in common object detection tasks were analyzed, namely EfficientDet [65], DETR [66], and Mask R-CNN [31]. The first two architectures allow for object detection, whereas the third one also implements segmentation. All models were trained to predict trash at each location and scale (the solution is not targeted at a specific object size). To properly evaluate each proposed architecture, a set of tests was conducted using a wide range of datasets altogether and individually. Defined classes varied depending on the used dataset, as described in Section 3.1.

Training details
To ensure one-stage detector efficient and fast performance, as EfficientDet [42], was utilized to localize litter in images. The used model was initialized with pre-trained weights. In the final experiments decay rate was set to 0.95, the learning rate to 1e-3, and the number of epochs equaled 20. Also, several approaches to data augmentation [67] were used. During training, Gaussian blur, random brightness, rain, fog, and snow were added. Additionally, images were rotated and cropped around the annotated bounding boxes with padding if needed. However, the best results were obtained for simple resize (to match default EfficientDet's input size), and normalization using means and standard deviation values per channel as for the COCO dataset [59]. We hypothesize that augmentation did not significantly improve the results because used data was naturally diversified by mixing a wide range of datasets. In conducted studies, the Pytorch implementation of EfficientDet [65].
Also, Transformers approach was applied to the waste detection task. DETR (which stands for DEtection TRansformer) with ResNet-50 and ResNet-101 backbone was initialized with pre-trained weights. In all experiments AdamW [68] optimizer was used, with a learning rate set to 1e-4 in the transformer and 1e-5 in the backbone. Selected number of queries, that represents the maximum number of instances that could be found in image, equaled 100, as by default. In final experiments, after about 63 epochs loss values saturate, and longer training reduce neither the training nor the validation error. The DETR litter detection model is based on Facebook team implementation [66].
In the case of instance segmentation, Mask R-CNN [45] with Resnet-50 backbone was used. In final experiments, the model was trained for 26 epochs with learning rate set to 1e-3 and decay set to 1e-4. The transfer learning technique was also used in this study, as each of the models was initially pre-trained on the COCO 2017 [69] subset. Implementation of the model is based on standard Pytorch library.

Results
Conducted experiments were divided into two phases. In the first phase selected architectures were tested on the detect-waste dataset. The model, which exhibit the best performance, was selected for further training. The second phase showed how the results are distributed on the subpart datasets. This allowed for a deeper analysis of the effectiveness of the selected architecture and better preparation to the final training. As a basic evaluation metric, Average Precision (AP) for intersection over union (IoU) equaled 0.50 was used (AP@05 [70]). In the case of multi-label detection, this metric is averaged over all categories, giving mAP@0.50. As presented in Table 4, average precision of litter detection, using detect-waste dataset with one class, varied in the range of 28.0% for Mask R-CNN with ResNet-50 backbone to 65.5% for EfficientDet-D2. In general, due to a significant gap between EfficientDets and other tested architectures the rest of experiments were limited to this network family. Additionally, more complex networks from this family (i.e. EfficientDet-D3) were also tested, and exhibited a similar performance. As EfficientDet-D2 network reached the best evaluation results, it is smaller (considering the number of parameters) and requires less computing power, we decided to proceed with it exclusively.
Results of a comprehensive study using EfficientDet-D2 and selected datasets separately are presented in Table 5. The mAP@0.50 calculated per dataset is the highest for images presenting indoor scenario, namely Drink-waste (cans, plastic and glass bottles) and MJU-Waste (one hand-held waste object per image) datasets. However, EfficientDet-D2 also reached very high score, mAP@0.50 above 90%, for TrashCan 1.0 with selected 8 underwater waste categories. The worst result, reaching mAP@0.50 below 10%, was achieved for the underwater images from Trash-ICRA19 dataset, which could be related with poor quality of the photos (blurred movie frames). On the other hand, detection performance for images taken in natural or urban background was in range of 56.8% for Extended TACO (trash in various environments) to 74.1 for UAVVaste (small objects constituting over 80% of the dataset shown from a bird's eye view), which proved that precise detection of garbage in the different environment is possible. Corresponding sample predictions are shown in the Fig. 6.
Achieved results proved importance of data quality in the learning process of DL-based system, but apart from different quality of photos and environments of waste occurrence in which they were taken, also the number and kind of waste classes varied depending on the analyzed dataset. Moreover, annotated trash objects differ in shape and size. This suggests that one-class detection (with one litter category) using EfficientDet-B2 network might lead to much better performance. To provide more details, in final experiments with EfficientDet-D2, AP score at different IoU thresholds levels, along with AP@[0.50:0.95] [59] (AP integrated over IoUs in range from 0.5 to 0.95, and step 0.05) were calculated. This results are presented in Table 6. Selected neural network was tested in four different scenarios. In two of them the Extended TACO dataset was used, while the third and fourth experiments were conducted using different type of detect-waste dataset instead.  Figure 6: Example EfficientDet-D2 predictions for the diverse waste datasets. Selected images were taken in different locations such as a beach, pavement, indoor, and underwater. Detected objects vary in size and number -images show from one to five small, medium or large objects. EfficientDet-D2 trained on the detect-waste+ dataset reached the biggest mAP for IoU equal both 0.50 (66.4%) and 0.75 (51.3%). Solution effectiveness in respect of detected object size demonstrated better precision for large (59.8%) and medium (51.3%) objects, than for small (9.3%) instances. Moreover, as expected, the test conducted using the detect-waste+ dataset, which boosted the evaluation result for small objects from 5.9% to 9.3%, and at the same time increased effectiveness in terms of AP@0.50 from 65.5% to 66.4%. In the case of a models trained on the Extended TACO dataset with one class, AP@0.50 reached 9.6 percentage points less (56.8%) than the one that was trained on the detect-waste+ dataset (66.4%). However, in respect to detected objects size, among all conducted experiments, it reached the best results for small (19.8%) objects.
On the other hand, experiments in which detected trash was divided into seven classes (bio, glass, metals and plastic, non-recyclable, other, paper, unknown) resulted in significantly reduced mAP in all analyzed evaluation metrics, reaching almost 4 times smaller value for both IoU equal 0.50 (16.2%) and 0.75 (13.0%) than the best solution. Regarding detected objects size, for tiny objects multi-label detector trained on Extended TACO dataset reached better results (6.4%) than the one that was trained on detect-waste dataset with one class (5.9%). This may be due to the fact that in Extended TACO, approximately 45% of instances are small (area < 32 2 ), while for detect-waste it is only almost 25% of whole dataset. For that reason, feeding the detect-waste dataset with cigarettes butt improved the quality of prediction. It is worth emphasizing that a neural network that provides an ability to detect directly into 7 waste categories demonstrated average precision similar to one-class based detectors only for one category -metals and plastic -reaching AP@0.50 = 43.3%. Results obtained for the remaining six classes varied in range of AP@0.50 = 0.1% for bio to AP@0.50 = 8.9% for unknown litter. For that reason, it was decided to perform classification in a separate stage.

Litter Classification
At the second stage of our approach, we performed multi-class classification for seven waste categories. Due to the imbalanced classify-waste dataset with a significant dominance of metals and plastic class, classification networks were trained on cut out trash with some additions of underrepresented classes from other classification datasets as presented in Fig. 5 and described in Section 3.1. Boundaries of cropped litter were established using bounding boxes for annotated images and objects detected by EfficientDet-D2 [42] in case of unlabeled data. These all combined waste instances were applied as an input images to solve the classification problem.
To take advantage of the semi-labelled Open Litter Map dataset [57] pseudo-labeling [71] technique was applied. Pseudo-labeling is a kind of semi-supervised teaching method. The main concept of this approach is to teach a network to detect objects on data with and without annotation. Firstly unlabeled data is preassigned to some category by pretrained model, and afterwards used in further training. The process is repeated during the training every batch or epoch, as is presented in Fig. 7. This provides considerably larger, yet only partially annotated, dataset.

Training details
In our research we have tested two architectures dedicated to object classification problem -ResNet-50 [12] and EfficientNet-B2 [18] to chose the one that achieves better performance for waste classification task. We have thoroughly examined number of training hyperparameters. During all experiments, we set the learning rate to 1e-4, batch size to 16, and trained each network for 20 epochs. The output of our network differed between 7 and 8 classes. The first 7 classes refer to previously mentioned waste categories, while the 8th was a background class in order to reduce number of false-positives. Also, two kinds of samplers -random and weighted -were used to reduce the impact of data imbalance. Therefore it was crucial to investigate results on each class, not only on a whole dataset. The other hyperparameter was the "pseudo-labelling type" that shows if the pseudo-labels update was performed every batch or every epoch, or not at all.
Moreover, to additionally ensure our training dataset's diversity, we applied data augmentation [67]. At first, we extended the dimensions of our images to randomly crop parts of them. Horizontal/vertical flip and shift scale rotation, which does not change the image's content but only the position of an object in the image, were also applied. On the other hand, a random brightness contrast and cutout were used to change the image's content slightly. Finally, both training and testing sets were normalized and resized to 224x224, as it was an EfficientNet-B2 [18] input size. The second stage of our approach was self-implemented in Pytorch Lightning and based on the repository [72]. Figure 7: Flowchart for pseudo-labeling semi-supervised learning technique.

Results
A comparison of the performance of two different neural networks is presented in Table 7. One of them is EfficientNet-B2 -the backbone of the network used in the detection task. The results achieved using this classifier exceeded markedly higher values (over 10 percentage points) than ResNet-50. EfficientDet-B2 being state-of-the-art architecture comparing to ResNet-50 provides better performance in the classification task, similary as in detection, where backbones are EfficientNet-B2 for EfficientDet-D2 and ResNet-50 for Mask RCNN or DETR (see Table 4). It leads to the conclusion, that the EfficientNet networks family is a preferred choice to deal with waste images.
The experiments showed that updating pseudo-labels every batch can slightly raise accuracy. Although the bestachieved accuracy was 74.6%, it was performed for a random sampler. While analyzing the confusion matrices of each training, we have noticed that applying a weighted sampler provides more balanced results for each class. Therefore we have achieved an accuracy of 73% (and 86.7% on the training set), while almost 25% of our dataset were test images.
Although the confusion matrix clearly shows that most of the predictions are accurate (see Fig. 8, and Table 8), it indicates a significant data imbalance. The metals and plastic class was predicted with the highest precision of 87%, which is connected to the large representatives of this class. Still, it also results in a relatively low recall, which means that many objects are classified incorrectly as metals and plastic. There was a noticeable problem with identifying the unknown and non-recyclable classes, which precision was equal 52%. The other classes were recognized with the higher precision but still, due to the data imbalance, not fully correct.
The biggest confusion was remarked between metals and plastic and unknown category. Probably it was because of partly degraded or destroyed trash. All the classes were rarely misclassified with the glass, as evidenced by a high recall value -82%. The F 1 -score metric for this class achieved the utmost result -83%, which shows the balance between false positives and false negatives. It is worth noticing, that eliminating background from the rest of the waste was extremely successful -at the level of 97% for precision and 97% for recall. Adding a separate class for background improved the performance and as it was assumed, reduced the number of false-positives.

Conclusions and Future Work
There is a visibly increasing demand for artificial intelligence in numerous human activities. Following that, we proposed a DL-based framework that can localize trash in the image and then identify its class using two separate neural networks. Datasets used in this study were created by using various publicly available data of waste collected in diverse environments: inside houses, in the natural or urban environment and even underwater.
Firstly, three neural networks architectures were adapted to conduct the garbage detection. Used data was naturally augmented, which allowed for a precise location of the waste with AP@0.5 equals 66.4% for EfficientDet-D2 model. This is an excellent result comparing to other recent reports, that were presented for some of used datasets separately (15.9% for TACO with Mask R-CNN [53], or 55.4% for TrashCan 1.0 with Mask R-CNN [52]). Qualitatively good results were also observed for the semi-labeled OpenLitterMap dataset, allowing for its further use in the next stage.
In the case of classification, litter was divided into seven categories that imitate the sorting policy introduced in the city of Gdańsk with extra background class to eliminate false positives coming from the detector. Additionally, during training pseudo-labeling technique was applied, which allowed to utilize an unlabeled data. However, this gave only a slight performance boost in case of litter classification, which can be related to high imbalance and small amounts of labeled data for specific litter categories, especially in the case of bio or other waste. In the end, the accuracy up to 75% was achieved for the EfficientNet-B2 network. To the best of our knowledge, we present one of the first results to classify litter in wild.
As Artificial Intelligence is required to be more accurate than a human, the main future direction for the proposed system will be to improve its performance. Selected detectors works well when localizing medium and large objects, but recognition of small litter is still challenging. For that reason, exploring different state-of-the-art approaches, such as Deformable DETR [41], seems to be a good idea. On the other hand, a more balanced dataset and the use of the latest EfficientNetv2 [19], could also boost the classification accuracy.
Despite this, the presented framework showed the great potential of the DL-based methodology for waste management.
In the future with the assistance of DL models, it would be possible to mount robotic arms in waste management plants to automatically distinguish between different classes of objects and sort garbage without human intervention. Additionally, the high precision of litter localization in a large variety of the environments, shows the possibility of using neural networks for waste monitoring in cities or detecting of illegal dumps in nature, for example with the use of drones. All authors provided critical feedback and helped shape the research, analysis and manuscript.