Multi-format open-source weed image dataset for real-time weed identification in precision agriculture

Weeds are considered obnoxious and a hindrance to crop yield. Due to their uneven spatial distribution pattern, a ground or aerial robot are deployed to spot spray herbicides. This herbicidal application depends entirely on the computer vision algorithms that assist with in-field weed identification prior to spot spraying. Therefore, to develop advanced computer vision algorithms, big data pertaining to agricultural weed dataset are required. In the past, public domain weed dataset have been released but mostly acquired using ground-based technologies. The dataset discussed in this paper is unique in that it incorporates data captured both from handheld camera and unmanned aerial system (UAS), thus catering to both ground-based and aerial-based weeding robots. This dataset comprises of 3,975 images featuring five different weed species commonly found in North Dakota: kochia (Bassia scoparia), common ragweed (Ambrosia artemisiifolia), horseweed (Erigeron canadensis), redroot pigweed (Amaranthus retroflexus), and waterhemp (Amaranthus tuberculatus). These images have been meticulously annotated in various formats to facilitate the development and advancements of computer vision algorithms. Furthermore, various augmentation techniques have been applied to ensure that the dataset closely represents the real-world field conditions. Additionally, this dataset is open-source to assist precision weeding technologies for real-time in-field weed identification followed by herbicidal spot spraying application, ultimately contributing to more efficient and sustainable agricultural practices.

a b s t r a c t Weeds are considered obnoxious and a hindrance to crop yield.Due to their uneven spatial distribution pattern, a ground or aerial robot are deployed to spot spray herbicides.This herbicidal application depends entirely on the computer vision algorithms that assist with in-field weed identification prior to spot spraying.Therefore, to develop advanced computer vision algorithms, big data pertaining to agricultural weed dataset are required.In the past, public domain weed dataset have been released but mostly acquired using ground-based technologies.The dataset discussed in this paper is unique in that it incorporates data captured both from handheld camera and unmanned aerial system (UAS), thus catering to both ground-based and aerial-based weeding robots.This dataset comprises of 3,975 images featuring five different weed species commonly found in North Dakota: kochia (Bassia scoparia) , common ragweed (Ambrosia artemisiifolia) , horseweed (Erigeron canadensis) , redroot pigweed (Amaranthus retroflexus) , and waterhemp (Amaranthus tuberculatus) .These images have been meticulously annotated in various formats to facilitate the development and ad-vancements of computer vision algorithms.Furthermore, various augmentation techniques have been applied to ensure that the dataset closely represents the real-world field conditions.Additionally, this dataset is open-source to assist precision weeding technologies for real-time in-field weed identification followed by herbicidal spot spraying application, ultimately contributing to more efficient and sustainable agricultural practices.
© 2023 The Author(s • This dataset has the potential to be added to the custom datasets used by other researchers or users.This addition would enhance their datasets, leading to better algorithm development and improved generalization abilities.• Due to the added advantage of multiple formats within this dataset, researchers would be able to deploy various deep learning models on-the-go for weed identification tasks, thereby eliminating the need to convert the dataset into specific model training formats.• With the inclusion of two categories, aerial weeds and individual weeds, this dataset can assist both ground-based and aerial-based technologies in identifying and locating weeds for precise herbicidal application.

Objective
This dataset has been developed with both ground-based and aerial-based weed identification technologies in mind.The greenhouse images, along with clipped aerial images, would assist ground technologies in identifying weeds within a limited field of view.Meanwhile, the weed instances captured in high-resolution aerial images could help drone technologies locate weeds in a large field of view.A combination of both categories would advance computer vision models' ability to extract weed information from complex backgrounds, thus enhancing their capability to generalize weed identification amongst crop plants in unseen locations.This dataset comprises multiple model training formats that can be used by researchers working on specific computer vision models for training purposes.The dataset presented in this paper explores the applicability of computer vision models not only for identifying specific weed species but also for locating them in preparation for spot spraying applications.

Nature of in-field experiment
The major objective of this dataset was to create weed categories that could assist ground as well as aerial weeding technologies for precise herbicidal spraying application.Therefore, after determining the end goal, the dataset was divided into two categories, aerial weeds and individual weeds ( Fig. 1 ).
The experimental plots for in-field data collection were designed as shown in the orthomosaic ( Fig. 2 a).To imitate real field conditions, multiple crops were placed side-by-side consisting of weeds in the centre ( Fig. 2 b).The aerial images were captured in multiple locations ( Fig. 2 b) out of which individual weeds were clipped ( Fig. 2 c).Similar plot design and image acquisition process was carried out in all the three locations.

Dataset Categories
Within the Mendeley dataset repository, two zipped files have been added.These two files are named Aerial_weeds.rar(A) and Individual_Weed.rar(I).Within Aerial_Weeds file, there are two folders named "images" and "labels."The "images" folder consists of high-resolution aerial images in JPG format, while the "labels" folder contains three different subfolders, each dedicated to a specific format: JSON, TXT, and XML.Similar organizational criteria have been applied to the "Individual_weed" category as well.For individual weeds, images and labels of each individual weed species are stored within a respective class of weed species.
The aerial weeds dataset was captured using off-the-shelf unmanned aerial system (UAS), the DJI Phantom 4 Pro (V2.0), in three different locations: Casselton, Carrington, and Grand Farm.The UAS was flown at an average altitude of 12 ft ( ∼ 3.7 m) at varied speed in multiple environmental settings.The selection of this altitude was a result of careful consideration to ensure the acquisition of distinct weed images and features for DL applications.The images in Casselton were captured during the summer season, specifically from late May to late June 2021, while in Carrington, the data collection occurred from mid-July to late August of 2021.In Grand Farm, data collection occurred from mid-August to late September 2022.The choice of a two-year time frame was deliberate, aiming to incorporate both temporal and location-based diversity into the dataset for robust data generation.chosen so that temporal as well as location wise diversity could be adopted for data generation.The aerial images were captured at a resolution of 5,472 × 3,648 pixels in JPG format and encompassed a range of diverse in-field settings.A sample of the aerial images captured in multiple locations is depicted in Fig. 3 .Furthermore, approximately 550 images were collected for which manual annotations of weeds were created,   resulting in over 2,390 individual instances.Table 1 provides specific details on these instances, including manually drawn bounding-boxes, categorized into five distinct weed species classes.To create dataset for individual weeds, specific classes of each weed species were extracted from aerial images and subsequently manually annotated in multiple formats.In addition, to these images, several greenhouse images, captured using a handheld Canon 90D camera, were included to introduce diversity within the training set.In total, over 3,424 images were manually annotated, resulting in the generation of 7,700 instances covering four weed classes (as detailed in Table 1 ).Fig. 4 showcases sample images of the clipped weed species that were annotated and exported in multiple data formats.

Dataset augmentation
To enhance the volume and diversity of the dataset [1] , data augmentation techniques were applied on the individual weed category.The original number of images for kochia, horseweed, ragweed, and RRPW were 785, 448, 355, and 115, respectively.These images underwent further augmentation and the final output images are outlined in Table 1 .Additionally, APIs from Keras Image data generator were employed to execute various augmentation types, including feature wise centre, shear range, zoom range, horizontal flip, rotate, shear etc., all within a Python platform.The specific versions used for pre-processing and conducting data augmentation are elaborated in Table 2 .( Fig. 5 )

Dataset annotation
After the images were augmented, both categories of the dataset were annotated and exported using the LabelImg [2] tool.LabelImg is an open-source software that can be used to annotate objects of interest and export data in multiple formats for training various deep learning models.Fig. 6 showcases a sample of aerial image that was imported into the software and sub-  sequently annotated manually with weed species ( Fig. 6 b).The software itself defines the four coordinates of the annotated weed species.Finally, after manual annotation, a text file ( * .txt)was exported containing information organized in two columns: labels and coordinates ( Fig. 6 c).The labels correspond to specific classes of weed species with their respective coordinate in the second column.The TXT format thus exported has been widely adopted to train several stateof-the-art YOLO models based on COCO dataset [3] .

Exporting the Dataset in Multiple Formats
Following the output discussed in the previous section ( * .txtformat), JSON (JavaScript Object Notation) and XML (Extensible Markup Language) formats were also exported using the same LabelImg software.Fig. 7 (a & b) showcases the exported output of these sample formats.The JSON format comprises five parameters that defines the annotated object of interest.In Fig. 7 a, ragweed has been annotated with the coordinates x, y, width, and height.Similarly, Fig. 7 b presents the XML format, which includes the image resolution followed by four coordinates defining the object of interest, namely, ragweed.The following formats have been used in creating PASCAL VOC dataset to train various detection and segmentation models [4] .

Limitations
This dataset was captured while keeping all the field parameters in mind.However, this dataset does not include images that have been distorted due to the downward gust of wind pressure created by the drone while descending to capture images.Typically, the downward wind pressure can alter the physiological appearance of the weeds, making it challenging for deep learning algorithms to detect and locate the weed species, potentially resulting in incorrect identification and improper spot spraying.This aspect of data acquisition has not been considered either during in-field image capturing procedure or the pre-processing steps.

Ethics Statement
This dataset does not involve experiments on humans or animals nor does it tend to collect data from any social media platforms.

Data Availability
ImageWeeds: An Image dataset consisting of weeds in multiple formats to advance computer vision algorithms for real-time weed identification and spot spraying application (Original data) (Mendeley Data)

Fig. 1 .
Fig. 1.An overview of the dataset flowchart consisting of two dataset categories, aerial weeds and individual weed species exported in multiple formats.

Fig. 2 .
Fig. 2. The orthomosaic of the experimental plot used to capture in-field images for further annotation, (a) orthomosaic created by flying UAS at 32 ft ( ∼9.7 m) displaying the nature of in-field experiment, (b) aerial images that were captured by flying UAS at an average altitude of 12 ft ( ∼3.7 m), and (c) individual weed species clipped from the aerial images captured in b.

Fig. 3 .
Fig. 3. Sample images of Aerial Weeds dataset along with several in-field parameters that were added while collecting the dataset in 2 locations.These in-field parameters were, (a) cloudy environment, (b) drone shadow on a weed specie, (c) weeds occluded due to corn crops, (d) dark environment using low-light settings, (e) images captured around noon hours when the sun would be overhead, and (f) motion blurring caused due to drone speed.

Fig. 5 .
Fig. 5. Code snippet showcasing various data augmentation techniques applied on the individual weed category of the dataset.

Fig. 6 .
Fig. 6.Manual annotation of the image data consists of three steps, (a) acquiring the aerial image, (b) manually annotating each class of weed species, and (c) exporting a text file consisting of labels and coordinates.

Fig. 7 .
Fig. 7. JSON and XML formats as exported using the LabelImg tool.In these formats, (a) JSON format, and (b) XML format as exported using the LabelImg tool.

Table 1
Number of images and instances (annotations) for each weed species in both the categories of the dataset.

Table 2
Python libraries and its respective versions used to perform image pre-processing and data augmentation.