GrapeMOTS: UAV vineyard dataset with MOTS grape bunch annotations recorded from multiple perspectives for enhanced object detection and tracking

Object Detection and Tracking have provided a valuable tool for many tasks, mostly time-consuming and prone-to-error jobs, including fruit counting while in the field, among others. Fruit counting can be a challenging assignment for humans due to the large quantity of fruit available, which turns it into a mentally-taxing operation. Hence, it is relevant to use technology to ease the task of farmers by implementing Object Detection and Tracking algorithms to facilitate fruit counting. However, those algorithms suffer undercounting due to occlusion, which means that the fruit is hidden behind a leaf or a branch, complicating the detection task. Consequently, gathering the datasets from multiple viewing angles is essential to boost the likelihood of recording the images and videos from the most visible point of view. Furthermore, the most critical open-source datasets do not include labels for certain fruits, such as grape bunches. This study aims to unravel the scarcity of public datasets, including labels, to train algorithms for grape bunch Detection and Tracking by considering multiple angles acquired with a UAV to overcome fruit occlusion challenges.

Object Detection and Tracking have provided a valuable tool for many tasks, mostly time-consuming and prone-to-error jobs, including fruit counting while in the field, among others.Fruit counting can be a challenging assignment for humans due to the large quantity of fruit available, which turns it into a mentally-taxing operation.Hence, it is relevant to use technology to ease the task of farmers by implementing Object Detection and Tracking algorithms to facilitate fruit counting.However, those algorithms suffer undercounting due to occlusion, which means that the fruit is hidden behind a leaf or a branch, complicating the detection task.Consequently, gathering the datasets from multiple viewing angles is essential to boost the likelihood of recording the images and videos from the most visible point of view.Furthermore, the most critical open-source datasets do not include labels for certain fruits, such as grape bunches.This study aims to unravel the scarcity of public datasets, including labels, to train algorithms for grape bunch Detection and Tracking by considering multiple angles acquired with a UAV to overcome fruit occlusion challenges.

Value of the Data
• Datasets, along with annotations, are helpful for researchers and professionals working with Computer Vision techniques to perform grape bunch detection and tracking [4] .• Datasets with multiple-perspective videos are crucial to avoid occlusion, which may lead to underestimation of the number of grape bunches in each row.• Grape bunch tracking allows for counting the number of grape bunches on each side of a vineyard row, which is relevant to estimating yield.Additionally, when coupled with ground truth information in the annotations, phenotypic traits can be extracted [5] , further contributing to yield estimation.
• The dataset is helpful for winegrowers and field technicians as it provides high-quality videos for visual inspection of bunch monitoring and disease development, eliminating the need to be physically present in the field.• Datasets, together with annotations, address the lack of public agricultural datasets.
• This dataset can be integrated with other datasets from the same vineyard that contain key information such as the position of the plant trunks or lidar point clouds [ 4 , 6 , 7 ], enabling researchers to go further and achieve a more precise understanding of the vineyard.

Background
In agricultural research, the importance of datasets cannot be underestimated, and their applications in vineyards are particularly notable.They help in the identification and classification of diseases [8] , as well as in the detailed analysis of yield factors [9] .Following the idea of [10] , where they introduced the concept of different angles with a handheld camera to avoid occlusions and provided 110 0 0 + images, this dataset offers Unmanned Aerial Vehicles (UAV) videos with grape bunch annotations recorded in a commercial vineyard under challenging conditions, such as occlusion.This endeavour aims not just at enriching the repository of data available for precision agriculture but also at overcoming specific hurdles not only for object detection within viticulture, similar to [11] where they provided instances to locate the bunches in the images but including tracking, by adding the same ID of each grape bunch along frames.By capturing footage from multiple vantage points around the vineyard rows, this dataset allows for a depth analysis, enabling algorithms to count bunches more accurately despite the frequent obstructions caused by foliage.Moreover, the significance of this dataset extends beyond its immediate utility.It serves as another piece that can be synergistically combined with other existing datasets from the same vineyard [ 4 , 6 , 7 ], which encompass a diverse range of data types, including videos, UAV orthoimages, and even LiDAR information.This diversity enhances the potential for data fusion and enables a multifaceted analysis of the vineyard ecosystem on the same dates but also across different years.Such comprehensive temporal and spatial coverage offers an unparalleled opportunity to study the dynamics of vineyard ecosystems in depth.Further, it empowers the available data lake of the vineyard to train models that are capable of generalizing under different operational conditions.This fusion of datasets opens up new avenues for research and application, allowing for a more detailed examination of bunch visibility, phenotypic trait extraction, and yield estimation under varying conditions, among other characteristics.
Therefore, in order to obtain a complete perspective of the vineyard, recording the side of the row from multiple perspectives becomes essential.Consequently, this dataset aids Object Detection and Tracking algorithms training in real vineyard conditions, ensuring accurate bunch counting.

Data Description
The dataset was collected during the 2023 harvesting campaign between September 19th and 20th in a 1.06-hectare commercial vineyard ( Vitis vinifera cv.Loureiro) located in Tomiño, Spain (X: 516989.02,Y: 4644806.53;ETRS89 / UTM zone 29N) ( Fig. 1 ).The plants, managed in a vertical trellis system, were planted in 1990 with an NE-SW orientation.The distance between rows and plants is 3 × 2.5 meters, respectively, and no leaf removal was performed, resulting in a dataset marked by leaf occlusion.
The dataset was collected by flying the UAV over the adjacent vineyard row, recording the side of the row of interest.Two types of videos were acquired: (1) a basic type that observed the canopy from a frontal point of view only, serving as control videos, and (2) videos following a path planning using the Ant Colony Optimization (ACO) [ 12 , 13 ] as optimizing algorithm with multiple-angle perspectives to address occlusion.Fig. 2 illustrates the perspectives obtained from the grape bunches when acquiring the data from multiple viewing points.The videos with names starting with NoPathPlanning_ * belong to the first category, while those starting with PathPlanning_ * were recorded using ACO.

Experimental Design, Materials and Methods
The UAV platform used in this study was the DJI Phantom4 RTK (DJI Sciences and Technologies Ltd., Shenzhen, Guangdong, China), equipped with an integrated RGB sensor.The flights were conducted under a clear sky at 3 m AGL above the vineyard rows, with wind below 0.5 m/s.

Data annotation
A total of 11 vineyard videos were annotated using CVAT software in MOTS style for grape bunch Detection and Tracking.The MOTS annotations were labelled with per-pixel accuracy, which ensured that each grape bunch instance remained coherent throughout the video sequence.Furthermore, even shaded grape bunches were annotated to ensure proper generalization to multiple illumination scenarios.The annotation focused on exclusively labelling grape bunches, excluding the peduncle and surrounding leaves.In the videos with multiple perspectives, grape bunches appear from different viewpoints, resulting in various shapes.The same ID was maintained for grape bunches seen from different perspectives to enhance Object Tracking.In order to increase the efficiency of the annotation task and due to the similarity of adjacent frames in the video, a frame step as 2 was selected in most of the videos, except PathPlanning_1, and the three videos without Path Planning.
Table 1 summarizes the dataset, providing details on the videos, including the number of frames each video included, the number of annotated frames of each video, the frame step for each video annotation task and the size of both images and annotations.The dataset, totalling 78.8 GB excluding the original videos, includes 5958 labelled frames.Videos are available in MP4 format, while the images, along with the annotations, are provided in PNG.Moreover, the instances folder includes also a txt file, which contains the label(s) of the annotations.

Table 1
Description of the videos and annotations provided, along with the number of annotated frames, and the size of the zip file containing frames and instances.

Limitations
None.
Dataset link: MOTS-annotated UAV Vineyard Dataset captured using Multiple Perspectives to avoid Leaf Occlusion for Object Detection and Tracking (Original data)

Fig. 2 .
Fig. 2. Vineyard row acquired from multiple viewing points.The red masks represent the grape bunch annotations.(a) Videos recorded from the left.(b) Vineyard row observed from a frontal point of view.(c) UAV perspective when it was recording being rotated from the right.
© 2024 The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) [2] provide a multiple-angle view, each from a different vine plant.The other three videos (named NoPathPlanning_ * ) offer a frontal view of the canopy's side.These record the same plants as those with multiple perspectives, allowing for comparison.Recording details:The videos were captured between September 19 and September 20, 2023, during the harvesting period.Both days had sunny conditions and a wind speed below 0.5 m/s.Annotation Information:All the videos have been annotated using CVAT software[1], employing the Multiple Object Tracking and Segmentation (MOTS) annotation style[2].