Dataset of bird's eye chilies farm for stereo image semantic segmentation

This paper presents a dataset of bird's eye chilies in a single farm for semantic segmentation. The dataset is generated using two cameras that are aligned left and right forming a stereo-vision video capture. By analyzing the disparity between corresponding points in the left and right images, algorithms can calculate the relative distance of objects in the scene. This depth information is useful in various applications, including 3D reconstruction, object tracking, and autonomous navigation. The dataset consists of 1150 left and right compressed images extracted from ten sets of stereo videos taken at ten different locations within the chili farm from the same ages of the bird's eye chilies. Since the dataset is used for semantic segmentation, the ground truth images of manually semantic segmented images are also provided in the dataset. The dataset can be used for 2D and 3D semantic segmentation of the bird's eye view chili farm. Some of the object classes in this dataset are the sky, living things, plantation, flat, construction, nature, and misc.

Bird's eye chili farm Semantic segmentation Autonomous navigation a b s t r a c t This paper presents a dataset of bird's eye chilies in a single farm for semantic segmentation.The dataset is generated using two cameras that are aligned left and right forming a stereo-vision video capture.By analyzing the disparity between corresponding points in the left and right images, algorithms can calculate the relative distance of objects in the scene.This depth information is useful in various applications, including 3D reconstruction, object tracking, and autonomous navigation.The dataset consists of 1150 left and right compressed images extracted from ten sets of stereo videos taken at ten different locations within the chili farm from the same ages of the bird's eye chilies.Since the dataset is used for semantic segmentation, the ground truth images of manually semantic segmented images are also provided in the dataset.The dataset can be used for 2D and 3D semantic segmentation of the bird's eye view chili farm.Some of the object classes in this dataset are the sky, living things, plan-

Value of the Data
• This data is valuable because it is specific agricultural data, namely the Bird's Eye Chilies farm data that is taken for 2D and 3D image analysis such as segmentation, object recognition, and classification.• This data can be used in the automation of agricultural activities that require navigation towards the target plants and the understanding of the position and parts of the plants such as harvesting, irrigating, and weeding.• These data can be reused to examine the performance of the developed algorithms in object recognition, semantic segmentation, and target detection and navigation.

Data Description
The data consists of frames of ten left and right stereo videos of a five-week-old bird's eye chili tree that were taken on 15 th September 2021 in ten different places within the bird's eye chili farm as implemented in [1] .Each video consists of 50 1920 by 1080 pixels images of the video frame and 50 images of the mask according to the classes of the objects.The ground truth image or the mask file needs to be opened as a grayscale image and the pixel value of each of the pixels in the mask image is saved as the object ID as shown in Table 1 .The sample of the   video frame is shown in Fig. 1 .The classes included in the mask are shown in Table 1 according to their object ID which is the pixel value in the mask images.The sample of the mask image of the image in Fig. 1 according to the reference color in Table 1 is shown in Table 2 .
The frame images are in JPG format while the mask is in PNG format.Altogether there are 1150 images of video frames and masks included in this dataset.As shown in Fig. 3 , there are ten folders in the dataset that are named from "Video 1" to "Video 10".Each of the folders consists of 2 folders called "left" and "right".In the left folder for "Video 1", "Video 2" and "Video 10", there are two folders named "frame " and "mask", respectively.Each file in the frame and mask folder is named as continuous numbers from 1 to 50 followed by the file format.The mask image for a particular frame image will have the same filename as the frame image's filename.For example,

Experimental Design, Materials and Methods
For the image capture, a high-definition web camera is used as shown in Fig. 4 .The camera uses CMOS optical sensor, and the maximum resolution is 1280 by 960 pixels.The default frame rate for the camera is 30 frames per second.To capture the video in stereo, two same cameras are stacked side by side on pole as shown in Fig. 5 .Camera calibration for stereo vision is done in MatLab using a standard toolbox established in [ 2 , 3 ].By using 30 pairs of images generated previously, this image is processed.The checkerboard used is an 8 × 5 rectangular square, and each rectangular square has a physical size of 35 mm × 35 mm in dimension.Table 2 shows the value of the camera parameter with the value of intrinsic and extrinsic resolution and the visualization of the values is shown in Fig. 6 .

Limitations
Although the provided data is captured using a stereo camera setup, in this first documentation, only the left video is included in the dataset.This is because ground truth images are still lacking for stereo image segmentation.Since only the ground truth images for a single camera are available, we only provide the single camera data which is the left camera data only in this dataset.The dataset later will be extended according to our research and the generation of the ground truth image.

Fig. 1 .
Fig. 1.The sample of the video frame in the dataset.

Fig. 2 .
Fig. 2. The sample of the semantic segmentation ground truth or mask image of a video frame in the dataset.

Fig. 6 .
Fig. 6.Illustration of intrinsic resolution of the checkerboard pair image.

Table 1
The Object ID and the description of the objects in the dataset.

Table 2
Camera parameters information details of the intrinsic and extrinsic resolution.