KFuji RGB-DS database: Fuji apple multi-modal images for fruit detection with color, depth and range-corrected IR data

This article contains data related to the research article entitle “Multi-modal Deep Learning for Fruit Detection Using RGB-D Cameras and their Radiometric Capabilities” [1]. The development of reliable fruit detection and localization systems is essential for future sustainable agronomic management of high-value crops. RGB-D sensors have shown potential for fruit detection and localization since they provide 3D information with color data. However, the lack of substantial datasets is a barrier for exploiting the use of these sensors. This article presents the KFuji RGB-DS database which is composed by 967 multi-modal images of Fuji apples on trees captured using Microsoft Kinect v2 (Microsoft, Redmond, WA, USA). Each image contains information from 3 different modalities: color (RGB), depth (D) and range corrected IR intensity (S). Ground truth fruit locations were manually annotated, labeling a total of 12,839 apples in all the dataset. The current dataset is publicly available at http://www.grap.udl.cat/publicacions/datasets.html.


Data
The KFuji RGB-DS database contains a total of 967 multi-modal images of Fuji apples on trees and the corresponding ground truth fruit location annotations. Each image contains data from three different modalities: color (RGB), depth (D), and range-corrected IR intensity (S). Fig. 1 illustrates three selected images from de dataset, showing ground truth annotations and the modalities that composes each image.
This dataset was built to be used for training, validation and benchmarking of fruit detection algorithms using RGB-D sensors. For instance, in Ref. [1], the deep convolutional neural network Faster R-CNN [2] was used to detect and localize fruits from the presented dataset.
Images are 548 Â 373px and were saved in three different files: RGB hr (high resolution color image): Raw color image. These images are saved in 8-bit JPG files. RGB p (projected color image): Projection of the color 3D point cloud onto the camera focal plane. The RGB p and the D-S modalities are obtained following the same procedure, allowing the comparison between these modalities for fruit detection. These images are saved in 8-bit JPG files.

Specifications table
Subject area Machine learning, computer vision, deep learning, agronomy More specific subject area Image fusion, Precision agriculture.

Type of data
Multi-modal images with color (RGB), depth (D), and range-corrected IR intensity (S).

How data was acquired
The images were acquired using Microsoft Kinect v2.

Data format
Raw images: JPG Raw point clouds: MAT Pre-processed images: JPG (color channels) and MAT (depth and range-corrected IR channels) Annotations: CSV and XLM. Experimental factors Different image modalities have been registered to have pixel-wise correspondence between image channels.

Experimental features
All captures were carried out during the night, using artificial lighting. S and D data were normalized between 0 and 255 elike RGB images-to achieve similar mean and variance between channels. This normalization allows a faster learning convergence of machine learning algorithms (such as deep convolutional neural networks).
All images were manually annotated with rectangular bounding boxes, labelling a total of 12,839 apples in all the dataset. Annotations are provided in XLM and CSV formats, where each row corresponds to an apple annotation, giving the following information: item, topleft-x, topleft-y, width, height, label id.

Experimental design, materials, and methods
The data acquisition was carried out in a commercial Fuji apple orchard (Malus domestica Borkh. cv. Fuji), three weeks before harvesting (85 BBCH growth stage [3]). The RGB-D sensors used were two Microsoft Kinect v2 (Microsoft, Redmond, WA, USA), which are composed by an RGB camera and a time-of-flight (ToF) depth sensor. For each capture, the sensor provides a 3D point cloud with RGB and backscattered IR intensity data, and a raw RGB image. Due to the performance of the depth sensor drops under direct sunlight exposure [4], data was acquired at night using artificial lighting.
Pre-processing of data was carried out to build the multi-modal images with pixel-wise correspondence between channels. Fig. 2 shows an outline of the data preparation steps. To overcome the IR signal attenuation, the IR intensity data was range-corrected (Fig. 2a) following the methodology described in Ref. [1]. Then the acquired 3D point clouds were projected onto the camera focal plane (Fig. 2b), generating the RGB, range-corrected IR and depth projected images. These images were geometrically wrapped and registered (Fig. 2c) with RGB hr so that different image modalities have pixel-wise correspondence. Finally, to reduce the number of fruits per image, and considering that fruit size is small compared with the image size, each capture was split into 9 images of 548 Â 373 px (Fig. 2d).