Fuji-SfM dataset: A collection of annotated images and point clouds for Fuji apple detection and location using structure-from-motion photogrammetry

The present dataset contains colour images acquired in a commercial Fuji apple orchard (Malus domestica Borkh. cv. Fuji) to reconstruct the 3D model of 11 trees by using structure-from-motion (SfM) photogrammetry. The data provided in this article is related to the research article entitled “Fruit detection and 3D location using instance segmentation neural networks and structure-from-motion photogrammetry” [1]. The Fuji-SfM dataset includes: (1) a set of 288 colour images and the corresponding annotations (apples segmentation masks) for training instance segmentation neural networks such as Mask-RCNN; (2) a set of 582 images defining a motion sequence of the scene which was used to generate the 3D model of 11 Fuji apple trees containing 1455 apples by using SfM; (3) the 3D point cloud of the scanned scene with the corresponding apple positions ground truth in global coordinates. With that, this is the first dataset for fruit detection containing images acquired in a motion sequence to build the 3D model of the scanned trees with SfM and including the corresponding 2D and 3D apple location annotations. This data allows the development, training, and test of fruit detection algorithms either based on RGB images, on coloured point clouds or on the combination of both types of data.


a b s t r a c t
The present dataset contains colour images acquired in a commercial Fuji apple orchard ( Malus domestica Borkh. cv. Fuji) to reconstruct the 3D model of 11 trees by using structure-from-motion (SfM) photogrammetry. The data provided in this article is related to the research article entitled "Fruit detection and 3D location using instance segmentation neural networks and structure-from-motion photogrammetry" [1] . The Fuji-SfM dataset includes: (1) a set of 288 colour images and the corresponding annotations (apples segmentation masks) for training instance segmentation neural networks such as Mask-RCNN; (2) a set of 582 images defining a motion sequence of the scene which was used to generate the 3D model of 11 Fuji apple trees containing 1455 apples by using SfM; (3) the 3D point cloud of the scanned scene with the corresponding apple positions ground truth in global coordinates. With that, this is the first dataset for fruit detection containing images acquired in a motion sequence to build the 3D model of the scanned trees with SfM and including the corresponding 2D and 3D apple location annotations. This data allows the development, training, and test of fruit detection algorithms either based on RGB images, on coloured point clouds or on the combination of both types of data.
© 2020 The Author(s

Value of the data
This data is useful for the research community for the following reasons: • First dataset for fruit detection with 3D coloured point clouds generated by applying structure-from-motion photogrammetry. This dataset differs from other existing fruit detection datasets based on RGB, RGB-D and LiDAR sensors [2][3][4] by providing 3D point clouds that were obtained with SfM. Furthermore, it includes the set of motion sequence images used for point cloud generation and a set images manually annotated with instance segmentation masks. • Computer vision community can benefit from these data to test new object detection and segmentation algorithms either based on 2D or on 3D data. • Annotations provided can be used for training machine learning systems used in agriculture with applications such as yield prediction, yield mapping and automated harvesting [5][6][7][8] . • Finally, a further significant value of this data is that it provides for the first time a very detailed annotated dataset for benchmarking fruit detection based on machine learning and 3D computer vision techniques. • This dataset allows to reproduce all methods and results reported in the corresponding research paper [1] .

Data
The Fuji-SfM dataset includes annotated data for 2D and 3D fruit detection. This dataset can be downloaded at http://doi.org/10.5281/zenodo.3712808 [9] . Once the dataset is unpacked, data is organized as shown in Fig. 1 .
The 1-Mask-set folder includes 12 raw images of Fuji apple trees. This set of images was used in [1] to train and validate the Mask-RCNN [10] . Since the performance of object detection and segmentation neural networks decreases when detecting small objects [11] , each Mask-set image was divided into 24 sub-images of 1024 × 1024 px ( Fig. 2 a and Fig. 2 b). The resulting 288 sub-images were split in training (231 sub-images) and validation (57 sub-images) sub-sets and were manually annotated generating the apples segmentation masks ground truth ( Fig. 2 c). Image annotations were saved in CSV and JSON file formats. Fig. 1 (b and c) Fig. 3 ). A total of 1455 apples were annotated. Each fruit location annotation was saved in a TXT file where the first row corresponds to the position [x, y, z] of the apple centre, while the following eight rows indicate the positions of the bounding box corners.

Experimental design, materials, and methods
Images provided in Fuji-SfM dataset were acquired on September 2017 in a commercial Fuji apple orchard located in Agramunt, Catalonia, Spain (E: 336,297 m; N: 4,623,494 m; 312 m a.s.l., UTM 31T -ETRS89). The scanned trees were trained in a tall spindle system, with a maximum canopy height of 3.5m and width of 1.5 m, approximately. Mask-set images were taken from different randomly selected zones of the orchard, while SfM-set images were acquired from both sides of 11 consecutive trees containing a total of 1455 apples. All data was acquired three weeks before harvesting, at BBCH phenological growth stage 85 [12] . The camera used for data acquisition was an EOS 60D DSLR Canon camera (Canon Inc. Tokyo, Japan), with an 18 MP (5184 × 3456 px) CMOS APS-C sensor (22.3 × 14.9 mm), and a Canon EF-S 24mm f/2.8 STM lens (35 mm film equivalent focal length of 38 mm). All images were taken freehand from a distance of approximately 3 m from the trees centre, and at a height of 1.7 m ( Fig. 3 ). Images from the east side of the row of trees were photographed in the morning (11:53 -12:26h), while the west side was photographed in the afternoon (15:27 -16:05h) under natural illumination conditions. Fig. 3 illustrates the data acquisition process followed for the SfM-set . Yellow circles represent the camera centre of different photographic positions. The separation between two consecutive positions was 0.2 m, corresponding to a total of 53 photographic positions per row of trees side. From each camera position, a vertical sweep of 5-6 photographs was taken (black lines). With this configuration, a total of 291 images were taken per side, with a vertical/horizontal overlapping between neighbouring images higher than a 30/90 %, respectively (as shown in [1] , Fig. 2 ).  SfM-set images were used to reconstruct the 3D model of the 11 scanned trees. A multi-view structure-from-motion photogrammetry based on bundle adjustment [13] was applied to generate the 3D point cloud of each side of the row of trees. This 3D model generation was carried out using Agisoft Professional Photoscan software (v1.4, Agisoft LLC, St. Petersburg, Russia). A set of known markers in the scene was used to scale and georeferencing the obtained point clouds. Then, point clouds from both sides of the row of trees were merged, obtaining a complete representation of the scanned trees in a single point cloud.
Mask-set images were manually labelled with apple segmentation masks, allowing the use of this set of images to train and test 2D instance segmentation algorithms. This annotation was performed using the VIA annotation software [14] , enclosing individual apples with polygon region shapes. The point cloud of the 11 scanned trees was also manually labelled. Similarly than in [15] , the 3D annotation was carried out using the software CloudCompare (Cloud Compare [GPL software] v2.9 Omnia), placing 3D rectangular bounding boxes around each apple, as can be seen in the zoomed-in region of Fig. 3 .

Acknowledgments
This work was partly funded by the Secretaria d'Universitats i Recerca del Departament d'Empresa i Coneixement de la Generalitat de Catalunya (grant 2017 SGR 646), the Spanish Ministry of Economy and Competitiveness (project AGL2013-48297-C2-2-R) and the Spanish Ministry of Science, Innovation and Universities (project RTI2018-094222-B-I00). Part of the work was also developed within the framework of the project TEC2016-75976-R, financed by the Span-ish Ministry of Economy, Industry and Competitiveness and the European Regional Development Fund (ERDF). The Spanish Ministry of Education is thanked for Mr. J. Gené's pre-doctoral fellowships (FPU15/03355). We would also like to thank Nufri (especially Santiago Salamero and Oriol Morreres) and Vicens Maquinària Agrícola S.A. for their support during data acquisition, and Ernesto Membrillo and Roberto Maturino for their support in dataset labelling.