Simultaneous Fruit Detection and Size Estimation Using Multitask Deep Neural Networks

The measurement of fruit size is of great interest to estimate the yield and predict the harvest resources in advance. This work proposes a novel technique for in-field apple detection and measurement based on Deep Neural Networks. The proposed framework was trained with RGB-D data and consists of an end-to-end multitask Deep Neural Network architecture specifically designed to perform the following tasks: 1) detection and segmentation of each fruit from its surroundings; 2) estimation of the diameter of each detected fruit. The methodology was tested with a total of 15335 annotated apples at different growth stages, with diameters varying from 27 mm to 95 mm. Fruit detection results reported an F1-score for apple detection of 0.88 and a mean absolute error of diameter estimation of 5.64 mm. These are state-of-the-art results with the additional advantages of: a) using an end-to-end multitask trainable network; b) an efficient and fast inference speed; and c) being based on RGB-D data which can be acquired with affordable depth cameras. On the contrary, the main disadvantage is the need of annotating a large amount of data with fruit masks and diameter ground truth to train the model. Finally, a fruit visibility analysis showed an improvement in the prediction when limiting the measurement to apples above 65% of visibility (mean absolute error of 5.09 mm). This suggests that future works should develop a method for automatically identifying the most visible apples and discard the prediction of highly occluded fruits


Introduction
According to the Food and Agriculture Organization (FAO), by 2050 the agriculture industry will need to produce 70% more food while only being able to use 5% more land. Since most land suitable for farming is already in use, this production growth has to come from another source. The introduction of Precision Agriculture has enabled farmers to measure, map and manage crops at their different stages to increase production while optimising used resources and costs.
Interest in vision techniques for Precision Agriculture has grown in recent years. Such techniques have contributed to triggering improvements in field conditions and have also helped farmers to better estimate their production through the use of fruit counting (Gené-Mola et al., 2020a) or fruit size estimation methods (Casagrande et al., 2021;Tsoulias et al., 2020), among others. This project focuses on the use of Deep Neural Networks (DNNs) for the detection of fruits on the tree and the estimation of their size.
Fruit growing and production are of great importance worldwide and in-field fruit monitoring contributes to the optimisation of its management (Anderson et al., 2021). Some of the most interesting tasks for farmers in this regard are fruit detection since it is used for fruit counting, and fruit size estimation, which is an important fruit quality parameter. In-field fruit counting and sizing also serve to estimate yield load and plan for its future transport, determine whether an automated harvesting system can be supported, or assess the validity of different cultivation techniques (Longchamps et al., 2022).
Traditionally, in the agriculture industry, yield prediction of orchards has been a challenging task since fruit measurement is usually done manually using a Vernier calliper. Therefore, only an approximation is obtained, since not all the fruits are measured due to the time-consuming nature of the task. Another challenge is the generalization capacity of the methodology so fruits can be measured at different growth stages (Neupane et al., 2023).
Nowadays, most fruit detection works are based on Neural Networks, either using object detectors (Aguiar et al., 2021;Ghiani et al., 2021), semantic segmentation (Afonso et al., 2020;Peng et al., 2021) or instance segmentation Liu et al., 2019). In contrast, most fruit size estimation methods are based on classical techniques such as geometrical (circle, ellipse, sphere…) fitting algorithms (Gene-Mola et al., 2023;Kurtser et al., 2020;Neupane et al., 2022) or segmenting the detected fruits and measuring the segmented area (Apolo-Apolo et al., 2020;Costa et al., 2021;Lu et al., 2022). These methods are highly affected by the quality of the fruit segmentation and the amount of fruit visibility (Wang et al., 2020), in addition, most of them are 3D-based, therefore are computationally expensive algorithms which limit real-time operation (Gené-Mola et al., 2021a). Methods based on 2D images require the usage of calibration targets that must be placed at the same distance to the cameras as the fruits, which adds complexity to the data acquisition process (Lu et al., 2022;Wang et al., 2018).
Alternatively, in this work, we propose a novel Multitask Neural Network specifically designed for simultaneous fruit detection and sizing using RGB-D images captured in-field conditions and without the need of requiring calibration targets.
To the best of the authors' knowledge, this is the first end-to-end trainable Multitask DNN to detect and estimate the diameter of fruits by combining two architectures: one for detection and the other for diameter regression.
The paper itself is divided into the following sections. First, Section 2 describes the proposed methodology, presenting the dataset and how to pre-process it. An explanation is then provided of how and why depth maps are used, and finally, the developed Neural Network architecture is presented in detail. Section 3 presents and analyses the most relevant validation and test results in terms of fruit detection and size estimation. In Section 4, we discuss the results, while the main conclusions and future research lines are discussed in Section 5.

Data and data pre-processing
The data used in this project was generated by annotating some images from the PFuji-Size dataset (Gené-Mola et al., 2021b). The original dataset includes: (1) raw images used to generate the 3D point clouds of apple trees using structure from motion (SfM) and multi-view stereo (MVS) techniques; (2) the resulting 3D point clouds of the Fuji apple trees; (3) 3D instance (fruit) segmentation annotations; (4) fruit size (diameter) annotations which were manually obtained by measuring the maximum horizontal diameter using a Vernier calliper; and (5) the apples centre position. In addition, the fruit diameter and centre position were used to obtain 3D spherical mask of each apple. Part of the data was captured in October 2018, when apples were at an advanced ripening stage (growth stage of BBCH85 in Meier (2001) scale) (Fig. 1a), while the rest of the data was acquired in July 2020, when the apples were at 70% of their final size (growth stage of BBCH77 in Meier (2001) scale) (Fig. 1b). Each apple was assigned a unique identifier, which helped to manage the data. To adapt the original dataset to the needs of the present research, additional data curation was required to: (1) generate depth images; and (2) obtain 2D image annotations (apple masks and establish the correspondence between the mask and the ground truth diameter using the apple IDs  information was projected onto the 2D images in order to obtain the apple ID and consequently, its diameter provided in the PFuji-Size dataset. Furthermore, the 2D projection of the spherical masks was also obtained (Fig. 3c). This projection was carried out following the pinhole camera model (Faugeras, 1993). This allowed us to estimate the percentage of visibility of each annotated apple, which is defined as the ratio between the area (in pixels) of the instance segmentation mask ! (Fig.   3b) and the area of the projected spherical mask " (Fig. 3c). It can be translated to the following expression, which represents the visibility of an apple with a unique ID as the number of pixels of the instance mask (which represents the visible part of the apple) over the number of pixels in the projected spherical mask (which represents the whole area of the fruit): Finally, the result of this automatic annotation was manually corrected by using the VIA annotation software (Dutta and Zisserman, 2019). The manual correction consisted of: (1) deleting apple masks wrongly identified; (2) correcting the apple IDs and ground truth diameters wrongly matched; (3) labelling miss-annotated apples. The result of this annotation is shown in Fig. 3. The generated dataset was split into training, validation and test sets. Images acquired from the west side of the row of trees were used for training, while east-side images were used for validation and testing. Table 1 details the number of images and apple annotations from BBCH77 (green apples)and BBCH85 (ripe apples) data split into the training, validation and test sets. All the data generated and used for this project (RGB-D images and annotations) has been made publicly available at http://www.grap.udl.cat/en/publications/papple_rgb-d-size-dataset/.

Instance segmentation branch
Mask R-CNN  architecture was used as the baseline. It is a well-known two-stage network that detects objects in an image while also generating a segmentation mask for each detection. It is an extension of the Faster R-CNN (Ren et al., 2017) network and, in our case, we used the Detectron2 implementation ( four feature maps ( 1, 3, 4, 5). Next, to generate the final feature maps, the Feature Pyramid Network (FPN) was used. Identifying the same object at different scales is known to be a challenging task and sometimes the network is not able to generalise well in this regard. For this reason, FPNs are of great use in such situations, since they take the input features at different scales and transform them so that, at smaller scales, the network focuses on the larger objects, while, at bigger scales, the network can focus on extracting features for the smaller objects. The outputs of the backbone are the final feature maps 2, 3, 4 and 5, while the feature map 6 is generated from a max-pooling operation on 5.
• The k proposals are parameterised relatively to the k reference boxes (anchors) generated with cluster analysis over the training set. Mask R-CNN, as Faster R-CNN, uses three scales and three aspect ratios by default, yielding k = 9 anchors at each cell. The output boxes of the RPN are called proposal boxes.
• Box Head: The region of interest (RoI) Align process crops the rectangle regions of the feature maps that are specified by the proposal boxes. RoI Align is a more precise way to perform RoIPooling as it matches the feature map level that is most convenient for each bounding box. The result is fed to the Box Head, which has two fullyconnected layers and performs regression to obtain the final corners of each bounding box. L1-loss is used to calculate the error.
• Mask Head: With the new bounding boxes estimated by the Box Head, RoI Align is again performed, and the output is the input for the Mask Head. This branch of the network is formed by three convolutional layers and a deconvolutional one. The loss used is the cross-entropy loss.
Usually, the input of Mask R-CNN is a 3-channel colour image. However, this project considers the depth information to be relevant for fruit size estimation purposes, so the input ends up having four channels: RGB+D (Fig. 4a). In order to boost speed, the depth channel was added following an early-fusion strategy (Sa et al., 2016). Due to this additional channel, filters from the first convolutional layer increase in depth (from 3 to 4). This modification does not affect the detection accuracy, as stated in previous works (Gené-Mola et al., 2019) and it is a way of simulating a tri-dimensional space using bi-dimensional data, which will be very helpful for the fruit size estimation.

Diameter regression branch
The Diameter Head is a regression branch added to the baseline (Mask R-CNN) that aims to estimate the maximum horizontal diameter of detected apples. Fig. 4c (region coloured in green) illustrates a conceptual representation of the architecture of this branch inspired by the Mask Head. Its main differences from the Mask Head are the addition of depth information as an input and a final linear layer to predict the diameter.
The input of this Diameter Head comes from different parts of the network and has to be properly combined. The output from the FPN is fed into the sub-network. These four groups of feature maps (levels) have different sizes: 2(256 x 256), 3 (128 x 128), 4 (64 x 64) and 5 (32 x 32). In addition, to ensure the depth information influence plays a major role in the network's final weights, the depth maps of the corresponding images in the batch are concatenated to the FPN feature maps by reshaping them four times to the four desired sizes. The feature maps together with the resized depth information are matched with the bounding boxes coming from the Box Head using RoI Align.
The Diameter Head architecture (Fig. 4c) is formed by three 2D convolutions with a kernel of size 3 x 3, a stride of 1 x 1 and padding of 1 x 1. Each of the three convolution layers has a ReLU activation. Then, there is a deconvolution layer with a kernel of size 2 x 2 and a stride of 2 x 2 and a ReLU activation. This deconvolution, up-samples the image, which goes from 14 x 14 (default pooling resolution) to 28 x 28. After the deconvolution, the data is flattened and fed to a linear layer that predicts the diameter for that mask.
The developed network was implemented in the Pytorch framework and the code has been made publicly available jointly with the presented dataset at http://www.grap.udl.cat/en/publications/papple_rgb-d-size-dataset/.

2.2.3.
Network training and inference details a) Weight initialisation: Mask R-CNN has a set of weight initialisations pre-trained with different backbones on ImageNet (Deng et al., 2009). In our case, the used weights were pre-trained with a ResNet50 backbone. However, during the course of this project, we have carried out several additions to the baseline, and the new Diameter Head needs its set of pre-trained weights. We tried both a standard MSRA (He et al., 2015)initialization and re-using the Mask Head weights. Our experiments showed better results in re-using weights.
b) Data augmentation: One of the most popular techniques to increase the accuracy of the model is performing data augmentation. Creating "new" data from the existing images allows the models to generalise better and helps avoid overfitting. However, the scenario we are observing is quite monotonous, at least in the short term (and at certain hours of the day), so many augmentations such as 2D rotations might not be of great use. After some trial and error processes, we concluded that the best data augmentation technique was simply to apply a horizontal flip of the image.

c) Evaluation metrics:
To evaluate the fruit detection results, we used precision (P), recall (R), F1-score, and average precision (AP) metrics. Diameter estimation was evaluated in terms of the mean absolute error (MAE), the mean bias error (MBE), the mean absolute percentage error (MAPE), the root mean square error (RMSE) and the coefficient of where is the number of observations, $ is the diameter estimation, %& is the diameter ground truth, $ DDD is the mean of estimated diameters and is the linear regression model that relates $ and %& .

d) Training hyperparameter optimisation:
The Stochastic Gradient Descent (SGD) optimiser was used, and the optimal hyperparameters were found by means of a grid search. We considered a parameter to be optimal if the model yielded the smallest diameter error in the inference process. A grid search was performed over the following hyperparameters: I. Learning rate: In this work, we performed a grid search over the following values: = [2 · 10 '( , 2 · 10 ') , 2 · 10 '* ]. The optimal was found to be 2 · 10 ') . I. Non-maximum suppression (NMS) threshold: NMS is used by the model to determine the number of accepted bounding box predictions since it will suppress boxes with an IoU bigger than the specified threshold. We selected the optimum NMS threshold by finding the one that maximized the average precision (AP) in the validation set (Section 3.1). This analysis was carried out using the AP, as this metric is not affected by the selected confidence score.

II.
Confidence score threshold: The minimum confidence score to consider a detection as positive was selected by analysing the P,R andF1-score curves for confidence values ranging from 0 to 1 (Section 3.1).

Experiments on the validation set
The presented model was trained using the parameters detailed in the section above. Fig. 5a represents the loss curves for training and validation. The training was stopped when the model's generalisation capacity reached its limit, and the validation curve started getting flatter. The presented model does not show signs of overfitting as it can be observed in more detail in Fig. 5b and Fig. 5c, where both curves, for fruit detection and diameter estimation tasks, descend.

Fruit detection
To find the optimal NMS threshold, the AP metric was used to analyse the effect of applying different levels of NMS in the validation set. Fig. 6a represents the evolution of the AP curve regarding the NMS threshold. The highest values of AP correspond to the more restrictive NMS threshold ( +,-$", = 0.1), which means that the bounding boxes that have more than 10% of overlap are eliminated. Based on these results, an NMS threshold of 0.1 was subsequently used to assess the fruit detection performance in the test set (Section 3.2).
The P, R and F1-score curves obtained in the validation set were analysed to choose the best confidence score value. Fig.   6b shows the behaviour of these parameters with respect to confidence. Note that the F1-score curve remains almost constant, although it slightly decreases with the confidence score. This might be due to the fact that the NMS threshold was already optimised, and so some possible outliers were already filtered. Furthermore, the P curve is an increasing function since the higher the confidence in the prediction, the less likely it is to encounter an FP. In contrast, the R curve tends to decrease, which is due to the fact that a more restrictive confidence threshold results in fewer predictions being accepted, and therefore, fewer TPs. Since the F1-score metric is maximised for a confidence value of 0.7, such confidence value was used to assess the fruit detection performance in the test set (Section 3.2).
(a) (b) Fig. 6. Fruit detection results on the validation set. (a) average precision depending on the NMS threshold. (b) P, R and F1-score curves depending on the confidence score.

Fruit size estimation
The diameter estimation error is also affected by the degree of confidence in the prediction. The curve shown in Fig. 7a shows the evolution of the MAE of the estimated diameter with respect to the confidence threshold. The MAE decreases with higher confidence values, which means that diameter estimation improves when measuring fruits detected with higher confidence. This improvement is also observed with the increase of the coefficient of determination ( # ) between predicted and actual fruit sizes (Fig. 7b). Since the confidence value that minimizes the MAE is 0.99, the results presented in Section 3.2.2 were obtained using it as a threshold. Although 0.99 is restrictive, the number of fruits detected using it (about 2300 apples) is a representative sample of all detections (about 3150 apples).

Fruit detection
The model was evaluated using the test set. Table 2 shows the detection results using the optimal parameters ( +,-$", = 0.1, Confidence > 0.7). Similar F1-score results were obtained at different growth stages ( 1 = 0.88). In terms of AP, the neural network presented a better performance detecting ripe apples ( ../01( = 0.75) than detecting green apples ( ../022 = 0.69). We attribute this difference to two main reasons: (1) the green colour of apples from the BBCH77 set makes the task more challenging due to the similarity of apple and leaf colour; and (2) the number of training samples in the BBCH85 set is larger than in the BBCH77 set.  Fig. 8 shows the input image and the comparison between the actual ground truth and the model's prediction regarding both detection and diameter estimation. In terms of detection, the model performed as expected and the majority of fruits were detected. The critical cases were whenever there were high occlusions or the apples were in the margins of the image.
In terms of computational speed, the average processing time per image was 0.1439 s/img, which corresponds to a throughput of 6.95 img/s. This processing times were obtained using an NVIDIA GeForce GTX 1080 Ti GPU. Fig. 8. Fruit detection and size estimation results in four randomly selected images. The top two rows show results on two images from the BBCH85 set (ripe apples), while the bottom two rows show results on images from the BBCH77 set (green unripe apples). The first column corresponds to the original images. The second column illustrates the instance segmentation masks and apple diameter ground truth. The third column illustrates the fruit detections (instance segmentation masks) and the estimated diameters.

Fruit size estimation
A detailed comparison of the performance of the model at different apple maturity stages can be found in Table 3. Results showed MAE between 4.16 mm and 6.22 mm, obtaining higher errors for ripe apples (with bigger sizes). The absolute error presented a standard deviation ( ) between 3.88 mm and 5.73 mm. This is considered a high standard deviation compared to the reported MAE. Authors attribute this high dispersion of the error to the higher errors obtained when measuring highly occluded apples. The percentage error was found to be similar at different growth stages (about 8%), concluding that the mean error is proportional to the size of the measured apples.  to the ground truth. The predicted distribution for both datasets is quite fitted to the actual ground truth. One thing to note is that, for the smaller apples, the tendency is to slightly overestimate their size (MBE = 0.93 mm). The opposite also happens with bigger apples (MBE = −1.74 mm). This effect is caused by the fact that the model was not trained for a specific apple size, so it tends to look for the middle ground.

Discussion
The presented model was robust enough to detect a significant number of apples at different growth stages and degrees of visibility, reporting an F1-score of 0.88 on the task. These results are comparable with other state-of-the-art works based on neural networks, which have reported F1-score values between 0.73 and 0.97 (Chu et al., 2021;Koirala et al., 2019;Wang and He, 2022). The apples that were not detected were highly occluded by other structural elements (leaves, trunks, other apples, …) or placed in the margins of the images with a small amount of the apple surface visible in the field of view of the camera. These fruit detection issues were also observed in previous works (Gené-Mola et al., 2019).
Having a robust fruit detector is of extreme importance for fruit counting but also for fruit sizing purposes, since it ensures that the size measures will be representative of the crop. The proposed methodology was able to predict the diameter of apples at different ripening stages, reporting a MAE of 5.64 mm. As presented in the previous section, it tends to overestimate the size of the smaller apples, and underestimate the size of the bigger ones, however, we argue that this bias is negligible since in Fig. 9 we showed that the diameter prediction distribution is adapting properly to the ground truth diameter distribution.
The results we show might differ if images with very different lighting conditions or different tree-camera distances are used.
Nevertheless, the data-gathering settings are easy to reproduce and are public alongside the whole dataset (Gené-Mola et al. (2021b)). Although it is difficult to compare methodologies tested with different datasets, we can state that, in terms of mean diameter errors, our method performed similarly to other state-of-the-art methods, which reported MAE results between 3.5 mm and 12.4 mm, as we can see in Table 4. The main contribution of this work is that, for the first time, an end-to-end deep learning architecture has been designed and tested for the simultaneous detection and measurement of fruits. Besides its performance in terms of detection and sizing, the method presents other significant advantages: it overcomes the limitations of traditional sizing methodologies where calibration targets are required and must be placed at the same distances from the fruits to be measured (Lu et al., 2022;Wang et al., 2018). With these techniques, only the fruits around the calibration targets can be measured, while our proposed methodology could measure larger areas efficiently. In addition, previous fruit sizing works required the identification of feature points on the apple images to subsequently perform a geometrical measurement (Wang et al., 2020). Alternatively, the presented method directly estimates the diameter of the apples without the need to identify specific key points to measure, which results in a more efficient method. Furthermore, since it is based on a CNN that can be processed with graphic processing units (GPUs) and parallel computing, our method will permit its use for real-time and edge-computing applications (Mazzia et al., 2020). We obtained a competitive throughput of 6.95 img/s thanks to using an early-fusion network architecture. This is considered a high inference speed for simultaneous fruit detection and sizing compared with other state- based on LiDAR that required a processing time of 13s per tree. Another advantage of our method is that it is based on the use of RGB-D images, which allows us to include the 3D information without the computational complexity of adding another dimension. In this paper, we used highly precise depth maps that were created using SfM which requires a great number of images of the studied area. We propose to study the different effects that the depth maps obtained using commercial sensors have on the presented framework in future work.
Some fruit sizing works from the literature limit the evaluation of their methods on fully visible fruits (Gongal et al., 2018;Herrero-Huerta et al., 2015;Wang et al., 2020Wang et al., , 2017. However, the present work presents an analysis of results at different fruit visibility percentages. Results showed that the more visible an apple is, the better its diameter will be predicted. The MAE improved from 5.64 mm to 5.09 mm when limiting the measurement to fruits with visibility percentages higher than 65%. This suggests that future works should explore the development of a method for automatically identifying the most visible apples and not consider the prediction of low visibility scores.

Conclusions
This project proposes a deep learning approach for simultaneous fruit detection and size estimation. The method presented can be used to measure fruits at different growth stages and, as stated in the introduction, such insights can provide farmers with much-needed data to manage their crops more efficiently. The baseline for this work was the Mask R-CNN instance segmentation network, which was extended with a regression branch in order to compute the diameter of the detected apples, yielding successful results both in terms of fruit detection and fruit size estimation.
Regarding apple detection, our method achieves state-of-the art performance, with an F1-score of 0.88. Furthermore, the presented architecture was able to estimate fruit size with a MAE of 5.64 mm. Results were robust at different degrees of visibility but, when discarding the measured highly occluded apples, the correlation between actual and estimated diameter slightly improved (from # = 0.66 to # = 0.77). These results are similar to other state-of-the-art methodologies, but our proposed method has the following advantages: a) it simultaneously detects and estimates the size with a single end-to-end trainable network; b) it is efficient and fast so it can be used for real-time applications, and c) it uses RGB-D data which can be acquired with affordable depth cameras.
The method presented successful results, demonstrating the promising future of deep learning approaches in the field of fruit sizing. However, there is still room for improvement. A combination of the proposed method with automatic estimation of fruit visibility would help to select the best candidate apples to be measured. In addition, an unexplored and promising path for fruit size computation would be to use Graph Neural Networks, which use 3D data. Finally, although this work deals with apples, it could be extended to other fruit varieties.