GBCNet: In-Field Grape Berries Counting for Yield Estimation by Dilated CNNs

We introduce here the Grape Berries Counting Net (GBCNet), a tool for accurate fruit yield estimation from smartphone cameras, by adapting Deep Learning algorithms originally developed for crowd counting. We test GBCNet using cross-validation procedure on two original datasets CR1 and CR2 of grape pictures taken in-field before veraison. A total of 35,668 berries have been manually annotated for the task. GBCNet achieves good performances on both the seven grape varieties dataset CR1, although with a different accuracy level depending on the variety, and on the single variety dataset CR2: in particular Mean Average Error (MAE) ranges from 0.85% for Pinot Gris to 11.73% for Marzemino on CR1 and reaches 7.24% on the Teroldego CR2 dataset.


Introduction
The recent adoption of digital technologies to better assess the conditions of agricultural fields and to improve production processes [1][2][3], commonly known as precision agriculture, represents a growing trend with high economic impact, potentially triggering wider societal changes as indicated by the author in [1].
Precision agriculture stimulates increasing the productivity while reducing the amount of treatment on crops, eventually raising the availability of safer food at lower costs, a critical aim for the close future [4]. The main pillar of such a breakthrough is the systematic use of technology, including the widespread adoption of sensors, both in-field and in-lab for quality control processes.
In addition to the expensive and highly accurate analytics instruments used in the lab, sensors on portable devices are constantly being developed in precision agriculture to support quality control, to dramatically reduce costs and obtain results which are comparable to the ones obtained in labs with traditional technologies. An example is the use of small sophisticated tools [5][6][7] or even portable generic cameras [8,9], mounted on tractors or robots for in-field image acquisition, or the use of remote sensing imagery [10]. An even better and more appealing opportunity for farmers is to employ the smartphone [11][12][13][14] they already have and use in their daily activities. This simplified approach can overcome the current procedure based on destructive sampling (cutting off and weighting a collection of grape bunches) to obtain a yield estimate, as proposed in a rich line of research initiated

Preliminaries
Measuring grape weight is a crucial task for wine producers also in view of quality control aspects, for example, to decide whether to thin the cluster or defoliate the shoot. As the amount of nutrients present in the ground and transmitted to the grapes is substantially constant [40], regulating the grape weight has a critical impact on wine quality. The standard procedure estimates yield as a function of the number of vines per surface unit N v , the number of grape bunches per vine N b and the average weight of the bunch P b , combined as follows to obtain the yield: Clearly, the method has practical limitations, in particular connected to the possibility of obtaining long term forecasts. In fact, the average weight of the clusters P b can be accurately determined only closer to the harvest phase and estimation based only on historical data is difficult because the weight of the clusters can significantly change from year to year. For the varieties considered in this study the cluster's weights collected in the last five years by the CAVIT s.c. laboratory are presented in Table 1. From there we see that there are cases where the relative deviation (https://mathworld.wolfram.com/RelativeDeviation.html) V = |x−x| x through the years can reach 10%. Last but not least, this is a destructive sampling technique.
In Table 2, the average weight of single berries is reported: comparing Table 1 with Table 2 we can see that in most of the cases, the average berry's weight is more stable through the years than cluster's weight. We have unified the nomenclature throughout the manuscript using "weight", in order to conform to the literature on the subject, both for berries and clusters. This suggests that combining the historical series of berry's weight with accurate berry counting, we can deliver better results than using clusters weight alone. Moreover, the use of the historical data opens the possibility to have a yield estimate immediately after the fruit sets.
Following this approach, Equation (1) becomes: with N a the average number of berries per bunch and P a the average berry's weight.
In this work, we seek non-destructive approaches for grape yield estimation, applicable immediately after the fruit set. GBCNet is based on the use of images taken with standard smartphones and application of deep learning algorithms to count the number of berries in the images. With our solution, the agronomist can have a prediction of the yield by simply taking pictures in the field with a smartphone. The production estimate will be then obtained by processing the images with GBCNet and deriving the value for N a -the average number of berries per bunch-in Equation (2) as a function of the GBCNet output of GBCNet.
Counting is the core step for yield estimation for fruit; for grapes, automatic image analysis used 3D bunch reconstruction or artificial illumination at night [41,42], while other Android based solutions used a capturing box as a synthetic background [43].
GBCNet does not require particular preparation for image acquisition, enabling an easier and faster AI-based yield estimation system. This opens the possibility of testing two different strategies for the yield estimation: the first is based on the evaluation of the average number of grape per bunch in Equation (2). The second is having a picture of the whole grape field (for example as a panoramic view), estimate the total number of berries and then simply multiply this for the average berry's weight. The inputs of the networks for the two methods are images with slightly different characteristics. In Section 4, we show the results obtained on datasets optimized for the two different approaches.

From Crowd to Berries Counting
GBCNet stems from the family of Dilated CNNs [44] and integrates geometry-adaptive kernels [45] to solve the problem of grape cluttering in the images. We demonstrate the potentiality of GBCNet on two original datasets CR1, with 7 different varieties, and CR2, with only one variety; good performances are achieved in terms of Mean Absolute and Squared Error (MAE/MSE), with variability induced by the different grapevine varieties. Overall, Mean Average Error (MAE) varies from 0.85% for Pinot Gris to 11.73% for Marzemino on CR1 and reaches 7.24% on the Teroldego CR2 dataset, supporting the claim that GBCNet achieves a good compromise between minimal sensor cost, in-field efficiency and yield estimate reliability. The core of GBCNet for yield estimation is the ability of accurate automatic counting of berries from pictures taken in the grape fields ( Figure 1). We will show that, for these tasks, techniques developed in the context of automatic crowd counting can successfully be adapted [39,46]: in the case of congested scene recognition presented in [44], the input picture is processed by the Deep Neural Network CSRNet returning a density map, in which the integral is the estimated amount of subjects to count, in our case the number of berries in the image, as shown in Figure 2. GBCNet architecture: the model takes in-field smartphone images as the input and estimates a density map in which the integral represents berries count. The second block uses a dilation factor of 2. Every convolutional layer is followed by a ReLU operation, except for the last one.  The CSRNet architecture employs the first ten convolutional layers of VGG16 [47] pretrained on ImageNet [48] as feature extractor and a dilated CNN [49,50] for density map generation. Training from scratch the full network requires an enormous amount of annotated data, and annotation is an expensive operation, in particular with grape images where labeling is required at the level of single berry. To reduce the number of annotations required for the training we adopt for GBCNet a transfer learning approach where a pre-trained VGG16 model is used as a generalized feature extractor for the training of the last part of the network. The use of dilated convolutions, i.e., convolutions with non contiguous kernels with a larger receptive field, aggregates multi-scale contextual information while maintaining the same spatial resolution.
The training phase is based on the generation of density maps as ground truth. This requires the annotation of the images at single berry level: given an input image, a berry at the position x i is represented as a Dirac delta function δ(x − x i ), which represents a binary mask with only the point x i set to 1. After the annotation the image is represented as: where N is the number of labeled points.
To obtain a continuous density function F(x) from the discrete representation H(x) [45], GBCNet employs a convolution with a Gaussian kernel G σ using F( , where the σ fixes the level of smoothing in the mask. Additionally, to tackle the presence of dense scenes in the images, GBCNet is endowed by geometry-adaptive kernels [45] evaluating the distribution of the neighbors of a labeled point. Geometry-adaptive kernels are available in the Python module Scikit-Image [52,53], and they are defined as follows: where d i is the average distance of the k nearest neighviors of x i and β is a regularization parameter.
In all the experiments we use the same configuration as by Li and colleagues in [44] , setting k = 3 and β = 0.3. The k and β parameter space has been preliminarily explored through a grid search on an initial subset of images to obtain the target density map in both sparse and highly dense regions, similarly to the original method for crowd counting [44]. As shown in Figure 1, the model is divided in two main components: a VGG-based feature extraction module and the density estimation module. The amount of detected berries is obtained by integrating the estimated density map, i.e., by summing all pixel values. To tackle highly congested scenes, ground truth density maps are generated by dot annotations employing geometry-adaptive kernels. Separated berries result in distinct regions of the corresponding ground truth density maps. GBCNet is forced to learn this trait and thus to estimate consistent density maps.
The GBCNet source code is jointly owned by FBK and CAVIT s.c. and cannot be publicly shared.

In-Field Images
The GBCNet models were validated on two in-field image datasets, CR1 and CR2, for a total of more than 35,000 berries, all manually annotated. The main descriptive statistics of the datasets are summarized in Table 3).
The images in CR1 were collected by CAVIT s.c. agronomists during routine management operations, while CR2 was acquired by one of the authors. Both CR1 and CR2 datasets were manually annotated by the authors using the open source annotation software Sloth [54]. The CR1 dataset is composed of 128 close-up and manually labeled images belonging to 7 different varieties, taken with 8Mpx and 2Mpx smartphone cameras from which we extracted 17,006 single berry annotations. The CR2 dataset collected 18,622 manually labeled single berry annotations, derived from 17 images of the Teroldego variety taken with a smartphone camera at 8Mpx resolution (2448 × 3268 pixels) from a medium distance (1-1.5 m). Examples of the images in the two datasets are presented in Figure 3.  The CR1 images were taken in a stage where berries are still small and well separated, therefore clusters are characterized by a low degree of occlusion. In addition, the dataset was collected trying to include only one bunch in every picture. For the evaluation of the GBCNet performance the dataset was randomly split in 102 images for train and 26 images for test, corresponding to 13,353 berries in training and 3653 berries in test. The same 80-20% split is adopted, for example, in [30,31] . Resampling by 5-fold Cross Validation (5-CV) was applied on the training dataset. The dataset CR1 is jointly owned by FBK and CAVIT s.c. and cannot be publicly shared.
In the CR2 dataset each image contains more than one cluster, with different sizes both in the foreground and in the background. The images are randomly split in 11 images for train and 6 images for validation, corresponding to an average of 12,415 berries in training and 6207 in validation, respectively. In this case 3-CV was applied. Dataset CR2 is publicly available at the web address https://github.com/MPBA/CR2/.
For both datasets, given that the environment where the pictures are taken is not controlled, there is a large variance between images under different aspects. First, the clusters are visually very different in brightness and saturation while there is little difference with the colors between grapes and the surrounding leaves. This represents a challenge given that intra-class variance (e.g., colors between bunches) is higher than inter-class variance (e.g., bunches versus leaves).
For CR2 an additional challenge is given by the main cluster dimension, that ranges from 1000px (around 40% of the total height with the landscape orientation of the image) to 70px (0.03%). Finally, the CR2 dataset is characterized by images of grapes before veraison, in a stage where berries are almost of the final size, presenting a high degree of occlusion between berries, increasing the task difficulty.
Input images have different resolutions since they were collected by different devices. To ensure homogeneity among training and test data used as input for GBCNet, we resized images at 800px height. In addition, since the first part of the model consists of the first ten VGG-16 layers, it is important to normalize images with the same preprocessing techniques. To this end we employed channel normalization with the same parameters used by VGG-16 on CR1 and CR2. Finally, to increase the number of images available for training, we applied data augmentation techniques. At training time we randomly select patches in which the size is 1 4 of the original image size, and then we randomly flip images in the horizontal direction with 0.5 probability.

Performance Metrics
To evaluate the GBCNet model performance we adopt the most common metrics employed in both agricultural yield estimation and crowd-counting domains [13,17,44,45], i.e., Mean Absolute Error (MAE) and Mean Squared Error (MSE). These are defined as follows: where C i is the estimated count and C GT i is the ground truth count associated to image i. The estimated count is equal to the integral of the output density map. These two metrics represent a measure of accuracy (MAE) and robustness (MSE) of the model.
To estimate crop yield it is important to consider the performances obtained when considering the cumulative sum of the outputs and ground truths as well. To this end, we also employ Overall MAE, defined as providing information on the performances that can be obtained in practical applications of the system.

Results and Discussion
As explained in Section 1, we explore two different strategies for yield estimation using deep learning. The former, based on Equation (2), has images taken at small distances with only one grape bunch on focus, while the latter considers panoramic images collected from a distance of 1-2 m that potentially can capture a wide portion of the field (in the order of thousands of berries). In the first case, the majority of the image pixels consist of berries, while in the panoramic view the fraction of image containing background is much larger.
We present here berries counting performances of GBCNet on the two datasets CR1 and CR2 as a test of the feasibility of the two approaches. By applying five-fold cross validation on CR1, an average number of 2671 berries was selected for each fold and 3653 berries were used for testing. Results on CR2 are reported using three-fold cross validation, for an average number of 6207 berries per fold. In all the experiments we employed the Adam optimizer [55], setting the initial learning rate as 10 −5 and 10 −4 for CR1 and CR2 respectively, dropping the learning rate by an order of magnitude every 50 epochs. Considering the small amount of images of the two datasets, we froze the feature extraction layers (i.e., the first ten VGG-16 layers) and updated only the dilated CNN layer weights for density map generation. With this approach, all the training processes converged in less than 200 epochs, and we evaluated the performances of GBCNet using the weights of the last training epoch. Finally, the number of patches for each iteration (i.e., the batch size) was set to 20 for CR1 and 4 for CR2, given the memory restrictions on the machine used for training and the larger size of CR2 images. For each patch there are an average of 71 berries for CR1 and 427 berries for CR2.
In Table 4, the results for GBCNet on CR1 are presented both for 5-CV and test. We report both the error per image and the overall error. The latter is important in the assumption of having a unique grape bunch in the picture and being interested in the average number of berries per bunch: considering the full dataset helps averaging the over/under-estimation of the network on the single images. It is quite impressive to observe the drop in the percentage error when considering the whole dataset from 10% to less than 1% in test, showing the importance of averaging on many pictures. While the error on single image predictions is similar to the CR2 one, the overall MAE suggests that GBCNet reaches better performances with close-up images. An interesting aspect of the network behavior occurs when having a single cluster on focus in the CR1 dataset. In fact, due to the closeness of the camera to the main photographed cluster, bunches in the background are out of focus. Since only the foremost clusters were labeled in CR1 images, the network automatically learns to ignore background berries and considers only those present in the foreground. The probable learning mechanism employed by the network is to use features like sharpness and sizes of berry edges as discriminant (Figure 4).  However, there are cases in which GBCNet highlights green background regions as berry. This effect, which leads to an overestimation error, is associated with patterns affected by a high local variability in brightness and contrast, as shown in Figure 5. The CR1 dataset collects pictures of seven different varieties: the performances of the network for each variety is reported in Tables 5 and 6. The difference in performance reflects that having collected the pictures in the same days for all the varieties implies a non-uniform phenological state, yielding highly different visual features exemplified in Figure 6. Although this difference among varieties impacts the performances of GBCNet on single image prediction, the model is capable of obtaining a low MAE by aggregating the output predictions for almost all the varieties.   Table 7 collects results for GBCNet tested on CR2 dataset with 3-CV. Considering single images predictions with an average of 1113.9 berries per picture, the model reaches a MAE of 117.36 berries for each validation fold (10.74%). The overall MAE obtained comparing the cumulative sum of predictions and ground truth (6288.3 berries in average for each fold) results in lower value, i.e., 466.53 (7.24%), benefiting from the balancing effect of over-and underestimation when aggregating predictions. In Figure 7 we graphically report the MAE(%) for all varieties and for all the experimental conditions: these results are fully comparable with what was obtained with the alternative methods available in the literature, but where the images are taken in a controlled environment or employing a capturing box to limit background interference.
As stated in Section 2 (Preliminaries), the estimation of the number of berries is the crucial parameter for having an accurate prediction of the yield. Having proved that the error on counting berries is of the order of a few percentage points, we are allowed to use Equation (2) to arrive at the final goal of estimating the grape production. and test mode by image (I) and overall (O) for all the 8 grape varieties in the two datasets CR1, CR2. All varieties with white background belong to CR1, while Teroldego, in gray background, is CR2. Results are formally reported as mean ± sd, where sd may be larger than mean.

Conclusions
We demonstrate that crop yield estimation for grape berries can be obtained using smartphone cameras with fixed-focus small aperture wide angle optical systems by the DL architecture GBCNet, an adaptation of algorithms for crowd counting. Although other factors (see Equations (1) and (2)) have to be considered for an actual yield estimate, the average test error of about 5% for the berry counting model is considered valid for operational application (10%, according to the vine training system). In this study, all data were collected directly in the field and without requiring special cautions or additional constraints, such as a backing board. Notably, the average test error systematically decreases by estimating over more than three pictures from the same parcel. For the Pinot Gris, with a test set of seven images (for a total of about 1300 berries), the percentage MAE is less than 1%.
Further research will investigate, in the same setup, the problem of estimating clusters' weight, thus considering a correcting factor for non-visible berries.
Acknowledgments: Authors thank Andrea Faustini and CAVIT s.c. for providing data and scientific support throughout all phases of the research.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: