Use of Tomato Analyzer software for apple shape detection
The Tomato Analyzer software was used to process images of apples, but manual editing was required due to a generalized failure in detecting the calyx and stalk cavities (see Supplementary Information 4). The data obtained with the software included measures for both fruit size (4) and shape (11) attributes. For all attributes, data were quantitative continuous. Some of the shape descriptors (Supplementary Information 2) were obtained from ratios between fruit size attributes, such as the fruit shape index external I (FSII), which was derived from the ratio between fruit heigh and widht. The fruit blockiness and triangular shape (PFB, DFB and FST) were also ratios between widths at different fruit positions and reflected conical proportions. Other shape parameters described the degree to which the fruit section described an ellipsoid, circle, or rectangle (E, C, R), and the extent of deviation from a circular from, which was referred to as eccentricity (ECC, FSIINT). Measurements were also taken in specific areas of the apple, such as the angles at the stalk and calyx regions, using PAMa and DAMa (see Fig. 1).
Data and descriptive statistics for all measurements and years can be found in Supplementary Information 5 and 6, respectively, and are visualized through boxplots in Fig. 2. Among the measurements, those with lower coefficient of variation (CV) were the rectangular shape (R) with an average CV of 2.6%, and the Distal Fruit Blockiness (DFB), with an average CV of 5%. On the other hand, fruit area values (A) showed higher CV values, averaging 22.1%. In general, size attributes had higher CV values than shape attributes, with average CVs of 14.18% and 10.87%, respectively.
Using each year data, strong and very strong correlation was observed between the size attributes, with values ranging between 0.71 and 0.95 (p-value < 0.05) across the years (Fig. 3; Supplementary Information 7). Regarding the shape measures, strong positive and negative correlation was found between FST and PFB and DFB, respectively, being measures of fruit conicity. Also, the indexes FSII and FSIINT were stongly correlated, with correlation values between 0.70 and 0.74 (p-value < 0.05). In addition, a moderate to strong correlation was observed between FSIINT, ECC, and the homogeneity measures (E, C, and R).
Variance between years
The number of scanned genotypes differed for the three years of evaluations, while 94 coincided in the three sample sets to allow for the estimation of environmental effects on the traits. In the subset of 94 genotypes, the data of five attributes (FST, ECC, FSIINT, E and C) were normally distributed (p > 0.01) after logarithmic transformation (Supplementary Information 8). Homoscedasticity failed for PFB, PAMa, DAMa and E. For a proper analysis of the values across years, we removed those producing the heteroscedasticity. This meant a slight reduction of the sample size from 94 to 90 for PFB, to 91 for PAMA, to 92 for E. For the DAMa, all 2020 data was discarded.
An ANOVA or a Kruskal Wallis test waws conducted for normal and non-normal distributed traits, respectively. Differences between means (p < 0.001) were observed in five out of the 15 attributes, all of them measuring fruit shape: FST, ECC, FSIINT, E and C (Supplementary Information 8). The multiple comparison test showed that in three of them, the differences occurred with the 2020 data. We did not observe significant variations between years in the size attributes (area, width, and height).
Trait broad sense heritability (H2) ranged from 0.15 (DAMa) to 0.82 (FSII) (Supplementary Information 9). In general, size traits showed higher H2 than shape traits (0.72 vs 0.45 in average, respectively). The shape related traits showing higher H2 after FSII were C, FSIINT and PAMa (with H2 of 0.62, 0.57 and 0.52, respectively).
Main fruit shape descriptors
The Fruit Shape Index (FSI), which is the ratio between fruit height and width, and the conical aspect of the apple, are two of the most important attributes used for the classification of fruits into categorical classes (Dapena et al. 2009). In our dataset, the FSII obtained with the Tomato Analyzer software corresponds to the FSI, and the blockiness measures (FST, PFB and DFB) resemble the conical aspect. These traits exhibited low coefficient of variation (CV) between years, ranging from 7.90–9.59%, indicating moderate variation in the evaluated sample subsets.
The FSII reached values from 0.67 to 1.14 (Fig. 4a, e; Supplementary Information 5 and 6), with a mean of 0.825 and an average CV of 8.22%. The varieties ‘Gros Api’ (MUNQ 33) and ‘Belle Flavoise’ (MUNQ270), described as “flat” in the National Fruit Collection (NFC) database (http://www.nationalfruitcollection.org.uk/index.php), were among the ones with lower FSII, together with others like ‘Grenadier’ (MUNQ 93) and ‘Szaszpap Alma’ (MUNQ 2990) described as “broad globose conical”. In contrast, the variety with higher FSII value was ‘Skovfoged’ (MUNQ 345), described as “narrow conical”.
The ratio between the width at the stalk cavity (proximal) and eye basin (distal) shoulders (FST) showed a mean value of 1.1, with an average CV of 9.3% when considering the three years data (Fig. 4b, f). Some varieties showed contrasting values between years (Supplementary Information 5). This was the case of ‘Reinette d’Anthezieux’ (MUNQ 405) (FST2019 = 1.743; FST2020 = 1.052) described as “broad globose conical” and ‘Reinette Sanguine du Rhin’ (MUNQ 491) (FST2019 = 1.155; FST2020 = 0.643) described as oblong, for example. A revision of the images revealed a detection error of the Tomato Analyzer software, which does not recognize well the shoulder limits in asymmetric apple sections (Supplementary Information 4).
Supervised machine learning
The apples were annotated following two visual classifications: 1) a simple one considering only three classes (spheroid oblate, spheroid and spheroid oblong) that we called CAT-own, and 2) the ECPGR catalog (Supplementary Information 3). To identify the key attributes for fruit classification, we employed a supervised machine learning classifier called Random Forest. We evaluated the importance of these traits in both the CAT-own and ECPGR classification methods. In the CAT-own, the model achieved high accuracy (0.90) and f1-scores ranging from 0.82 and 0.92 across classes, as shown in Table 1. Similarly, In the ECPGR classification, the model achieved an accuracy of 0.9 and f1-scores ranging from 0.71 to 0.98 across classes.
The confusion matrix revealed that in the CAT-own model, 10 out of the 102 spheroid fruits (10%) were missclassified as spheroid oblate, while nine out of 118 spheroid oblate fruits (8%) were missclassified as spheroid (Fig. 5a). Additionally, the model incorrectly predicted three out of 10 spheroid oblong fruits (30%) as spheroids (Fig. 5b). For the ECPGR model, we found that the FSII atrribute was the most relevant trait in both classification methods, as depicted in Fig. 5c. Furthermore, in the ECPGR classification we found that the FST (a descriptor of fruit conicity) and DAMa (a descriptor of the eye basin) were the second and third most important traits, respectively (Fig. 5d). We computed the confusion matrix for the five most frequent categories observed in the sampleset. We observed 22 missclassifications, of which 15 (70%) were between broad-globose-conical and flat globose classes.
Table 1
Classification results from the random forest models using CAT-own and ECPGR classes.
Model | Categories | Precision | Recall | F1-score | Support |
CAT-own | Spheroid | 0,88 | 0,9 | 0,89 | 102 |
Spheroid Oblate | 0,92 | 0,92 | 0,92 | 118 |
Spheroid Oblong | 1 | 0,7 | 0,82 | 10 |
Accuracy | | | 0,9 | 230 |
Macro avg | 0,93 | 0,84 | 0,88 | 230 |
Weighted avg | 0,91 | 0,9 | 0,9 | 230 |
ECPGR | Broad-globose-conical | 0,9 | 0,94 | 0,92 | 110 |
Flat | 1 | 0,97 | 0,98 | 32 |
Flat globose | 0,75 | 0,67 | 0,71 | 27 |
Globose | 0,85 | 0,85 | 0,85 | 13 |
Globose conical | 0,93 | 0,93 | 0,93 | 46 |
Accuracy | | | 0,9 | 228 |
Macro avg | 0,89 | 0,87 | 0,88 | 228 |
Weighted avg | 0,9 | 0,9 | 0,9 | 228 |
Data dispersion fo the five shape important parameters higlighted in the random forest analysis (FSII, FST, PAMa, DAMa, ECC) is shown in Fig. 6. Confirming the random forest results, the FSI was the attribute discriminatting better the samples classified into the CAT-own classes, while some class overlapping was observed when using the ECPGR classes, as for example the broad-globose-conical and flat globose classes, globose conical and globose classes.