Non-destructive identification of single hard seed via multispectral imaging analysis in six legume species

Physical dormancy (hard seed) occurs in most species of Leguminosae family and has great consequences not only for ecological adaptation but also for agricultural practice of these species. A rapid, nondestructive and on-site screening method to detect hard seed within species is fundamental important for maintaining seed vigor and germplasm storage as well as understanding seed adaptation to various environment. In this study, the potential of multispectral imaging with object-wise multivariate image analysis was evaluated as a way to identify hard and soft seeds in Acacia seyal, Galega orientulis, Glycyrrhiza glabra, Medicago sativa, Melilotus officinalis, and Thermopsis lanceolata. Principal component analysis (PCA), linear discrimination analysis (LDA), and support vector machines (SVM) methods were applied to classify hard and soft seeds according to their morphological features and spectral traits. The performance of discrimination model via multispectral imaging analysis was varied with species. For M. officinalis, M. sativa, and G. orientulis, an excellent classification could be achieved in an independent validation data set. LDA model had the best calibration and validation abilities with the accuracy up to 90% for M. sativa. SVM got excellent seed discrimination results with classification accuracy of 91.67% and 87.5% for M. officinalis and G. orientulis, respectively. However, both LDA and SVM model failed to discriminate hard and soft seeds in A. seyal, G. glabra, and T. lanceolate. Multispectral imaging together with multivariate analysis could be a promising technique to identify single hard seed in some legume species with high efficiency. More legume species with physical dormancy need to be studied in future research to extend the use of multispectral imaging techniques.


Background
Physical dormancy (PY, referred as hard seed) occurs in at least 18 angiosperm plant families including Fabaceae [1,2], and is caused by a water-impermeable seed or fruit coat [1,3,4]. This kind of dormancy prevents seeds from imbibing water even under favorable environmental conditions, and it may play a role in determining the time and place of seed germination in the field. Also, physical dormancy may help to ensure long-term seed survival, especially for wild species growing in harsh environments [5]. For example, the storage life of physical dormant soybean seeds is longer than those of nondormant seeds [6]. Furthermore, physical dormant seeds generally exhibit a greater vigor than those without physical dormancy in Codariocalyx motorius [7], Glycyrrhiza uralensis [8], and Lespedeza bicolor [9]. However, from agronomic perspectives, physical dormancy is an undesirable trait because it prevents rapid imbibition and synchronous germination, leading to non-uniform seedling establishment [10]. Therefore, distinguishing seeds with and without physical dormancy has great practical significance, as it is important to seed vigor and germplasm storage as well as understanding seed adaptation to various environment.

Open Access
Since hard seed is impermeable to water, distinguishing whether a seed imbibed or not when soaking in the water is the most common method to determine seed physical dormancy [1,11]. However, this process destroys the seed coat structure of soft seeds and thus not suitable for online measurements and sorting. Moreover, this method is very time consuming as it often takes several days to a month to detect presence of physical dormancy depending on species [12]. Thus, a rapid, nondestructive and on-site screening method to detect hard seed is necessary not only for research purpose but also for seed grading and sorting in seed industry.
Morphological, structural and compositional properties of seed coat have been reported to affect seed dormancy status [3,4,[13][14][15]. The intraspecific or even intra-individual variation in seed size have been found to influence seed dormancy status [1,16]. Also, seed coat compositional properties such as polyphenols content including flavonoids, lignin and lignans showed a positive relationship with dormancy in faba bean [17] and pea [14]. These results imply discriminating soft and hard seeds through their morphological and compositional traits is possible. Indeed, previous studies [18] found that near infrared spectroscopy can provide a high accuracy in identifying hard seeds of three legume species. However, this method did not apply seed image analysis techniques, and seed spectral traits was measured individually which was time consuming and impractical.
Multispectral imaging is an emerging technology that integrates conventional imaging and spectroscopy to simultaneously attain both spatial and spectral information of an object [19]. The merits of nondestructive, straightforward measurement strategies that do not require pre-treatment make multispectral imaging analysis ideally suited for online process monitoring and quality control. Recently, this technique has been increasingly used to assess food safety and quality, such as contaminant detection, defect identification, constituent analysis, and quality evaluation [20,21]. In regard of seed identification, multispectral imaging was originally applied to discrimination of transgenic rice seeds from its non-transgenic counterparts [20], discrimination of rice seeds among different varieties [22], and classification of maize kernels [23]. Refer to the potential morphological and chemical difference among hard seed and soft seed, multispectral imaging may have a great potential in distinguishing seeds with or without physical dormancy.
Six common legume species including Acacia seyal, Galega orientulis, Glycyrrhiza glabra, Medicago sativa, Melilotus officinalis, and Thermopsis lanceolata were applied in this study. Among these species, G. orientulis, M. sativa, and M. officinalis [24,25] are important forage species, which are widely cultivated in the world. G. glabra [26] and T. lanceolate [27] have been used as traditional Chinese medicine. A. seyal has medical and ecological value [28]. According to previous studies [24][25][26][27][28], seeds of these six species exhibited physical dormancy which restrict their cultivation. Thus, discriminating hard and soft seeds for these species is extremely important not only for research purpose but also for practical significance.
Herein, we described a new approach with merits of nondestructive, rapid and high throughput to discriminate hard and soft seeds of legume species, based on the VideometerLab 4 spectral imaging system in combination with multivariate analysis.

Morphologic feathers of hard and soft seeds
The difference in morphological traits between hard and soft seeds of a species was varied with species (Table 1). For M. sativa and M. officinalis, a significant difference was observed between hard and soft seeds of each species in terms of the mean value of seed area, length, Width/Length Ratio, compactness circle, BetaShape a, BetaShape b, CIELab L*, CIELab a*, CIELab b*, and saturation, while no significant difference existed in terms of the mean value of compactness ellipse and vertical orientation. However, for A.seyal, G. glabra and T. lanceolate, almost all morphological traits except for area of T. lanceolate and Width/Length Ratio, and CIELab L* of G. glabra, showed no significant difference between hard seed and soft seed within each species. For G. orientulis, a significant difference was found between hard seed and soft seed in terms of the mean value of seed area, length, compactness ellipse, CIELab a* and hue, while no significant difference existed in terms of length, width, Width/Length Ratio, compactness circle, BetaShape a, BetaShape b, vertical skewness, CIELab L*, CIELab b*, and saturation. Table 1 Morphological features of hard and soft seeds for six species.

Spectroscopic analysis of hard and soft seeds
Except for A.seyal, the spectroscopic analysis revealed a significant difference between hard and soft seeds of the other five species in the mean reflectance (Fig. 1). For M. sativa and M. officinalis, soft seeds have significant higher reflectance than those of hard seeds in the whole wavelength region. Consistent with M. sativa and M. officinalis, soft seeds of G. glabra also showed a higher reflectance than hard seeds, while the statistical significance was observed only in the spectral range from 405 nm to 590 nm and from 850 nm to 970 nm. Contrast with above, the soft seeds of G. orientulis showed a significant lower reflectance than the hard seeds in the spectral range from 515 nm to 570 nm, while an opposite trend was observed in the spectral range from 660 nm to 970 nm. For T. lanceolate, no significant difference was detected in the reflectance in the whole spectral range except for in 970 nm.

Principal component analysis (PCA)
There was no distinct difference in terms of PCA score between hard and soft seeds for all species regardless of dimensionality applied (Additional file 1: Figure S1, Additional file 2: Figure S2 and Additional file 3: Figure S3). Here, we took the first two principal components as an example.
The first two principle components extracted from the morphological and spectral traits explained 55.11%, 55.52%, 61.18%, 57.42%, 55.40% and 56.80% of the original variance for A. seyal, G. orientulis, G. glabra, M. officinalis, M. sativa, and T. lanceolate, respectively (Fig. 2). However, the biplot of PCA for either of the species did not reveal a distinct separation between hard and soft seeds, suggesting that discrimination between these two kinds of seed within species through PCA is difficult.

Seed classification based on linear discrimination analysis (LDA) model
The performance of LDA model in classifying hard seeds and soft seeds was varied with species (Table 2). For M. officinalis, M. sativa, and G. orientulis, LDA model had a high average accuracy value of 90%, 90% and 85%, respectively in classifying hard and soft seeds in independent validation data sets. Meanwhile, the sensitivity and specificity for hard seed classification in these three species were reasonably good with a range from 82.69% to 86.67%, and from 84.29% to 95.59%, respectively, for independent validation data sets. For G. glabra and T. lanceolate, a high classification accuracy and specificity was observed in both species for independent validation data sets, while the classification specificity for hard seeds was quite low with value of 50% and 33.33%, respectively. Contrast with this, the average correct classification and specificity for A. seyal was 87.5% and 98.11%, respectively. However, the classification specificity for hard seeds of A. seyal was only 7.14%.

Seed classification based on support vector machine (SVM) model
In agreement with the LDA model, the performance of SVM model in classifying hard and soft seeds differed among species (Table 3). SVM model had an average accuracy value as high as 91.67%, 89.17% and 87.5% in seed classification for independent validation data sets of M. officinalis, M. sativa, and G. orientulis, respectively. Meanwhile, the sensitivity and specificity for hard seed classification in these three species were reasonably good with a range from 76.67% to 88%, and from 87.14% to 96.67%, respectively. For A. seyal, G. glabra, and T. lanceolate, the average classification accuracy was 88.33%, 80% and 77.5%, respectively. However, the classification sensitivity for hard seeds in G. glabra, and T. lanceolate was quite low with value of 46.88% and 7.41%, respectively. Similarly, for classification sensitivity, the classification specificity for A. seyal was 0.
For all species, the reflectance in the near infrared region (840-970 nm) contributed more than morphological traits for SVM model. For example, the reflectance in 970 nm, 940 nm, 880 nm and 850 nm ranked the first five traits contribute to SVM model, and explained 35.2%, 33.9% and 36.1% of the total variation for M. officinalis, M. sativa and G. orientulis, respectively (Fig. 4).

Discussion
Previous studies [1,14,17] have indicated that morphological and spectral traits of a species may differ between hard and soft seeds, and thus can be employed as a tool for seed classification. Consistent with this, our study clearly shows that there is a significant difference in at least one of morphological and spectral traits between hard and soft in six tested species. However, it is worth noting that an overlap exists between hard seed and soft seeds, though significant difference is observed in terms of the mean value, suggesting that it is not appropriate to discriminate hard and soft seed of a species with any single trait. Moreover, the difference between hard and soft seed is varied a lot across different species. For example, hard and soft seeds of M. officinalis, M. sativa and G. orientulis have significant difference in most traits measured both in morphological and spectral. However, for the other three species, significant difference between hard and soft seeds is only detected in a very few traits.
Also, no consistent difference is observed between hard and soft seeds among species. For example, hard seeds of M. officinalis, M. sativa and G. orientulis are smaller than soft seeds, while the opposite trend is observed in T. lanceolate. Besides, soft seeds of M. officinalis, M. sativa and G. glabra have a higher reflectance in the whole wavelength region than hard seeds, and an opposite trend is observed in the short wavelength region (365-590 nm) in G. orientulis. These variations among species may also explain the performance difference of discrimination model on different species.
It is clearly to see that from the PCA scatter plot the PCA method could not separate hard and soft seeds in all test species. A possible reason is that PCA method aims to maximize the variance of variables rather than to maximize the discriminability of hard and soft seeds. In this case, if the variables between groups have very similar mean value with a large variation, the total variance will be mainly composed of variance within groups but not those between groups. Thus, PCA would not detect the difference among groups. Indeed, either for morphological or spectral trait, they all have very close mean value with a large overlap distribution between hard and soft seeds in six species. Also, we notice that a part of information has been lost after PCA analysis, since the first two principle components only explained the total variance ranged from 55.11% to 62.69%, the loss of information may further lead to the failure to separate groups by PCA. Unlike PCA, the supervised methods such as LDA and SVM, aim to minimize the distance within classes and to maximize the distance between groups, thus they showed good discriminability among groups [29][30][31]. Consistent with this, our study shows that both LDA and SVM model provide a high classification accuracy for hard and soft seeds in M. officinalis, M. sativa and G. orientulis. It is interesting that, although both the LDA and SVM models have a high accuracy in seed discrimination, they seem to work in completely different ways. When we take a close look at the relative importance of each feature, the SVM model mainly relies on NIR region spectral trait in model building since spectral traits contribute more in the model, while LDA focus more on seed morphology which contribute most in LDA discrimination model building. Hu et al. [32] also had a remarkably similar finding in seed discrimination between alfalfa and sweet clover via multispectral imaging analysis. However, we failed to detect the reason in depth for this differentiation between these two methods. Further study involving methods combining LDA and SVM may get higher accuracy results for multispectral analysis of hard seeds.
Contrast with this, although the classification accuracy is reasonably good in A. seyal, G. glabra, and T. lanceolate, the model is less specificity or sensitive since the classification specificity for hard seeds in A. seyal is only 7.14%, and the sensitivity for hard seeds in G. glabra, and T. lanceolate is only 50% and 33.33%, respectively. In the former case, most soft seeds are misclassified as hard seeds; while in the latter case, the model will classify hard seeds as soft seeds in a high probability. This inconsistence between classification accuracy and sensitivity or specificity are mainly attributed to the unbalanced data set. For instance, the number of hard and soft seeds of A. seyal were unbalanced with 30 and 250 in the calibration set, and 14 and 106 in the independent validation data set. In this case, when the model classified most seeds as hard seed, the model will still have a high average classification accuracy and sensitivity, but with a very low specificity. However, it is worth noting that the unbalanced data is not the reason for poor performance of the model since there is no reliable empirical evidence to support the claim that unbalanced data set has a negative effect on the performance of LDA [29]. Indeed, when some hard seeds of A. seyal were randomly removed from the sample, the average classification accuracy and sensitivity is decreased and consequently increases the classification specificity. These results suggest current model used in our study could not discriminate hard and soft seeds in A. seyal, G. glabra, and T. lanceolate. This is possibly due to the difference in morphological and spectral traits between hard and soft seeds is not big enough in our study. Sun et al. [33] reported that using near infrared spectroscopy can provide a high accuracy in identifying hard seeds in G. uralensis, and their results showed a significant difference in light absorbance when the wavelength is higher than 1000 nm. Consistent with this, our study also showed that the difference in spectral trait between hard and soft seeds in G. glabra increased as the wavelength increasing. In addition, a significant difference between hard and soft seeds in A. seyal is detected only at 970 nm. Thus, a wide range wavelength such as near infrared spectroscopy may help to improve the data quality and favor discrimination model building. Furthermore, other machine learning tools, such as random forest (RF) and back propagation neural network (BPNN), which have been proved to be effective in discrimination of soybean seeds [22] and high-quality watermelon seeds [21], can be applied in separating hard and soft seeds in future studies.

Seed sample
Seeds of Acacia seyal, Galega orientulis, Glycyrrhiza glabra, Medicago sativa, Melilotus officinalis, and Thermopsis lanceolata (Fig. 5) were provided by the Official Herbage and Turfgrass Seeds Testing Center, Ministry of Agriculture and Rural Affairs, China. Seeds were kept in water-proof bags in laboratory conditions (20 °C, 35% relative humidity) till the time of image acquisition in April 2019. The initial moisture content of A. seyal, G. orientulis, G. glabra, M. sativa, M. officinalis, and T. lanceolate were 8.5%, 7.3%, 6.8%, 6.5%, 6.7% and 8.1%, respectively. The amount of seeds used for the experiment of hard and soft seed classification was 400. For each species, 280 seeds for each sample were randomly selected as calibration set and the remaining 120 seeds were used for independent validation set.

Multispectral imaging system
Multispectral images were acquired with a Videometer-Lab4 (Videometer, Hørsholm, Denmark) multispectral imaging system. The samples of 400 seeds for each species in each petri dish were placed beneath a hollow integrating sphere, with a camera located in the top of the sphere. During image capture, the sphere closes over the sample stage to create optically closed conditions, allowing even lighting with minimal shadows and specular reflection. Samples were illuminated by 19 high power light emitting diodes (LEDs) at specific wavelengths: 365,405,430,450,470,490,515,540,570,590,630,645,660,690,780,850,880,890, and 970 nm. The LEDs strobe successively in a scan time of approximately five seconds, resulting in a monochrome image at each wavelength at 19 different wavelengths. The images consisted of 2192 × 2192 pixels, with a high spatial resolution of approximately 40 μm/ pixel. Before acquiring multispectral images, the system was fully calibrated radiometrically and geometrically by using three successive plates: a white one for reflectance correction, a dark one for background correction and a doted one for geometric pixel position aligning calibration, followed by a light setup calibration.

Determination of hard seed
Following imaging acquisition, each seed was placed on two sheets of filter paper (Hangzhou Shuangquan, Hangzhou, Zhejiang, China) moistened with 10 ml distilled water in 12-cm-diameter petri dishes and incubated at 20 °C for 14 days. The number of imbibed (soft seed) and unimbibed (hard seed) seeds in each dish was monitored daily. When a seed imbibed, there was a visible change in its size/volume, thus imbibed and unimbibed seeds could easily be distinguished from each other. The number of true hard and soft seeds for each data set was shown in Table 4.

Multispectral image analysis
The main objects appeared in the acquired multispectral image are the seeds in addition to some other objects, such as the Petri dish and its surrounding background that should be removed from the image before extracting spectral information of the individual seeds. Image segmentation was performed using the VideometerLab software version 3.10. To remove the image background, all items, except the seeds, were removed by a normalized canonical discriminant analysis (nCDA) [34] and segmented using a simple threshold. Then, the attributes of the seeds such as morphological traits and main spectral features of all individual seeds were extracted from the image analysis and processing. The morphological traits included area, length, width, Width/Length Ratio, compactness circle, compactness ellipse, BetaShape a, BetaShape b, vertical skewness, CIELab L*, CIELab a*, CIELab b*, saturation, hue and vertical orientation [19,35]. Explanation of morphological traits were listed in Additional file 4: Table S1. The extracted spectral signatures of the seeds represent the mean intensity of the reflected light at each single wavelength calculated from all seed pixels in the image.

Multivariate data analysis
Multivariate analysis including PCA, LDA and support vector machines (SVM) using FactoMineR, MASS, and

PCA
To identify the patterns hidden in the extracted morphological features and spectral data of all seeds, PCA was carried out as an explorative multivariate data analysis technique, which commonly used to get an overview of the systematic variation in the data and to explore the possibility of grouping the seeds of similar morphology and spectral profiles [36][37][38]. PCA score was calculated based on the first two, three and all PCs.

LDA
LDA is a well-known algorithm, which calculates a surface separating the sample groups, by establishing a linear discriminant function that maximizes the ratio of the between class and the within-class variances [37]. In this study, the seeds were randomly sampled as calibration (70% of total sample) and validation sets (remained 30%) as shown in Table 4, LDA classification models were developed using the calibration set, and the models obtained were validated using the independent validation set, which was not used during model building. To reduce the potential overfitting, the LDA models were developed under cross-validation using leave-one-out cross-validation method in which one seed was taken out at a time, and the LDA model was built for the remaining seeds. The model was then used to classify the seed left out, and the same routine was repeated until all seeds were removed [39]. The classification method performance was evaluated by the ability to detect the presence of hard seeds in seed lots of each species through the sensitivity (eq. 1), specificity (eq. 2), and accuracy (eq. 3).
where, TH true hard seed, FS false soft seed, TS true soft seed and FH false hard seed.

SVM
Least squares-support vector machine (SVM) is a supervised learning algorithm used for classification and regression tasks proposed by Cortes and Vapnik [40]. Compared with other analysis methods, SVM can learn in high-dimensional characteristic space with fewer calibration variables or samples, and details of the SVM algorithm can be found in previous reported research [41,42]. It has been effectively used to perform multivariate function estimation or non-linear classification. In this study, the linear kernel was used for classification. To reduce the potential overfitting, the LS-SVM models were developed under cross-validation using leaveone-out cross-validation method as described above. The quality of classification was evaluated by calculation of sensitivity, specificity and classification accuracy as described above.

Conclusion
In brief, our study clearly shows that multispectral imaging together with multivariate analysis could be a promising technique to identify hard seeds in some legume